Pruning Blobs from BOSH Releases

August 6, 2015

Over time, as blobs are continually added to BOSH releases, the files can start consuming lots of disk space. Blobs are frequently abandoned because newer versions replace them, or sometimes the original packages referencing them are removed. Unfortunately, freeing the disk space isn't as simple as rm blobs/elasticsearch-1.5.2.tar.gz because BOSH keeps track of blobs in the config/blobs.yml file and uses symlinks to cached copies.

To help keep a lean workspace, I remove references to blobs which are no longer needed in my release. The blobs remain untouched in the blobstore/S3, but as far as my local bosh command cares about, it doesn't need to keep local copies. One option for pruning is to manually edit config/blobs.yml and remove the old references (and then run bosh sync blobs to update blobs/). However, I tend to go the other direction - interactively or with shell scripts – removing files from blobs/ and then updating blobs.yml with this command...

for FILE in $( grep -E '^[^ ].+:$' config/blobs.yml | tr -d ':' ) ; do
  [ -e "blobs/${FILE}" ] || sed -i '' -E -e "\\#^${FILE}:\$#{N;N;N;d;}" config/blobs.yml
done

Once they're gone from blobs.yml I can commit the changes and know that the next time I need to clone/sync into a new workspace it'll be faster.

git commit -om 'Prune old blob references' config/blobs.yml

But... while those blobs are no longer listed in config/blobs.yml and they are no longer in blobs/, the blob still exists in the .blobs directory where bosh keeps an original copy. I can remove unreferenced blobs from .blobs with this command...

for BLOBSHA in $( find .blobs -type f ) ; do
  grep -qE "^  sha:\s+$( basename $BLOBSHA )" config/blobs.yml || rm "$BLOBSHA"
done

Even though the blobs are now effectively gone, their references still exist in repository history. For example, if you wanted to rebuild your .blobs cache directory you could loop through changes to blobs.yml and rerun bosh sync blobs to restore local copies...

for COMMIT in $( git rev-list --parents HEAD -- config/blobs.yml | cut -d" " -f1 ; git rev-parse HEAD ) ; do
  git checkout "$COMMIT" config/blobs.yml
  bosh sync blobs
done

As an example, here's a before and after of cleaning up blobs in my long-running logsearch-boshrelease workspace...

$ 
du -sh .blobs/ | cut -f1
904M
...cleanup...
$ 
du -sh .blobs/ | cut -f1
168M