I have come up with a moderately complex solution to a particular use case that I have and am posting it here in case it is useful to someone else, and to get suggestions on how to improve it.

The problem:

I have a large number of files that are accessed infrequently and stored off-line on DVD-Rs. I need to keep track of which files are on which disc so that when I want a file I can find it.

The solution:

I currently keep a text file to track which files are on which discs. I would like to organize all the files in a proper filesystem using git annex, allowing me better organization and the ability to keep some smaller related files online near the annexed large files.

Requirements:

1) Easily locate the DVD-R containing any specific offline file

This is easily taken care of with git annex whereis

2) Automatically de-duplicate stored files with the same contents

This is taken care of with one of the hash backends (E.G. SHA256)

3) The DVD-Rs still need to be usable without git or git-annex (E.G. The stored files should retain their normal human readable names)

This requirement rules out dir and rsync special remotes, they store the files named according to their hash. I have settled on making each disc a separate repo which will satisfy this requirement.

Future goals:

4) Easily incorporate the current DVD-Rs into the new system

I haven't found a way to fulfill this goal yet. I have some convoluted ideas, but nothing so easy as mount disc, run git annex command.

The solution in detail

Suppose you have the following tree:

~/mainrepo/thing1/file1.bin
~/mainrepo/thing1/description1.txt
~/mainrepo/thing2/file2.bin
~/mainrepo/thing2/description2.txt

You want to store thing1 on disc1 and thing2 on disc2, but you'd like to keep the descriptions online because they are small and useful for figuring out which thing you want later.

1) Create the main repo and annex the files:

cd ~/mainrepo
git init
git annex init mainrepo
git annex add .
git commit -m 'added files'

2) Create two new unrelated repos and populate them with their respective data and annex:

cd /tmp
mkdir disc1repo disc2repo
cd disc1repo
cp ~/mainrepo/thing1/* .
git init
git annex init disc1
git annex add .
git commit -m 'added files'
cd ../disc2repo
cp ~/mainrepo/thing2/* .
git init
git annex init disc2
git annex add .
git commit -m 'added files'

3) This is optional, but after annexing the files in these new repos, I replace the symlinks pointing into to the .git/annex/objects/ directory with hard links. This makes the DVD-Rs usable from operating systems that can't deal with symlinks. (mkisofs handles hard links correctly)

cd /tmp
find disc1repo/ disc2repo/ -type l -execdir sh -c "mv -iv {} {}.symlink && ln -L {}.symlink {} && rm {}.symlink" \;

4) Burn these repos onto DVD-Rs:

cd /tmp
#make isos
mkisofs -volid disc1 -rational-rock -joliet -joliet-long -udf -full-iso9660-filenames -iso-level 3 -o disc1.iso disc1repo/
mkisofs -volid disc2 -rational-rock -joliet -joliet-long -udf -full-iso9660-filenames -iso-level 3 -o disc2.iso disc2repo/
#burn the isos (untested command)
cdrecord -v -dao disc1.iso
cdrecord -v -dao disc2.iso

5) Mount the DVD-Rs and add as a remote and fetch, then drop from the mainrepo:

cd ~/mainrepo
#disc1
mount /mnt/cdrom
git remote add disc1 /mnt/cdrom
git fetch disc1
git annex drop thing1/thing1.bin
umount /mnt/cdrom
#disc2
mount /mnt/cdrom
git remote add disc2 /mnt/cdrom
git fetch disc2
git annex drop thing2/thing2.bin
umount /mnt/cdrom

6) Enjoy! You can now find out what disc things are on simply using git annex whereis, and you can git annex get them or simply use them directly from the disc.

I'd appreciate any comments and helpful suggestions. Especially how to simplify the process or easily integrate all the things I already have stored on discs.

Maybe it would be possible to create a special remote using the hooks for the DVD-Rs.

Even though it is a bit tedious and complicated, the current process could be automated using a script.

http://dar.linux.free.fr/doc/index.html

Would be nice to have this as another remote option for git-annex, since I too would like to have static (and possibly incrementally extended) remotes that span multiple DVDs

Comment by https://me.yahoo.com/a/2grhJvAC049fJnvALDXek.6MRZMTlg--#eec89 Sat Oct 20 19:03:37 2012

dar looks familiar, I'm sure I have run across it in the past. However, it is not suitable in this case; see requirement #3 above that the DVD-Rs be usable without git or git-annex.

What would work we be some sort of special remote that allows free-form data. Imagine that you create the DVD-R with the files on it, then you mount it and add the mount directory as a free-form special remote. git-annex checksums all the files under the specified directory and stores the relative path to each file somewhere. Then, when you want to fetch a specific hash from the remote it looks up the relative path, adds it to the base directory and transfers it into the local .git/annex/objects/ store.

Comment by Steve Sat Oct 20 22:11:23 2012

I have already stored a lot of large files on DVDs. I did that for arhiving, so I cared that there are several copies. But I want this to be more automated.

I take my disc (or one created by someone else, without any knowledge of Git), checksum its contents in git-annex, and in the projects where I'm using this content, I can check that the file is archived on at least N discs.

Also, I might enhance the content -- this would be refected in a Git commit, so then I want also to be able to check that the new version has also ben archived on severeal discs.

A special remote for such free-form read-only media would be very convenient.

Comment by http://lj.rossia.org/users/imz/ Sat Oct 20 23:58:45 2012

This is starting to get interesting. A free-form remote would definitely simplify my use case, and also solve the "future goal" of easily incorporating my already existing DVD-Rs.

I haven't really looked into the git-annex internals up to this point, but looking at the hook page there doesn't seem to be a hook for init which would be needed to populate git-annex's index of files in the remote. (git-annex seems to assume that new special remotes are empty)

Another problem is where to store the hash to path relation information. On a RW remote it would be stored in the remote, but here we need to keep it in the repo somehow. This could be in the git-annex branch, or possibly another branch created specifically for this purpose.

1) initremote needs to:

  • hash the contents of all the remote's files
  • update git-annex's index of the remote's contents
  • store the paths to the hashes in the repo

2) store and remove should just fail.

3) retrieve and check present seem straight forward.

The assistant blog mentions adding support for read only remotes but I don't know anything about it: day 65 transfer polish (I'm still on 3.20120605)

Let me know if there is anything I haven't thought of yet.

Comment by Steve Sun Oct 21 02:07:40 2012

I encourage playing around with the hook special remote and see how far you can make it go.

I may be doing something vaguely like this for desymlink, although I'm pretty sure it would still have a git repository associated with the directory of regular files.

One option is to use the web special remote, with file:// urls. Assuming a given disc will always end up mounted somewhere stable, such as /media/dvd1, /media/dvd2, etc, you could then just git annex addurl file:///media/dvd1/$file. git annex whereis will show the url, which has enough info to work out the disk to mount.

The web special remote did not support file:// urls, but I've just fixed that. The only downside is that, while it will identify files duplicated across disks, and whereis will show multiple urls for such files, there's only one web special remote, and so it only counts as 1 copy. This could perhaps be improved; git-annex may eventually get support for remotes reporting how many copies of a file they contain.

Comment by http://joeyh.name/ Sun Oct 21 05:36:36 2012

This works great! I first tried it with WORM, no-go. I can see why the SHA backends are so powerful, they appear to circumvent the commits which git usually uses for merging. When I first do the merge, it reports this:

warning: no common commits

Compared to how I've managed CD/DVD backups in the past, this is a quantum leap forward, and I don't find it convoluted in comparison. Yes, there is dar, but I prefer this method. In my case, its the perfect solution for original files, which in generally are treated as immutable, and not accessed very often. They are usually large, too! I'm using them for digital pictures.

Comment by http://www.openid.albertlash.com/openid/ Wed Oct 24 22:00:31 2012

Hi Joey,

Thanks for the advice. I had thought of the web special remote; but as you may have noticed from my example, I don't use automount so my DVDs and CDs all get mounted in the same place. (/mnt/cdrom) so the web special remote won't work for me.

I'll try to play around with the hook special remote this weekend. I had a thought it might be interesting to have it search for the DVDs in some common places or even by parsing the mounted file systems, and allow an override or augmentation through git config.

Comment by Steve Wed Oct 24 23:26:53 2012

Albert,

Thanks for feedback! I'm glad that somebody else found the method I worked out useful. As I'm going to try and turn it into a proper special remote, let me know if there is any particular use case or feature you'd like me to address.

Note that in my testing, I found that you don't actually need to merge the DVD's branch into the local branch you are using for git annex to be able to find the files on it that are identical to files in your local branch.

I haven't played around with cloning the repo, but I will try that this weekend. I'm thinking it might be necessary to create local branches from the DVD remotes so that they'll get carried along when you clone the repo.

As far as the repos on the DVD's not having a shared ancestry with main repo, that was a conscious choice that I made. I wanted to add as little extra data to the DVDs as possible since I usually fill them to the brim anyway. I didn't feel that it would be beneficial for the DVD's to know about the history of the main repo and other files that they don't contain. Furthermore, besides all the links and history, you'd be replicating all the files in the main repo that aren't annexed.

If you want to avoid the error, but still have a local branch for the DVD repos you should be able to do something like the following:

WARNING: these commands are untested!

git checkout -b disc1 disc1/master
git checkout -b disc2 disc2/master

Working from the original example, you should then get local branches for the DVDs that don't have a common ancestor with your master local repo. I haven't actually tested that though. Testing will have to wait for this weekend.

Comment by Steve Wed Oct 24 23:52:30 2012
@Steve, it seems to me you could still use the web special remote, just pointing it at an url that goes through a symlink to the mount point.
Comment by http://joeyh.name/ Thu Oct 25 03:33:29 2012
Comments on this page are closed.