What steps will reproduce the problem?
Take a large sub-directory in a repository (e.g. ccash
) with some files within,
$ tar -xzf ccash.tar.gz
$ du -sh ccash
59M ccash
$ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar
-rw-r--r-- 1 dietz dietz 1748 Jul 27 2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java
-rw-r--r-- 1 dietz dietz 313898 May 22 18:36 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar
Annex it,
$ git annex add ccash
...
$ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar
lrwxrwxrwx 1 dietz dietz 215 Jul 27 2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java -> ../../../../../../../../../../../.git/annex/objects/mv/zf/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486
lrwxrwxrwx 1 dietz dietz 210 Jul 27 2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar -> ../../../../../../../../.git/annex/objects/8G/gQ/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73
Unannex it (before or after committing),
$ git annex unannex ccash
Note that some fraction of the files will still be symbolic links, now pointing to non-existent files. This data has apparently been lost forever.
$ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar
-rw-r--r-- 1 dietz dietz 1748 Jul 27 2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java
lrwxrwxrwx 1 dietz dietz 210 Jul 27 2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar -> ../../../../../../../../.git/annex/objects/8G/gQ/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73
It is unclear why some files are affected while others are not. That being said, unannexing small numbers of files at a time appears to avoid the issue,
$ tar -zxf ccash.tar.gz
$ git annex add ccash
$ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar
lrwxrwxrwx 1 dietz dietz 215 Jul 27 2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java -> ../../../../../../../../../../../.git/annex/objects/mv/zf/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486
lrwxrwxrwx 1 dietz dietz 210 Jul 27 2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar -> ../../../../../../../../.git/annex/objects/8G/gQ/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73/SHA256-s313898--593552ffea3c5823c6602478b5002a7c525fd904a3c44f1abe4065c22edfac73
$ git annex unannex ccash/trunk/DataProvider/WebContent/WEB-INF
...
$ ls -l ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar
lrwxrwxrwx 1 dietz dietz 215 Jul 27 2011 ccash/trunk/annotationinterface/src/edu/byu/nlp/annotationinterface/java/BasicAnnotation.java -> ../../../../../../../../../../../.git/annex/objects/mv/zf/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486/SHA256-s1748--5c0d1cbf104214b6d0ab85c53a85cadb975ec208f42a7b33a76d85e175352486
-rw-r--r-- 1 dietz dietz 313898 Jul 27 2011 ccash/trunk/DataProvider/WebContent/WEB-INF/lib/dom4j.jar
For this reason, it seems likely this is due to some sort of race condition.
What version of git-annex are you using? On what operating system?
This is on Ubuntu 12.04 with git-annex revision a1e2bc4.
Here is a quick script which reproduces the issue on another Ubuntu 12.04 machine,
This results in dozens of dead symlinks.
What's going on here is you have multiple files with the same content, so the symlinks point to the same annexed file. When unannex processes the first symlink, it moves the annexed file to replace it. This breaks the other symlink that pointed to it. Notice that if you then re-add the file to the annex, the broken symlink automatically gets fixed -- there's no actual data loss going on here.
This problem can be avoided by using
git annex unannex --fast
, which makes hardlinks to the annexed file. But then you are also left with the hard links in.git/annex/objects
..git annex unused
can find and remove them.It may make sense to make the current "--fast" behavior the default for unannex..
If unannex makes the file a hard link to the annexed content, it will be mode 444 or so. But if the user changes the permissions and modifys it, that will corrupt the content still in the annex!
So the current --fast behavior seems no worse than the proposed behavior. And it's not at all clear to me that this would be a better default behavior for unannex than the current behavior, which at least ensures that data left in the annex (and referred to by another annexed file) cannot be corrupted.