git-annex can transfer data to and from configured git remotes. Normally those remotes are normal git repositories (bare and non-bare; local and remote), that store the file contents in their own git-annex directory.
But, git-annex also extends git's concept of remotes, with these special types of remotes. These can be used just like any normal remote by git-annex. They cannot be used by other git commands though.
- S3 (Amazon S3, and other compatible services)
- Amazon Glacier
- bup
- directory
- rsync
- webdav
- web
- xmpp
- hook
The above special remotes can be used to tie git-annex into many cloud services. Here are specific instructions for various cloud things:
- Amazon S3
- Amazon Glacier
- Internet Archive via S3
- tahoe-lafs
- Box.com
- Google drive
- Google Cloud Storage
- Mega.co.nz
- SkyDrive
- OwnCloud
- Flickr
- IMAP
- Usenet
Unused content on special remotes
Over time, special remotes can accumulate file content that is no longer
referred to by files in git. Normally, unused content in the current
repository is found by running git annex unused
. To detect unused content
on special remotes, instead use git annex unused --from
. Example:
$ git annex unused --from mys3
unused mys3 (checking for unused data...)
Some annexed data on mys3 is not used by any files in this repository.
NUMBER KEY
1 WORM-s3-m1301674316--foo
(To see where data was previously used, try: git log --stat -S'KEY')
(To remove unwanted data: git-annex dropunused --from mys3 NUMBER)
$ git annex dropunused --from mys3 1
dropunused 12948 (from mys3...) ok
Similar to a JABOD, this would be Just A Bunch Of Files. I already have a NAS with a file structure conducive to serving media to my TV. However, it's not capable (currently) of running git-annex locally. It would be great to be able to tell annex the path to a file there as a remote much like a web remote from "git annex addurl". That way I can safely drop all the files I took with me on my trip, while annex still verifies and counts the file on the NAS as a location.
There are some interesting things to figure out for this to be efficient. For example, SHAs of the files. Maybe store that in a metadata file in the directory of the files? Or perhaps use the WORM backend by default?
Would it be possible to support Rapidshare as a new special remote? They offer unlimited storage for 6-10€ per month. It would be great for larger backups. Their API can be found here: http://images.rapidshare.com/apidoc.txt
Is there any chance a special remote that functions like a hybrid of 'web' and 'hook'? At least in theory, it should be relatively simple, since it would only support 'get' and the only meaningful parameters to pass would be the URL and the output file name.
Maybe make it something like git config annex.myprogram-webhook 'myprogram $ANNEX_URL $ANNEX_FILE', and fetching could work by adding a --handler or --type parameter to addurl.
The use case here is anywhere that simple 'fetch the file over HTTP/FTP/etc' isn't workable - maybe it's on rapidshare and you need to use plowshare to download it; maybe it's a youtube video and you want to use youtube-dl, maybe it's a chapter of a manga and you want to turn it into a CBZ file when you fetch it.
Sorry if it is RTFM... If I have multiple original (reachable) remotes, how could I establish my preference for which one to be used in any given location?
usecase: if I clone a repository within amazon cloud instance -- I would now prefer if this (or all -- user-wide configuration somehow?) repository 'get's load from URLs originating in the cloud of this zone (e.g. having us-east-1.s3.amazonaws.com/ in their URLs).
This should be implemented with costs
I refer you too: http://git-annex.branchable.com/design/assistant/blog/day_213__costs/
This has been implemented in the assistant, so if you use that, changing priority should be as simple as changing the order of the remotes on the web interface. Whichever remote is highest on the list, is the one your client will fetch from.
remote.<name>.annex-cost
to appropriate values. See also the documentation for theremote.<name>.annex-cost-command
which allows your own code to calculate costs.Thank you -- that is nice!
Could costs be presented in 'whereis' and 'status' commands? e.g. like we know APT repositories priorities from apt-cache policy -- now I do not see them (at least in 4.20130501... updating to sid's 0521 now)
Is there any remote which would not only compress during transfer (I believe rsync does that, right?) but also store objects compressed?
I thought bup would do both -- but it seems that git annex receives data uncompressed from a bup remote, and bup remote requires ssh access.
In my case I want to make publicly available files which are binary blobs which could be compressed very well. It would be a pity if I waste storage on my end and also incur significant traffic, which could be avoided if data load was transferred compressed. May be HTTP compression (http://en.wikipedia.org/wiki/HTTP_compression) could somehow be used efficiently for this purpose (not sure if load then originally could already reside in a compressed form to avoid server time to re-compress it)?
ha -- apparently it is trivial to configure apache to serve pre-compressed files (e.g. see http://stackoverflow.com/questions/75482/how-can-i-pre-compress-files-with-mod-deflate-in-apache-2-x) and they arrive compressed to client with
Content-Encoding: gzip
but unfortunately git-annex doesn't like those (fails to "verify") -- do you think it could be implemented for web "special remotes"? that would be really nice -- then I could store such load on another website, and addurl links to the compressed content
All special remotes store files compressed when you enable encryption. Not otherwise, though.
As far as the web special remote and pre-compressed files, files are downloaded from the web using
wget
or (of wget is not available)curl
. So if you can make it work with those commands, it should work.FWIW -- eh -- unfortunately it seems not that transparent. wget seems to not support decompression at all, curl can do with explicit --compressed, BUT it doesn't distinguish url to a "natively" .gz file and pre-compressed content. And I am not sure if it is possible to anyhow reliably distinguish the two urls. In the case of obtaining pre-compressed file from my sample apache server the only difference in the http response header is that it gets "compound" ETag: compare ETag: "3acb0e-17b38-4dd5343744660" (for directly asking zeros100.gz) vs "3acb0e-17b38-4dd5343744660;4dd5344e1537e" (requesting zeros100) where portion past ";" I guess signals the caching tag for gzipping, but not exactly sure on that since it seems to be not a part of standard. Also for zeros100 I am getting "TCN: choice"... once again not sure if that is any how reliably indicative for my purpose. So I guess there is no good way ATM via Content-Type request.