The Internet Archive allows members to upload collections using an Amazon S3 compatible API, and this can be used with git-annex's S3 support.

So, you can locally archive things with git-annex, define remotes that correspond to "items" at the Internet Archive, and use git-annex to upload your files to there. Of course, your use of the Internet Archive must comply with their terms of service.

A nice added feature is that whenever git-annex sends a file to the Internet Archive, it records its url, the same as if you'd run git annex addurl. So any users who can clone your repository can download the files from, without needing any login or password info. The url to the content in the Internet Archive is also displayed by git annex whereis. This makes the Internet Archive a nice way to publish the large files associated with a public git repository.

webapp setup

Just go to "Add Another Repository", pick "Internet Archive", and you're on your way.

basic setup

Sign up for an account, and get your access keys here:

# export AWS_ACCESS_KEY_ID=blahblah
# export AWS_SECRET_ACCESS_KEY=xxxxxxx

Specify when doing initremote to set up a remote at the Archive. This will enable a special Internet Archive mode: Encryption is not allowed; you are required to specify a bucket name rather than having git-annex pick a random one; and you can optionally specify x-archive-meta* headers to add metadata as explained in their documentation.

# git annex initremote archive-panama type=S3 \ bucket=panama-canal-lock-blueprints \
    x-archive-meta-mediatype=texts x-archive-meta-language=eng \
            x-archive-meta-collection=test_collection \
    x-archive-meta-title="original Panama Canal lock design blueprints"
initremote archive-panama (Internet Archive mode) ok
# git annex describe archive-panama "a man, a plan, a canal: panama"
describe archive-panama ok

The above uploads to the test collection where items are removed after thirty days. Uploads can persist by changing to another writable collection.

Then you can annex files and copy them to the remote as usual:

# git annex add photo1.jpeg --backend=SHA256E
add photo1.jpeg (checksum...) ok
# git annex copy photo1.jpeg --fast --to archive-panama
copy (to archive-panama...) ok

update lag

It may take a while for to make files publically visible after they've been uploaded.

While files can be removed from the Internet Archive, derived versions of some files may continued to be stored there for a while after the originals were removed.

exporting trees

By default, files stored in the Internet Archive will show up there named by their git-annex key, not the original filename. If the filenames are important, you can run git annex initremote with an additional parameter "exporttree=yes", and then use git-annex-export to publish a tree of files to the Internet Archive.

Note that the Internet Archive may not support certian characters in filenames (see FAQ). If exporting a filename fails due to such limitations, you would need to rename it in your git annex repository in order to export it.