Adding a URL from wayback machine

Is there a nice way to add a URL from the Wayback machine? I don't want to add HTML documents but rather PDFs etc. The problem is that web.archive.org doesn't send a Content-Length header:

$ curl -I http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg
HTTP/1.1 200 OK
Server: Tengine/2.1.0
Date: Fri, 17 Nov 2017 12:09:35 GMT
Content-Type: audio/ogg
Connection: keep-alive
X-Archive-Orig-content-length: 29784324
X-Archive-Orig-accept-ranges: bytes
X-Archive-Orig-server: Apache/2.4.29 (Debian)
X-Archive-Orig-last-modified: Wed, 11 Feb 2015 01:31:47 GMT
X-Archive-Orig-connection: close
X-Archive-Orig-etag: "1c67904-50ec5f783257c"
X-Archive-Orig-date: Fri, 17 Nov 2017 12:08:49 GMT
Cache-Control: max-age=1800
X-Archive-Guessed-Content-Type: audio/ogg
Memento-Datetime: Fri, 17 Nov 2017 12:08:47 GMT
Link: <https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg>; rel="original", <http://web.archive.org/web/timemap/link/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg>; rel="timemap"; type="application/link-format", <http://web.archive.org/web/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg>; rel="timegate", <http://web.archive.org/web/20140426045733/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg>; rel="first memento"; datetime="Sat, 26 Apr 2014 04:57:33 GMT", <http://web.archive.org/web/20170822174608/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg>; rel="prev memento"; datetime="Tue, 22 Aug 2017 17:46:08 GMT", <http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg>; rel="memento"; datetime="Fri, 17 Nov 2017 12:08:47 GMT", <http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg>; rel="last memento"; datetime="Fri, 17 Nov 2017 12:08:47 GMT"
Content-Security-Policy: default-src 'self' 'unsafe-eval' 'unsafe-inline' data: archive.org web.archive.org analytics.archive.org
X-App-Server: wwwb-app39
X-ts: ----
X-Archive-Playback: 0
X-location: All
X-Page-Cache: MISS

Therefore, I'd have to pass the --relaxed flag when calling addurl. Is there maybe a way to tell git-annex to download the file before checking its size? Or is there a way to use the Wayback machine via the S3 special remote?

Eventually, I'd like to have a script that automatically saves all web content in my repo into the Wayback machine and adds the new URLs for redundancy.

This is what I've got so far:

#! /usr/bin/env python3

from subprocess import check_output, Popen, PIPE
import json
from urllib import request
from sys import stdout

output = check_output(['git-annex', 'find', '--in', 'web', '--json'])
files = [json.loads(line)['file'] for line in output.split(b'\n')[:-1]]
p = Popen(['git-annex', 'whereis', '--batch', '--json'], stdin=PIPE, stdout=PIPE)
p.stdin.writelines(str.encode(f + '\n') for f in files)
out, err = p.communicate()
urls = []
for line in out.split(b'\n')[:-1]:
    for loc in json.loads(line)['untrusted']:
        if loc['uuid'] == '00000000-0000-0000-0000-000000000001':
            url = loc['urls'][0]
            import urllib.request
            print('GET https://web.archive.org/save/' + url)
            r = request.urlopen('https://web.archive.org/save/' + url)
            assert(r.getcode() == 200)
            urls.append(r.geturl())
            break
assert(len(files) == len(urls))
p = Popen(['git-annex', 'addurl', '--batch', '--with-files', '--relaxed'], stdin=PIPE, stdout=stdout)
p.stdin.writelines(str.encode(f + ' ' + u) for f, u in zip(urls, files))
p.communicate()

RSS Atom

comment 1

Lack of a Content-Length header does not prevent git annex addurl from working as far as I can see. Perhaps you should show what the problem you're experiencing looks like.

Comment by joey — Tue Dec 5 17:37:21 2017

Remove comment

comment 2

I'm adding a file and later want to add a URL, that doesn't give a Content-Length header, to the file.

It is essantially the following:

mkdir test
cd test
git init
git annex init
wget http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg
git annex add
git annex addurl "http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg" --file git-annex_views_demo.ogg

and the last command gives me the result

addurl git-annex_views_demo.ogg 
  while adding a new url to an already annexed file, url does not have expected file size (use --relaxed to bypass this check) http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg
failed
git-annex: addurl: 1 failed

Comment by robert.schuetz — Sat Dec 9 11:50:09 2017

Remove comment

comment 3

Ah, gotcha.

When adding an url to an annexed file, git-annex doesn't download the content again, because that could be a lot of work, instead it checks if the size is the same, which is the same check that git-annex always uses to see if the web still seems to have the content of a file.

It doesn't feel safe to relax the size check when the web server doesn't send a Content-Length, because then there's no indication at all that this url really has the same content as the file. It might make sense for git-annex to download all the content again in that case, and check if it's the expected content and only then add it. However, that could be a whole lot of unwanted work to do.

If the user is sure that the url has the same content, it does make sense for them to add --relaxed. So, perhaps what's needed is a --strict for the users who are not sure and want to force a full download to check.

Here's how to check it yourself, without any changes to git-annex.

wget -O tmp.git-annex_views_demo.ogg http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg
if [ "$(git annex calckey tmp.git-annex_views_demo.ogg)" == "$(git annex find --format '${key}' git-annex_views_demo.ogg)" ]; then
    git annex addurl "http://web.archive.org/web/20171117120847/https://downloads.kitenet.net/videos/git-annex/git-annex_views_demo.ogg" --file git-annex_views_demo.ogg --relaxed
fi
rm tmp.git-annex_views_demo.ogg

Comment by joey — Mon Dec 11 17:56:53 2017

Remove comment

Add a comment