NAME
git-annex addurl - add urls to annex
SYNOPSIS
git annex addurl [url ...]
DESCRIPTION
Downloads each url to its own file, which is added to the annex.
When yt-dlp
is installed, it can be used to check for a video
embedded in a web page at the url, and that is added to the annex instead.
(However, this is disabled by default as it can be a security risk.
See the documentation of annex.security.allowed-ip-addresses
in git-annex(1) for details.)
Special remotes can add other special handling of particular urls. For
example, the bittorrent special remotes makes urls to torrent files
(including magnet links) download the content of the torrent,
using aria2c
.
Normally the filename is based on the full url, so will look like "www.example.com_dir_subdir_bigfile". In some cases, addurl is able to come up with a better filename based on other information. Options can also be used to get better filenames.
OPTIONS
--fast
Avoid immediately downloading the url. The url is still checked (via HEAD) to verify that it exists, and to get its size if possible.
--relaxed
Don't immediately download the url, and avoid storing the size of the url's content. This makes git-annex accept whatever content is there at a future point.
This is the fastest option, but it still has to access the network to check if the url contains embedded media. When adding large numbers of urls, using
--relaxed --raw
is much faster.--verifiable
-V
This can be used with the
--fast
or--relaxed
option. It improves the safety of the resulting annexed file, by letting its content be verified with a checksum when it is transferred between git-annex repositories, as well as by things likegit-annex fsck
.When used with --relaxed, content from the web will always be accepted, even if it has changed, and the checksum recorded for later verification.
When used with --fast, the checksum is recorded the first time the content is downloaded from the web. Once a checksum has been recorded, subsequent downloads from the web must have the same checksum.
When addurl was used without this option before, the file it added can be converted to be verifiable by migrating it to the VURL backend. For example:
git-annex migrate foo --backend=VURL
--raw
Prevent special handling of urls by yt-dlp, and by bittorrent and other special remotes. This will for example, make addurl download the .torrent file and not the contents it points to.
--no-raw
Require content pointed to by the url to be downloaded using yt-dlp or a special remote, rather than the raw content of the url. if that cannot be done, the add will fail.
--raw-except=remote
Prevent special handling of urls by all special remotes except for the specified one. To allow special handling only by yt-dlp, use
--raw-except=web
.--file=name
Use with a filename that does not yet exist to add a new file with the specified name and the content downloaded from the url.
If the file already exists, addurl will record that it can be downloaded from the specified url(s).
--preserve-filename
When the web server (or torrent, etc) provides a filename, use it as-is, avoiding sanitizing unusual characters, or truncating it to length, or any other modifications.
git-annex will still check the filename for safety, and if the filename has a security problem such as path traversal or a control character, it will refuse to add it.
--pathdepth=N
Rather than basing the filename on the whole url, this causes a path to be constructed, starting at the specified depth within the path of the url.
For example, adding the url http://www.example.com/dir/subdir/bigfile with
--pathdepth=1
will use "dir/subdir/bigfile", while--pathdepth=3
will use "bigfile".It can also be negative;
--pathdepth=-2
will use the last two parts of the url.--prefix=foo
--suffix=bar
Use to adjust the filenames that are created by addurl. For example,
--suffix=.mp3
can be used to add an extension to the file.--no-check-gitignore
By default, gitignores are honored and it will refuse to download an url to a file that would be ignored. This makes such files be added despite any ignores.
--jobs=N
-JN
Enables parallel downloads when multiple urls are being added. For example:
-J4
Setting this to "cpus" will run one job per CPU core.
--batch
Enables batch mode, in which lines containing urls to add are read from stdin.
-z
Makes the
--batch
input be delimited by nulls instead of the usual newlines.--with-files
When batch mode is enabled, makes it parse lines of the form: "$url $file"
That adds the specified url to the specified file, downloading its content if the file does not yet exist; the same as
git annex addurl $url --file $file
--json
Enable JSON output. This is intended to be parsed by programs that use git-annex. Each line of output is a JSON object.
--json-progress
Include progress objects in JSON output.
--json-error-messages
Messages that would normally be output to standard error are included in the JSON instead.
--backend
Specifies which key-value backend to use.
Also the git-annex-common-options(1) can be used.
CAVEATS
If annex.largefiles is configured, and does not match a file, git annex
addurl
will add the non-large file directly to the git repository,
instead of to the annex. However, this is not done when --fast or --relaxed
is used.
SEE ALSO
git-annex(1)
AUTHOR
Joey Hess id@joeyh.name
Warning: Automatically converted into a man page by mdwn2man. Edit with care.
I have been trying to figure out how to use addurl to get this video. I have this in my mscourtstuff annex as a large binary, but I would really like to use the web as a remote for this.
Hughes v Hosemann 2010-CA-01949-SCT-43112001.mp4 youtube-dl --referer 'http://judicial.mc.edu/case.php?id=24206' http://player.vimeo.com/video/43112001
There's not currently a way to do per-file youtube-dl options. The difficulty is that we don't know what youtube-dl options might be unsafe, and which such a feature could make eg
git annex get
use when run by a different user.I feel that this needs some support in youtube-dl to avoid git-annex needing to know about all its safe options. Especially since which options are available, or safe, could vary between versions of youtube-dl.
In using git-annex in the past, I've always found it counterintuitive that rmurl uses the following form to remove a URL from a file:
While, in contrast, addurl uses a flag to designate the file that a URL should be added to the list of URLs a file points to:
It would make sense (at least to me) to make the syntax for these more congruous so that both commands use either two positional arguments or one positional argument and one keyword argument / flag.
@john, the difference is that while addurl can make up a filename to use if you do not provide one, rmurl needs you to specifiy a filename.
So, yes, "git annex rmurl --file=whatever url" would be more consistent, but it requires typing more my making something that is not actually optional into an option. And "git annex addurl file url" would make the command more consistent with rmurl, but harder to use.
Consistency is not everything.
(Also, the rmurl batch interface would then be less consistent to its command-line interface.)
@gan, there's not much point in providing flags that are only used in the initial download; the main point in adding the url to git-annex is so you can download the same content from it again later.
Hi @joey,
Thanks for your continued work on git-annex and for responding to my last comment. I agree that consistency is not everything, but I do think that it's also important to balance functionality against the amount of cognitive load that an interface places on an end-user. My perspective is no doubt influenced by my cognitive science background, but whenever I find an interface where it's easy to confuse two similar operations that use different syntaxes, I'm reminded of Don Norman's rant about the early Unix UI. I'm also reminded of common phenomena described in memory research such as retroactive interference, wherein a more recently learned memory interferes with something that was learned previously. In this case, if I were to learn the syntax for addurl first, and then learned the syntax for rmurl much later, my internal representation of rmurl would to some degree "overwrite" my previous knowledge of addurl and compete with it. Making the two syntaxes consistent with each other in this case would eliminate any competition between internal mental representations of how the two commands are structured.
I'm also not entirely sure why a positional argument can't be optional. If there's a good reason for this not to be so then I won't argue my point anymore, but something like the following syntax would make the most sense from my view:
git annex rmurl [url] [file]
git annex addurl [url] [file; optional positional argument]
@m15 this page is not a bug tracking system. File bug reports over at bugs.
If you add a file to your repo first via
addurl --fast
, it writes the filename as a symlink to a file that incorporates the URL, rather than the file hash. This is expected, since git-annex can't know the file hash until it's actually downloaded the file.If you then
git annex get
that file, it downloads the file to the path that uses the URL. Is the hash ever recorded for these files? If you were to drop and re-download the file, would git-annex accept a different file?Hash is not recorded, but file size is. You can disable the size check with
--relaxed
. See using the web as a special remote. Aftergit-annex-get
ting the file, you can usegit-annex-migrate
to record it under a new checksum-based hash, then usegit-annex-unused
to find and remove the old key.Sometimes you can get the hash without downloading the file, e.g. if the hash is stored next to the file at
http://my/file.md5
, or if the file is stored in the Google Cloud. Then you can use the plumbing commandsgit-annex-registerurl
to associate the checksum-based key with the URL, andgit-annex-setpresentkey
to record the key's presence in the (web) remote.Related discussion: alternate keys for same content
scheme:<arbitrary json>
, but I am not sure if this might become an issue later. I could also encode the data with base64 or something similar, in which case size limitations would still be relevant; if there are any. Although, the json variant has the added benefit of being much more easily readable in whereis output.@matthias.risze length is not an issue. You should avoid characters that are not usually in urls, particularly whitespace and newline.
It seems to me though that your special remote would perhaps be better served by using the SETSTATE and GETSTATE commands (see external special remote protocol)
Turning on
securehashesonly
seems to disable theaddurl
command:Does this have something to do with the URL prefix that the annex object has?