Please describe the problem.
Original complaints could be found mentioned in the comments of the importfeed page: when using addurl
, and even when the server provides Content-Disposition field with the filename, git-annex seems (BTW -- no Content-Disposition was mentioned in the --debug output) to take that filename value and obfuscates it (replaces '-' with '_' etc) to what supposed to be the original filename.
$> mkdir /tmp/testrepo; cd /tmp/testrepo; git init; git annex init;
mkdir: cannot create directory ‘/tmp/testrepo’: File exists
E: could not determine git repository root
Initialized empty Git repository in /tmp/testrepo/.git/
init ok
(recording state in git...)
$> git annex addurl --fast https://girder.dandiarchive.org/api/v1/item/5e9f9588b5c9745bad9f58ff/download
addurl https://girder.dandiarchive.org/api/v1/item/5e9f9588b5c9745bad9f58ff/download (to sub_mouse_AAYYT_ses_20180420_sample_2_slice_20180420_slice_2_cell_20180420_sample_2.nwb) ok
(recording state in git...)
$> ls -l
total 4
lrwxrwxrwx 1 yoh yoh 184 May 7 17:02 sub_mouse_AAYYT_ses_20180420_sample_2_slice_20180420_slice_2_cell_20180420_sample_2.nwb -> .git/annex/objects/Gj/9z/URL-s9335000--https&c%%girder.dandiarchive.org-48163bc503cb7181516be86ef215f923/URL-s9335000--https&c%%girder.dandiarchive.org-48163bc503cb7181516be86ef215f923
]
whenever original content-disposition was having "-" in the filename, which are perfectly safe the filename AFAIK:
$> wget -S https://girder.dandiarchive.org/api/v1/item/5e9f9588b5c9745bad9f58ff/download
... bunch of forwards to the final one with the content disposition field
Resolving dandiarchive.s3.amazonaws.com (dandiarchive.s3.amazonaws.com)... 52.219.101.51
Connecting to dandiarchive.s3.amazonaws.com (dandiarchive.s3.amazonaws.com)|52.219.101.51|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
x-amz-id-2: VgJE1jV5XUkBQXZDWgR5WEDfmHJp4Fj6fGo6z2tYkLfyTsxDWC+m92B2qOSVppCuiRFu2QpNV5M=
x-amz-request-id: 1221CAC30E3931CF
Date: Thu, 07 May 2020 21:02:52 GMT
Last-Modified: Wed, 22 Apr 2020 00:54:32 GMT
ETag: "acf3b4f5951435245a0efcd4a518e77d"
Content-Disposition: attachment; filename="sub-mouse-AAYYT_ses-20180420-sample-2_slice-20180420-slice-2_cell-20180420-sample-2.nwb"
...
$> git annex version
git-annex version: 7.20190708+git9-gfa3524b95-1~ndall+1
This is due to the filename being passed through sanitizeFilePath.
There are security concerns here. If the filename contains "../" it absolutely has to be modified, or the command would have to fail and refuse the import it.
If the filename contains an ANSI escape sequence, it could potentially lead to a security hole. Or if the filename starts with "-" it could be somewhere between a possible security hole and just very annoying to work with. As could a filename that contains a newline, which will break large quantities of shell pipelines. While generally git repos can have these problems with files in them too, the exposure seems larger when talking to some random web server than when pulling from a repo.
Also, cross filesystem compatibility is a concern. It used to allow "|" in the filename, but a bug pointed out that cannot be used on fat filesystems. And "\" means different things on linux and windows, so probably best to avoid filenames containing it on linux too.
Finally, it's somewhat opinionated, since it replaces spaces with underscores. That's certainly the least defensible thing.
(git-annex may also truncate the filename if it's longer than what the filesystem supports.)
So, it's clearly wrong that it should be taken as-is without obfuscation, IMHO. Maybe there's a way to improve it to meet some use case though.
I could see having a config that avoids sanitizing the filename, but makes addurl fail if the filename looks like a security problem.
Though that has the downside that git-annex would then need to comprehensively track, going forward, all the ways that people find to make filenames be a security problem; the current method, by being strict in what it lets through, probably limits expoits to ones involving a) unicode or b) the user's wetware.
git-annex import
does not do any sanitization, and that could be considered inconsistent, particularly when importing from a remote like S3.A difference with that is, it creates a remote tracking branch for the imported files. (That happens to avoid "../" path traversal because git generally avoids it.) Maybe the real difference is, import from a special remote is completely analagous to fetching from a git remote. So it feels different to me than adding an url does.
If I sync with a S3 bucket and it turns out it imported a escape sequence file, well I could have looked at the bucket first, or imported and reviewed the branch before merging it. And if I was syncing with a git remote the same thing could happen. So it feels like I should have no expectation git-annex would protect me. Whereis, if I add an url and the web server uses an obscure-ish http header to surprise me with a similar malicious filename, I had no way before hand to know that would happen, and so it does feel like git-annex should protect me.
(Although if git did prevent that, git-annex should too, and I'd be fine with git preventing that.)
Implemented git-annex addurl --preserve-filename, which will do what you want.
Leaving this bug open because I only implemented it for web urls, not yet for torrents and other special remotes that have their own url scheme. The sanitization for those is currently done at a lower level than addurl, and so that will take a bit more work to implement.
(importfeed does not, I think, need to implement this option, because the filenames are based on information from the rss feed, and it's perfectly fine to sanitize eg a podcast episode title to get a reasonable filename.)
IMHO those indeed are ok to target for sanitization
So why not to sanitize it only at the beginning of the filename?
-
is a very common and a safe character to use within filename. For that matter we VERY frequently use-
in filenames. It even became part of our BIDS standard in neuroimaging: https://bids-specification.readthedocs.io where we separate_key
fromvalue
, e.g.in ` . I really do not see why git-annex should so aggressively sanitize filenames as replacing "-" within filenames -- it makes nothing more secure or convenient.Well, not sure about ansi characters and new line symbols, but typically files are saved by the browsers with the name suggested by the server.
I agree that it may as well allow non-leading '-'. But, if you are relying on getting the unsanitized filename generally, you should use --preserve-filename
Web browsers do do some santization, particulary of '/'. Chrome removes leading "." as well. Often files are downloaded without the user confirming it. I suspect there is enough insecurity in that area that someone could make a living injecting bitcoin miners into dotfiles.