Please describe the problem.
When using git-annex-import
to import from a directory remote, if the directory tree has directory symlinks pointing to directories outside the directory tree of the directory remote, the targets of these symlinks get imported. This can lead to importing much more than was intended; such symlinks should probably get imported as symlinks by default, with a command-line option to import their targets. There might even be a security issue with unexpectedly importing and sharing content outside the explicitly specified directory tree.
What version of git-annex are you using? On what operating system?
# If you can, paste a complete transcript of the problem occurring here.
# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log
$ git annex version
git-annex version: 10.20220322-g7b64dea
build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV
dependency versions: aws-0.22 bloomfilter-2.0.1.0 cryptonite-0.29 DAV-1.3.4 feed-1.3.2.0 ghc-8.10.7 http-client-0.7.9 persistent-sqlite-2.13.0.3 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.1.2
key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL X*
remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg hook external
operating system: linux x86_64
supported repository versions: 8 9 10
upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10
local repository version: 8
$ uname -a
Linux ctchpcpx163.merck.com 3.10.0-1160.6.1.el7.x86_64 #1 SMP Tue Nov 17 13:59:11 UTC 2020 x86_64 GNU/Linux
# End of transcript or log.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
That I use it enough to run into corner-case issues shows its continued usefulness
git-annex import $dir
also follows symlinks inside $dir. So importing has been behaving this way since long before the directory special remote supported importtree.This is not a security hole, because if an attacker wants to make you import
/foo
when importing/bar
, and they have write access to bar, they are not limited to making a/bar/foo -> /foo
symlink. They can justcp -a /foo /bar
instead.I don't really think it would make much sense for any import to import symlinks as symlinks. If the symlink points outside the imported directory, that would result in a symlink that points outside the git repository, which is not something one often wants to check into a git repository.
I don't know if I would really consider this a bug either. It at least seems plausible that there might be users who import from
~/disk
which is a symlink to/media/somethinglong
, and rely on it following the symlink. I often make symlink aliases for mount points like that, though I have not imported from them.I meant "security hole" more in the sense of the user themselves inadvertently importing (and then sharing) more than they meant to. E.g. I was importing a large subtree created by others, and had no clue it included symlinks outside the subtree; I only noticed this by accident, when the import started taking too long.
In conjunction with the
mv
semantics, this seems risky... had I been using the original directory import, I'd have inadvertently deleted a large dataset (to which there was a symlink in the imported tree) from a place others expect to find it. The unixmv
command doesn't even have a flag to follow symlinks (nevermind defaulting to that).It's certainly plausible; question is, should it be the default? It's not the default in
cp
(you have to use-L
) or intar
(have to use-h
). I think most people assume that importing from a directory remote is akin to doing acp -r
ortar cf
from it.One other scenario where the result might not be what users expect is the following: if a directory special remote is configured with both
importtree=yes
andexporttree=yes
, and the directory contains symlinks pointing outside the tree, then an import followed by an export will replace the symlink in the original directory with a copy of the content.Such a symlink is already not something one often has But if one does, then the repo is likely for one's own usage, or for the usage by people with access to the shared filesystem where the link works, so adding the link to git as-is makes sense. Logically, it's likely that the out-of-tree link target represents some separate tree of files that you don't think of as part of the tree (or you'd have put them under the tree); if you did want to import them, you'd make a separate repo for them and import them as a submodule.
Also, what happens if the target tree of the out-of-tree link has a symlink back to the original tree -- could this cause infinite recursion?
The
git-annex-import
man page says the command imports "a tree of files". It seems simplest if this description was always strictly true, regardless of what's in the tree. But if you decide to keep the current default, maybe clarify the web page?Thanks again for all your work.
Hey, I just got bitten by this. Is there an option to not follow symlinks? If not(and I couldn't find an option), it would be nice to add that. I have no strong opinion on wether it should be a default or not.
Maybe it is stupid and there is a much better way, but here is what I did. I decided to move all my data into git annex. I had a total mess of old hard drives with data, backups or both. Some 10+ years old. So I started to use the directory import with deduplication to import each Hdd into a separate folder of my annex. That worked very well and left me with a much smaller amount of data to organize and empty Hdds ready to wipe+dispose.
Where things went wrong was when I imported an old external Hdd with a backup of a home directory from 2011 on it. Apparently I had wine installed at that time which had created symlinks in .wine to the following directories.
/
/home/USERNAME
Conveniently I use the same username now as I used back then. I noticed that something was wrong because the import took very long and the KDE desktop started to misbehave. Probably git annex deleted some files KDE needed. I don't know why but at first glance there is not much data missing from my home directory. But now there are a lot of files in my annex that I don't want.
I think it is not too painful to sort this out but it would have been easy to avoid. My use case for this will probably go away once all the old Hdds are gone. But scenarios like this might not be that rare for new users coming from a world without numcopies and location tracking.
I have been eyeing git-annex for a couple of years now and finally got around to actually use it. I am very impressed and a huge fan. Thanks for all the work.
@skcin I'm very sorry that happened to you. I suppose it's not data loss, but it sounds like a mess. You should be able to examine
git log
to find what got imported, and rungit-annex unannex
on it, and then move it back to the right place.Seems like I underestimated the chance this would be a foot bomb. I now think that git-annex import and the directory special remote should skip over symlinks. Probably with an informative message to avoid silently doing nothing in cases where users had been using them intentionally to follow symlinks.
Such a check will be race prone, but that is only likely to matter if an attacker is racing it to replace a file with a symlink, and as I discussed in previous comments, such an attacker seems like they would be able to accomplish the same thing with the write permission they must have.
It should still import them as symlinks into git, just not change them to annexed files. If these are relative symlinks pointing within the tree being imported, they'll still work in the repo. If these are absolute symlinks to common files mounted on a filesystem shared by all users of the repo, you still want them in the imported repo as absolute symlinks. Absolute symlinks that happen to point within the tree being imported should be imported as annexed files.
Ok, I've changed
git-annex import
to skip symbolic links inside the directory being imported. The directory being imported can still itself be a symlink and that will be followed.Re: new manpage description that says
In my use cases, skipping symlinks (instead of importing them as git symlinks) would make the imported tree unusable by tools that expect specific filenames in specific subdirs of the tree, when these filenames were symlinks in the original tree. Could the symlinks be imported as standard git symlinks instead of skipping them? Worst case, a checkout of the repo will have symlinks to non-existing targets -- this can be fixed by mounting volumes with the right paths. OTOH, having filenames missing from the imported tree because they happened to be symlinks in the original tree can cause all sorts of errors. Import is much simpler to think about if it's guaranteed to replicate the full original tree structure, like
tar
does.@Ilya the old git-annex import already skipped symlinks to files. (Along with not importing sockets, device files, etc.) It only followed symlinks to directories.
Then the old import was problematic as well
Sorry for harping on this, but since this affects my use cases (and I suspect others' as well), what's the rationale for not doing what
tar
does? I assumed that an import is similar to doing atar
or acp -pr
, except that the files get annexed and a connection to the files' origin is recorded.For things like song collections the symlinks might not matter, but for datasets they do. E.g. symlinks are used to ensure tools that expect certain filenames at certain paths can find them. The tools are often written by other people, so as a user you might not know all details of these expectations. Files from related analyses are symlinked to keep the relation explicit. Absolute symlinks are used to point to large shared resources (e.g. genome files) on a filesystem that all users of the repo share. I can give more detailed illustrations if it would help. But I'm not sure I understand the downside of importing symlinks the way tar does.