Hi,
Thank you very much for this software. I'm working in a research institute and we are very interested into using git-annex with DataLad to manage our datasets.
We aim to provide a datasets repository accessible through the local network on a single file system. Some of our datasets are multi TB with a few millions of files. It will be managed by a few people but the primary users, the researchers, will only have read access. We would like to use hardlinks everywhere to avoid infrequent reading errors related to symlinks and save space when we want to propose different versions of the datasets with slight changes. The file system will be backed-up so we don't really need multi copies of the same files on a single file system.
We seam to be able to achieve this using the direct
mode in git-annex version 5 but it seams that the unlock
mode in version 7 does copies instead of hardlinks. I'm wondering how we could achieve the same behaviour as in version 5. I believe I've read in the doc that there's a maximum of 2 hardlinks for a single file but I can't remember where or see if that is still the case. If that is still the case, I couldn't find if there is a setting to set or remove this maximum.
We've tested with git-annex local version 5 / build 7.20190819, local version 7 / build 7.20190819 and local version 7 / build 7.20191106. Here is a gist containing test scripts for each setup. The .annex-cache
part can be ignored for this topic. I've used Miniconda3-4.3.30 on Ubuntu 18.04.2 LTS to setup the environments.
Thank you,
Satya
I don't have a full answer, but local caching of annexed files might have relevant info.
There is also the
annex.thin
setting; but check some caveats related to it.Thanks for your comment. I've looked into local caching of annexed files and most of it can be found in the scenario described in the test gist.
The two settings
annex.thin
andannex.hardlink
are also set in the two git-annex repositories of the test. Thanks for letting me know about the caveats. Based on the tests that I've executed, it would seam thatgit-annex unlock
now copies the file to avoid the mentioned issue as I noticed different inodes? I understand that this prevents unwanted lost of data while using git-annex but I would actually like to have a hardlink instead of a copy. I'm wondering if it's possible.I again don't have a full answer, but maybe you could customize git's post-checkout hook? (You'd need to still call the hook that git-annex installs.)
Also, I've thrown together a FUSE filesystem that fetches git-annexed files on demand, maybe that could be adapted. It only works with locked files and symlinks though.
What is the reason you don't want to use locked files? You can have different worktrees with symlinks of locked files pointing into the same annex.
Thanks for your help. Yes I believe that post-checkout hook could do the trick but I really like your idea of using a FUSE filesystem. Thanks a lot for sharing. I also believe this could be the base to progressively get the content of an indexed archive (like .zip) as it's getting needed.
The worktree is a very interesting feature but I'm also using DataLad 0.11.8 which is unfortunately incompatible with it for the moment.
As for my objective to not use locked files, I initially though that the script of a library I was using to preprocess some data was failing because of the fact the files were symlinks but I couldn't reproduce. Unfortunately, too many factor changed so I'm just going to assume I was doing something wrong. Still, it would sometimes be useful to work with unlocked files in the case I'm doing a multi-phases (with multi-commits) preprocessing of a big file. In that case, a phase would modify the file, trigger a copy by unlocking it and annex the modified file. I would be interested into skipping the copy to save a significant amount of time and space since the intermediate states of the file are only temporary. The checksums are still interesting to make sure the phase correctly executed. But that is very specific and will not happen too often so I'm fine with workarounds.