Scenario
On multiple Windows systems, I have several directories with related, but somewhat different content.
E:\Dir\V1\... \V2\... ... \Untracked E:\V1-Related E:\V2-Related ...
V1 and V2 each contain dozens of directories with hundreds of files totaling 10s of gigabytes. The content is very similar, but may have a few differences. Some, but not all of the directories in E:\Dir should be tracked.
V1- and V2-related directories each contain 10-20 thousand subdirectories with 100K-200K files totaling from 50-200GB per base directory. Again, there is a lot of overlap between the V*-related directories.
Also, the *-related directory content is related to the V# directories, but cannot be stored in the same place. There may be 5-10GB of overlap between them.
Then there is my main storage system which is running FreeNAS. I'd like it to retain copies of all content for protection, backup, and redundancy. That way I can drop stuff locally from various computers with much smaller drives and still have it. This system can either be accessed by SMB or via SSH or rsync. It should be possible for me to build git-annex for FreeBSD or to run a Linux VM on it.
Questions
Is direct mode supported at all on NTFS? I'd like to avoid the file duplication and copying. I see it gets flagged as a broken file system. (No need to explain, I worked in the guts of Cygwin.)
Is using git-annex and the available tools even realistic with so many files? I've read a lot of the forum and tip posts, but my head is still swimming.
How might I best track V# and V#-related together? Or should I could keep them in separate annexes and just script them together.
What is the best way to keep V#-related in the root directory? I've considered adding an E:\annex because .git in the root directory can be problematic. If I do, how would I associate those directories with the annex? GIT_DIR?
I've noticed using git-annex over a locally mounted drive letter or using a UNC path is slow. I assume it's being treated as local and running multiple passes across the network while reading, writing, checksumming and verifying them. I base this on seeing uplink and downlink saturation while syncing. What would be the best, most performant solution for accessing it?
Finally, is there anyway to tip, donate or contribute to you?
Direct mode was only supported by old versions of git-annex. It did work on windows. The replacement (adusted unlocked branches + annex.thin) is better in every way except one: On windows (and some filesystems like FAT), it is not able to avoid storing 2 copies of files, because git-annex isn't able to hard link files there. If there was a reasonable way to do that on windows, that could be a big improvement, but I have not dug into whether windows has anything similar enough to hard link for git-annex to use it.
git-annex scales to hundreds of thousands of files.
If I had two directories like your V# and V#-related, I might make them each into their own repository, and set them each as a git remote of the other. That would let git-annex know that identical files have two copies. (Or, the parent directory could be made into a git-annex repository, which would let git-annex deduplicate identical files, but since that needs symlink support, it won't happen on windows.)
You can certianly use GIT_DIR with git-annex.
Generally the best thing to do with storage on the other side of a network connection is to run git-annex on it locally, or use it as some kind of special remote.
direct-mode doesn't exist anymore, it is replaced with the annex.thin config (See "using less disk space"). And yes, it works on NTFS.
It definitely isn't going to be fast, the numbers you gave suggests that there will be ~1000000 files per repository (For the V*-related dirs). Still you should try it and see if it's fast enough for you. Some tips to improve performance: Don't use
include=
/exclude=
in preferred-content-expressions and Repositories with large number of files. My experimental script here might also be worth a try.Having the root of the repo on the root of the drive and then excluding everything that shouldn't be in the repo via
.gitignore
can be a vivable approach. But with that many files I'd create one repo per directory. It could also be done with git worktrees.Don't use git-annex on top of a network share, in that case run it directly on the server. git-annex is designed to run on local drives/storage. Also, git-annex on windows is way slower than on linux.
You can donate via Patreon and Liberapay.