Hi,
I just was setting um my git-annex repository and started to sync my whole stuff in it.
Background: I have choosen git-annex to sync my whole stuff (pictures, mp3s, documents, etc) between my pc, notebook and a home-server
My Problems: 1) When I'm starting the git-annex deamon, the "Performing startup scan" message occurs for hours 2) git-annex synchronizes folders from the server which already on my pc, and that every time I restart the deamon on client
My Questions: For 1) is git-annex when running one repository suitable to manage > 100gb and > 50000 files? For 2) do I have to wait until every tasks are completed (everything is committed) to get rid of multiple downloads of the same folders/files 3) what is the best schema to sync between >2 devices should I use a Mesh or Star Schema (where my server is in the middle)
Thank You in advance! Regards J
It sounds like you are using git-annex to sync a lot of files. The startup scan is a basic check that there are no new, deleted, or modified files since the last time it ran. This requires a little work, like statting the files, but won't take long for reasonable numbers of files.
The size of files doesn't affect the length of the scan. I have a repository with on the order of 50k files, and the startup scan takes only a few minutes. One thing that could matter is that my repository is in indirect mode. Direct mode is less efficient. You could try to switch your repository to indirect mode:
git annex indirect
(you can always switch back:git annex direct
)It would be possible to disable that scan, but at the expense of not being able to sync changes made while git-annex is not running. There's also a trick you can use: Start the assistant running in a subdirectory and it will only scan that subdirectory (it will only notice new files in that subdirectory too..)
I don't quite understand what you mean with problem #2. If files were repeatedly being uploaded or downloaded, that have already been sent, that would be a bug. Please file a bug report with full debug logs if that is the case.
Which topology is best? I think the best way is to start with the one you like, and if it doesn't work well, add more links between repositories. A star topology will certainly work ok. A mesh can work ok but can be hard to maintain.
I would like to add that this startup check has probably been a blocker for my use case for a long long time. I tried to use git-annex to synchronize a huge number of files, most of them never changing. My plan was to have a few tens of GB of data which more or less never change in an archive directory and then add from time to time new data (by batches of a few hundreds of files, each of them not necessarily very large) to the annex. Once this new data has been processed or otherwise become less immediately useful, it would be shifted to the archive. It would have been very useful to have such a setup, because the amount of data is too large to be replicated everywhere, especially on a laptop. After finding this post I finally understand that the seemingly never ending "performing startup scan" that I observed are probably not due to the assistant somehow hanging, contrary to what I thought. It seems it is just normal operation. The problem is that this normal operation makes it unusable for the use case I was considering, since it does not make much sense to have git-annex scanning about 106 files or links on every boot of a laptop. On my workstation this "startup scan" has now been running for close to one hour now and is not finished yet, this is not thinkable on laptop boot.
Maybe an analysis of how well git-annex operation scales with number of files should be part of the documentation, since "large files" is not the only issue when trying to sync different computers. One finds references to "very large number of files" about annex.queuesize, but "very large" has no clear meaning. One also finds a reference to "1 million files" being a bit of a git limitation on comments of a bug report https://git-annex.branchable.com/bugs/Stress_test/.
Orders of magnitude of the number of files that git-annex is supposed to be able to handle would be very useful.
@maurizio, that's a good motivating example.
So, I have made
git config annex.startupscan false
disable the startup scan (except for a minor tree walk that needs to be done to get inotify working).Of course, if you set that, the assistant won't notice any changes that are made when it's not running. It would work well to set it if the assistant is started at boot and left running until shutdown.
My goal for git-annex is for it to scale to how ever many files git scales to. So I try to always make git be the limiting factor. I feel that git scales fairly well into the 100k file range.
Something to keep in mind for some point in the future: btrfs supports an efficient method by which one can inquire exactly which files have changed since a specified point in time, where the point in time is measured by a unit called «generation».
A program can request the current «generation» index, and later ask btrfs which files have changed after that index, without needing to walk the file-system. This is currently implemented in the user-space program «btrfs filesystem find-new». While it’s called «find-new», this not only find newly created files, but also changed files.
Currently, it seems one needs to be super-user to use «find-new», because it lists changed files for complete subvolumes, but since generations are stored per-file, in the future there will likely be a user-space method for regular users.