performance and multiple replication problems

git-annex/ forum/ performance and multiple replication problems

Edit
RecentChanges
History
Preferences
Branchable
5 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

Hi,

I just was setting um my git-annex repository and started to sync my whole stuff in it.

Background: I have choosen git-annex to sync my whole stuff (pictures, mp3s, documents, etc) between my pc, notebook and a home-server

My Problems: 1) When I'm starting the git-annex deamon, the "Performing startup scan" message occurs for hours 2) git-annex synchronizes folders from the server which already on my pc, and that every time I restart the deamon on client

My Questions: For 1) is git-annex when running one repository suitable to manage > 100gb and > 50000 files? For 2) do I have to wait until every tasks are completed (everything is committed) to get rid of multiple downloads of the same folders/files 3) what is the best schema to sync between >2 devices should I use a Mesh or Star Schema (where my server is in the middle)

Thank You in advance! Regards J

RSS Atom

comment 1

It sounds like you are using git-annex to sync a lot of files. The startup scan is a basic check that there are no new, deleted, or modified files since the last time it ran. This requires a little work, like statting the files, but won't take long for reasonable numbers of files.

The size of files doesn't affect the length of the scan. I have a repository with on the order of 50k files, and the startup scan takes only a few minutes. One thing that could matter is that my repository is in indirect mode. Direct mode is less efficient. You could try to switch your repository to indirect mode: git annex indirect (you can always switch back: git annex direct)

It would be possible to disable that scan, but at the expense of not being able to sync changes made while git-annex is not running. There's also a trick you can use: Start the assistant running in a subdirectory and it will only scan that subdirectory (it will only notice new files in that subdirectory too..)

I don't quite understand what you mean with problem #2. If files were repeatedly being uploaded or downloaded, that have already been sent, that would be a bug. Please file a bug report with full debug logs if that is the case.

Which topology is best? I think the best way is to start with the one you like, and if it doesn't work well, add more links between repositories. A star topology will certainly work ok. A mesh can work ok but can be hard to maintain.

Comment by joey — Tue Jun 11 15:06:58 2013

Remove comment

comment 2

I have witnessed that second problem as well, to the point where I've stopped autostarting (and hence using) git annex for the moment. I'll try to get some debugging data.

Comment by Frederik Vanrenterghem — Wed Jun 12 00:47:21 2013

Remove comment

the startup check is not a small issue

I would like to add that this startup check has probably been a blocker for my use case for a long long time. I tried to use git-annex to synchronize a huge number of files, most of them never changing. My plan was to have a few tens of GB of data which more or less never change in an archive directory and then add from time to time new data (by batches of a few hundreds of files, each of them not necessarily very large) to the annex. Once this new data has been processed or otherwise become less immediately useful, it would be shifted to the archive. It would have been very useful to have such a setup, because the amount of data is too large to be replicated everywhere, especially on a laptop. After finding this post I finally understand that the seemingly never ending "performing startup scan" that I observed are probably not due to the assistant somehow hanging, contrary to what I thought. It seems it is just normal operation. The problem is that this normal operation makes it unusable for the use case I was considering, since it does not make much sense to have git-annex scanning about 10⁶ files or links on every boot of a laptop. On my workstation this "startup scan" has now been running for close to one hour now and is not finished yet, this is not thinkable on laptop boot.

Maybe an analysis of how well git-annex operation scales with number of files should be part of the documentation, since "large files" is not the only issue when trying to sync different computers. One finds references to "very large number of files" about annex.queuesize, but "very large" has no clear meaning. One also finds a reference to "1 million files" being a bit of a git limitation on comments of a bug report https://git-annex.branchable.com/bugs/Stress_test/.

Orders of magnitude of the number of files that git-annex is supposed to be able to handle would be very useful.

Comment by maurizio — Tue Feb 25 11:37:15 2014

Remove comment

comment 4

@maurizio, that's a good motivating example.

So, I have made git config annex.startupscan false disable the startup scan (except for a minor tree walk that needs to be done to get inotify working).

Of course, if you set that, the assistant won't notice any changes that are made when it's not running. It would work well to set it if the assistant is started at boot and left running until shutdown.

My goal for git-annex is for it to scale to how ever many files git scales to. So I try to always make git be the limiting factor. I feel that git scales fairly well into the 100k file range.

Comment by joeyh.name — Wed Mar 5 22:06:00 2014

Remove comment

comment 5

Something to keep in mind for some point in the future: btrfs supports an efficient method by which one can inquire exactly which files have changed since a specified point in time, where the point in time is measured by a unit called «generation».

A program can request the current «generation» index, and later ask btrfs which files have changed after that index, without needing to walk the file-system. This is currently implemented in the user-space program «btrfs filesystem find-new». While it’s called «find-new», this not only find newly created files, but also changed files.

Currently, it seems one needs to be super-user to use «find-new», because it lists changed files for complete subvolumes, but since generations are stored per-file, in the future there will likely be a user-space method for regular users.

Comment by zardoz — Tue Jul 8 12:26:58 2014

Remove comment

Add a comment

Last edited Wed Nov 27 22:47:37 2013