What steps will reproduce the problem?
mkdir annex_stress; cd annex_stress, then execute the following script:
#! /bin/sh
# creating a directory, in which we dump all the files.
mkdir probes; cd probes
for i in `seq -w 1 25769`; do
mkdir probe$i
echo "This is an important file, which saved also in backup ('back') directory too.\n Content changes: $i" > probe$i/probe$i.txt
echo "This is just an identical content file. Saved in each subdir." > probe$i/defaults.txt
echo "This is a variable ($i) content file, which is not backed up in 'back' directory." > probe$i/probe-nb$i.txt
mkdir probe$i/back
cp probe$i/probe$i.txt probe$i/back/probe$i.txt
done
It creates about 25000 directory and 3 files in each, two of them are identical.
What is the expected output? What do you see instead?
I expect git annex could import the directory within 12 hours. Yet, it just crashes the gui (starting webapp, uses the cpu 100% and it does not finish after 28hours.)
What version of git-annex are you using? On what operating system?
version 2013.04.17
Please provide any additional information below.
I do hope git-annex can be fixed to handle large number of files. This stress test models well enough my own directory structure, relatively high number of files relatively low disk space usage (my own directory structure: 750MB, this test creates 605MB).
Best, Laszlo
Is this related or unrelated to the bug you filed at Resource exhausted?
I tried this test, and noticed that it was taking the assistant rather a long time to get to the 10 thousand file threshhold where it makes a batch commit. A small change to a better data structure for its queue reduced that time from probably 10 minutes to 2.5.
I was unable to reproduce any problem with the webapp. Please provide lots of details to back up "it just crashes the GUI".
The main problem with this directory tree is that it has more directories than inotify can watch, in the default configuration. So after it adds the first 8192 directories, it begins failing to watch any more, and printing a message about you needing to increase the inotify limits for each additional directory. I don't think that 51 thousand directories is a particularly realistic amount for any real-world usage of git-annex. (It will also break file manager, dropbox, etc, which all use inotify in the same way.)
The other main time sink is that git-annex needs to run
git hash-object
once per file to stage its symlink. That is a lot of processes to run, and perhaps it could be sped up by usinggit fast-import
.Hi,
First of all thank you for your time looking into my bug. I try to research more from my side.
The 'Resource exhausted' bugreport (which lost its title, and could not click on it to add this testcase as a comment) was tested on real data, my own working directory (a copy of it). This bugreport is tested on the output of this small shell script.
None of them succeeded to import, and I quickly assumed it is the exact same.
So I will test again, raising the ulimit to 81920, and report.
I would be perfectly fine if I could configure git-annex to sync those directory only once a month or once a week (ie. check for update once a week). So no need to watch it real time, those are my archived work files.
Well, it is not 25000 dir in a single a folder, but rather something like this:
Where each 'backX' contains a whole backup the work until it. So the directory structure is a bit more deep, and no 25000 subdirectory in a single dir. But the overall numbers are right.
If I could somehow mark this work_done dir to not sync real time (or work_done/2008,work_done/2009,work_done/2010,work_done/2011,work_done/2012 subdir in them), then my whole issue would vanish.
I only want to use git-annex to have a backup of this directory. In case of laptop theft, or misfunction I could have a backup. I dont need live sync anywhere, I have directories which I know I will not touch for months.
Best, Laszlo
I beg to differ: with dropbox I handle my scrapbook(1) folder, which means 130 thousand files for over 2 years now without problem between three computers.
Don't get me wrong. I'm not complaining, I only give you a completely unrelated usecase, which requires also high number of files handling. And in that case the 81 thousand ulimit would not help either.
(1): https://addons.mozilla.org/hu/firefox/addon/scrapbook/
You're confusing number of files (inotify doesn't care) with number of directories (inotify does care).
Dropbox is on record about being limited in the number of directories it can watch without adjusting the inotify limit. https://www.dropbox.com/help/145
I found a bug in the webapp thanks to this stress test. When inotify goes over limit, it displays a message about how to fix it.. But it displays that message over and over for each file. The result is a constantly updating very large web page.
Unless you tell me differently, I'm going to assume that's what the GUI crash you referred to was, since it can make a web browser very slow.
I've fixed this problem. Now when it goes over limit, the webapp will just display this:
I put in a further change to reduce the number of alerts shown in the webapp when bulk adding files. This probably quadrupled the speed or more, even when the webapp was not running, as updating an alert every time a file was added was a lot of unnecessary work.
After these changes, it adds the first 10 thousand files in 35 minutes, on my five year old netbook. It should scale linear (aside from git's own scalability issues with a lot of files, which I don't think are very bad under 1 million files), so adding all 100 thousand files should take 6 hours or so.
I'm interested to see what results you get, compared with before..
Hi,
I have just tried it out again with the latest (20130501) version.
It is really nice to see you have been working on it, and it have improved tremendously! The logging issue solved, and logrotates even, and it finished importing without crashing!
Remaining polishing things:
a) The import time is not as good (as you write), it slowes itself down. It is true the first 10000 files import in about an hour, but it finishes with everything in 9 hours 20 minutes. (on a normal laptop, the last 5000 file portion took more then 2 hours)
b) Every startup means rechecksuming everything, so it means the second start took also around 8-12 hours. (I don't know exactly because it finished somewhere during the night, but it was longer then 8 hours) I don't think rechecksuming is necessary at all, if the filename, size and date have not modified, then why rechecksuming (sha) it?
c) It is leaking. At the second startup, it reported it successfully added: Added 2375 files 5 files probe25366.txt
I have not touched the directory. ls confirms leaking:
d) Without ulimit raise, it does not work at all. I think it could be solved by not watching each and every directory all the time. Every users will likely have a working directory and some which he don't intend to touch/modify at all. Some usecases: photo archiving, video archiving, finished work archiving, etc
All the above results with the stress test script. I would love to have a confirmation by a thirdparty.
Overall I'm impressed with the work you have done.
Best, Laszlo
I have been working the whole day zipping up (tar.gz) all the unused directories. Now my real data dir looks like this:
When I first started git annex, it added 5492 files, then next time it added the missing 3596 files. Then it stopped adding files. From the gui everything looked fine even at the first start (performed startup scan), even in the log files (daemon.log.x) was nothing suspicious.
As you can see, this case is not a stress test at all, it is really the minimal test case, 1.1GB diskspace, 9088 files and a thousand dirs. The real question is, why git-annex miss at the first startup 3492 files (ie. adding all the files).
It would help tremendously, if it would display at startup how many files he found, and when it adds, then how many left to be added. Something like this:
So it is definietly a bug, and I stuck how to debug it further. Everything looks just fine.
Best, Laszlo
My estimate was indeed slightly optimistic. While I did not run the whole import, it did run slower for the later batches of files. As far as I can see, that slowdown is just because git gets slower as it has more files. So nothing I can do about it. git-annex is now scaling well itself, though.
Re checksumming on startup: There was a bug that caused the assistant to re-checksum all direct mode files on startup. This bug was fixed in version 4.20130417. If you're using that version and still see it re-checksumming files, please file a new bug report about it, as this is not intended behavior.
You seem to be saying that the assistant is failing to add some files, and then when stopped and restarted it finds and adds them. I don't quite know how that would happen. If you can provide a test case that I can use to reproduce that behavior, I will try to debug it.
rechecksuming: it seem like it is indeed fixed in the newest (2013.05.01) version downloaded from here: http://downloads.kitenet.net/git-annex/linux/
I tried to add the big stress test dir as a secondary repository into git-annex (along with my real data dir), but seems like some library is not matching on my system, so some curl is complaining:
I'm on ubuntu 10.04.
And the log file is starting to fill up, so maybe once a problem occur, it should only write into the log file once.
I will redone this stress test next week, without combining with any repository. Thank you very much for your response, I do appreciate you are bothering/dealing with my complains!:)
Best, Laszlo