assistant Stress test

What steps will reproduce the problem?

mkdir annex_stress; cd annex_stress, then execute the following script:

#! /bin/sh

# creating a directory, in which we dump all the files.
mkdir probes; cd probes

for i in `seq -w 1 25769`; do
    mkdir probe$i
    echo "This is an important file, which saved also in backup ('back') directory too.\n Content changes: $i" > probe$i/probe$i.txt
    echo "This is just an identical content file. Saved in each subdir." > probe$i/defaults.txt
    echo "This is a variable ($i) content file, which is not backed up in 'back' directory." > probe$i/probe-nb$i.txt
    mkdir probe$i/back
    cp probe$i/probe$i.txt probe$i/back/probe$i.txt
done

It creates about 25000 directory and 3 files in each, two of them are identical.

What is the expected output? What do you see instead?

I expect git annex could import the directory within 12 hours. Yet, it just crashes the gui (starting webapp, uses the cpu 100% and it does not finish after 28hours.)

What version of git-annex are you using? On what operating system?

version 2013.04.17

Please provide any additional information below.

I do hope git-annex can be fixed to handle large number of files. This stress test models well enough my own directory structure, relatively high number of files relatively low disk space usage (my own directory structure: 750MB, this test creates 605MB).

Best, Laszlo

RSS Atom

comment 1

Is this related or unrelated to the bug you filed at Resource exhausted?

I tried this test, and noticed that it was taking the assistant rather a long time to get to the 10 thousand file threshhold where it makes a batch commit. A small change to a better data structure for its queue reduced that time from probably 10 minutes to 2.5.

I was unable to reproduce any problem with the webapp. Please provide lots of details to back up "it just crashes the GUI".

The main problem with this directory tree is that it has more directories than inotify can watch, in the default configuration. So after it adds the first 8192 directories, it begins failing to watch any more, and printing a message about you needing to increase the inotify limits for each additional directory. I don't think that 51 thousand directories is a particularly realistic amount for any real-world usage of git-annex. (It will also break file manager, dropbox, etc, which all use inotify in the same way.)

The other main time sink is that git-annex needs to run git hash-object once per file to stage its symlink. That is a lot of processes to run, and perhaps it could be sped up by using git fast-import.

Comment by joey — Tue Apr 23 20:00:31 2013

Remove comment

comment 2

Hi,

First of all thank you for your time looking into my bug. I try to research more from my side.

The 'Resource exhausted' bugreport (which lost its title, and could not click on it to add this testcase as a comment) was tested on real data, my own working directory (a copy of it). This bugreport is tested on the output of this small shell script.

None of them succeeded to import, and I quickly assumed it is the exact same.

So I will test again, raising the ulimit to 81920, and report.

The main problem with this directory tree is that it has more directories than inotify can watch, in the default configuration

I would be perfectly fine if I could configure git-annex to sync those directory only once a month or once a week (ie. check for update once a week). So no need to watch it real time, those are my archived work files.

I don't think that 51 thousand directories is a particularly realistic amount for any real-world usage of git-annex.

Well, it is not 25000 dir in a single a folder, but rather something like this:

work_done/2009/workname/back9/back8/back7/back6/back5

Where each 'backX' contains a whole backup the work until it. So the directory structure is a bit more deep, and no 25000 subdirectory in a single dir. But the overall numbers are right.

If I could somehow mark this work_done dir to not sync real time (or work_done/2008,work_done/2009,work_done/2010,work_done/2011,work_done/2012 subdir in them), then my whole issue would vanish.

I only want to use git-annex to have a backup of this directory. In case of laptop theft, or misfunction I could have a backup. I dont need live sync anywhere, I have directories which I know I will not touch for months.

Best, Laszlo

Comment by Laszlo — Wed Apr 24 06:30:16 2013

Remove comment

comment 3

(It will also break file manager, dropbox, etc, which all use inotify in the same way.)

I beg to differ: with dropbox I handle my scrapbook(1) folder, which means 130 thousand files for over 2 years now without problem between three computers.

~/Dropbox/scrapbook$ ls -R -1 |wc -l
130263

Don't get me wrong. I'm not complaining, I only give you a completely unrelated usecase, which requires also high number of files handling. And in that case the 81 thousand ulimit would not help either.

(1): https://addons.mozilla.org/hu/firefox/addon/scrapbook/

Comment by Laszlo — Wed Apr 24 09:10:20 2013

Remove comment

comment 4

You're confusing number of files (inotify doesn't care) with number of directories (inotify does care).

Dropbox is on record about being limited in the number of directories it can watch without adjusting the inotify limit. https://www.dropbox.com/help/145

Comment by joey — Wed Apr 24 15:05:33 2013

Remove comment

comment 5

I found a bug in the webapp thanks to this stress test. When inotify goes over limit, it displays a message about how to fix it.. But it displays that message over and over for each file. The result is a constantly updating very large web page.

Unless you tell me differently, I'm going to assume that's what the GUI crash you referred to was, since it can make a web browser very slow.

I've fixed this problem. Now when it goes over limit, the webapp will just display this:

inotify max limit alert.png

Comment by joey — Wed Apr 24 15:30:04 2013

Remove comment

comment 6

I put in a further change to reduce the number of alerts shown in the webapp when bulk adding files. This probably quadrupled the speed or more, even when the webapp was not running, as updating an alert every time a file was added was a lot of unnecessary work.

After these changes, it adds the first 10 thousand files in 35 minutes, on my five year old netbook. It should scale linear (aside from git's own scalability issues with a lot of files, which I don't think are very bad under 1 million files), so adding all 100 thousand files should take 6 hours or so.

I'm interested to see what results you get, compared with before..

Comment by joey — Wed Apr 24 17:26:39 2013

Remove comment

comment 7

A few more changes got the rate down to 21 minutes per 10 thousand files. Estimate 3.5 hours for all.

Comment by joey — Wed Apr 24 21:02:55 2013

Remove comment

Definite improvement

Hi,

I have just tried it out again with the latest (20130501) version.

It is really nice to see you have been working on it, and it have improved tremendously! The logging issue solved, and logrotates even, and it finished importing without crashing!

Remaining polishing things:

a) The import time is not as good (as you write), it slowes itself down. It is true the first 10000 files import in about an hour, but it finishes with everything in 9 hours 20 minutes. (on a normal laptop, the last 5000 file portion took more then 2 hours)

b) Every startup means rechecksuming everything, so it means the second start took also around 8-12 hours. (I don't know exactly because it finished somewhere during the night, but it was longer then 8 hours) I don't think rechecksuming is necessary at all, if the filename, size and date have not modified, then why rechecksuming (sha) it?

c) It is leaking. At the second startup, it reported it successfully added: Added 2375 files 5 files probe25366.txt

I have not touched the directory. ls confirms leaking:

After first start (importing):
annex_many/.git$ ls -lR |wc -l
770199

After second startup:
annex_many/.git$ ls -lR |wc -l
788351

d) Without ulimit raise, it does not work at all. I think it could be solved by not watching each and every directory all the time. Every users will likely have a working directory and some which he don't intend to touch/modify at all. Some usecases: photo archiving, video archiving, finished work archiving, etc

All the above results with the stress test script. I would love to have a confirmation by a thirdparty.

Overall I'm impressed with the work you have done.

Best, Laszlo

Comment by Laszlo — Fri May 3 06:27:12 2013

Remove comment

Maybe it is not leaking after all

I have been working the whole day zipping up (tar.gz) all the unused directories. Now my real data dir looks like this:

./annex_real/work_done$ du -hs .
1,1G    .
Has 9088 files and 1608 directories in total:
./annex_real/work_done$ ls -R1l |grep \\-r |wc -l
9088
./annex_real/work_done$ ls -R1l |grep ^d |wc -l
1608

When I first started git annex, it added 5492 files, then next time it added the missing 3596 files. Then it stopped adding files. From the gui everything looked fine even at the first start (performed startup scan), even in the log files (daemon.log.x) was nothing suspicious.

./annex_real/work_done$ for i in ../.git/annex/daemon.log.*; do echo $i; cat $i |grep files; done
 ../.git/annex/daemon.log.1
 ../.git/annex/daemon.log.2
 ../.git/annex/daemon.log.3
 ../.git/annex/daemon.log.4
 [2013-05-03 20:03:34 CEST] Committer: Adding 3596 files
 ../.git/annex/daemon.log.5
 [2013-05-03 19:15:22 CEST] Committer: Adding 5492 files

As you can see, this case is not a stress test at all, it is really the minimal test case, 1.1GB diskspace, 9088 files and a thousand dirs. The real question is, why git-annex miss at the first startup 3492 files (ie. adding all the files).

It would help tremendously, if it would display at startup how many files he found, and when it adds, then how many left to be added. Something like this:

(scanning...) [2013-05-03 20:03:14 CEST] Watcher: Performing startup scan
(started...)
[2013-05-03 20:03:34 CEST] Committer: Found 9088 files
[2013-05-03 20:03:34 CEST] Committer: Adding 3596 files of 9088 remaining files (9088 in total)
....
[2013-05-03 20:05:04 CEST] Committer: Adding 1492 files of 5492 remaining files (9088 in total)
....
[2013-05-03 20:06:02 CEST] Committer: Adding 4000 files of 4000 remaining files (9088 in total)

So it is definietly a bug, and I stuck how to debug it further. Everything looks just fine.

Best, Laszlo

Comment by Laszlo — Fri May 3 18:37:48 2013

Remove comment

comment 10

My estimate was indeed slightly optimistic. While I did not run the whole import, it did run slower for the later batches of files. As far as I can see, that slowdown is just because git gets slower as it has more files. So nothing I can do about it. git-annex is now scaling well itself, though.

Re checksumming on startup: There was a bug that caused the assistant to re-checksum all direct mode files on startup. This bug was fixed in version 4.20130417. If you're using that version and still see it re-checksumming files, please file a new bug report about it, as this is not intended behavior.

You seem to be saying that the assistant is failing to add some files, and then when stopped and restarted it finds and adds them. I don't quite know how that would happen. If you can provide a test case that I can use to reproduce that behavior, I will try to debug it.

Comment by joey — Mon May 6 16:54:36 2013

Remove comment

comment 11

rechecksuming: it seem like it is indeed fixed in the newest (2013.05.01) version downloaded from here: http://downloads.kitenet.net/git-annex/linux/

I tried to add the big stress test dir as a secondary repository into git-annex (along with my real data dir), but seems like some library is not matching on my system, so some curl is complaining:

curl: /lib/tls/i686/cmov/libc.so.6: version `GLIBC_2.12' not found (required by /home/user/Desktop/down/git-annex.linux//usr/lib/i386-linux-gnu/libldap_r-2.4.so.2)

I'm on ubuntu 10.04.

And the log file is starting to fill up, so maybe once a problem occur, it should only write into the log file once.

I will redone this stress test next week, without combining with any repository. Thank you very much for your response, I do appreciate you are bothering/dealing with my complains!:)

Best, Laszlo

Comment by Laszlo — Sat May 11 05:36:48 2013

Remove comment

Add a comment