Recent comments posted to this site:

I find out that I need to add the following lines to the gitolite.rc in the server side.

'git-annex-shell ua'

The signal repository works with gitolite as expected. However, the mirroring feature is not working for slaves. When I do

git annex copy --to origin

The master server store the annex file correctly. The file managed by annex is not syncing to the slave mirrors at all.

Comment by git-annex Wed Sep 28 18:12:56 2016

The metadata you get out should always be encoded the same as the metadata you put in. The encoding, or encodings used are up to you.

Are you seeing metadata queries returning a different sequence of bytes than the sequence of bytes that were originally stored? If not, I don't think this is a bug.

Comment by joey Mon Sep 26 20:50:26 2016

Instead of profiling git annex copy --to remote, I profiled git annex find --not --in web, which needs to do the same kind of location log lookup.

        total time  =       12.41 secs   (12413 ticks @ 1000 us, 1 processor)
        total alloc = 8,645,057,104 bytes  (excludes profiling overheads)

COST CENTRE               MODULE                      %time %alloc

adjustGitEnv              Git.Env                      21.4   37.0
catchIO                   Utility.Exception            13.2    2.8
spanList                  Data.List.Utils              12.6   17.9
parsePOSIXTime            Logs.TimeStamp                6.1    5.0
catObjectDetails.receive  Git.CatFile                   5.9    2.1
startswith                Data.List.Utils               5.7    3.8
md5                       Data.Hash.MD5                 5.1    7.9
join                      Data.List.Utils               2.4    6.0
readFileStrictAnyEncoding Utility.Misc                  2.2    0.5

The adjustGitEnv overhead is a surprise! It seems it is getting called once per file, and allocating a new copy of the environment each time. Call stack: withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv. Looks like simply making gitEnv be cached at startup would avoid most of the adjustGitEnv slowdown.

(The catchIO overhead is a false reading; the detailed profile shows that all its time and allocations are inherited. getAnnexLinkTarget is running catchIO in the expensive case, so readSymbolicLink is the actual expensive bit.)

The parsePOSIXTime comes from reading location logs. It's implemented using a generic Data.Time.Format.parseTime, which uses a format string "%s%Qs". A custom parser that splits into seconds and picoseconds and simply reads both numbers might be more efficient.

catObjectDetails.receive is implemented using mostly String and could probably be sped up by being converted to use ByteString.

Comment by joey Mon Sep 26 19:59:43 2016

Built git-annex with profiling, using stack build --profile

(For reproduciblity, running git-annex in a clone of the git-annex repo https://github.com/RichiH/conference_proceedings with rev 2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed objects.)

Profiling git-annex find +RTS -p:

        total time  =        3.53 secs   (3530 ticks @ 1000 us, 1 processor)
        total alloc = 3,772,700,720 bytes  (excludes profiling overheads)

COST CENTRE            MODULE                  %time %alloc

spanList               Data.List.Utils          32.6   37.7
startswith             Data.List.Utils          14.3    8.1
md5                    Data.Hash.MD5            12.4   18.2
join                   Data.List.Utils           6.9   13.7
catchIO                Utility.Exception         5.9    6.0
catches                Control.Monad.Catch       5.0    2.8
inAnnex'.checkindirect Annex.Content             4.6    1.8
readish                Utility.PartialPrelude    3.0    1.4
isAnnexLink            Annex.Link                2.6    4.0
split                  Data.List.Utils           1.5    0.8
keyPath                Annex.Locations           1.2    1.7

This is interesting!

Fully 40% of CPU time and allocations are in list (really String) processing, and the details of the profiling report show that spanList and startsWith and join are all coming from calls to replace in keyFile and fileKey. Both functions nest several calls to replace, so perhaps that could be unwound into a single pass and/or a ByteString used to do it more efficiently.

12% of run time is spent calculating the md5 hashes for the hash directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and it is probably a quite unoptimised version. Switching to the version if cryptonite would probably speed it up a lot.

Comment by joey Mon Sep 26 19:20:36 2016

As an aside, just tried syncthing's arm verion, and I get a "runtime: kernel page size (32768) is larger than runtime page size (4096)" error. While their arm64 version also won't run.

Perhaps it's a similar issue with git-annex? Seems the page-size comes up in a number of contexts of people trying to get software running on the WD NAS.

Is there any way to adjust this page-size in the linker?

Thanks

Comment by PaulK Sun Sep 25 03:08:40 2016

And there is a complication with running [git annex copy --from --to] at the same time as eg git annex get of the same file. It would be surprising for get to succeed (because copy has already temporarily downloaded the file) and then have the file later get dropped.

A solution to this subproblem would transparently fall out of a facility for logically dropping files, which was briefly talked about a long time ago. Just mark the file as logically dropped. If the user git annex gets it while the copy-out is in progress, its status will change to "present", so copy will know not to physically delete it.

(Of course there are race conditions involved, but I presume/hope that they're no worse than git-annex already has to deal with.)

Comment by erics Sun Sep 25 00:33:56 2016
Nope, no difference. Same "ELF load command alignment not page-aligned" errors as before.
Comment by PaulK Sat Sep 24 04:16:51 2016

THANK YOU for implementing this feature -- we will make use of it soon. But so we don't do reverse estimation from "byte-progress" and "percent-progress", and didn't have to get it from a key (which might not have it e.g. in case of URL relaxed backend) -- could you just include in each record the "byte-target" (if known) or something like that? ;) thanks in advance!

Comment by EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a] Sat Sep 24 02:43:21 2016

(Let's not discuss the behavior of copy --to when the file is not locally present here; there is plenty of other discussion of that in eg http://bugs.debian.org/671179)

Agreed, it's kind of secondary.

git-annex's special remote API does not allow remote-to-remote transfers without spooling it to a file on disk first.

yeah, i noticed that when writing my own special remote.

And it's not possible to do using rsync on either end, AFAICS.

That is correct.

It would be possible in some other cases but this would need to be implemented for each type of remote as a new API call.

... and would fail for most, so there's little benefit there.

how about a socket or FIFO of some sort? i know those break a lot of semantics (e.g. [ -f /tmp/fifo ] fails in bash) but they could be a solution...

Modern systems tend to have quite a large disk cache, so it's quite possible that going via a temp file on disk is not going to use a lot of disk IO to write and read it when the read and write occur fairly close together.

true. there are also in-memory files that could be used, although I don't think this would work across different process spaces.

The main benefit from streaming would probably be if it could run the download and the upload concurrently.

for me, the main benefit would be to deal with low disk space conditions, which is quite common on my machines: i often cram the disk almost to capacity with good stuff i want to listen to later... git-annex allows me to freely remove stuff when i need the space, but it often means i am close to 99% capacity on the media drives i use.

But that would only be a benefit sometimes. With an asymmetric connection, saturating the uplink tends to swamp downloads. Also, if download is faster than upload, it would have to throttle downloads (which complicates the remote API much more), or buffer them to memory (which has its own complications).

that is true.

Streaming the download to the upload would at best speed things up by a factor of 2. It would probably work nearly as well to upload the previously downloaded file while downloading the next file.

presented like that, it's true that the benefits of streaming are not good enough to justify the complexity - the only problem is large files and low local disk space... but maybe we can delegate that solution to the user: "free up at least enough space for one of those files you want to transfer".

[... -J magic stuff ...]

And there is a complication with running that at the same time as eg git annex get of the same file. It would be surprising for get to succeed (because copy has already temporarily downloaded the file) and then have the file later get dropped. So, it seems that copy --from --to would need to stash the content away in a temp file somewhere instead of storing it in the annex proper.

My thoughts exactly: actually copying the files to the local repo introduces all sorts of weird --numcopies nastiness and race conditions, it seems to me.

thanks for considering this!

Comment by anarcat Thu Sep 22 12:43:11 2016

Here's my use case (much simpler)

Three git repos:

desktop: normal checkout, source of almost all annexd files, commits, etc.. The only place I run git annex commands. Not enough space to stored all annexed files

main_external: bare git repo, stores all annext file contents, but no file tree. Usually connected. Purpose: primary backups

old_external: like main_external, except connected only occasionally.

I periodically copy from desktop to main_external. That's all well and good.

The tricky part is when I plug in old_external and want to get everything on there. It's hard to get content onto old_external that is stored only on main_external. That's when I want to:

git annex copy --from=main_external --to=old_external --not --in old_external

Note that this would not copy obsolete data (ie only referenced from old git commits) stored in old_external. I like that.

To work around the lack of that feature, I try to keep coppies on desktop until I've had a chance to copy them to both external drives. It's good for numcopies, but I don't like trying to keep track of it, and I wish I could choose to let there be just one copy of things on main_external for replaceable data.

Comment by JasonWoof Thu Sep 22 00:40:07 2016