Comments in the moderation queue:
Recent comments posted to this site:
I find out that I need to add the following lines to the gitolite.rc
in the server side.
The signal repository works with gitolite as expected.
However, the mirroring feature is not working for slaves.
When I do
git annex copy --to origin
The master server store the annex file correctly.
The file managed by annex is not syncing to the slave
mirrors at all.
The metadata you get out should always be encoded the same as the metadata
you put in. The encoding, or encodings used are up to you.
Are you seeing metadata queries returning a different sequence of bytes
than the sequence of bytes that were originally stored? If not, I don't
think this is a bug.
Instead of profiling git annex copy --to remote, I profiled git annex
find --not --in web, which needs to do the same kind of location log lookup.
git annex copy --to remote
find --not --in web
total time = 12.41 secs (12413 ticks @ 1000 us, 1 processor)
total alloc = 8,645,057,104 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
adjustGitEnv Git.Env 21.4 37.0
catchIO Utility.Exception 13.2 2.8
spanList Data.List.Utils 12.6 17.9
parsePOSIXTime Logs.TimeStamp 6.1 5.0
catObjectDetails.receive Git.CatFile 5.9 2.1
startswith Data.List.Utils 5.7 3.8
md5 Data.Hash.MD5 5.1 7.9
join Data.List.Utils 2.4 6.0
readFileStrictAnyEncoding Utility.Misc 2.2 0.5
The adjustGitEnv overhead is a surprise! It seems it is getting called once
per file, and allocating a new copy of the environment each time. Call stack:
withIndex calls withIndexFile calls addGitEnv calls adjustGitEnv.
Looks like simply making gitEnv be cached at startup would avoid most of
the adjustGitEnv slowdown.
(The catchIO overhead is a false reading; the detailed profile shows
that all its time and allocations are inherited. getAnnexLinkTarget
is running catchIO in the expensive case, so readSymbolicLink is
the actual expensive bit.)
The parsePOSIXTime comes from reading location logs. It's implemented
using a generic Data.Time.Format.parseTime, which uses a format string
"%s%Qs". A custom parser that splits into seconds and picoseconds
and simply reads both numbers might be more efficient.
catObjectDetails.receive is implemented using mostly String and could
probably be sped up by being converted to use ByteString.
Built git-annex with profiling, using stack build --profile
stack build --profile
(For reproduciblity, running git-annex in a clone of the git-annex repo
https://github.com/RichiH/conference_proceedings with rev
2797a49023fc24aff6fcaec55421572e1eddcfa2 checked out. It has 9496 annexed
Profiling git-annex find +RTS -p:
git-annex find +RTS -p
total time = 3.53 secs (3530 ticks @ 1000 us, 1 processor)
total alloc = 3,772,700,720 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
spanList Data.List.Utils 32.6 37.7
startswith Data.List.Utils 14.3 8.1
md5 Data.Hash.MD5 12.4 18.2
join Data.List.Utils 6.9 13.7
catchIO Utility.Exception 5.9 6.0
catches Control.Monad.Catch 5.0 2.8
inAnnex'.checkindirect Annex.Content 4.6 1.8
readish Utility.PartialPrelude 3.0 1.4
isAnnexLink Annex.Link 2.6 4.0
split Data.List.Utils 1.5 0.8
keyPath Annex.Locations 1.2 1.7
This is interesting!
Fully 40% of CPU time and allocations are in list (really String) processing,
and the details of the profiling report show that spanList and startsWith
and join are all coming from calls to replace in keyFile and fileKey.
Both functions nest several calls to replace, so perhaps that could be unwound
into a single pass and/or a ByteString used to do it more efficiently.
12% of run time is spent calculating the md5 hashes for the hash
directories for .git/annex/objects. Data.Hash.MD5 is from missingh, and
it is probably a quite unoptimised version. Switching to the version
if cryptonite would probably speed it up a lot.
As an aside, just tried syncthing's arm verion, and I get a "runtime: kernel page size (32768) is larger than runtime page size (4096)" error. While their arm64 version also won't run.
Perhaps it's a similar issue with git-annex? Seems the page-size comes up in a number of contexts of people trying to get software running on the WD NAS.
Is there any way to adjust this page-size in the linker?
And there is a complication with running [git annex copy --from --to] at the same time as eg git annex get of the same file. It would be surprising for get to succeed (because copy has already temporarily downloaded the file) and then have the file later get dropped.
git annex copy --from --to
git annex get
A solution to this subproblem would transparently fall out of a facility for logically dropping files, which was briefly talked about a long time ago. Just mark the file as logically dropped. If the user git annex gets it while the copy-out is in progress, its status will change to "present", so copy will know not to physically delete it.
(Of course there are race conditions involved, but I presume/hope that they're no worse than git-annex already has to deal with.)
THANK YOU for implementing this feature -- we will make use of it soon.
But so we don't do reverse estimation from "byte-progress" and "percent-progress", and didn't have to get it from a key (which might not have it e.g. in case of URL relaxed backend) -- could you just include in each record the "byte-target" (if known) or something like that? thanks in advance!
(Let's not discuss the behavior of copy --to when the file is not
locally present here; there is plenty of other discussion of that in
Agreed, it's kind of secondary.
git-annex's special remote API does not allow remote-to-remote
transfers without spooling it to a file on disk first.
yeah, i noticed that when writing my own special remote.
And it's not possible to do using rsync on either end, AFAICS.
That is correct.
It would be possible in some other cases but this would need to be
implemented for each type of remote as a new API call.
... and would fail for most, so there's little benefit there.
how about a socket or FIFO of some sort? i know those break a lot of
semantics (e.g. [ -f /tmp/fifo ] fails in bash) but they could be a
[ -f /tmp/fifo ]
Modern systems tend to have quite a large disk cache, so it's quite
possible that going via a temp file on disk is not going to use a
lot of disk IO to write and read it when the read and write occur
fairly close together.
true. there are also in-memory files that could be used, although I
don't think this would work across different process spaces.
The main benefit from streaming would probably be if it could run
the download and the upload concurrently.
for me, the main benefit would be to deal with low disk space
conditions, which is quite common on my machines: i often cram the
disk almost to capacity with good stuff i want to listen to
later... git-annex allows me to freely remove stuff when i need the
space, but it often means i am close to 99% capacity on the media
drives i use.
But that would only be a benefit sometimes. With an asymmetric
connection, saturating the uplink tends to swamp downloads. Also,
if download is faster than upload, it would have to throttle
downloads (which complicates the remote API much more), or buffer
them to memory (which has its own complications).
that is true.
Streaming the download to the upload would at best speed things up
by a factor of 2. It would probably work nearly as well to upload
the previously downloaded file while downloading the next file.
presented like that, it's true that the benefits of streaming are not
good enough to justify the complexity - the only problem is large
files and low local disk space... but maybe we can delegate that
solution to the user: "free up at least enough space for one of those
files you want to transfer".
[... -J magic stuff ...]
And there is a complication with running that at the same time as eg
git annex get of the same file. It would be surprising for get to
succeed (because copy has already temporarily downloaded the file)
and then have the file later get dropped. So, it seems that copy
--from --to would need to stash the content away in a temp file
somewhere instead of storing it in the annex proper.
My thoughts exactly: actually copying the files to the local repo
introduces all sorts of weird --numcopies nastiness and race
conditions, it seems to me.
thanks for considering this!
Here's my use case (much simpler)
Three git repos:
desktop: normal checkout, source of almost all annexd files, commits, etc.. The only place I run git annex commands. Not enough space to stored all annexed files
main_external: bare git repo, stores all annext file contents, but no file tree. Usually connected. Purpose: primary backups
old_external: like main_external, except connected only occasionally.
I periodically copy from desktop to main_external. That's all well and good.
The tricky part is when I plug in old_external and want to get everything on there. It's hard to get content onto old_external that is stored only on main_external. That's when I want to:
git annex copy --from=main_external --to=old_external --not --in old_external
Note that this would not copy obsolete data (ie only referenced from old git commits) stored in old_external. I like that.
To work around the lack of that feature, I try to keep coppies on desktop until I've had a chance to copy them to both external drives. It's good for numcopies, but I don't like trying to keep track of it, and I wish I could choose to let there be just one copy of things on main_external for replaceable data.