Please describe the problem.

I have a special directory remote with exporttree=yes (encryption=none) on an USB hard drive. Both git annex sync --content and git annex export only write around 400 KiB/s. Thus an export of a 9GB DVD iso takes a whole night.

The drive is not blazing fast, but:

  • sync; dd if=/dev/zero of=tempfile bs=1M count=10; sync gives something around 10MB/s (don't recall the exact number)
  • rsync (with --progress turned on) copies files with 2.35MB/s

mount for this drive shows:

/dev/sdc1 on /media/thk/thk-sg1 type ext4 (rw,nosuid,nodev,relatime,sync,stripe=8191,uhelper=udisks2)

I tried to mount the drive without sync but failed. Even with the usdisks2 service stopped I could not manually mount the drive without sync (or with async). It always ended up being mounted with sync.

What steps will reproduce the problem?

TODO(thk): try other drive and other laptop once the current transfer finishes...

Update 2020-03-07:

  • export to a different USB drive (both seagate, same size, similar age) from the same machine with the exact same setup (but NTFS filesystem) runs with ~80 MiB/s. So this is perfect. This time there is also no problem with a lost exporttree=yes config.

Update 2020-03-10:

  • Export from a different laptop, same drive goes blazingly fast with >200MB/s.

What version of git-annex are you using? On what operating system?

  • git-annex version: 8.20200227-gf56dfe791
  • Debian testing with Kernel 5.2.17

Please provide any additional information below.

I now learned that there is no Linux kernel primitive to copy a file but that this is actually a high art:

http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c

I was surprised to see the implementation of meteredWrite in Utility/Metered.hs. I hoped that there would be some haskell standard library for efficient file copying? I wonder how rsync implements its progress meter? And whether the progress meter is the reason why rsync had slower write speed than dd.

Maybe it would make sense to call out to the cp command and just issue a stat() every few seconds for the progress meter? This is what I do to monitor cp progress manually.

I have no clue, but maybe these could help for fast file copying in Haskell?

Have you had any luck using git-annex before?

Well, I'm coming back to git-annex after several years. So far it is better than I remembered:

  • tor support is great and solves the need for a central server
  • I hope that the sqlite integration will now make large collections of files managable
  • Finally we have exporttree, yeah!

2020-03-07 update

Turns out, the problem is more complex. I wanted to be clever. When I set up the two synced annex repos I made the mistake of not specifying exporttree=yes at the beginning. But I wanted to re-use the initial name. So I tried hard to remove all evidence of the previous existence of a special remote with that name from git-annex.

I checked out the git-annex branch in a separate worktree (see man git worktree) in both repositories, deleted the lines for that remote from remote.log and pushed to the other repo (not git annex sync). I even made the changes in parallel in both repos before pushing in both directions so that the special merge does not bring the lines back. I actually was sure there was nothing left of the old remotes. Of course I also deleted them from .git/config.

Somehow, there is again a line in remote.log for that remote without exporttree=yes. So now, after the last git annex sync --content, I have a mixture of an exported tree and an exported annex object store in the same special remote dir.

I also noticed that the repo that was so slow did not have the remote.$REMOTE.annex-tracking-branch config. But I could still run sync --content somehow. After adding this config, the last sync actually ran with 2 MB/s but it still wrote in object store format, not as an exported tree.

Some questions:

  • Is there any other place where git-annex stores information about remotes then remote.log?
  • The object store files in the remote were stored in format AAA/BBB/$HASH with three character directory names. While in .git/annex/objects the folders have two characters. What are those characters? I believe the 3 characters format is for remotes that potentially do not distinguish letter case?
  • Is there a command to get the full path of a file in the object store (two or three letters) from the hash?
  • Maybe there is still a bug. Is there a possibility that git-annex could forget that a remote is configured with exporttree=yes? Especially if I export to the same directory on the same usb drive from two synced repos?

done