Efficiently move files from one repo to another

git-annex/ forum/ Efficiently move files from one repo to another

Edit
RecentChanges
History
Preferences
Branchable
10 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

Hi,

Context

I use git-annex to work with my files on multiple computers, and also to collaborate with other people on my files. Since I don't want to share all my files to all the devices (especially not these of other people), I have several different annex-repos. Of course there is location tracking and file content is not required on all devices. Still, for some devices I want a clear separation in the sense that there should not be any trace that the files from the other repos even exist. However, sometimes I want to change the "permissions" for a file or directory (meaning I have to move it to another repo).

Problem

Having multiple annex-repos is a reasonable workaround for the use case discussed above. One remaining issue is that moving files or folders from one repo to another is quite inefficient. This is especially true for the remotes.

Example:

Having two annex-repos (Repo-A and Repo-B) on my laptop. Both repos also exist on my server. All content is synced among the devices. On the laptop I move a large subdir from Repo-A to Repo-B. If I sync both repos with the server afterwards, I have to re-upload all files even though they are already there, but in a different repo.

Question

Is there an efficient way to move files from one repo to another that works in the example given above?

Possible approach

I could imagine something like a device-wide location tracking. In this case Repo-B could check "Is this file available in another repo on the same device?" and get it locally. However, for my use case it is really important, that this device-wide location tracking is not synchronized among all the remotes.

Example:

Again there are Repo-A and Repo-B on laptop and server. Another repo Repo-C is on my laptop and my phone. The phone knows nothing about Repo-A and Repo-B and also is not aware that the server even exists. And this should stay this way. This means that the laptop must not tell the phone that there are other remotes and other repos, as well.

RSS Atom

comment 1

Location tracking tracks locations of keys, not of filenames or directory names (unless you use the WORM backend). Are you sure you care if other people know what keys (checksums) exist on your non-shared system? If you don't, you could keep the contents of different repos as unconnected/unrelated branches in the same git repo. See also https://git-annex.branchable.com/tips/local_caching_of_annexed_files/

Comment by Ilya_Shlyakhter — Mon Sep 24 15:58:03 2018

Remove comment

re: comment 1

Hi Ilya_Shlyakhter, thanks for your answer.

I'm not sure about using different branches. What prevents other users from just checking out the branches, also the git-annex branch tracks the clear names of all remotes. Moreover, it's a scalability question. Assume the server has thousands (millions?) of files while the phone has only a few. Then it still would be "bothered" all the location tracking information of the server.

But one thing in the »local caching« tip raised my attention: "git config remote.cache.annex-speculate-present true"

The tip says: "The annex-speculate-present setting is the essential part. It makes git-annex know that the cache repository may contain the content of any annexed file. So, when getting a file, git-annex will try the cache repository first."

Together with "git config remote.cache.annex-pull false; git config remote.cache.annex-push false" this could be pretty close to my "local location tracking" idea.

It further says: "The cache repository will remain an empty git repository (except for the content of annexed files). This means that the same cache can be used with multiple different git-annex repositories, without intermingling their git data." Thus, there should be no »information leaks« between two repos that both use this cache.

A thing I would have to work around is that I should not do "git annex --copy --to server --not --in server". Or, at least before doing this I should always log into the server and do "git-annex get" first.

Thanks again for your suggestion. I will give this approach a try.

Comment by mario — Mon Sep 24 18:02:38 2018

Remove comment

comment 3

See also config settings remote..annex-readonly , remote..annex-push, remote..annex-pull etc.

Comment by Ilya_Shlyakhter — Mon Sep 24 18:29:36 2018

Remove comment

comment 4

I discovered one potential problem with this approach. The "server" in my example typically has "--bare" repos created by gitolite. I'm not sure if I can convince such a repo to get files from the cache. But I will try and report my findings.

Is there a way to tell a remote repo (e.g. the server) to get files from another remote?

Comment by mario — Tue Sep 25 09:45:00 2018

Remove comment

comment 5

Maybe http://git-annex.branchable.com/bare_repositories/ can help. Also, for the approach suggested earlier of storing different repos as unrelated branches in the same repo, maybe could use refspecs and hierarchical branch names (with '/') to keep things sufficiently separate.

Comment by Ilya_Shlyakhter — Tue Sep 25 15:23:35 2018

Remove comment

comment 6

If you perform the same move between repos on both the laptop and the server, git-annex will generate the same keys for the files on both of them, and so won't need to transfer the data again.

Comment by joey — Tue Sep 25 18:14:19 2018

Remove comment

comment 7

Joey, that's true. However, one of the things I like most about git-annex is that I don't have to repeat operations (like renaming/moving of files) on each remote. Therefore, I try to also automate this behavior even if the move operation spans two repos (if both repos exist on the remote).

Probably such "actions" (get files from another local repo) could be checked-in as "todos" into the repo. Then the remote could try to perform it after a sync. However, I imagine that its hard to design this in a straight forward way and could get pretty messy. Therefore, I thought some kind of "local location tracking" could be the cleaner solution.

My preferred solution would be the following. But I guess this would require changes in git-annex: Multiple local repos can share a single .git/annex/objects directory. Typically a repo would only know about its own files in there (and would not care about files put there by other repos at all, especially not leaking them into the location tracking). But if a file is checked-in into a repo ("sync", not "sync --content") and the content of this file is already present (e.g. because it also exists in another repo), the repo would detect that the file is there and updates location tracking for this file.

I presume the "local cache" approach with hardlinks is the next best thing to this idea, that not requires changes to git-annex. E.g. some scripting/hooks that performs a "get" operation after a sync..

Comment by mario — Tue Sep 25 21:33:17 2018

Remove comment

comment 8

Ilya_Shlyakhter, thanks for the link. Apparently "git-annex get" also works on bare repos (but I still have to find out how to use this with gitolite, maybe I can write some custom hook or something..)

About the refspecs etc. I think my git-knowledge is not sufficient to fully grasp your idea. But there's one thing that makes me hesitate. I'm looking for a solution that is as less "special" as possible. If I have to remember what I was thinking when I designed all this, I'm pretty sure things will break eventually. Thus, in a good solution all thinks should work identically as with regular git-annex. If I have to remember something to do things (e.g. move operations) more efficiently that's okay, as long as things don't break when I just use my repo like a regular git-annex repo. I'm not sure if your approach would provide this.

Also, I suppose I would have to "design" one mega-repository for all my current (and probably future) use cases. A loose coupling of otherwise separated repos seems not to be possible, as far as I understand the solution. This could especially become an issue for repos I want to share with others.

Comment by mario — Tue Sep 25 21:59:10 2018

Remove comment

comment 9

I have had a lot of success with some scripts I have written that externally assist git-annex in moving large files. Perhaps they can be at some point integrated into git-annex. I understand that this probably isn't a priority for Joey.

One thing that has made a huge difference for me is asynchronous verification of content. I have a turbo-copy script which simply copies keys from one repo to another externally, and then triggers an fsck in the target repo after it has finished copying everything. In many cases, this makes a huge improvement in transfer speed. I might have it start an fsck in the target as soon as each file is copied externally.

I have also been experimenting with using make to externally manage git-annex pipelines, where it makes more sense to do more simultaneous copies for smaller files, for certain backends.

One thing that would be amazingly helpful, is if there was a way that a backend could inform git-annex ahead of time what it intends to do. For me, when I encrypt and upload files > 40M < 400M to Google Drive, git-annex spends about 50% of its time encrypting, and 50% uploading. It would substantially improve performance if git-annex could get to work on the next file while the current one is uploading. I don't know if it is possible for a backend to do this. I think that it would have to falsly claim success to get git-annex to give it the next file.l