Recent comments posted to this site:
What's the reason for not supporting importtree via webdav?
Would be nice to keep a tree in sync on my nextcloud and sync to my phone etc.
To access the manifest and bundles, one needs the UUID of the special remote initially configured. Then one can run
git clone 'annex::<UUID>?type=directory&encryption=none&directory=/path/to/space%20sanitized%20directory'
A bit tedious for both the need to type all settings (even those not shown by the remote helper when doing the push operations from the initial repo, in this case the directory, in other cases all required settings to init the remote in the first place) and for having to HTML sanitize any URL disallowed characters. But doable
The other option would be to manually clone by initializing the new empty repo, then adding the special remote the normal git annex way. This doesn't work right just yet because --uuid
is not an allowed option for initremote
. It would be nice if this were an option simply to avoid the tedium of typing the URL as above (one could copy and paste git --no-pager show git-annex:remote.log
into initremote
)
Despite the URL tedium, an exciting result of the current system is that any number of repos and file annexes can share one directory! Like an entire organization (or repo group) in one folder. Datalad has a similar archetype (remote indexed archives) which offer (slightly) improved user friendliness by filing each repo UUID into meaningfully-named folders (unhashed first three/remaining is nice for being actually the UUID but it still doesn't let me easily copy/paste the UUID for cloning). Although I kind of like how git-annex's implementation encourages a single unified "annex" (rather than RIA's UUID/Annex
which gives each UUID a separate annex) and of course bundles over loose git files, especially for cloud special remotes which can be slow to upload each and every loose file.
Looking forward to seeing how this feature develops!
Perhaps Joey can help me out here a bit with some background knowledge:
I've been seeing sporadic corruption with this setup:
- chunking
- encryption
- old helper program git-annex-remote-rclone
- rclone's pcloud backend
As it seems, rclone keeps partial files under the name of the full file when a transfer is interrupted, for the pcloud backend. (This is for rclone <= 1.67.0; 1.68.0 has changes for pcloud, which may fix this.) My theory how the corruption might have happened:
- First interrupted run of git-annex uploads chunks A and a partial(!) chunk B
- Second run skips chunks A and B(!); and proceedsto upload the rest of the chunks (C and D)
- At the end we have uploaded A, C and D and a corrupted/partial chunk B
Joey: Is this a possible error scenario?
git annex assist
directly after a git clone
, wondering why I'm getting a million files shoved into my face, CTRL+C'ing it, being left with a weird unclean work tree for the download-aborted unlocked files, so I have to git restore .
again, then configuring git annex wanted present
before I continue.
Is there any way to set a default preferred content setting -- either used when a new clone is made or whenever a repo doesn't specify one?
I've got an annex that has a couple servers with all the content, and several clients[1] -- which I create more often and more manually -- that just want the content I pick. Basically every time I set up another client, I run git annex sync --content
, am surprised to see a bunch of get ...
lines, go kill the sync, set group and preferred content to be manual/standard, and run the sync again. It'd be handy if I could set up the repo in advance to just configure that by default. (I guess I could make an alias that does like git clone $server/$repo && cd $repo && git annex wanted . standard && git annex group . manual
, but it'd be nice if I could just do the git clone
I'm used to and it would all work.)
[1] AIUI, the "client" group means "get every file referenced in HEAD, unless it's in archive/, and skip older versions"? I guess that makes sense for like a software project with some media assets. I've mostly used git-annex for situations where most files aren't being actively worked with and clients only have a few of them, which is where it seems to really shine over GitLFS. I've always been vaguely surprised by how the client group works as a result. Any sense of how commonly people use it for different use cases? It is excellent for the sparse checkout case though.
Thanks to Thowz for the above solution.
There's a couple of scaling issues for large numbers of files (100K+ files in my case) which makes it go slowly and actually breaks the command line length ("Argument list too long").
Here's my modification for the first two commands:
# Enable write permissions on *directories* containing misfiled items
find -xtype l -printf "%l\0" \
|sed -z -r "s#.*(\.git/annex/objects)/(.)(.)/(.)(.)/([^/]*).*#\1/[\L\2\U\2][\L\3\U\3]/[\L\4\U\4][\L\5\U\5]/\E\6#" \
|sort -z \
|uniq -z \
|xargs -0 -ifoo bash -c "chmod u+w foo"
# Reinject the *files* into the annex (note different sed pattern)
find -xtype l -printf "%l\0" \
|sed -z -r "s#.*(\.git/annex/objects)/(.)(.)/(.)(.)/(.*)#\1/[\L\2\U\2][\L\3\U\3]/[\L\4\U\4][\L\5\U\5]/\E\6#" \
|sort -z \
|uniq -z \
|xargs -0 -ifoo bash -c "git-annex reinject --known foo"
If you used bsdtar (or some other method that attempts to copy over Apple metadata resource forks) you'll see a ton of ._
prepended files in your archive. If you're using this on Linux going forward and want these to be cleaned up (and enable the below directory cleanups to actually succeed and know you don't actually want any of the metadata) you'll want to delete these with something like this:
find .git/annex/objects -name ._\* -print0 |xargs -0 rm
You can then continue with his last two commands:
# Remove empty directories (rmdir will fail on the non-empty directories)
find .git/annex/objects -mindepth 3 -maxdepth 3 -type d -exec rmdir {} \;
find .git/annex/objects -mindepth 2 -maxdepth 2 -type d -exec rmdir {} \;
# And, if you want to be thorough, add this one...
find .git/annex/objects -mindepth 1 -maxdepth 1 -type d -exec rmdir {} \;
Another place this came up is https://git-annex.branchable.com/design/passthrough_proxy/#index14h2 where a proxy to an encrypted special remote necessarily does encryption server side, but the user may not want the server to see their unencrypted files.
There I suggested "adding a special remote that does its own client-side encryption in front of the proxy". Such a layered special remote could also be used with a git remote. There would be some complexity cost, since you would have two remote names, one used for git and the other for git-annex.
Implementing object encryption in git remotes is certianly possible, but it would be a special case and the existing code for encrypting special remotes (particularly Remote.Helper.Special.specialRemote) would not be able to be reused.
There's also the problem that, if such a git repository is added as a regular remote, and the git-annex branch that indicates that it is encrypted has not yet been pulled, git-annex would not realize that it is supposed to be encrypted, so would send unencrypted objects to it. This seems like an easy situation to accidentially get into eg:
git remote add foo http://example.com/
git annex move --to foo # oops unencrypted
Overall I prefer the idea of layering an encrypted special remote to complicating the git remote with encryption. Enabling that special remote could make git-annex treat the underlying remote as annex-ignore, to prevent accidentially sending unencrypted objects to it.
There could also be situations where you want to store some files unencrypted on a git hosting site to let them be accessible via its UI, but encrypt other files, and the layered special remote also allows for that kind of thing.
I'd like to set a few additional configurations so that all clones treat a special remote similarly.
Particularly I'd like to set the trustlevel and tracking-branch for an exporttree special remote so that any clone that enables this remote also have these configurations enabled. In particular this is justified for a certain remote of mine because it exports to a version controlled environment that I trust, so it would just be nice not to have to run
git config remote.name.annex-tracking-branch
andgit annex trust name semitrusted
for every clone.Of course, are
git annex config --set remote.name.annex-trustlevel "semitrusted"
andgit annex config --set remote.name.annex-tracking-branch "main"
(called once) any easier than the above called multiple times? Maybe not, but it would be slightly less mental overhead to not do the above.Off hand can you imagine any caveats that would preclude adding these settings to the list of supported for this command? I agree that only some make sense for all clones to see rather than anything one can set in git config but of course that specification requires manual addition of config cases that do make sense. Maybe this is one of them.