Recent comments posted to this site:

Thank you Joey! I hope that release is coming some time soon?
Comment by yarikoptic Thu Apr 25 21:11:56 2024

git config annex.uuid seems reliable enough to me - the uuid is stored in .git/config and doesn't exist if it's not annex-inited (assuming no malicious behavior).

If you're looking for the very cheapest, perhaps something like a file-exists check on .git/annex would work? I don't know if there are any edge cases with this one, though.

Comment by aurtzy Thu Apr 25 17:35:49 2024

Of course, choosing a backend that does not include extension is something worth considering. Unless something needs the object file to preserve the extension. For a .mkv file, I'd guess most video players don't care about the extension.

annex.maxextensionlength won't help here, but I think it makes sense to add an analagous annex.maxextensions which would default to 2 (as it currently does to handle .tar.gz) but you could set to 1.

It might also be a reasonable argument that filename extensions are not just numbers, but then again, foo.1.pdf foo.2.pdf is a pretty common kind of pattern, although the extent that the numbers in that are extensions with any meaning to a program would depend. Some archivers do eg split files into foo.1 foo.2 foo.3 and use the extensions to get them back. Anyway, the same kind of problem could happen when not using only numbers.

Comment by joey Thu Apr 18 17:39:52 2024

Sorry for resurrecting this after 2 years, I somehow forgot this discussion was ongoing.

So, first of all, thank you so much for taking the time to writing up a very cool server side solution for the problem. Do I understand your proposal correctly, that basically on the server we would always store a git-annex rewritten branch as if it was correctly written by the client, no matter what the clients do on their own in their own git-annex branches, right?

And since all the merging in git-annex is line based, this constant rewrite wouldn't confuse the clients when they git fetch --all + git annex merge? Wouldn't the merge commits in gitk git-annex be very weird to understand?

So what I don't understand, is that if we do this on the central server side, then yes, the rewrite on the server is good, but when the offending client does a git fetch + git annex merge, it will create a merge commit with 2 parents. Will we also straighten that out automatically and delete the "stupid" side on the next push? Doesn't this mean, that debugging just becomes more confusing and this client will create longer and longer side branches on its graphical branch view of gitk git-annex?

Let me reflect back to your "comment 5", where you asked the very valid question of what to do in case of difference of opinions. I think the correct solution is to implement the override feature (in .git/config, as you said), and let it completely happen. If the only way for unwanted UUIDs to appear in my central repo is for someone to use this extra feature, I'm OK with that. I want to prevent accidents, and I certainly don't want to prevent expert power-users achieving their goals when needed, so local override (even if the end result is pushed back), is 100% fine.

Now, that I'm thinking about this as a "reasonable difference of opinion to have", an interesting "solution" comes to mind, that opens up of course a very big discussion: why in the design of git-annex there is only ONE AND ONLY git-annex branch? Git has orphan branches, and it would be legit to say, that different group of people working in a repo, have different opinion of "view of the annex", e.g. they think different repos (or special remotes) are important or unimportant for them. I mention this question not really seriously as a proposal to redesign, but I'm sure that you had this idea sometime in the past, and if you have some insight or revelations, I'd be happy to read it.

Comment by ErrGe Thu Apr 18 01:17:02 2024

"Instruction deposition" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.

Thanks for the pointer, very useful!

Regarding the points you raised:

Datalad's run feature has been around for some years, and we have seen usage in the wild with command lines that are small programs and dozens, sometimes hundreds of inputs. It is true that anything could be simply URL-encoded. However, especially with command-patterns (always same, except parameter change) that may be needlessly heavy. Maybe it would compress well (likely), but it still poses a maintenance issue. Say the compute instructions need an update (software API change): Updating one shared instruction set is a simpler task than sifting through annex-keys and rewriting URLs.

I don't quite understand the necessity for "Worktree provisioning". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.

We need a worktree different from HEAD whenever HEAD has changed from the original worktree used for setting up a compute instruction. Say a command needs two input files, but one has been moved to a different directory in current HEAD. An implementation would now either say "no longer available" and force maintenance update, or be able to provision the respective worktree. In case of no provision capability we would need to replace the URL-encoded instructions (this would make the key uncomputable in earlier versions), or amend with an additional instruction set (and now we would start to accumulate cruft where changes in the git-annex branch need to account for (unrelated) changes in any other branch).

Comment by mih Mon Apr 15 05:00:58 2024

I just want to mention that I've implemented/tried to implement something like this in https://github.com/matrss/datalad-getexec. It basically just records a command line invocation to execute and all required input files as base64-encoded json in a URL with a custom scheme, which made it surprisingly simple to implement. I haven't touched it in a while and it was more of an experiment, but other than issues with dependencies on files in sub-datasets it worked pretty well. The main motivation to build it was the mentioned use-case of automatically converting between file formats. Of course it doesn't address all of your mentioned points. E.g. trust is something I haven't considered in my experiments, at all. But it shows that the current special remote protocol is sufficient for a basic implementation of this.

I like the proposed "request one key, receive many" extension to the special remote protocol and I think that could be useful in other "unusual" special remotes as well.

I don't quite understand the necessity for "Worktree provisioning". If I understand that right, I think it would just make things more complicated and unintuitive compared to always staying in HEAD.

"Instruction deposition" is essentially just adding a URL to a key in my implementation, which is pretty nice. Using the built-in relaxed option automatically gives the distinction between generating keys that have never existed and regenerating keys.

Comment by m.risse Sat Apr 13 20:30:56 2024
I just had the same problem. An addurl test failing while building on MacOS with nix-shell -p git-annex. I recorded a video of it coincidentally. I don't know what's with the security program it tries to call. I was using a MacOS VM via Docker-OSX.
Comment by nobodyinperson Sat Apr 13 15:51:54 2024

I agree, after discussions at distribits it's clear there is use for this in datalad, and in git-annex generally.

Comment by joey Wed Apr 10 16:58:39 2024

I would just clone the repo to the new machine, do git annex init, and then rsync the contents of .git/annex/objects, and then do git annex fsck --all to have to recheck every key it knows about.

Alternatively, if you're concerned that there might be keys that weren't properly recorded somehow, in your new repo, after .git/annex/objects has been transferred, you can create an ingestion directory with a flat layout of the copied keys:

mkdir ingest && find .git/annex/objects -type f | xargs mv ingest && git annex reinject --known ingest/*

Finally, if you just want to rebuild it from scratch, do cp with the -cL option. If you are on macOS, it will make a reflink copy, and follow the symlinks. Delete the target .git dir and re-create it.

Comment by unqueued Wed Apr 10 12:46:53 2024

For my two cents, I have found git-annex to be a simple enough format that I have only needed basic helper scripts.

But many operations can be done with one or a few lines of code.

Git can do much of the heavy lifting for you in terms of looking stuff up from the git-annex branch, and I find the formats to be quite regular and easy to parse.

I am thinking of bringing some of this together into a PHP library.

But maybe I should just post my pure git-annex bash/perl one-liners.

Comment by unqueued Wed Apr 10 12:26:49 2024