Recent comments posted to this site:

Thank you. The bash-tab-completion in Debian stable suggested it.
Comment by jnkl Thu Jan 30 19:59:09 2025
Thank you for your reply. And I see what you mean, although personally it still feels more that a bug than a feature. Trying to be a good internet citizen I exhausted every documentation I could find about it, and it still took me way too long to figure out something that simple. Maybe the manpage could reflect this bit of information in more detail - let's see if I can figure out, how to contribute patches ...
Comment by luciusf Thu Jan 30 19:17:58 2025

Nothing. --fast happens to be parsed as a global option so it's accepted with every command, but it does not change the usual behavior of git-annex get at all.

Commands like git-annex copy that implement a different behavior for --fast have it documented in their individual man pages.

Comment by joey Thu Jan 30 18:56:04 2025
Thank you!
Comment by jnkl Thu Jan 30 18:52:14 2025
I just created rclone issue #8349 to track this.
Comment by dmcardle Thu Jan 30 13:56:30 2025

What will happen with too small a bloom filter is git-annex unused may think that some keys are used which are really not. And git-annex sync --content may operate on some keys that are not in the work tree.

The git-annex info command displays how much memory the configured bloom filters use, which is why it's reporting 32 membibytes. But the annex.bloomcapacity setting is the number of expected files in the work tree, by default 500000.

It would probably make sense for you to set it to 2000000 or so unless your system has an unusually small amount of RAM.

Comment by joey Wed Jan 29 15:54:19 2025

Hi, author of rclone's "gitannex" command here. Sorry you're running into trouble with it!

Based on the text, that error is definitely coming from gitannex.go.

I believe that my intent was to detect that the following mkdir would fail, and offer up a more specific error message rather than letting it fail.

I don't know anything about Backblaze B2, unfortunately. I suppose we could work around the issue by creating an empty file underneath the place we want the empty directory. Sounds plausible, right?

Would you mind trying to make an empty directory on your B2 remote to verify it fails? Something like rclone mkdir myremote:newdir.

And also try touching a file in a new directory to verify it's possible in one go? Something like rclone touch --recursive myremote:newdir/newfile.txt.

Comment by dmcardle Wed Jan 29 14:44:37 2025

Some thoughts regarding your ideas:

  • Multiple output files could always be emulated by generating a single archive file and registering additional compute instructions that simply extract each output file from that archive. I think there could be some convenience functionality on the CLI side to set that up and the key of the archive file might not even need to correspond to an actual file in the tree.
  • For my use-cases (and I think DataLad at large) it is important to make this feature work across repository boundaries. E.g. I would like to use this feature to build a derived dataset from https://atris.fz-juelich.de/MeteoCloud/ERA5, where exactly this conversion from grib to netcdf happens in the compute step. I'd like to have the netcdf outputs as a separate dataset as some users might only be interested in the grib files, and it would scale better when there is more than just one kind of output that can be derived from an input by computation. git annex get doesn't work recursively across submodules/subdatasets though, and datalad get does not understand keys, just paths (at least so far).
Comment by matrss Wed Jan 29 10:13:59 2025

@m.risse in your example the "data.nc" file gets new content when retrieved from the special remote and the source file has changed.

True, that can happen, and the user was explicit in that they either don't care about it (non-checksum backend, URL in my PoC), or do care (checksum backend) and git-annex would fail the checksum verification.

But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source "data.grib" file.

So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do

This I haven't entirely thought through. I'd say if the key uses a non-checksum backend, then it can only be assumed and is the users responsibility that the resulting file is functionally, even if not bit-by-bit, identical. E.g. with netCDF checksums can differ due to small details like chunking, but the data might be the same. With a checksum backend git-annex would just fail the next recompute, but the interactions with copies on other remotes could indeed get confusing.

In your "cdo" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.

Again, two possible cases depending on if the key uses a checksum or a non-checksum backend. With a checksum: if the new version produces the same output everything is fine; if the new version produces different output then git-annex would indicate this discrepancy on the next recompute and the user has to decide how to handle it (probably by checking that the output of the new version is either functionally the same or in some way "better" than the old one and updating the repository to record this new key as that file).

Without a checksum backend the user would again have been explicit in that they don't care if the data changes for whatever reason, the key is essentially just a placeholder for the computation without a guarantee about its content.

Something like VURL would be a compromise between the two: it would avoid the upfront cost of computing all files (which might be very expensive), but still instruct git-annex to error out if the checksum changes at some point after the first compute. A regular migration of the computed-files-so-far to a checksum backend could achieve the same.

Comment by matrss Wed Jan 29 09:56:12 2025
Indeed. However the space savings would likely be marginal in a typical git-annex repo.
Comment by Atemu Tue Jan 28 21:57:42 2025