Recent comments posted to this site:
- Remove comment
Nothing. --fast happens to be parsed as a global option so it's
accepted with every command, but it does not change the usual behavior of
git-annex get
at all.
Commands like git-annex copy
that implement a different behavior for --fast have it documented
in their individual man pages.
What will happen with too small a bloom filter is git-annex unused
may
think that some keys are used which are really not. And
git-annex sync --content
may operate on some keys that are not in the
work tree.
The git-annex info
command displays how much memory the configured bloom
filters use, which is why it's reporting 32 membibytes. But the
annex.bloomcapacity setting is the number of expected files in the work
tree, by default 500000.
It would probably make sense for you to set it to 2000000 or so unless your system has an unusually small amount of RAM.
Hi, author of rclone's "gitannex" command here. Sorry you're running into trouble with it!
Based on the text, that error is definitely coming from gitannex.go.
I believe that my intent was to detect that the following mkdir would fail, and offer up a more specific error message rather than letting it fail.
I don't know anything about Backblaze B2, unfortunately. I suppose we could work around the issue by creating an empty file underneath the place we want the empty directory. Sounds plausible, right?
Would you mind trying to make an empty directory on your B2 remote to verify it fails? Something like rclone mkdir myremote:newdir
.
And also try touching a file in a new directory to verify it's possible in one go? Something like rclone touch --recursive myremote:newdir/newfile.txt
.
Some thoughts regarding your ideas:
- Multiple output files could always be emulated by generating a single archive file and registering additional compute instructions that simply extract each output file from that archive. I think there could be some convenience functionality on the CLI side to set that up and the key of the archive file might not even need to correspond to an actual file in the tree.
- For my use-cases (and I think DataLad at large) it is important to make this feature work across repository boundaries. E.g. I would like to use this feature to build a derived dataset from https://atris.fz-juelich.de/MeteoCloud/ERA5, where exactly this conversion from grib to netcdf happens in the compute step. I'd like to have the netcdf outputs as a separate dataset as some users might only be interested in the grib files, and it would scale better when there is more than just one kind of output that can be derived from an input by computation.
git annex get
doesn't work recursively across submodules/subdatasets though, anddatalad get
does not understand keys, just paths (at least so far).
@m.risse in your example the "data.nc" file gets new content when retrieved from the special remote and the source file has changed.
True, that can happen, and the user was explicit in that they either don't care about it (non-checksum backend, URL in my PoC), or do care (checksum backend) and git-annex would fail the checksum verification.
But if you already have data.nc file present in a repository, it does not get updated immediately when you update the source "data.grib" file.
So, a drop and re-get of a file changes the version of the file you have available. For that matter, if the old version has been stored on other remotes, a get may retrieve either an old or a new version. That is not intuitive and it makes me wonder if using a special remote is really a good fit for what you're wanting to do
This I haven't entirely thought through. I'd say if the key uses a non-checksum backend, then it can only be assumed and is the users responsibility that the resulting file is functionally, even if not bit-by-bit, identical. E.g. with netCDF checksums can differ due to small details like chunking, but the data might be the same. With a checksum backend git-annex would just fail the next recompute, but the interactions with copies on other remotes could indeed get confusing.
In your "cdo" example, it's not clear to me if the new version of the software generates an identical file to the old, or if it has a bug fix that causes it to generate a significantly different output. If the two outputs are significantly different then treating them as the same git-annex key seems questionable to me.
Again, two possible cases depending on if the key uses a checksum or a non-checksum backend. With a checksum: if the new version produces the same output everything is fine; if the new version produces different output then git-annex would indicate this discrepancy on the next recompute and the user has to decide how to handle it (probably by checking that the output of the new version is either functionally the same or in some way "better" than the old one and updating the repository to record this new key as that file).
Without a checksum backend the user would again have been explicit in that they don't care if the data changes for whatever reason, the key is essentially just a placeholder for the computation without a guarantee about its content.
Something like VURL would be a compromise between the two: it would avoid the upfront cost of computing all files (which might be very expensive), but still instruct git-annex to error out if the checksum changes at some point after the first compute. A regular migration of the computed-files-so-far to a checksum backend could achieve the same.