Recent comments posted to this site:
Even if it only re-checks when git-annex is going to use the remote (and not on every run of git-annex) that seems perhaps too often to check.
But if it checks less often than that, once per day or whatever, there will of course be a window where it has not yet noticed the change and uses the cached remote.name.annexUrl and potentially fails.
A balance might be that if it fails to connect to the remote.name.annexUrl, it could re-check it then.
Yes, the trashbin remote could be private. I think we're in agreement that's the best way to go.
--accessedwithin relies on atime, and looks at objects in the local repository only, so it would not work to find objects in the trashbin remote.
I don't think there is anything in preferred content expressions that would meet your need here exactly. It would probably be possible to add an expression that matches objects that have been present in a given repository for a given amount of time. The presence logs do have a timestamp.
Of course, if you used a directory special remote you could use
plain old find.
There are also some common setup stage tasks that pose problems but could all be fixed in one place:
- Encryption setup generates encryption keys. Which is both slow and also generating an then throwing away an encryption key is the wrong thing to do. I think this could be dealt with by copying the encryption setup of the remote that is generating the emphemeral remote into it.
- remote.name.annex-uuid is set in git config by gitConfigSpecialRemote. Either that could be disabled for ephemerals, or the uuid and name could also be inherited, which would make that a no-op.
The major difficulty in implementing this seems to be the setup stage, which is the per-special-remote code that runs during initremote/enableremote. That code can write to disk, or perform expensive operations.
A few examples:
- S3's setup makes 1 http request to verify that the bucket exists (or about 4 http requests when it needs to create the bucket). It does additional work when bucket versioning is enabled.
- directory's setup modifies the git config file to set remote.name.directory. And if that were skipped, generating the directory special remote would fail, because it reads that git config.
My gut feeling is that it won't be practical to make it possible to ephemeralize every type of special remote. But it would not be too hard to make some subset of special remotes able to be used ephemerally.
It might be possible to maintain a cache of recently used ephemeral special remotes across runs of git-annex, and so avoid needing to re-run the setup stage.
This seems like a good design to me. It will need a protocol extension to indicate when a git-annex version supports it.
It occured to me that when git-annex p2phttp is used and is proxying to a
special remote that uses this feature, it would be possible to forward the
redirect to the http client, so the server would not need to download the
object itself.
A neat optimisation potential, although implementing it would cut across several things in a way I'm unsure how to do cleanly.
That did make me wonder though, if the redirect url would always be safe to share with the client, without granting the client any abilities beyond a one-time download. And I think that's too big an assumption to make for this optionisation. Someone could choose to redirect to an url containing eg, http basic auth, which would be fine when using it all locally, but not in this proxy situation. So there would need to be an additional configuration to enable the proxy optimisation.
This is fixed in aws-0.25.1. I have made the git-annex stack build use that version. I also added a build warning when built with an older version, to hopefully encourage other builds to get updated.
Root caused to this bug: https://github.com/aristidb/aws/issues/296
Seems likely that git-annex import from an importtree=yes S3
remote on GCP is also broken since it also uses getBucket.
git-annex uses getBucket to probe if the bucket already exists, which lets it avoid dealing with the various ways that PUT of a bucket can fail. GCP also has some incompatabilities in how it responds to that, eg in the above log, it uses a custom "BucketNameUnavailable", rather than the S3 standard " BucketAlreadyExists".
Implementing this will need changes to the haskell aws library, since it does not allow setting this header.
I opened an issue https://github.com/aristidb/aws/issues/294
Update: This will need git-annex to be built with aws-0.25. If a S3 special remote is configured with this header, and an older version of git-annex or a git-annex built with an older version of aws is used, it will just not send along the header when storing an object.
So if your use case involves making newly uploaded objects private, you'll want to make sure you're always using a build of git-annex that supports it.
After fixing the other bug, I have successfully run the test for several hours without any problems.
Using DebugLocks, found that the deadlock is in checkvalidity,
the second time it calls putTMVar endv ().
That was added in 7bd616e169827568c4ca6bc6e4f8ae5bf796d2d8 "a bugfix to serveGet, it hung at the end".
Looks like a race between checkvalidity and waitfinal, which both fill endv. waitfinal does not deadlock when endv is already full, but checkvalidity does.