Recent comments posted to this site:
Probably this. In any case, it's better to upgrade before filing a bug on something like this.
git-annex (8.20211123) upstream; urgency=medium
* Bugfix: When -J was enabled, getting files could leak an
ever-growing number of git cat-file processes.
10.20251114-..... I will update/close issue according to the result.
With the separate autoenabled remote for PRs, the UX could look like this:
> git-annex add myfile
add myfile ok
> git commit -m foo
> git push origin HEAD:refs/for/main -o topic="add myfile"
> git-annex push origin-PRs
copy myfile (to origin-PRs) ... ok
Or with a small git-annex improvement, even:
> git-annex assist -o topic="add myfile"
add myfile ok
copy myfile (to origin-PRs) ... ok
For this, origin-PRs would want all files not in origin, and origin would want all files not in origin-PRs. And origin-PRs would need to have a lower cost than origin so that it doesn't first try, and fail, to copy the file to origin.
A per-user special remote that is assumed to contain the annexed files for all of the users AGit-PRs. If git recognizes remote configs in the users' global git config then it could be possible to get away with configuring things once, but I am not sure of the behavior of git in that case.
I think git will do that (have not checked), but a special remote needs information to be written to the git-annex branch, not just git config, so there's no way to globally configure a special remote to be accessible in every git-annex repository.
Along similar lines, forgejo could set up an autoenabled remote
that contains annexed files for all AGit-PRs, and that wants any files
not in the main git repository. (This could be a special remote, or a
git-annex repository that just doesn't allow any ref pushes to it. The
latter might be easier to deal with since git-annex p2phttp could serve
it as just another git-annex repository.)
That would solve the second problem I discussed in the comment above, because when the user copies objects to that separate remote, it will not cause git-annex in the forgejo repository to update the main git-annex branch to list those objects.
When merging a PR, forgejo would move the objects over from that remote to the main git repository.
You would be left with a bit of an problem in deleting objects from that remote when a PR is rejected. Since the user may never have pushed their git-annex branch after sending an object to it, and so you would not know what PR that object belongs to. I suppose this could be handled by finding all objects that are in active PRs and deleting ones that are not after some amount of time.
Obviously annexed objects copied to the Forgejo-aneksajo instance via this path should only be available in the context of that PR in some way.
The fundamental issue seems to be that annexed objects always belong to the entire repository, and are not scoped to any branch.
Hmm.. git objects also don't really belong to any particular branch. git only fetches objects referenced by the branches you clone.
Similarly, git-annex can only ever get annex objects that are listed
in the git-annex branch. Even with --all, it will not know about objects
not listed there.
So, seems to me you may only need to keep the PR's git-annex branch separate from the main git-annex branch, so that the main git-annex branch does not list objects from the PR. I see two problems that would need to be solved to do that:
If git-annex is able to see the PR's git-annex branch as eg (refs/foo/git-annex), it will auto-merge it into the main git-annex branch, and then --all will operate on objects from the PR as well. So the PR's git-annex branch would need to be named to avoid that.
This could be just
git push origin git-annex:refs/for/git-annex/topic-branch
Maybegit-annex synccould be made to support that for its pushes?When git-annex receives an object into the repository, the receiving side updates the git-annex branch to indicate it now has a copy of that object. So, you would need a way to make objects sent to a PR update the PR's git-annex branch, rather than the main git-annex branch.
This could be something similar to
git push -o topicin git-annex. Which would need to be a P2P protocol extension. Or maybe some trick with the repository UUID?
When the PR is merged, you would then also merge its git-annex branch.
If the PR is instead rejected, and you want to delete the objects
associated with it, you would first delete the PR's other branches, and
then run git-annex unused, arranging (how?) for it to see only the PR's
git-annex branch and not any other git-annex branches. That would find any
objects that were sent as part of the PR, that don't also happen to be used
in other branches (including other PRs).
I do wonder, if this were implemeted, would the git-annex
workflow for the user be any better than if there were a per-PR
remote for them to use? If every git-annex command that pushes the
git-annex branch or sends objects to forjejo needs -o topic
to be given, then it might be a worse user experience.
Glacier is in the process of being deprecated, instead there is the Deep Archive S3 storage class. https://aws.amazon.com/blogs/aws/new-amazon-s3-storage-class-glacier-deep-archive/
While it is possible to configure a S3 special remote
with storageclass=DEEP_ARCHIVE, or configure a bucket with lifecycle rules
to move objects to deep archive, git-annex won't be able to retrieve objects
stored in deep archive.
To support that, the S3 special remote would need to send a request to S3 to
restore an object from deep archive. Then later (on a subsequent git-annex run)
it can download the object from S3.
This is the API: https://docs.aws.amazon.com/AmazonS3/latest/API/API_RestoreObject.html
It includes a Tier tag which controls whether the restore is expedited. There would probably need to be a git config for that, since the user may want to get a file fast or pay less for a slower retrieval.
And there is a Days tag, which controls how long the object should be left accessible in S3. This would also make sense to have a git config.
I have opened this issue, which is a prerequisite to implementing this https://github.com/aristidb/aws/issues/297
I don't think storageclass=DEEP_ARCHIVE will currently work,
git-annex is not able to request that the object be restored.
See https://git-annex.branchable.com/todo/wishlist__58___Restore_s3_files_moved_to_Glacier/ for a todo which would solve this.
please advise on how to mitigate (
git reset --hardthes3-dandiarchive/masterto prior state before yesterday and reimport with newer git-annex or ... ?)
Simply resetting the remote tracking branch and re-importing won't cause an import to necessarily happen again. This is because git-annex tracks internally what has been imported from the remote. Running an import again when it's already imported files won't re-download those same files. And it will regenerate the same remote tracking branch.
So running in a clone from a backup is a better way to re-run the import.
Are the 000675/draft/ files you show it importing the ones that are
access restricted?
And when you replicated the problem from the backup, were you using it in the configuration where it cannot access those?
I notice that all the files affected seem to be probably smallish text files (yaml, jsonld). Do you have annex.largefiles configured in this repository, and are all of the affected files non-annexed files? If so, it would be worth retrying from the backup with the config changed so those files get annexed and see if that avoids the problem.