Recent comments posted to this site:

Was this ever explored more? This would be very interesting to be able to use the metadata functionality on regular git files that are not in the annex.
Comment by joris Thu Jun 20 09:58:05 2024

I have found one way to graft in the S3 bucket. And that involves performing git-annex initremote cloud type=S3 , which unavoidably creates a new dummybucket (can use bucket=dummy to identify it). Then performing git-annex enableremote cloud bucket=cloud- to utilise the original bucket without having to copy/move over all the files.

I did try it in one shot with git-annex initremote cloud type=S3 bucket=cloud- , but unfortunately it fails because the creation of the bucket step appears mandatory, and the S3 api errors out with an "already created bucket" type of error.

However, if there is a general guidance somewhere for... I guess importing/exporting the special remote metadata (including stored encryption keys), that would be very much appreciated.

Sorry, I should just clarify. Trying to do this via sync from the old, non-tuned git-annex repo fails with:

git-annex: Remote repository is tuned in incompatible way; cannot be merged with local repository.

Which I understand for the wider branch data implications... but I don't know enough to understand why just the special remote data can't be merge in.

Comment by beryllium Sat Jun 15 07:37:06 2024

Naively, I put myself in a position where my rather large, untuned git-annex had to be recovered due to not appreciating the effect of case-insensitive filesystems.

Specifically, NTFS-3G is deadly in this case. Because, whilst Windows has advanced, and with WSL added the ability to add case-sensitivity on a folder, which is also inheritable to folders under it... NTFS-3G does not do this.

So beware if you try to work in an "interoperable" way. NTFS-3G will do mixed case, but will create child folders that are not case-sensitive.

To that end, I want to migrate this rather large git-annex to be tuned to annex.tune.objecthashlower. I already have a good strategy around this. I'll just create a completely new stream of git-annex'es originating from a newly formed one. I will also be able to create new type=directory special remotes for my "tape-out" existing git-annex. I will just use git annex fsck --fast --from $remote to rebuild the location data for it.

I've also tested this with an S3 git-annex as a proof-of-concept. So in the new git-annex, I ran git-annex initremote cloud type=S3... to create a new bucket, copied over a file from the old bucket, and rebuilt the location data for that file.

But I really really would like to be able to avoid creating a new bucket. I am happy to lose the file presence/location data for the old bucket, but I'd like to graft back in, or initremote the cloud bucket with matching parameters. So too I guess, with an encrypted special remote, ie. import over the encryption keys, etc.

Are there "plumbing" commands that can do this? Or does it require knowing about the low-level storage of this metadata to achieve it, which seems to just send me back to the earlier comment of using a filter-branch... which I am hoping to avoid (because of all the potential pit-falls)

Comment by beryllium Sat Jun 15 00:57:26 2024

Looking at the behavior of git-annex get, the first one leaves the index in a diff state:

joey@darkstar:~/tmp/b2/x>git-annex get funky
get funky (from origin...)
(recording state in git...)
joey@darkstar:~/tmp/b2/x>git diff --cached
diff --git a/funky b/funky
index a8813f1..9488a18 100644
--- a/funky
+++ b/funky
@@ -1 +1 @@

To the second git-annex get, this is indistinguishable from a different unlocked file having been moved over top of funky. So the behavior of the second one is fine.

The problem is with the first git-annex get leaving the index in that state.

What's happening is, it doesn't restage the index, because the restage itself can't tell the difference between this state and an unlocked file having been moved over top of funky. In particular, git update-index --refresh --stdin when run after the first git-annex get, and fed "funky", leaves the index in diff state.

joey@darkstar:~/tmp/b2/x>touch funky
joey@darkstar:~/tmp/b2/x>echo funky | GIT_TRACE=1 git update-index --refresh --stdin
14:14:33.911458 git.c:465               trace: built-in: git update-index --refresh --stdin
14:14:33.911759 run-command.c:657       trace: run_command: 'git-annex filter-process'
14:14:33.917118 git.c:465               trace: built-in: git config --null --list
14:14:33.919641 git.c:465               trace: built-in: git show-ref git-annex
14:14:33.921390 git.c:465               trace: built-in: git show-ref --hash refs/heads/git-annex
14:14:33.925579 git.c:465               trace: built-in: git cat-file --batch
14:14:33.927011 run-command.c:50        trace: run_command: running exit handler for pid 1164525
joey@darkstar:~/tmp/b2/x>git status --short
M  funky

So git update-index is running git-annex filter-process, which is doing the same as git-annex smudge --clean funky in this case. And in Command.Smudge.clean, there is a parseLinkTargetOrPointerLazy' call which is intended to avoid storing a pointer file in the annex... The very thing that the assistant is somehow incorrectly doing. In this case though, that notices that funky's content looks like an annex pointer file, so it outputs that pointer. So git stages that pointer.

To avoid this, the first git-annex get would need to notice that the content it got looks like a pointer file. And it would need to communicate that through the git update-index somehow to git-annex filter-process. Then when that saw the same pointer file, it could output the original key, and this situation would be avoided. Also bear in mind that the git update-index can be interrupted and get restarted later and it would still need to remember that it was dealing with this case then. This seems... doable, but it will not be easy.

PS, Full script to synthesize a repository with this situation follows:

git init z
cd z
git-annex init
git commit --allow-empty -m created
cd ..
git clone z y
cd y
git-annex init
echo 'Thu Jun 13 12:30:17 JEST 2024' > foo
git-annex add foo
git commit -m added
git-annex move --foo --to origin
git rm foo
git commit -m removed
echo '/annex/objects/SHA256E-s30--93c16dbf65b7b66e479bd484398c09c920338e4a1df1fe352b245078d04645f4' > funkyobj
git-annex setkey WORM--foo funkyobj
echo '/annex/objects/WORM--foo' > funky
git add funky
git commit -m add\ funky
git annex find --format='${key}\n' funky
git-annex get funky
cd ..
git clone y x
cd x
git remote add z ../z
git-annex get funky
git-annex get funky
Comment by joey Thu Jun 13 18:01:01 2024

git-annex add (and smudge) use isPointerFile to check if a file that is being added is an annex pointer file. And in that case they stage the pointer file, rather than injecting it into the annex.

The assistant also checks isPointerFile though. And in the simple case, it also commits a newly added pointer file correctly:

joey@darkstar:~/tmp/b2/a>git-annex assistant
joey@darkstar:~/tmp/b2/a>echo '/annex/objects/SHA256E-s30--93c16dbf65b7b66e479bd484398c09c920338e4a1df1fe352b245078d04645f4' > new
joey@darkstar:~/tmp/b2/a>git show|tail -n 1

So this makes me think of a race condition. What if the file is not a pointer file when the assistant checks isPointerFile. But then it gets turned into one before it ingests it.

In git-annex add, it first stats the file before checking if it's a pointer file, and later it checks if the file has changed while it was being added, which should avoid such races.

Looking at the assistant, I'm not at all confident it handles such a race.

It might even be another thread of the assistant that triggered the race. Could be that something caused the assistant to drop the file, then get it again, then drop it again. (Eg something wrong with configuration causing a non-stable state... like "not present" in preferred content).

I've tried running a get/drop/get/drop loop while the assistant is running, and have not seen this happen to a file yet. But the race window is probably small. An interesting thing I did notice is that sometimes when such a loop runs for a while, the file will be left as a pointer file after git-annex get.

Comment by joey Thu Jun 13 17:07:02 2024

First I wanted to see if I could get this to happen without the assistant.

joey@darkstar:~/tmp/y>echo '/annex/objects/SHA256E-s30--93c16dbf65b7b66e479bd484398c09c920338e4a1df1fe352b245078d04645f4' > new
joey@darkstar:~/tmp/y>git annex add new
add new ok
joey@darkstar:~/tmp/y>git annex find --format='${key}\n' new

joey@darkstar:~/tmp/y>git config annex.largefiles anything
joey@darkstar:~/tmp/y>echo '/annex/objects/SHA256E-s30--93c16dbf65b7b66e479bd484398c09c920338e4a1df1fe352b245078d04645f4' > new2
joey@darkstar:~/tmp/y>git add new2
joey@darkstar:~/tmp/y>git annex find --format='${key}\n' new2

So no, it must be only the assistant that can mess up and add an annexed link to the annex.

Secondly, here's a way to manually create a repository with this behavior w/o using the assistant.

joey@darkstar:~/tmp/y>git remote add z ../z
joey@darkstar:~/tmp/y>git-annex move --key SHA256E-s30--93c16dbf65b7b66e479bd484398c09c920338e4a1df1fe352b245078d04645f4 --to z
joey@darkstar:~/tmp/y>echo '/annex/objects/SHA256E-s30--93c16dbf65b7b66e479bd484398c09c920338e4a1df1fe352b245078d04645f4' > funkyobj
joey@darkstar:~/tmp/y>git-annex setkey WORM--foo funkyobj
setkey funkyobj ok
joey@darkstar:~/tmp/y>echo '/annex/objects/WORM--foo' > funky
joey@darkstar:~/tmp/y>git add funky
git-annex: git status will show funky to be modified, since content availability has changed and git-annex was unable to update the index. This is only a cosmetic problem affecting git status; git add, git commit, etc won't be affected. To fix the git status display, you can run: git-annex restage
joey@darkstar:~/tmp/y>git commit -m add funky
joey@darkstar:~/tmp/y>git annex find --format='${key}\n' funky
joey@darkstar:~/tmp/y>cat funky
joey@darkstar:~/tmp/y>git-annex get funky

Nothing has gone wrong yet, funky is an unlocked file and it happens to have the content of an annex pointer file, but git-annex is not treating that content as an annex pointer file. If it were, the git-annex get funky above would get the SHA256 key from remote x.

But in a fresh clone, it's another story:

joey@darkstar:~/tmp>git clone y x
joey@darkstar:~/tmp>cd x
joey@darkstar:~/tmp/x>git remote add z ../z
joey@darkstar:~/tmp/x>cat funky
joey@darkstar:~/tmp/x>git-annex get funky
get funky (from origin...)
(recording state in git...)
joey@darkstar:~/tmp/x>git-annex get funky
get funky (from z...)
(recording state in git...)
joey@darkstar:~/tmp/x>cat funky
Thu Jun 13 12:30:17 JEST 2024

Which reproduces what you showed. I think this on its own is a bug, leaving aside whatever caused the assistant to generate this.

Comment by joey Thu Jun 13 16:31:57 2024

interestingly on the client git restore --staged PATH managed to recover the link to become "proper". And git-annex restage did nothing to fix situation with Modified file:

[bids@rolando VIDS] > git merge --ff-only synced/master
Updating b4f3af57..263dad67
Updating files: 100% (871/871), done.
 .gitattributes                                                           |  1 +
create mode 100644 logs/2024-05-24T07:35-04:00.log
 create mode 100644 logs/2024-05-24T07:35-04:00.logpwd

git-annex: git status will show Videos/2024/03/2024. to be modified, since content availability has changed and git-annex was unable to update the index. This is only a cosmetic problem affecting git status; git add, git commit, etc won't be affected. To fix the git status display, you can run: git-annex restage
[bids@rolando VIDS] > 
[bids@rolando VIDS] > 
[bids@rolando VIDS] > 
[bids@rolando VIDS] > git-annex restage
restage  ok
[bids@rolando VIDS] > git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   Videos/2024/03/2024.

[bids@rolando VIDS] > git-annex restage 
restage  ok
[bids@rolando VIDS] > git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   Videos/2024/03/2024.

[bids@rolando VIDS] > git-annex restage  Videos/2024/03/2024.
git-annex: This command takes no parameters.
[bids@rolando VIDS] > git status
On branch master
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   Videos/2024/03/2024.

[bids@rolando VIDS] > git restore --staged Videos/2024/03/2024.
[bids@rolando VIDS] > git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   Videos/2024/03/2024.

no changes added to commit (use "git add" and/or "git commit -a")
[bids@rolando VIDS] > git diff
diff --git a/Videos/2024/03/2024. b/Videos/2024/03/2024.
index 92b79020..fc930f54 100644
--- a/Videos/2024/03/2024.
+++ b/Videos/2024/03/2024.
@@ -1 +1 @@
diff --git a/Videos/2024/04/2024. b/Videos/2024/04/2024.
--- a/Videos/2024/04/2024.
+++ b/Videos/2024/04/2024.
@@ -1 +0,0 @@
[bids@rolando VIDS] > git log Videos/2024/03/2024.
commit ef5549f74dfea19c11bf963a7ec9789bce0d925d
Author: ReproStim User <>
Date:   Wed Apr 17 09:38:23 2024 -0400

    Move files under subfolders
[bids@rolando VIDS] > git --version
git version 2.39.2
[bids@rolando VIDS] > git annex version --raw
Comment by yarikoptic Tue Jun 11 17:36:51 2024

While I don't think this affects the ds002144 repository (because the repository with the missing tree is dead), here's what happens if the export.log's tree is missing, master has been reset to a previous tree, which was exported earlier, and in a clone we try to get a file that is present in both trees from the remote:

get foo (from d...) fatal: bad object f4815823941716de0f0fdf85e8aaba98d024d488

  unknown export location

Note that the "bad object" message only appears the first time run. Afterwards it only says "unknown export location".

Even if the tree object later somehow gets pulled in, it will keep failing, because the exportdb at this point contains the tree sha and it won't try to update from it again.

To recover from this situation, the user can make a change to the tree (eg add a file), and export. It will complain one last time about the bad object, and then the export.log gets fixed to contain an available tree. However, any files that were in the missing tree that do not get overwritten by that export will remain in the remote, without git-annex knowing about them. If the remote has importtree=yes, importing from it is another way to recover.

Comment by joey Mon Jun 10 14:36:37 2024

Note that at least in the case of ds002144, its git-annex branch does not contain grafts of the missing trees. The grafts only get created in the clone when dealing with a transition.

So, it seems that to recover from the problem, at least in the case of this repository, it will be sufficient for git-annex to avoid regrafting trees if the object is missing.

Done that, and so I suppose this bug can be closed. I'd be more satified if I knew how this repository was produced though.

Comment by joey Fri Jun 7 20:25:27 2024

Fixed performTransitionsLocked to create the new git-annex branch atomically.

Found another way this could happen, interrupting git-annex export after it writes export.log but before it grafts the tree into the git-annex branch. Fixed that one too.

So hopefully this won't happen to any more repositories with these fixes. Still leaves the question of how to recover from the problem.

Comment by joey Fri Jun 7 17:59:43 2024