Recent comments posted to this site:

FWIW: Yes! ;-)
Comment by yarikoptic Tue Jun 15 15:38:16 2021

git-annex fsck will detect this problem, but the real problem here is not that some edit got lost, but that you corrupted a version control object file. Similar to editing a file in .git/objects/. Fsck will, when it notices the problem, move the corrupted object file to `.git/annex/bad/'. So your edits are not lost, but are in danger of being forgotten.

Note that, once the modified version of the file from repo B replaces the worktree file, git annex fsck of that file won't check the old version, so will not detect the problem. git annex fsck --all still will detect it.

git-annex mostly prevents this kind of problem by making the file not have a write bit be set, and putting it in a directory that also has its write bit not set.

You have to either be running as root, or using a program that goes out of its way to change multiple permissions, to get into that situation.

(One example of a program that does is vim. :w! will temporarily overwrite both write bits.)

Comment by joey Tue Jun 15 13:50:17 2021

Oh, there's a much better solution: If the annex object file already exists when ingesting a new file, skip populating other associated files. They will have already been populated. moveAnnex has to check if the annex object file already exists anyway, so this will have zero overhead.

(Maybe that's what yarik was getting at in comment #30)

Implemented that, and here's the results, re-running my prior benchmark:

run 1: 0:03.14 run 2: 0:03.24 run 3: 0.03.35 run 4: 0.03.45 run 9: 0:03.65

That also shows the actual overhead of the diffing of the index, as its size grows, is quite small.

Comment by joey Tue Jun 15 13:01:04 2021

Some thoughts leading to a workable plan:

It's easy to detect this edge case because getAssociatedFiles will be returning a long list of files. So it could detect say 10 files in the list and start doing something other than the usual, without bothering the usual case with any extra work.

Git starts to get slow anyway in the 1 million to 10 million file range. So we can assume less than that many files are being added. And there need to be a fairly large number of duplicates of a key for speed to become a problem when adding that key. Around 1000 based on above benchmarks, but 100 would be safer.

If it's adding 10 million files, there can be at most 10000 keys that have >= 1000 duplicates (10 million / 1000). No problem to remember 10000 keys; a key is less than 128 bytes long, so that would take 1250 kb, plus the overhead of the Map. Might as well remember 12 mb worth of keys, to catch 100 duplicates.

It would be even better to use a bloom filter, which could remember many more, and I thought I had a way, but the false positive case seems the wrong way around. If the bloom filter remembers keys that have already had their associated files populated, then a false positive would prevent doing that for a key that it's not been done for.

It would make sense to do this not only in populateUnlockedFiles but in Annex.Content.moveAnnex and Annex.Content.removeAnnex. Although removeAnnex would need a different bloom filter, since a file might have been populated and then somehow get removed in the same git-annex call.

Comment by joey Mon Jun 14 20:26:52 2021

git annex find, like all git-annex commands except for add, skips over non-annexed files.

What you can do is get a list of all annexed files:

git annex find --include '*' | sort > annexed-files

And get a list of all files git knows:

git -c core.quotepath=off ls-files | sort > all-files

And then find files that are in the second list but not the first:

comm -1 -3 annexed-files all-files
Comment by joey Mon Jun 14 18:38:13 2021
I discussed that approach in comment #24.
Comment by joey Mon Jun 14 18:36:11 2021

It would be more beneficial to speed up that scanning (reconcileStaged), which should be doable by using the git cat-key --batch trick.

That got implemented.

Comment by joey Mon Jun 14 18:33:43 2021

Maybe it's better to not tie this directly in to fsck. Another way would be:

git annex untrust foo --after=100days

The first time this is run, it would record that the trust level will change to untrust after 100 days. The next time it's run, it would advance the timeout.

So, you could do whatever fsck or other checks make you still trust the repo, and then run this again.

Implementation would I guess need a separate future-trust.log in addition to trust.log, and when loading trust levels, if there was a value in future-trust.log that has a newer timestamp than the value in trust.log, and enough time has passed, use it instead of the value from trust.log. That way it avoids breaking older git-annex with changes to trust.log.

No need to change what's in trust.log, although it could, which would also let older git-annex versions learn about the change to trust.

Comment by joey Mon Jun 14 18:20:06 2021

Tried to implement this, but ran into a problem adding FsckAll: If it only logs FsckAll and not also Fsck, then old git-annex expire will see the FsckAll and not understand it, and treats it as no activity, so expires. (I did fix git-annex now so an unknown activity is not treated as no activity.)

And, the way recordActivity is implemented, it removes previous activities, and adds the current activity. So a FsckAll followed by a Fsck would remove the FsckAll activity.

That could be fixed, and both be logged, but old git-annex would probably not be able to parse the result. And if old git-annex is then used to do a fsck, it would log Fsck and remove the previously added FsckAll.

So, it seems this will need to use some log other than activity.log to keep track of fsck --all.

Comment by joey Mon Jun 14 17:56:23 2021

The same needs to also hold true for unlocked files, and so it has to check if foo is an unlocked pointer to K and populate the file with the content.

but that should only needs to be done iff K became present/known whenever it was not present before. If K is already known (e.g. was just added for another file, or may be was added in previous "commit") no such checks are needed since those files could be expected to already be populated. Right?

Comment by yarikoptic Mon Jun 14 17:36:16 2021