Please describe the problem.
When getting a lot of objects from an external special remote of the exporttree kind,
sometimes the EXPORT
command is sent to a different process of the external remote
than the following TRANSFEREXPORT
command.
As a result, one process receives two EXPORT
s in a row, the first of which it ignores,
while some other process receives a TRANSFEREXPORT
without a prior EXPORT
,
reusing whatever filename was set in the previous transaction,
and either ovewriting the last accessed remote file with the wrong content,
or retrieving the content of the last accessed remote file instead of the one git-annex wanted.
Or I misunderstood the protocol - please tell me if that's the case.
What steps will reproduce the problem?
I don't have a minimal reproducing example.
I was working on adding tree export support to git-annex-remote-rclone, and its the testsuite on my branch can reproduce the problem around half of the time, but only on MacOS. It seems like it's a race condition and the interleaving that triggers it doesn't happen on Linux
Here's an example test run when it fails on github
and the test code - on github (I'd post it as an attachment but I'm getting an error that only admin can upload attachments).
What version of git-annex are you using? On what operating system?
git-annex version: 10.20230227 OS: macOS 12.6.3 21G419
Please provide any additional information below.
Full log is quite long so I put it in a github gist. I wanted to upload it as an attachment but apparently I'm not allowed.
Here are the interesting parts:
First, during the tree export, we see an example of correct upload of the test 9.dat
and test 10.dat
files:
2023-03-26T16:34:52.2551390Z [2023-03-26 16:34:52.085377] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[3] <-- EXPORT test 9.dat
2023-03-26T16:34:52.2653060Z [2023-03-26 16:34:52.086085] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[3] <-- TRANSFEREXPORT STORE SHA256E-s7--a30e9d26ce633b40da3ba8cd8806b9b349cb6c3de6816c6d64f85a571012a973.dat .git/annex/objects/zZ/kw/SHA256E-s7--a30e9d26ce633b40da3ba8cd8806b9b349cb6c3de6816c6d64f85a571012a973.dat/SHA256E-s7--a30e9d26ce633b40da3ba8cd8806b9b349cb6c3de6816c6d64f85a571012a973.dat
[...]
2023-03-26T16:34:49.9382650Z [2023-03-26 16:34:49.629684] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[3] <-- EXPORT test 10.dat
2023-03-26T16:34:49.9384770Z [2023-03-26 16:34:49.629761] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[3] <-- TRANSFEREXPORT STORE SHA256E-s8--d5624694cf1515bdf8c3a648ae35d64f4bdff934800a22a070c6d0baddc120b0.dat .git/annex/objects/6k/X3/SHA256E-s8--d5624694cf1515bdf8c3a648ae35d64f4bdff934800a22a070c6d0baddc120b0.dat/SHA256E-s8--d5624694cf1515bdf8c3a648ae35d64f4bdff934800a22a070c6d0baddc120b0.dat
so we know that key of test 9.dat
is a30e...973.dat
, and for test 10.dat
it's d56...0b0.dat
Then, during the get, we see the bug happen:
2023-03-26T16:34:55.4392510Z [2023-03-26 16:34:54.534833] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[5] --> TRANSFER-SUCCESS RETRIEVE SHA256E-s8--fbb0a327e1528cdb3214abb2ec3fb1dd97cc39dfc12751df102eca019f602e73.dat
2023-03-26T16:34:55.4394480Z [2023-03-26 16:34:54.53532] (Annex.Perms) freezing content .git/annex/objects/45/Gp/SHA256E-s8--91d9183c1a8f61526ed68c5357d52a719481ccbba5039f815cf8f71ae4edbf24.dat/SHA256E-s8--91d9183c1a8f61526ed68c5357d52a719481ccbba5039f815cf8f71ae4edbf24.dat
2023-03-26T16:34:55.4395900Z [2023-03-26 16:34:54.545524] (Annex.Branch) read b51/a3b/SHA256E-s7--a30e9d26ce633b40da3ba8cd8806b9b349cb6c3de6816c6d64f85a571012a973.dat.log
2023-03-26T16:34:55.4397280Z [2023-03-26 16:34:54.545836] (Annex.Perms) freezing content directory .git/annex/objects/45/Gp/SHA256E-s8--91d9183c1a8f61526ed68c5357d52a719481ccbba5039f815cf8f71ae4edbf24.dat
2023-03-26T16:34:55.4398630Z [2023-03-26 16:34:54.54609] (Annex.Branch) read 253/975/SHA256E-s8--91d9183c1a8f61526ed68c5357d52a719481ccbba5039f815cf8f71ae4edbf24.dat.log
2023-03-26T16:34:55.4399950Z [2023-03-26 16:34:54.5477] (Annex.Branch) set 253/975/SHA256E-s8--91d9183c1a8f61526ed68c5357d52a719481ccbba5039f815cf8f71ae4edbf24.dat.log
2023-03-26T16:34:55.4401900Z [2023-03-26 16:34:54.558828] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[2] --> TRANSFER-SUCCESS RETRIEVE SHA256E-s7--b163189df1a08d63ba85e1566799cf07ca2de3865ac97a85b2b2d0cbfbd9a2f3.dat
2023-03-26T16:34:55.4403360Z [2023-03-26 16:34:54.559391] (Annex.Branch) read c9a/21e/SHA256E-s8--128888bb8975fc51a7fd410322b088593e158b37a24973483da2ad17fb6d7ff5.dat.log
2023-03-26T16:34:55.4405510Z [2023-03-26 16:34:54.561866] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[2] <-- EXPORT test 9.dat
2023-03-26T16:34:55.4407810Z [2023-03-26 16:34:54.561942] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[2] <-- EXPORT test 10.dat
2023-03-26T16:34:55.4409480Z [2023-03-26 16:34:54.561946] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[5] <-- TRANSFEREXPORT RETRIEVE SHA256E-s7--a30e9d26ce633b40da3ba8cd8806b9b349cb6c3de6816c6d64f85a571012a973.dat .git/annex/tmp/SHA256E-s7--a30e9d26ce633b40da3ba8cd8806b9b349cb6c3de6816c6d64f85a571012a973.dat
2023-03-26T16:34:55.4411390Z [2023-03-26 16:34:54.562032] (Annex.ExternalAddonProcess) /Users/runner/work/git-annex-remote-rclone/git-annex-remote-rclone/git-annex-remote-rclone[2] <-- TRANSFEREXPORT RETRIEVE SHA256E-s8--d5624694cf1515bdf8c3a648ae35d64f4bdff934800a22a070c6d0baddc120b0.dat .git/annex/tmp/SHA256E-s8--d5624694cf1515bdf8c3a648ae35d64f4bdff934800a22a070c6d0baddc120b0.dat
As you can see in the last 4 lines:
- process 2 got
EXPORT test 9.dat
and then immediatelyEXPORT test 10.dat
- then process 5 got the
TRANSFEREXPORT
command for thetest 9.dat
file (a30..973.dat
)- process 5 did not, prior to this transfer, receive any
EXPORT
command since its last transfer of an unrelated file (seeTRANSFER-SUCCESS
in the first quoted line). - so it was still holding the old filename from that unrelated transfer
- process 5 did not, prior to this transfer, receive any
- then, process 2 receives a (correct)
TRANSFEREXPORT
for thetest 10.dat
file (d56...0b0.dat
).
It appears that the EXPORT test 9.dat
should've been sent to process 5 instead of process 2.
But since it wasn't, process 5 retrieved the wrong file into what git-annex expected to be test 9.dat
.
Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders)
I really wanted to use git annex for my media storage, it seems like the project I've been looking for for years.
But I first needed to be able to exporttree over FTP. (I have a cheap NAS is terribly slow at encryption (SSH, etc) so I'm using FTP to get reasonable speeds with it.) So I thought I'd implement it and then hit this bug.
Looking at the code, handleRequestExport first uses withExternalState to send EXPORT, and then it calls handleRequestKey to send the following command. That uses withExternalState a second time.
withExternalState operates on a pool of processes, so in a race (when using -J presumably, as your test case does), the two calls to it can use different externals. And so the bug.
Was easy to fix based on that analysis. I have not tried to reproduce it, but am confident that I fixed it in 02662f52920e84cd9464641ada84f6c3bbe3f86a
In your implementation, you're keeping EXPORT set after handling TRANSFEREXPORT and similar commands, and so if you receive a TRANSFEREXPORT not prefixed by an EXPORT, it can be bad. A natural way to write things, indeed
doc/special_remotes/external/example.sh
does the same.An alternative implementation would be to clear the EXPORT after handling a TRANSFEREXPORT, CHECKPRESENTEXPORT, REMOVEEXPORT, or RENAMEEXPORT. And have those error out if no EXPORT was received before them.
But, that does not fully avoid the data loss problem. Because this might happen:
So even ignoring the first EXPORT can result in the external special remote doing the wrong thing. To fully guard against it, you'd have to error out if multiple EXPORT were received before a request that uses one, rather than ignoring the first EXPORT.
So, fixing this in your remote and others would need significant defensive coding. Too much to make sense, IMHO. I think git-annex needs to chnge the protocol in some way instead, to make it easy for you to avoid speaking to a buggy git-annex.
What about simply increasing the protocol version number? If VERSION 2 is the same as VERSION 1, but only supported by the fixed git-annex, then an external can just be updated to send VERSION 2, and it does not need to worry about talking with a buggy version of git-annex.
git-annex could either continue to also support VERSION 1, or it could refuse to work with externals that don't use VERSION 2. The latter would force externals to get updated. But then those externals would have no way to work with old git-annex even if they wanted to. I think forcing an update is not called for. So git-annex will keep supporting both versions.
I've implemented support for
VERSION 2
. Recommend any external special remotes that support exporttree=yes use it to avoid talking with a buggy git-annex version.(See better message for external special remote protocol mismatch for related todo.)
Hmm but a remote only learns that it's being used with exporttree=yes after it has sent a
VERSION
?I'm afraid I can't bump the version in case of git-annex-remote-rclone, it's already widely used for non-exporttree scenarios and requiring a git-annex that supports
VERSION 2
for these existing usecases (which aren't affected by this bug) would be a regression...I'll ask git-annex-remote-rclone maintainers, but it seems to me that defensive coding around the
EXPORT
command is gonna be a better solution in this remote's case.What if I made git-annex send "EXTENSIONS PROTOCOLVERSION2"? Then you could reply to EXPORTSUPPORTED with EXPORTSUPPORTED-FAILURE when used by a buggy git-annex.
So basically I'd reply to it with
EXTENSIONS PROTOCOLVERSION2
, and then set a flag that supporting export is ok? AndEXTENSIONS
always comes beforeEXPORTSUPPORTED
? Sounds like a very ad-hoc use of the extension mechanism... I think it'd work but I'd feel bad if you made a release with such a feature and then it turned out it doesn't work... Do you have some sort of per-releases?You would not need to send back
EXTENSIONS PROTOCOLVERSION2
yourself, only check if git-annex sends it. A better name might be EXPORTBUGFIXED, it's only using the existing extensions mechanism as a flag to communicate that git-annex has the bug fixed.Some old versions of git-annex don't send EXTENSIONS at all (and those also don't have the bug fixed), but if git-annex does send it, it will always be before EXPORTSUPPORTED.