I knew that making git annex export
handle renames efficiently would take
a whole day somehow.
Indeed, thinking it over, it is a seriously super hairy thing. Renames can swap contents between two or more files, and so temp files are needed. It has to handle cleaning up temp files after interrupted exports, which may be resumed with the same or a different tree. It also has to recover from export conflicts, which could cause the wrong content to be renamed to a file.
I think I've thought through everything and found a way to deal with it all. Here's how it looks in operation swapping two files:
git annex export master --to dir
rename bar -> .git-annex-tmp-content-SHA256E-s30--472b01bf6234c98ce03d1386483ae578f6e58033974a1363da2606f9fa0e222a ok
rename foo -> .git-annex-tmp-content-SHA256E-s4--b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c ok
rename .git-annex-tmp-content-SHA256E-s4--b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c -> bar ok
rename .git-annex-tmp-content-SHA256E-s30--472b01bf6234c98ce03d1386483ae578f6e58033974a1363da2606f9fa0e222a -> foo ok
(recording state in git...)
The export todo list is only getting longer.. But the branch may be close to being merged.
Today's work was supported by the NSF-funded DataLad project.
Looking forward to the export support!
Just curious (and I apologize in advance for any nightmares induced) did you consider the case of four files, A–D, where (content-wise) A=B and C=D and the change is to swap A/C and B/D? That'd potentially be an issue, since four files want to share two temporary names. Unless of course it's all only done with pairwise swaps.
Did not consider such a case. However, that's closely related to exporting files with same content being inefficient. There's a move operation but no copy operation. I might add a copy operation eventually, unsure.
If a copy operation is added, then that rename case can be handled more efficiently, by moving to the single temp file and copying. Although it might still involve the special remote doing more work than strictly necessary depending on how it implements copy.
Anyway, if the user is exporting copys of files, they're probably going to care more about that being somewhat more efficient than about renames of pairs of those copies being optimally efficient..
Handling it fully optimally, with only one temp file per key, would require analizing the change and finding pairs of renames that swap filenames and handling each pair in turn. I suppose that is doable, just needs a better data structure than I have now. I've added a note to my todo list and the design document, but no promises.