more efficient resolution of trivial export conflicts

Export conflicts happen whenever two repos export different trees to the same special remote.

Some conflicts that seem trivial currently involve unnecessary unexporting and re-uploading in their resolution.

For example, if A exports a tree containing [foo], and B exports a tree containing [foo, bar], bar gets unexported when resolving the conflict.

Bear in mind that A could have first exported a tree containing [foo, bar] then B exported a tree containing [foo, bar], and then A exported [foo], so unexported bar.

So, if one exported tree is a subset of the other, it's not necessary to unexport files added by the other tree. It's sufficient to check that files are present in the export and upload any that are missing. --Joey

I proved this is not a safe optimisation, so wontfix --Joey

RSS Atom

sponsoring?

I am running into this problem right now. I exported a movie collection from laptop A and then a huge collection of photos from laptop B to an USB drive. Both exports ran many hours. Now when I want to export the next collection of files, I see:

git annex export master --to usb-sg2

Resolving export conflict..

unexport usb-sg2 path/to/some/picture/file.jpg ok

Do you already have a solution for this in mind? Could I speed up the implementation with sponsorship? This is a hard blocker now for me in my consolidation of our data with git-annex. I would rather not spend many nights to keep exporting and unexporting the same files over and over.

Maybe I could hack this as a temporary workaround by manually changing the exported trees in the exports.log file? The "conflicting" exports are in two separate sub-trees.

Comment by thk — Thu Mar 19 08:40:31 2020

Remove comment

comment 2

I took a look at this, the relevant code is here:

                    warning "Resolving export conflict.."
                    forM_ ts $ \oldtreesha -> do
                            -- Unexport both the srcsha and the dstsha,
                            -- because the wrong content may have
                            -- been renamed to the dstsha due to the
                            -- export conflict.
                            let unexportboth d =
                                    [ Git.DiffTree.srcsha d
                                    , Git.DiffTree.dstsha d
                                    ]
                            -- Don't rename to temp, because the
                            -- content is unknown; delete instead.
                            mapdiff
                                    (\diff -> commandAction $ startUnexport r db (Git.DiffTree.file diff) (unexportboth diff))
                                    oldtreesha new

So, it diffs from the tree in the export conflict to the new, resolved tree. Any file that differs -- eg, any file that was involved in the conflict -- gets removed from the export.

In the todo, I said:

For example, if A exports a tree containing [foo], and B exports a tree containing [foo, bar], bar gets unexported when resolving the conflict.

Let's be more clear about the content of the trees, and say A exports [(foo, 1)] and B exports [(foo, 1), (bar, 1)].

If the export is then resolved to [(foo, 1), (bar, 1)], we can see nothing needs to be done. But what it currently does is diff from [(foo, 1)] to the resolution and so unexports bar.

If B had instead exported [(foo, 1), (bar, 2)], then it would still need to diff from that the the resolution, and so would unexport bar, and so it should.

But.. but.. ugh. Consider an export conflict that started with [(bar, 1)] exported. A exported [(bar, 2)] and B exported [(baz, 1)] (by renaming bar to baz). So the export remote might contain [(baz, 2)] (A uploaded 2 to bar, and then B renamed bar to baz) or it might contain [(bar, 2), (baz, 1)]; we do not know ordering between A and B.

If the export conflict resolution is [(bar, 2), (baz 1)] then the tree exported by B is a subset, so it skips that one. And the tree exported by A is a subset, so ummm... it skips that one. And so nothing gets unexported. Then, it proceeds to try to upload any missing files to the export. If the export remote contains [(bar, 2), (baz, 1)] nothing is missing, nothing gets uploaded, all is well. But if the export remote contains [(baz, 2)], it will upload (bar, 2), resulting in [(bar, 2), (baz 2)]. That is not what it's supposed to contain.

So, no, this optimisation will not work!

The only way to make this optimisation work, I think, is to not do renaming when updating export remotes. But file renaming is more common than export conflicts; you can always adjust your workflow to avoid export conflicts, by pulling from the remote you tend to conflict with, before performing an export.

Comment by joey — Mon May 4 17:30:59 2020

Remove comment

comment 3

Hmm, but.. What if both A and B's trees are subsets of the resolved tree? Safe then?

A exports [(foo, 1)], while B exports [(bar, 2)] the resolved tree is [(foo, 1), (bar, 2)].

Well, what was in the export before? Suppose it was [(foo, 2)].. Then B would have renamed foo to bar, and A exported 1 to foo. Order is unknown, so the export has either of [(foo, 1), (bar, 2)] or [(bar, 1)]

Yeah, still not safe even when both trees are subsets.

An optimisation like this needs some way to detect if there's been a rename like B keeps doing in these examples. If there has not been any rename, the optimisation is safe.

export.log contains only the sha of the tree that has been exported by each repo to the export remote. It might contains some trees that were exported before, but when it gets compacted, that information is lost, and anyway there's no way to know if B exported some old tree before or after A exported its most recently exported tree. So, I don't think retrospective rename detection is possible.

Comment by joey — Mon May 4 18:36:30 2020

Remove comment

Add a comment