This suggestion has come from being surprised at the behaviour of "import --skip-duplicates" which copies files instead of moving them and leaves the source directory untouched (description implies it will just leave duplicates alone).
Apologies for the brevity, I've already typed this out once..
"import" has several behaviours which can be controlled through some options, but they don't cover all wanted behaviours. This suggestion is for an alternative interface to control these behaviours, totally stolen from rsync :P
# create symlinks (s), inject content (i) and delete from source (d)
# duplicate (D) and new (N) files
git annex import --mode=Dsid,Nsid $src # (default behaviour)
git annex import --mode=Dsi,Nsi $src # --duplicate
git annex import --mode=Dd,Nsid $src # --deduplicate
git annex import --mode=Nsi $src # --skip-duplicates
git annex import --mode=Dd $src # --clean-duplicates
git annex import --mode=Did,Nsid $src # (import new, reinject duplicate.. really want this!)
git annex import --mode=Ns $src # (just creates symlinks for new)
git annex import --mode=Nsd $src # (invalid mode due to data loss)
git annex import --mode=Nid $src # (invalid or require --force)
Current thinking is in remove legacy import directory interface. This old todo is redundant, so wontfix --Joey
This TODO (and "reinject --known") would then be:
Bearing in mind that I would have to support all of the resulting combinatorial explosion, and that several combinations don't make sense, or are unsafe, or seem useless, I think I'd rather keep it limited to well-selected points from the space.
I've fixed the description of --skip-duplicates to match its behavior. I don't know if there's a good motivation for it not deleting the files it does import. I'd almost rather have thought that was a bug in the implementation, but the implementation explicitly copies rather than moves files for --skip-duplicates, so that does seem to have been done intentionally. In any case,
--clean-duplicates
can be run after it to delete dups, I suppose.An implementation of --mode=Did,Nsid seemed worth adding at first, perhaps as --reinject-duplicates. But thinking about it some more, that would be the same as:
The first command moves all known files into the annex, which leaves only non-duplicate files for the second command to import.
The only time I can think of that this might not be suitable is if
/path
is getting new files added to it while the commands run... But in that case you canmkdir /path/toimport; mv /path/* /path/toimport
and then run the 2 commands on/path/toimport/*
--mode=Did,Nsid would be quite a bit faster because it wouldn't hash the files twice, which is an advantage this suggestion has over any multiple command alternative.
If you want to keep it to certain points in space rather than deal with all combinations, you could whitelist which ones are acceptable and people can request more to be whitelisted as they discover use cases for those modes. The current commands would alias to the modes (which would also make their behaviour obvious if this alias is mentioned in the documentation).
Actually, import --deduplicate, --skip-duplicates, --clean-duplicates are implemeted naively and do hash files twice. So it's the same efficiency..
But, I just finished a more complicated implementation that avoids the second hashing.
That does make the combined action worth adding, I suppose. Done so as --reinject-duplicates.
I feel that the problem with this idea is that the suggested actions "create symlinks (s), inject content (i) and delete from source (d)" are only an approximation of how import is implemented. If they perfectly matched the implementation, then import could treat them as a DSL and simply evaluate the expression to do its work. But it's not that simple. For one thing, --deduplicate and --clean-duplicates don't simply "delete from source" the duplicates; they first check that numcopies can be satisfied. The default import behavior doesn't "sid", in fact it moves from source to the work tree (thus implicitly deleting from source first), then injects, and then creates the symlink. Everything has dependencies and interrelationships, and the best way I've found to express that so far is as the Haskell code in Command/Import.hs.
Even exposing that interface and using the current implementation for particular canned expressions seems risky; exposing imperfect abstractions can shoot you in the foot later when something under the abstraction needs to change.
So I'd rather improve the documentation for git-annex import if it is unclear. Not opposed to finding a way to work in these "Dsid,Nsid" summaries to the the documentation.