I'm currently in the process of gutting old (some broken) git-annex's and cleaning out download directories from before I started using git-annex.
To do this, I am running git annex import --clean--duplicates $PATH
on the directories I want to clear out but sometimes, this takes a unnecessarily long time.
For example, git-annex will calculate the digest for a huge file (30GB+) in $TARGET, even though there are no files in the annex of that size.
It's a common shortcut to check for duplicate sizes first to eliminate definite non-matches really quickly. Can this be added to git-annex's import
in some way or is this a no-go due to the constant memory constraint?
This could be done in constant space using a bloom filter of known file sizes. Files with wrong sizes would sometimes match, but no problem, it would then just do the work it does now.
However, to build such a filter, git-annex would need to do a scan of all keys it knows about. This would take approximately as long to run as
git annex unused
does. It might make sense to only build the filter if it runs into a fairly large file. Alternatively, a bloom filter of file sizes could be cached and updated on the fly as things change (but this gets pretty complex).Could be tested out with an additional flag
--with-size-bloom
on import?It would then build a bloom (and use a cached one with --fast) and do the usual import.
So I could do this:
I can implement this behaviour in Perl with Bloom::Filter and let you know how it performs if that would be useful to you..?
git-annex import --from remote
has recently been sped up a lot, and the plan is to remove legacy import directory interface in favor of it or reimplement the lecacy interface on top of it.Using
git-annex import --from remote --fast
, when there's a huge file in the directory remote, will hash it, but only once. On subsequent runs it will recognise the file it has seen before.So all that's needed to emulate --clean-duplicates is a way to do this:
Which doesn't work currently, but see drop from export remote.