It would be really useful to have something like the limit option in SQL for matching options.
Here is a scenario where it makes sense, and which actually comes up for me quite a bit:
You have 2TB of files, where none of the files are larger than 1TB. You want to transfer the entire 2TB annex to another location (we'll call it target), but you can only do it with a single 1TB hard drive.
By being able to git annex copy . -t <1tb drive> --in=here --not --in=target --limit=1TB, I can ensure that an arbitrary group of any files, not to exceed 1TB, is copied to my transport drive. After I move the files from the transport drive to the target repo, I can repeat the process with the next 1TB.
I don't care which of the specific files are copied, I just want a subset of them copied, because they are part of a batch transfer job.
I often transfer files via mediums that have transfer limits, but I am eventually going to transfer all of them, so I don't have to care which ones are selected.
Currently, I've been using tricks to select a subset of the files, such as a range of file-sizes.
Well, git-annex will avoid copying files to the drive once it's full, so you don't really need to tell it the size of the drive.
It seems to me that your command would work without any new --limit option.
Thanks for your reply Joey.
There are other times when I want to limit that size of a query, besides when the filesystem of the target device is exhausted.
I would be able to minimize the number of round trips required for transporting two or more repos on one hard drive. It would be great to have more control over the amount of space used on the remote, via the amount of space that is passed to copy or move.
Beyond that, what would be really cool, is to have something like --limitfiles and --limitsize, to constrain a query by the total number of files, or the total size, especially for certain cloud-based services. That is, cap the query to return a total number of files, regardless of size, along with --greaterthan and --lessthan.
So you want this limit option to make git-annex stop processing files once the files it's processed sum up to a given size, or after some number of files are transferred. Similar to --time-limit then.
It seems this would need to only count a file once git-annex decides to do something with it. Otherwise, a
git annex copy
would count all the files that are already on the drive as it scans through them, and stop too soon.So this seems to need something that sits in between the seek stage and the perform stage, that can see the key that is being processed. There's not currently a way to do that, it would need changes to the Command types, or would need to be manually inserted into every command that should do this. Either way, needs some changes to be made to every command in git-annex to implement.
Hmm, if it's limited to commands that transfer files, it could be hooked into Annex.Transfer instead; that does know the key being transferred. It could refuse to do the transfer and report a failure, and git-annex would then display failures to transfer every file after the limit. (I'm not very comfortable making that interface throw an exception and anyway users of it may already catch transfer exceptions.)