I would like to be able to limit the amount of space git-annex will use on a remote.
Obviously this would not limit the total amount of space used on whatever filesystem the remote is on, but would only apply to space used within the remote for the particular annex in question.
Many of the cloud storage providers that git-annex supports through special remotes have free tiers. For instance, Google offers 15GB of free storage. If I have an annex with > 15 GB of storage I don't just want to add a Google Drive special remote and start attempting to copy everything over. But it would be great to be able to take advantage of that storage by adding a special remote to the annex and telling git-annex to only use a maximum of 13 GB (to leave myself a 2 GB buffer, in case things get added to the Google Drive through some mechanism outside of the annex). I could then add the remote to a preferred content group like backup
and do a git-annex copy --auto gdrive
, which would copy everything to Google Drive, unless transferring the next file would cause the remote to use over 13 GB.
Currently I can see the size of a remote using git annex info gdrive
, so git-annex appears to have the needed information.
This is sort of like annex.diskreserve
, but more useful for special remotes where setting an amount of space to keep free is not relevant.
This poses difficulties when several different git-annex repositories are uploading to the same special remote. They would have to somehow maintain a count of the amount of space used on the remote, but the repositories could be completely disconnected, other than being able to all access the special remote. So, the count would have to be stored on the remote, and updated atomically. But, not all remotes support atomic operations well enough to support such a distributed counter.
If the storage service has an API call to get the space used, then the special remote can use that. I think that such a thing unfortunately can only be implemented that way, on a per-special-remote implementation basis.
Is it not possible to implement per-annex?
For instance, take the walkthrough example where you create an annex at
~/annex
and then clone it at/media/usb/annex
. Take that setup and add a special remote -- say, a Google Drive remote calledannex-gdrive
. Now both~/annex
and/media/usb/annex
know which files are stored on theannex-gdrive
remote. So we go into~/annex
and set the maximum usable space forannex-gdrive
and upload a couple files. Then we can go into/media/usb/annex
, sync, and it will be able to determine the space still available onannex-gdrive
, not through any special API but just by looking at the size of the files it knows are stored on that special remote.In the case that we create a brand new annex
~/foo
and add a Google Drive special remote to itfoo-gdrive
that happens to use the same Google account as theannex-gdrive
remote from the previous annex, I would not expect that new annex to know about the storage used forannex-gdrive
. That would need some sort of special API from the provider.I just want the annex
~/annex
to know about the storage used atannex-gdrive
and the annex~/foo
to know about the storage used atfoo-gdrive
.Hope that makes sense!
git-annex repositories can all be clones of one another, but not currently connected. They can stay disconnected for an arbitrary amount of time, up to forever.
For example, /media/usb/annex can be packed up, sent half way around the world and used with a different computer, which has no way to communicate with its parent ~/annex on the original computer.
So, git-annex has to deal with split-brain and other distributed inconsistencies. And it does. However, a keeping count of how much space is used on a remote can't be consistent in these kinds of situations, unless, as I said, the count is kept in sync using the remote in question, and not information in the git-annex repositories.
I dunno. I get your point. However I can also see this being a really useful feature. I can imagine it being implemented simply with a warning / note mentioning disconnected repos might overflow the limit (and perhaps a buffer should be left ie: limit = 15gb, allocate = 13gb). It would be up to the user to figure out what is right for them.
I'm sure plenty of people either using the remote alone, in a small team, or with linked up repos (github/gitlab/whatever) would appreciate this option. I know I came to this page looking to see if it was possible already.
Thanks too for the hard work! Amazing software!
Implementing a hard limit concurrently isn't feasible but only providing a best-effort target size would still be useful.
It should fully work in all non-concurrent use-cases which are the majority in my case.
This would also open up the possibility of allowing me to know which drive I need to plug in when there is new data to be backed up because git-annex would know how much a repo is able to store and how much it currently holds.
This obviously isn't perfect either (there might be other data on that drive; other annex repos or entirely untracked data for example) but it'd still be good enough to make a manual decision on for me.