Hi everyone,
I need some suggestions on how to operate git-annex best in my setup.
I need git-annex mainly for its ability to have directories of all my data on all my nodes but not for the data redundancy it can provide. I have one node that contains 2 filesystems that I want to merge in one git-annex repository. One filesystem (lets call it SAFE) is on top of a RAID1 between two 1TB hds. The other (BIG) is on top of a 3TB hd. SAFE holds data I do not want to loose (like digital pictures). BIG holds data that I can loose.
I do not have enough disk space on other nodes to get rid of the RAID1.
This is how I mount my filesystems:
SAFE at ~/AllData/
BIG at ~/AllData/bigfiles/
The root of the git repository is at ~/AllData/ however when I do:
git-annex add ~/AllData/bigfiles/file1 It says: add bigfiles/file1 failed
I assume that is because of file1 being on a different filesystem.
Do I have to create two repositories: one for each filesystem or do you have any ideas on how to use git-annex best in this scenario? Having two repositories also has the disadvantage that I need two repositories on all other nodes am I right?
Thanks for you suggestions
git-annex stores the contents of files inside
.git/annex/objects
. Thegit annex add
is failing because it cannotrename()
the file into that directory, because it is on a different filesystem. Even if it did a more expensive move of the file, it would not do what you want, because all the files would be moved to the.git/annex/objects
directory, which is stored on your smaller drive.The way git-annex is intended to be used with multiple drives is this:
git annex sync
to keep the git repositories in sync. (Or do it manually withgit pull
)git annex get $file
to get it.git annex drop
orgit annex move
to flush files out to the other drive(s).The walkthrough goes through an example of adding a removable USB drive this way, but you can do the same thing for non-removable drives.
No -- you combine the two repositories, so any clone of either one contains all the files in both. Other notes then only need one repository. However, for another node to be able to get files from both repositories on this node, it will need to have two git remotes configured, one for each repository.
It works fine. After it is set up with the client as described, sync is automatic from the Assistant.
What I find cumbersome is the fact that I need to manually call "git annex sync" on the remote (usb or local ssh) to view (generate) the link. Is there a way to avoid this extra step ?
Thanks joey, I thought of the object storage just yesterday and that the drive where it resides needs to be the big one.
I think that I will do it as you described with one addition:
I set up the git-annex repository on my big drive. Lets call that BIG. Then I set up another on the smaller drive (or raid) and set both to sync. Lets call that RAID.
BUT: Can I use num copies to tell BIG to only sync files in certain directories to RAID. Or will syncing sync everything regardless?
Thanks for your help
Yes,
git annex sync
only syncs the metadata.If you have numcopies set to to, and run
git annex copy --to RAID --auto
, it will only copy files that have less than 2 copies. (Same withgit annex copy --from BIG --auto
orgit annex get --auto
)I believe I understood the earlier discussion about managing git-annex across multiple file-systems, but I'm facing a more extreme case and am hoping for some useful advice.
We are exploring use of git-annex to manage the large boundary conditions used within our weather model. For the most part, this seems to be an excellent fit. But one of our goals is to limit data duplication among end users, and here there is a problem. Even in the era of multi-petabyte file systems, needless duplication of data s a concern.
In our computing facility there are many O(10-20) filesystems from which users run our model. If each user creates their own clone from some master repo on filesystem A, many (but by no means all) files will have multiple copies across the other filesystems. Worse, if multiple users running on filesystem X both cloned from A, we'd have multiple copies even on X as they would not know about the other local clone. The current ad hoc system (i.e., without git-annex) uses symbolic links to file system A and avoids the copies, but is otherwise quite limited in terms of managing variant input files and running in other computing environments.
One halfway solution would be to have a "master" repo on each filesystem and then ensure that users made their repos set upstream remotes to all of them. We'd still have multiple copies across filesystems, but at least there would then be at most one copy of any given file on each filesystem.
The other possibility is to somehow enforce a policy that all input files are to be accessed on filesystem A. This is what we'll probably end up doing. We'd kludge our existing scripts that set up symbolic links to point at clones on filesystem A.
I would very much like there to be another option.
Thanks in advance.
Hmm, this is a rather old forum thread to be reactivating, maybe a new thread more about your specific use case, Thomas?
Your master repository idea seems like a good one. If the master repository is cloned with
git clone --shared
then that clone will hard link files between it and the master repository (assuming a git-annex rather newer than the start date of this forum thread!), so multiple repositories will only have one copy of the file. Of course, since it uses hard links, master and clone need to be on the same drive.There are probably ways to improve git-annex to handle this kind of use case better. Maybe
git clone --shared
across filesystems should use symlinks rather than hard links or something like that. That might take some time to design and implement (it changes a core invariant of git-annex, that .git/annex/objects/ contains files, not symlinks).Happy to discuss other options..