git-annex across two filesystems

git-annex/ forum/ git-annex across two filesystems

Edit
RecentChanges
History
Preferences
Branchable
7 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

Hi everyone,

I need some suggestions on how to operate git-annex best in my setup.

I need git-annex mainly for its ability to have directories of all my data on all my nodes but not for the data redundancy it can provide. I have one node that contains 2 filesystems that I want to merge in one git-annex repository. One filesystem (lets call it SAFE) is on top of a RAID1 between two 1TB hds. The other (BIG) is on top of a 3TB hd. SAFE holds data I do not want to loose (like digital pictures). BIG holds data that I can loose.

I do not have enough disk space on other nodes to get rid of the RAID1.

This is how I mount my filesystems:

SAFE at ~/AllData/

BIG at ~/AllData/bigfiles/

The root of the git repository is at ~/AllData/ however when I do:

git-annex add ~/AllData/bigfiles/file1 It says: add bigfiles/file1 failed

I assume that is because of file1 being on a different filesystem.

Do I have to create two repositories: one for each filesystem or do you have any ideas on how to use git-annex best in this scenario? Having two repositories also has the disadvantage that I need two repositories on all other nodes am I right?

Thanks for you suggestions

RSS Atom

comment 1

git-annex stores the contents of files inside .git/annex/objects. The git annex add is failing because it cannot rename() the file into that directory, because it is on a different filesystem. Even if it did a more expensive move of the file, it would not do what you want, because all the files would be moved to the .git/annex/objects directory, which is stored on your smaller drive.

The way git-annex is intended to be used with multiple drives is this:

Make a separate git repository on each drive.
Set up git remotes connecting these repositories together. You don't have to connect them all up, but at least make the git repository on your main filesystem have a remote for each git repository on other drives.
Use git annex sync to keep the git repositories in sync. (Or do it manually with git pull)
When you want a file to be available in the local repository, use git annex get $file to get it.
When your local repository is getting too full, use git annex drop or git annex move to flush files out to the other drive(s).

The walkthrough goes through an example of adding a removable USB drive this way, but you can do the same thing for non-removable drives.

Having two repositories also has the disadvantage that I need two repositories on all other nodes am I right?

No -- you combine the two repositories, so any clone of either one contains all the files in both. Other notes then only need one repository. However, for another node to be able to get files from both repositories on this node, it will need to have two git remotes configured, one for each repository.

Comment by joey — Mon Apr 22 19:48:55 2013

Remove comment

comment 2

It works fine. After it is set up with the client as described, sync is automatic from the Assistant.

What I find cumbersome is the fact that I need to manually call "git annex sync" on the remote (usb or local ssh) to view (generate) the link. Is there a way to avoid this extra step ?

Comment by pradermecker [myopenid.com] — Sat Apr 27 17:14:43 2013

Remove comment

I should have thought of this

Thanks joey, I thought of the object storage just yesterday and that the drive where it resides needs to be the big one.

I think that I will do it as you described with one addition:

I set up the git-annex repository on my big drive. Lets call that BIG. Then I set up another on the smaller drive (or raid) and set both to sync. Lets call that RAID.

BUT: Can I use num copies to tell BIG to only sync files in certain directories to RAID. Or will syncing sync everything regardless?

Thanks for your help

Comment by Marek — Tue Apr 30 19:55:19 2013

Remove comment

Another correction

sync just syncs the meta data but not the objects itself right? I have to do an explicit git-annex copy -to or copy -from?

Comment by Marek — Tue Apr 30 20:11:57 2013

Remove comment

comment 5

Yes, git annex sync only syncs the metadata.

If you have numcopies set to to, and run git annex copy --to RAID --auto, it will only copy files that have less than 2 copies. (Same with git annex copy --from BIG --auto or git annex get --auto)

Comment by joey — Tue Apr 30 21:57:13 2013

Remove comment

git-annex - many filesystems

I believe I understood the earlier discussion about managing git-annex across multiple file-systems, but I'm facing a more extreme case and am hoping for some useful advice.

We are exploring use of git-annex to manage the large boundary conditions used within our weather model. For the most part, this seems to be an excellent fit. But one of our goals is to limit data duplication among end users, and here there is a problem. Even in the era of multi-petabyte file systems, needless duplication of data s a concern.

In our computing facility there are many O(10-20) filesystems from which users run our model. If each user creates their own clone from some master repo on filesystem A, many (but by no means all) files will have multiple copies across the other filesystems. Worse, if multiple users running on filesystem X both cloned from A, we'd have multiple copies even on X as they would not know about the other local clone. The current ad hoc system (i.e., without git-annex) uses symbolic links to file system A and avoids the copies, but is otherwise quite limited in terms of managing variant input files and running in other computing environments.

One halfway solution would be to have a "master" repo on each filesystem and then ensure that users made their repos set upstream remotes to all of them. We'd still have multiple copies across filesystems, but at least there would then be at most one copy of any given file on each filesystem.

The other possibility is to somehow enforce a policy that all input files are to be accessed on filesystem A. This is what we'll probably end up doing. We'd kludge our existing scripts that set up symbolic links to point at clones on filesystem A.

I would very much like there to be another option.

Thanks in advance.

Comment by thomas.l.clune — Mon Jan 30 14:24:11 2017

Remove comment

comment 7

Hmm, this is a rather old forum thread to be reactivating, maybe a new thread more about your specific use case, Thomas?

Your master repository idea seems like a good one. If the master repository is cloned with git clone --shared then that clone will hard link files between it and the master repository (assuming a git-annex rather newer than the start date of this forum thread!), so multiple repositories will only have one copy of the file. Of course, since it uses hard links, master and clone need to be on the same drive.

There are probably ways to improve git-annex to handle this kind of use case better. Maybe git clone --shared across filesystems should use symlinks rather than hard links or something like that. That might take some time to design and implement (it changes a core invariant of git-annex, that .git/annex/objects/ contains files, not symlinks).