I've been syncronizing my data since long time, mainly using rsync or unison. Thus I had two 3.5Gb datasets set1 (usb drive, hfs+ partition) and set2 (hdd, ext4 ubuntu 13.04 box) which differed only in 50Mb (new on set1 ). This was double checked using diff -r before doing anything.
I created a git annex repo in direct mode for set2 from command line, and after that I let the assistant scan it. After that created the repo for set1 and added it to the assistant. I think here comes my mistake (I think).
Instead of keeping them apart, at told assistant to sync with set2. Why I think this was a mistake? Because set2 was indexed and set1 no, and I'm seeing a lot of file moving a copying, which in my humble opinion should not happen. What I expected it only the difference to be transferred from set1 to set2. What it seems to be doing is moving away all content in set1, and copying it back from set2. I think it will end correctly, but with a lot of unnecessary and risky operations.
I think I should have independently added both datasets, let them be scanned and then connect to each other. So, now the questions:
- Is that the correct way to proceed?
- What if I have to identical files with different modifying times, I hope they are not synced, right?
- Is it posssible to achieve this behaviour of copying only the 50Mb?
Thanks in advance and keep up the good work. Best regards, Juan
EDIT: a couple of questions more:
- after finishing, set2 ended with a lot of symlinks but only in one subfolder. To prevent this should I put numcopies in 2?
- This data is composed of input datasets and output simulations. Thus, I need to change them often, but not as often as code and in a very partial way (chunks of 50Mb). For me direct mode is the best (or plain git). However, I was wondering, it is possible to drop some files (even in direct mode) and use simlinks instead?
I did something similar for my videos, I've created the repo on one machine add the video files then cloned it on the other machine then reinjected the files in to the cloned repo.
http://joeyh.name/blog/entry/moving_my_email_archives_and_packages_to_git-annex/
If you need to preserve mtimes, or differentiate between files with identical content but different mtimes, neither git nor git-annex is going to do what you want, since git doesn't care about preserving much file metadata.
As far as importing two sets of files on two computers, the best thing to do is import each, and then let the two sync up. Otherwise, when you're running the assistant it will start downloading the first set you import to the second computer, before the second set is added there, and do extra work. Although once the duplicate files from the second set land in the second git repository, the assistant will avoid any additional redundant transfers.
(The assistant never moves files, if both repositories are configured to be in the default client repository group. It only copies.)
I don't understand question #1. "set2 ended with a lot of symlinks but only in one subfolder" doesn't make sense to me, or rather I could interpret it to mean any of dozen things (none of which seem likely)
You can
git annex drop
files in direct mode. However, if you're running the assistant, it will try to get them back. You can configure your repository to be in manual mode to prevent the assistant doing that, or not use the assistant, or configure a preferred content expression to make the assistant do something more custom like not try to get files located in a "olddata" directory.Thanks. It is very clear now. I think I got it running. I have 2 direct mode copies in my ubuntu box and in the USB drive and one indirect in my ultrabook (small SSD). What I meant is that even in direct mode, after sync ended, the set I indexed first ended with the contents of a folder in the .git dir using symlinks. But it might have been a leftover of previous attempts. I think I got confused by the great amount of flexibility it provides. Thanks.