Import existing files

git-annex/ forum/ Import existing files

Edit
RecentChanges
History
Preferences
Branchable
6 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

I have not used git-annex before but I wonder whether this is for me.

I have some big existing directories, e.g. with family pictures (several 100 GBs), e.g. in ~/Pictures. I already have multiple copies of this pictures directory on multiple medias (other (remote) servers, hard drives, some DVDs, etc).

I wonder about the recommended workflow now. In the documentation, it is explained how to create a new Git Annex Repo, where I would copy over the data. But I don't want to copy over the data. I want to keep them in ~/Pictures, and also make use of other existing copies (I cannot even modify some of them anymore, such as my readonly DVDs).

I thought that Git-Annex would just help me keep track of multiple copies. How would I import such a directory?

I read briefly about Git Worktree, and I wonder whether that is supposed to be for this use case?

Or maybe this should be a bare repo?

Or should I create the new Git Annex Repo directly in ~/Pictures? I.e. I would do cd ~/Pictures; git init; git annex init? How would I now add the other copies of Pictures? How would I deal with readonly copies of Pictures like DVDs?

Also, I don't just want to store the pictures but also other stuff (e.g. ~/Music). I'm not sure if I should create separate repos for that, or whether it makes more sense to keep them all in one big repo?

I read how it works and workflow but this does not really answers my questions.

RSS Atom

comment 1

Hi,
I recommend you to move ~/Pictures inside a new directory, so that it's at ~/annex/pictures/Pictures, then initialize the annex ~/annex/pictures (e.g. cd ~/annex/pictures; git init; git annex init; git annex add .; git annex sync).
For copies that are writable (e.g. external hdd), you should clone that annex onto it, beside the existing data. Then you can use git annex reinject --known ../existing-Pictures/*.jpg (unfortunately git annex reinject doesn't work recursively) to move them inside the annex. If git annex reinject leaves any files in ../existing-Pictures/, these files where/are not part of the original ~/Pictures and (because they are unknown) are not copied into the annex.
You can abuse directory special remotes to track copies on read-only media. E.g. git annex initremote DVD-01 type=directory encryption=none importtree=yes directory=/mnt/DVD-26/. Then you can import the DVD to a dummy branch, without copying the content: git annex import --no-content --from=DVD-01 dummy-DVD-01 This will still update the location log of files in the annex that also are on the DVD so git annex now knows that there are copies of whatever files on the DVD.
IMHO it's a good idea to create separate repos as adding to many files to a single repo can slow down git-annex.
You might also want to run git annex config --set annex.dotfiles true before adding any files or else dotfiles will be added to git directly.

Comment by Lukey — Tue Dec 29 23:30:46 2020

Remove comment

comment 2

Hi, thanks for the answer.

What if I would want to leave ~/Pictures as-is, and not move it, nor change it? I would prefer that. I just want to add its content to a Git Annex repo, and easily sync future changes as well to the repo (e.g. after I added more files, or renamed some files, or updated some files).

Why git annex sync and not git commit? I always did only git commit so far.

Why git annex reinject and not git annex import or cp|mv & git annex add? Also, why would I not add files which were/are not part of the original ~/Pictures? The original ~/Pictures would not have contained all of the pictures, as they are somewhat distributed. So I want to add unknown files as well.

Why would I import the DVD to a dummy branch? I would want it all in my master/main branch, or not? (I also don't quite understand why I would want branches at all?) I also potentially want to git annex get such a file at some point.

What are "too many files" for a single repo? And why is that a problem? I am just adding a Google Takeout archive to Git Annex (via), and it will contain also many of the files of ~/Pictures (although not all; and sometimes, but not always, in smaller quality, but often also in original quality), but also many other files. So it's already pretty mixed up. Or does it make sense to just share the Annex object storage (.git/annex/objects) in multiple repos? Or do you mean that as the intended use case for branches actually?

What dotfiles does annex.dotfiles include? Just all .*? Why would I not want to add dotfiles? I think I would want to just archive the whole directory as-is.

Also, after reading a bit further, and trying it out a bit, I don't quite understand:

Given some file path (e.g. Picture/BestPics2020/a.jpg), how can I find other paths of the same file? (E.g. I would also have the file stored under Picture/2020/01/a.jpg or so.) Is that with git annex list? I'm not sure this lists all paths. So far I only see a single path always.

I'm not really sure how to use git annex import properly, in case the file is already annexed under a different path. In any case, I also want to add the new path (new name).

Sorry for the many follow-up questions, but this is still all somewhat unclear to me.

Comment by AlbertZeyer — Fri Jan 1 22:30:34 2021

Remove comment

comment 3

You can of course just use ~/Pictures directly as a repository. So cd ~/Pictures; git init; git annex init.

git annex sync does a little more things than just git commit. For example, it also automatically commits deletion of files.

Sorry, I thought the existing copies of your Photos where just backups of your ~/Pictures. In that case I suggest you to mv the files into the annex and then just git annex add them. For DVD's import to a sub-directory of your master branch instead of a dummy branch and without the --no-content option.

"Too many files" depends on you liking. The more files the slower some operations get, like git annex sync. I suggest you to set something like git annex config --set annex.largefiles 'largerthan=32kb'. This way, small files get added to git itself instead of git-annex, which speeds up git-annex operations if there are a lot of small files. Note that these small files will be in every clone of the repo and can't be git annex droped.

The various configuration options are documented in the main git-annex manpage, at the bottom. Without the annex.dotfiles option, dotfiles (any file starting with "." and anything inside directories starting with ".") will still be added, but to git itself with the disadvantages mentioned above.

You can get the key/hash for that file with git annex info <file>, and then search for other files with the same content with find . -lname '*<key>'.

You can just cp/mv the files in the annex and git annex add them. Note that for duplicate files in the annex, only one copy of the data/file content will be stored.

Comment by Lukey — Sat Jan 2 15:05:01 2021

Remove comment

comment 4

That is the best solution with find? There is no reverse index? I made a separate forum entry for this question here, to discuss that a bit more separately.

Why exactly does git annex sync (or other ops) get slower on bigger repos? In principle it could be implemented in a way that it should not get slower (basically always avoiding any need to iterate through all objects, which should always be possible to avoid by having some indices for any operations which needs that).

Does it make sense to split up the repo, but share the Git Annex object files (shared .git/annex/objects)?

Comment by AlbertZeyer — Mon Jan 4 12:04:04 2021

Remove comment

comment 5

I think you're really overcomplicating things. Some really basic use of git-annex as described in the walkthrough will work fine in the situation you describe. Ie, initialize a git-annex repository in ~/Pictures. If you have some other servers or hard drives that also have pictures, initialize git-annex repositories on those as well. Connect these repositories that all hold pictures together, by adding git remotes pointing to the other pictures repositories.

Then when you git push (or git-annex sync), git-annex will automatically learn if some picture is stored in multiple of the repositories. You'll be able to run commands like git-annex find --copies 2 or git-annex drop to operate on that information. Similarly, if Picture/BestPics2020/a.jpg and Picture/2020/01/a.jpg were the same content, git-annex will notice that when you add them to the annex, and will automatically deduplicate.

If you have readonly DVDs or whatever, yes those can be handled in ways like Lukey describes, but why bother trying to deal with all those edge cases before you're using git-annex at all?

As far as too many files, git has issues with the index file becoming slower with more files, but you need huge numbers of files for this to be a significant problem -- think millions. git-annex commands that need to operate on all files necessarily take longer when there are more files, but git-annex always lets you only operate on a subset of files, such as the ones in the current directory, so this is not a significant scalability problem. Worrying about speed before something is slow is a kind of premature optimisation; git-annex has actually been optimised in cases where it was slow.

Comment by joey — Mon Jan 4 17:17:08 2021

Remove comment

comment 6

Thanks for the answer.

Maybe the forum post title here was chosen badly. It's not just about how to import existing files, but also/mainly I was trying to figure out whether Git Annex fits my needs (for a quite big archive of data). That's why I had all these questions. Also because this was not exactly clear to me after reading the docs.

What's still not exactly clear to me is whether it is not a better idea to keep the Annex repo separate from the checked out files. I don't like all the symlinks too much, and a couple of applications behave strange (because they follow the symlinks). I would prefer a solution where the (maybe bare) repo is separate from the checked out tree.

That is why I asked about Git Worktree. But this is still not clear to me.

I also read about Git Annex Direct mode, which sounds like it is exactly that? But apparently this is not supported anymore? Why?

I also read about the Git Annex Assistant, which also sounds like this? But the docs are somewhat sparse, and its not totally clear how this is done, and why the main Git Annex cannot do that, while Git Annex Assistant can do that. But discussions like this sound very relevant (that describes many of the issues I have with symlinks). But I would not specifically want to do it all automatically (I think that's the purpose of the assistant) but do it explicitly (like adding files to the annex, i.e. using the commands git annex add etc).

I think this should be possible without having to watch live for changes (via inotify or so) (where it anyway would be easy to miss changes). E.g. git status seems to be very fast at such checks. I'm not exactly sure how it does it but I assume it does some fast checks for changed mtime or maybe other things. Some filesystems might also provide other means. E.g. if the file was copied with a reflink (cp --reflink) (which anyway makes sense to not store the data twice, and which is much more efficient), it could check whether the reflink has changed. Or otherwise using hardlinks and locking the files (readonly), and unlocking them would make them writeable (that's ok if unlocked files are less efficient to handle, as this would be a rare action).

Comment by AlbertZeyer — Wed Jan 6 14:59:00 2021

Remove comment

Add a comment

Last edited Tue Mar 2 22:09:45 2021