I have not used git-annex before but I wonder whether this is for me.
I have some big existing directories, e.g. with family pictures (several 100 GBs), e.g. in ~/Pictures
. I already have multiple copies of this pictures directory on multiple medias (other (remote) servers, hard drives, some DVDs, etc).
I wonder about the recommended workflow now. In the documentation, it is explained how to create a new Git Annex Repo, where I would copy over the data. But I don't want to copy over the data. I want to keep them in ~/Pictures
, and also make use of other existing copies (I cannot even modify some of them anymore, such as my readonly DVDs).
I thought that Git-Annex would just help me keep track of multiple copies. How would I import such a directory?
I read briefly about Git Worktree, and I wonder whether that is supposed to be for this use case?
Or maybe this should be a bare repo?
Or should I create the new Git Annex Repo directly in ~/Pictures
? I.e. I would do cd ~/Pictures; git init; git annex init
? How would I now add the other copies of Pictures
? How would I deal with readonly copies of Pictures
like DVDs?
Also, I don't just want to store the pictures but also other stuff (e.g. ~/Music
). I'm not sure if I should create separate repos for that, or whether it makes more sense to keep them all in one big repo?
I read how it works and workflow but this does not really answers my questions.
Hi,
I recommend you to move
~/Pictures
inside a new directory, so that it's at~/annex/pictures/Pictures
, then initialize the annex~/annex/pictures
(e.g.cd ~/annex/pictures; git init; git annex init; git annex add .; git annex sync
).For copies that are writable (e.g. external hdd), you should clone that annex onto it, beside the existing data. Then you can use
git annex reinject --known ../existing-Pictures/*.jpg
(unfortunatelygit annex reinject
doesn't work recursively) to move them inside the annex. Ifgit annex reinject
leaves any files in../existing-Pictures/
, these files where/are not part of the original~/Pictures
and (because they are unknown) are not copied into the annex.You can abuse directory special remotes to track copies on read-only media. E.g.
git annex initremote DVD-01 type=directory encryption=none importtree=yes directory=/mnt/DVD-26/
. Then you can import the DVD to a dummy branch, without copying the content:git annex import --no-content --from=DVD-01 dummy-DVD-01
This will still update the location log of files in the annex that also are on the DVD so git annex now knows that there are copies of whatever files on the DVD.IMHO it's a good idea to create separate repos as adding to many files to a single repo can slow down git-annex.
You might also want to run
git annex config --set annex.dotfiles true
before adding any files or else dotfiles will be added to git directly.Hi, thanks for the answer.
What if I would want to leave
~/Pictures
as-is, and not move it, nor change it? I would prefer that. I just want to add its content to a Git Annex repo, and easily sync future changes as well to the repo (e.g. after I added more files, or renamed some files, or updated some files).Why
git annex sync
and notgit commit
? I always did onlygit commit
so far.Why
git annex reinject
and notgit annex import
orcp|mv
&git annex add
? Also, why would I not add files which were/are not part of the original~/Pictures
? The original~/Pictures
would not have contained all of the pictures, as they are somewhat distributed. So I want to add unknown files as well.Why would I import the DVD to a dummy branch? I would want it all in my master/main branch, or not? (I also don't quite understand why I would want branches at all?) I also potentially want to
git annex get
such a file at some point.What are "too many files" for a single repo? And why is that a problem? I am just adding a Google Takeout archive to Git Annex (via), and it will contain also many of the files of
~/Pictures
(although not all; and sometimes, but not always, in smaller quality, but often also in original quality), but also many other files. So it's already pretty mixed up. Or does it make sense to just share the Annex object storage (.git/annex/objects
) in multiple repos? Or do you mean that as the intended use case for branches actually?What dotfiles does
annex.dotfiles
include? Just all.*
? Why would I not want to add dotfiles? I think I would want to just archive the whole directory as-is.Also, after reading a bit further, and trying it out a bit, I don't quite understand:
Given some file path (e.g.
Picture/BestPics2020/a.jpg
), how can I find other paths of the same file? (E.g. I would also have the file stored underPicture/2020/01/a.jpg
or so.) Is that withgit annex list
? I'm not sure this lists all paths. So far I only see a single path always.I'm not really sure how to use
git annex import
properly, in case the file is already annexed under a different path. In any case, I also want to add the new path (new name).Sorry for the many follow-up questions, but this is still all somewhat unclear to me.
You can of course just use
~/Pictures
directly as a repository. Socd ~/Pictures; git init; git annex init
.git annex sync
does a little more things than justgit commit
. For example, it also automatically commits deletion of files.Sorry, I thought the existing copies of your Photos where just backups of your
~/Pictures
. In that case I suggest you tomv
the files into the annex and then justgit annex add
them. For DVD's import to a sub-directory of your master branch instead of a dummy branch and without the--no-content
option."Too many files" depends on you liking. The more files the slower some operations get, like
git annex sync
. I suggest you to set something likegit annex config --set annex.largefiles 'largerthan=32kb'
. This way, small files get added to git itself instead of git-annex, which speeds up git-annex operations if there are a lot of small files. Note that these small files will be in every clone of the repo and can't begit annex drop
ed.The various configuration options are documented in the main git-annex manpage, at the bottom. Without the
annex.dotfiles
option, dotfiles (any file starting with "." and anything inside directories starting with ".") will still be added, but to git itself with the disadvantages mentioned above.You can get the key/hash for that file with
git annex info <file>
, and then search for other files with the same content withfind . -lname '*<key>'
.You can just
cp/mv
the files in the annex andgit annex add
them. Note that for duplicate files in the annex, only one copy of the data/file content will be stored.That is the best solution with
find
? There is no reverse index? I made a separate forum entry for this question here, to discuss that a bit more separately.Why exactly does
git annex sync
(or other ops) get slower on bigger repos? In principle it could be implemented in a way that it should not get slower (basically always avoiding any need to iterate through all objects, which should always be possible to avoid by having some indices for any operations which needs that).Does it make sense to split up the repo, but share the Git Annex object files (shared
.git/annex/objects
)?I think you're really overcomplicating things. Some really basic use of git-annex as described in the walkthrough will work fine in the situation you describe. Ie, initialize a git-annex repository in ~/Pictures. If you have some other servers or hard drives that also have pictures, initialize git-annex repositories on those as well. Connect these repositories that all hold pictures together, by adding git remotes pointing to the other pictures repositories.
Then when you
git push
(orgit-annex sync
), git-annex will automatically learn if some picture is stored in multiple of the repositories. You'll be able to run commands likegit-annex find --copies 2
orgit-annex drop
to operate on that information. Similarly, if Picture/BestPics2020/a.jpg and Picture/2020/01/a.jpg were the same content, git-annex will notice that when you add them to the annex, and will automatically deduplicate.If you have readonly DVDs or whatever, yes those can be handled in ways like Lukey describes, but why bother trying to deal with all those edge cases before you're using git-annex at all?
As far as too many files, git has issues with the index file becoming slower with more files, but you need huge numbers of files for this to be a significant problem -- think millions. git-annex commands that need to operate on all files necessarily take longer when there are more files, but git-annex always lets you only operate on a subset of files, such as the ones in the current directory, so this is not a significant scalability problem. Worrying about speed before something is slow is a kind of premature optimisation; git-annex has actually been optimised in cases where it was slow.
Thanks for the answer.
Maybe the forum post title here was chosen badly. It's not just about how to import existing files, but also/mainly I was trying to figure out whether Git Annex fits my needs (for a quite big archive of data). That's why I had all these questions. Also because this was not exactly clear to me after reading the docs.
What's still not exactly clear to me is whether it is not a better idea to keep the Annex repo separate from the checked out files. I don't like all the symlinks too much, and a couple of applications behave strange (because they follow the symlinks). I would prefer a solution where the (maybe bare) repo is separate from the checked out tree.
That is why I asked about Git Worktree. But this is still not clear to me.
I also read about Git Annex Direct mode, which sounds like it is exactly that? But apparently this is not supported anymore? Why?
I also read about the Git Annex Assistant, which also sounds like this? But the docs are somewhat sparse, and its not totally clear how this is done, and why the main Git Annex cannot do that, while Git Annex Assistant can do that. But discussions like this sound very relevant (that describes many of the issues I have with symlinks). But I would not specifically want to do it all automatically (I think that's the purpose of the assistant) but do it explicitly (like adding files to the annex, i.e. using the commands
git annex add
etc).I think this should be possible without having to watch live for changes (via inotify or so) (where it anyway would be easy to miss changes). E.g.
git status
seems to be very fast at such checks. I'm not exactly sure how it does it but I assume it does some fast checks for changed mtime or maybe other things. Some filesystems might also provide other means. E.g. if the file was copied with a reflink (cp --reflink
) (which anyway makes sense to not store the data twice, and which is much more efficient), it could check whether the reflink has changed. Or otherwise using hardlinks and locking the files (readonly), and unlocking them would make them writeable (that's ok if unlocked files are less efficient to handle, as this would be a rare action).