forum/Managing a large number of files archived on many pieces of read-only medium (E.G. DVDs)git-annexhttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/git-annexikiwiki2021-10-23T15:53:57ZHave you seen the dar utility?http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_1_25e65ee3949e7d918376298cf11585f2/John2013-11-27T22:47:37Z2012-10-20T19:03:37Z
<p>http://dar.linux.free.fr/doc/index.html</p>
<p>Would be nice to have this as another remote option for git-annex, since I too would like to have static (and possibly incrementally extended) remotes that span multiple DVDs</p>
"free-form" special remote / dar utilityhttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_2_8a71ca048f9de29a198a6afb17d5315e/Steve2013-11-27T22:47:37Z2012-10-20T22:11:23Z
<p>dar looks familiar, I'm sure I have run across it in the past. However, it is not suitable in this case; see requirement #3 above that the DVD-Rs be usable without git or git-annex.</p>
<p>What would work we be some sort of special remote that allows free-form data. Imagine that you create the DVD-R with the files on it, then you mount it and add the mount directory as a free-form special remote. git-annex checksums all the files under the specified directory and stores the relative path to each file somewhere. Then, when you want to fetch a specific hash from the remote it looks up the relative path, adds it to the base directory and transfers it into the local .git/annex/objects/ store.</p>
Yes, I agree, such a special remote for free-form read-only media would be convnient.http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_3_e3d1d3a3d3d831432ec940a8ab6f31e9/imz [lj.rossia.org]2013-11-27T22:47:37Z2012-10-20T23:58:45Z
<p>I have already stored a lot of large files on DVDs. I did that for arhiving, so I cared that there are several copies. But I want this to be more automated.</p>
<p>I take my disc (or one created by someone else, without any knowledge of Git), checksum its contents in git-annex, and in the projects where I'm using this content, I can check that the file is archived on at least N discs.</p>
<p>Also, I might enhance the content -- this would be refected in a Git commit, so then I want also to be able to check that the new version has also ben archived on severeal discs.</p>
<p>A special remote for such free-form read-only media would be very convenient.</p>
Some free-form remote ideashttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_4_26a33eae98b4faaf6baf6635e3d28a8f/Steve2013-11-27T22:47:37Z2012-10-21T02:07:40Z
<p>This is starting to get interesting. A free-form remote would definitely simplify my use case, and also solve the "future goal" of easily incorporating my already existing DVD-Rs.</p>
<p>I haven't really looked into the git-annex internals up to this point, but looking at the <a href="http://git-annex.branchable.com/special_remotes/hook/">hook</a> page there doesn't seem to be a hook for init which would be needed to populate git-annex's index of files in the remote. (git-annex seems to assume that new special remotes are empty)</p>
<p>Another problem is where to store the hash to path relation information. On a RW remote it would be stored in the remote, but here we need to keep it in the repo somehow. This could be in the git-annex branch, or possibly another branch created specifically for this purpose.</p>
<p>1) initremote needs to:</p>
<ul>
<li>hash the contents of all the remote's files</li>
<li>update git-annex's index of the remote's contents</li>
<li>store the paths to the hashes in the repo</li>
</ul>
<p>2) store and remove should just fail.</p>
<p>3) retrieve and check present seem straight forward.</p>
<p>The assistant blog mentions adding support for read only remotes but I don't know anything about it: <a href="http://git-annex.branchable.com/design/assistant/blog/day_65__transfer_polish/">day 65 transfer polish</a> (I'm still on 3.20120605)</p>
<p>Let me know if there is anything I haven't thought of yet.</p>
comment 5http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_5_49ac298d39c824b0e52a239961463e09/joeyh.name2013-11-27T22:47:37Z2012-10-21T05:36:36Z
<p>I encourage playing around with the hook special remote and see how far you can make it go.</p>
<p>I may be doing something vaguely like this for <a href="http://git-annex.branchable.com/design/assistant/desymlink/">desymlink</a>, although I'm pretty sure it would still have a git repository associated with the directory of regular files.</p>
<p>One option is to use the web special remote, with file:// urls. Assuming a given disc will always end up mounted somewhere stable, such as /media/dvd1, /media/dvd2, etc, you could then just <code>git annex addurl file:///media/dvd1/$file</code>. <code>git annex whereis</code> will show the url, which has enough info to work out the disk to mount.</p>
<p>The web special remote did not support file:// urls, but I've just fixed that. The only downside is that, while it will identify files duplicated across disks, and <code>whereis</code> will show multiple urls for such files, there's only one web special remote, and so it only counts as 1 copy. This could perhaps be improved; git-annex may eventually get support for remotes reporting how many copies of a file they contain.</p>
It works!http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_6_55a4a3616ea59654da1c2f9902561e3b/openid [www.openid.albertlash.com]2013-11-27T22:47:37Z2012-10-24T22:00:31Z
<p>This works great! I first tried it with WORM, no-go. I can see why the SHA backends are so powerful, they appear to circumvent the commits which git usually uses for merging. When I first do the merge, it reports this:</p>
<p>warning: no common commits</p>
<p>Compared to how I've managed CD/DVD backups in the past, this is a quantum leap forward, and I don't find it convoluted in comparison. Yes, there is dar, but I prefer this method. In my case, its the perfect solution for original files, which in generally are treated as immutable, and not accessed very often. They are usually large, too! I'm using them for digital pictures.</p>
web and hook special remoteshttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_7_92a2af3e0e328bb48bcc67a69187ee57/Steve2013-11-27T22:47:37Z2012-10-24T23:26:53Z
<p>Hi Joey,</p>
<p>Thanks for the advice. I had thought of the web special remote; but as you may have noticed from my example, I don't use automount so my DVDs and CDs all get mounted in the same place. (/mnt/cdrom) so the web special remote won't work for me.</p>
<p>I'll try to play around with the hook special remote this weekend. I had a thought it might be interesting to have it search for the DVDs in some common places or even by parsing the mounted file systems, and allow an override or augmentation through git config.</p>
no need to mergehttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_8_f6e39e71882d55cdc061166aea3e2bd3/Steve2013-11-27T22:47:37Z2012-10-24T23:52:30Z
<p>Albert,</p>
<p>Thanks for feedback! I'm glad that somebody else found the method I worked out useful. As I'm going to try and turn it into a proper special remote, let me know if there is any particular use case or feature you'd like me to address.</p>
<p>Note that in my testing, I found that you don't actually need to merge the DVD's branch into the local branch you are using for git annex to be able to find the files on it that are identical to files in your local branch.</p>
<p>I haven't played around with cloning the repo, but I will try that this weekend. I'm thinking it <em>might</em> be necessary to create local branches from the DVD remotes so that they'll get carried along when you clone the repo.</p>
<p>As far as the repos on the DVD's not having a shared ancestry with main repo, that was a conscious choice that I made. I wanted to add as little extra data to the DVDs as possible since I usually fill them to the brim anyway. I didn't feel that it would be beneficial for the DVD's to know about the history of the main repo and other files that they don't contain. Furthermore, besides all the links and history, you'd be replicating all the files in the main repo that aren't annexed.</p>
<p>If you want to avoid the error, but still have a local branch for the DVD repos you should be able to do something like the following:</p>
<p><b>WARNING:</b> these commands are untested!</p>
<pre>
git checkout -b disc1 disc1/master
git checkout -b disc2 disc2/master
</pre>
<p>Working from the original example, you should then get local branches for the DVDs that don't have a common ancestor with your master local repo. I haven't actually tested that though. Testing will have to wait for this weekend.</p>
comment 9http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_9_6c45a6264d69e22800c329a0f8a2d470/joeyh.name2013-11-27T22:47:37Z2012-10-25T03:33:29Z
@Steve, it seems to me you could still use the web special remote, just pointing it at an url that goes through a symlink to the mount point.
Web remote workshttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_10_a061d300b718ad943c940e122cc57220/Steve2013-11-27T22:47:37Z2012-11-25T22:30:46Z
<p>Thanks for the suggestion Joey, I found a way to make the web remotes work for adding the files from existing discs. I wound up adding a symlink farm to the repo with a link for each disk pointing at the mount point. This way when I try to retrieve a file, I see the URL which contains the name of the disc:</p>
<pre><code>$ git annex get bigfile.bin
get bigfile.bin (from web...)
curl: (37) Couldn't open file /var/tmp/repo/storage/dvd13/bigfile.bin
Unable to access these remotes: web
Try making some of these repositories available:
00000000-0000-0000-0000-000000000001 -- web
failed
git-annex: get: 1 failed
</code></pre>
<p>It took me a while as the version of git-annex in portage was rather old and I just didn't get around to updating git-annex on Gentoo for while. If anybody wants to get git-annex 3.20121112 running under Gentoo I detailed the process I used at <a href="http://git-annex.mysteryvortex.com">http://git-annex.mysteryvortex.com</a></p>
<p>Now I'll have to try out the assistant! (Though I didn't get the webapp to compile due to a shakespeare-js error)</p>
Bad ebuildshttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_11_76529080054407570611b4357ce4f3ed/Steve2013-11-27T22:47:37Z2012-12-01T02:45:37Z
Just an update for anybody that used the ebuilds I created from the link above. They did not create the git-annex-shell symlink which can cause git-annex to start ignoring your remotes. More details and fixed ebuilds are now on the page.
comment 12http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_12_9acf5ce41a023f3848a51891cceeb51b/arand2013-11-27T22:47:37Z2013-03-11T10:34:42Z
<p>Without having read this, I've reported a very similar wishlist item at:
<a href="http://git-annex.branchable.com/todo/wishlist:_recursive_directory_remote_setup__47__addurl">http://git-annex.branchable.com/todo/wishlist:_recursive_directory_remote_setup<strong>47</strong>addurl</a></p>
<p>combining a recursive addurl (in my case using --fast) script with the suggestions regarding symlinks here, it's somewhat workable:</p>
<pre><code>ln -s /media/cdrom /var/tmp/mycdrom123
~/utv/scripts/annex-importdir /var/tmp/mycdrom123
</code></pre>
<p>Ideally though, for optical media it would have a couple of more features (some already noted above):</p>
<ul>
<li>Ability to form a (reasonably) unique identifier from a disc, using the label and the date of creation
<ul>
<li>Ability for Annex to identify discs using this and ask for the correct disc if the file does not match (accomodating RW discs where label and date might change, or simply disc copies)
<ul>
<li>Example: <code>not the original disc... trying anyway... file hash mismatch... please enable the remote disc with "MYLABEL" and creation date "2001-01-01"</code></li>
</ul>
</li>
</ul>
</li>
<li>Option to checksum without importing the actual objects into the annex</li>
</ul>
Not a priority in itself, still feels like a missing piece.http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_13_0a343a8fdad66765371ca22581b35b84/stephane-gourichon-lpad2016-10-26T12:29:50Z2016-10-26T12:29:50Z
<blockquote><p>I have a large number of files that are accessed infrequently and stored off-line on DVD-Rs. I need to keep track of which files are on which disc so that when I want a file I can find it.
(...)
4) Easily incorporate the current DVD-Rs into the new system</p></blockquote>
<p>This last item would make <code>git-annex</code> suitable to catalog existing WORM media.</p>
<p>In the past I have used some programs but was never satisfied with their graphical-UI-first approach or closed format. For example: gtktalog, cdcat, cdcollect, where is it, virtual volume view, gnome catalog, basenji. Ref: https://alternativeto.net/software/cdcollect/?platform=linux .</p>
<p>I also used at some point a plain old <code>find|{stat;md5}|gzip > ~/catalogs/my_volume_id.gz</code> then <code>grep mystring ~/catalogs/*gz</code> which, at the end of the day, has an overall good cost/benefit ratio.</p>
<p>IMHO git-annex has a sane foundation and the potential to do better than those tools.</p>
<p>Technically this looks indeed similar to a web special remote, but needs to accommodate for arbitrary mount point and keep count of copies.</p>
<p>To be honest, the DVD use case is not a priority for me at the moment, but it feels like a missing piece in an otherwise good puzzle. As if handling this case nicely would actually benefit other, more modern use cases.</p>
comment 14http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_14_b315c50d796d73bfc35acebee1836f40/username2021-09-28T22:56:16Z2021-09-28T22:56:16Z
<p>Inspired by the OP I devised a slightly modified method for cataloguing my optical media library.</p>
<p>The goal is having a main git-annex repo tracking what's on HDD, while adding the optical media as one-time remotes for location tracking and easy future inspection of their contents without having to physically insert the disc in the drive and mounting it.</p>
<p>The current contents of the HDD and the contents of the total set of the optical discs are not a 1:1 match, both have files not present in the other.</p>
<p>Current plan:</p>
<ol>
<li><p>copy already burnt discs to a directory on disk and create a git-annex repo with the same configuration as the main catalogue repo, using the disc label as description:
<code>
git annex init disc_label && git annex config --set annex.dotfiles true && git config annex.backend MD5 && git config annex.gitaddtoannex false
</code></p></li>
<li><p>add the optical disc repo as remote in the main repo:
<code>
git remote add CD1 /mnt/CD1/ && git annex sync --no-push -jobs=cpus && git annex sync --cleanup && git remote remove CD1
</code></p></li>
<li><p>create a branch for easily "mounting" each disc with git checkout without needing to
<code>git log --oneline | grep disc_label</code>
for finding the relevant commit hash:
<code>
git branch CD1 XXXXXXXXXX
</code></p></li>
<li><p>edit working tree as desired, then override git-annex sync's automatic commit:
<code>
git add . && git commit --quiet --amend --message="add CD1"
</code></p></li>
<li><p>repeat for other discs</p></li>
</ol>
<p>The main repo should end up looking like so, each optical disc "dangling" from the main branch that tracks the HDD contents:</p>
<pre><code>$ git log --graph --oneline
* XXXXXXXXXX (HEAD -> master) add DVD1
|\
| * XXXXXXXXXX (DVD1) DVD1
* XXXXXXXXXX add CD1
|\
| * XXXXXXXXXX (CD1) CD1
* XXXXXXXXXX init
</code></pre>
<p>and any file given as argument to <code>git annex whereis</code> should be easily traceable:</p>
<pre><code>$ git-annex whereis file
whereis file (4 copies)
XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX -- HDD [here]
XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX -- CD1
XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX -- DVD1
XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX -- BD1
ok
</code></pre>
<p>Seems good enough for me but considering I can barely wrap my head around all that git and git-annex are capable of I'm posting here hoping for a sanity check that this solution does work as expected and/or suggestions for improvements before fully committing to it.</p>
comment 15http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_15_9ddd6278ea00c5594083347b5b9b8405/joey2021-10-11T16:40:49Z2021-10-11T16:07:38Z
<p>git-annex now has the ability to import a tree of files from a
directory special remote, which results in a remote tracking branch, the
same as you'd have after fetching a git remote.</p>
<pre><code>git-annex initremote dvd type=directory directory=/path/to/DVD encryption=none importtree=yes
git-annex import master --from dvd --no-content
</code></pre>
<p>The --no-content option avoids copying files to the local disk, although
their content still will have to be read to hash them. If you want to
copy the files from the disk at the same time, omit that option.</p>
<p>After that, you can use the dvd/master branch it created in whatever way
you desire. Also if you want the discs files to end up in a subdirectory,
that can be specified when you import, eg "master:dvd" will put the files
into a dvd/ subdirectory.</p>
<p>Using this with multiple discs would probably work best if there was a way
to mount each DVD to its own unique location.</p>
directory special remotehttp://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_16_c3f6cb58dd2328b7af8dd2657c2e2a1e/username2021-10-17T22:15:51Z2021-10-17T22:15:51Z
<p>Thanks for the suggestion.
I've looked into directory special remotes and using them for the optical media appears to undermine the intent of step 3 in my previous reply, of "mounting" each disc using <code>git checkout DISC_LABEL</code>, because the master branch contents are combined with the imported directory special remote contents.</p>
<p>The <code>git checkout</code> should leave the working directory with an 1:1 copy of the directory tree of the imported disc, except with all files replaced by broken annex symlinks.</p>
<p>But I'm considering the opposite now: using the directory special remote not for the optical discs but for the master branch of the repo instead, the one that tracks the local HDD tree:</p>
<pre><code>git-annex initremote HDD type=directory directory=/path/to/HDD encryption=none importtree=yes
</code></pre>
<p>The local dataset I want to use as the seed for the catalogue has multiple hardlinks so making a git-annex repository directly within it is out of the question as it would lead to duplicated data.</p>
<p>The initial plan to work around that was making a reflink copy of the directory tree, initialising the git-annex repo therein, and regularly update its master branch by replacing the git working directory with a brand new reflink clone and <code>git-annex add</code>'ing it.</p>
<p>If I understood git-annex right, this would imply a full re-read of the whole dataset because of the changed inode numbers of the new reflink clone, despite the contents, filenames, and mtimes of most files being 100% identical.</p>
<p>However, it seems that using a directory special remote would neatly circumvent that (at least until the current HDD dies and I'm forced to <code>mkfs</code> in the replacement) because git-annex would be smart enough to detect renames by looking at the stable inode and mtime of the moved files.</p>
<p>The local dataset is around 250K files and 4TiB in real size, ballooning to over 8TiB if hardlinked files were counted as copies. The updates (using <code>git-annex import master --from HDD --no-content</code>) to the catalogue master branch would happen with frequency somewhere between monthly to every 2 years.</p>
<p>2 questions:</p>
<ol>
<li><p>Am I correct in assuming that re-importing a special remote would only read the newly added files and correctly detect all renames and deletions without re-reading, no matter how much time passes between re-imports of the master branch?</p></li>
<li><p>Are there any downsides (scalability, memory use, etc) to using a directory special remote for this use case instead of a regular git-annex repository?</p></li>
</ol>
comment 17http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_17_455ab8b72c477769d68eb4348e911234/Lukey2021-10-18T15:36:06Z2021-10-18T15:36:06Z
<blockquote><p>Thanks for the suggestion. I've looked into directory special remotes and using them for the optical media appears to undermine the intent of step 3 in my previous reply, of "mounting" each disc using git checkout DISC_LABEL, because the master branch contents are combined with the imported directory special remote contents.</p></blockquote>
<p>Huh? Of course you can import to a separate branch.</p>
comment 18http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_18_679ad639cbba7f9a4d28131219d20178/username2021-10-18T19:54:29Z2021-10-18T19:54:29Z
<p>I wasn't clear enough, or maybe I did something wrong when testing the directory special remote.</p>
<p>When using standard git-annex repositories for everything as explained in comment 14 I get this:</p>
<pre><code>$ git log --graph --oneline
* 5555555555 (HEAD -> master) amended commit 2
|\
| * 4444444444 (DVD1) DVD1
* 3333333333 amended commit 1
|\
| * 2222222222 (CD1) CD1
* 1111111111 init
</code></pre>
<p>The master branch tracks only the contents of my local HDD (note that in step 4 I edit the working directory to my liking and amend the automatic <code>git-annex sync</code> commit), and the CD1 branch contains only what's in that disc and nothing else, such that <code>git checkout 2222222222</code> or <code>git checkout CD1</code> replaces everything in the working directory with the contents of CD1, similar to mounting the physical disc and browsing its filesystem.</p>
<p>Using the directory special remote, the contents of my master branch are combined with the contents of the special remote, so the same <code>git checkout CD1</code> command wouldn't replicate exactly what's on CD1, the master branch directory tree would still be present after the branch switch.</p>
comment 19http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_19_854531569edbb5c152c4d6b0d764a6c5/Lukey2021-10-19T17:17:47Z2021-10-19T17:17:47Z
Again, you can just run <code>git annex import --from=CD1 CD1</code>, which will import everything on the CD1 special remote to a branch named "CD1" which is completely independ from the master branch (it won't even share the same history).
comment 20http://git-annex.branchable.com/forum/Managing_a_large_number_of_files_archived_on_many_pieces_of_read-only_medium___40__E.G._DVDs__41__/comment_20_a80b9ce3a78fb44c9d60e877146d9e59/username2021-10-23T15:53:57Z2021-10-23T15:53:57Z
<p>Aha, that's what I was missing!</p>
<p>I was working under the wrong assumption that the git-annex location tracking was tied in some way to the master branch and that alternate branches required shared history with master to have their contents tracked.</p>
<p>Wholly independent branches for each disc is exactly what I want, thanks Lukey and Joey for your suggestions.</p>