Unlocked mode without data also under .git/annex?

I'm working on setting things up for making archives on both hard drives and possibly optical media. I want them readable without needing to have Linux or git-annex, so I'm going to be using NTFS (for HDDs) and UDF (for optical). So I'm using adjust --unlocked everywhere. I don't want anything mucking with the source data; making hard links, tweaking timestamps, etc.

Additionally, timestamps are highly relevant for me to preserve.

Initially I thought I might be able to use the directory remote with importdirectory, but I ran into a lot of performance problems with my 150,000-file setup, as well as the bug at ?Files recorded with other file's checksums. So I'm trying to work without the special remote at all, which seems to be much more performant and, I suspect, less buggy.

So now I'm thinking of having a repo on each drive, and rsyncing the content to it. I can easily enough exclude .git. Then I git annex add, git annex sync, and I think I've got what I need.

The question is, why, with unlocked mode, do we have to have the files also stored in .git/annex? Of course, most (though not all) files will be hardlinked to them. But as we know, when we modify a file, it invalidates its checksum, so I'm not really seeing what this adds.

If we didn't have to do this, I could just add git-annex to the source repo itself, secure in the notion that if I only push out from there, it will never modify anything there at all. Then I could just pull elsewhere.

On the targets, I guess it doesn't hurt too much since it's hardlinked.

RSS Atom

comment 1

Suppose git-annex did behave that way. Now suppose that you ran:

git config annex.largefiles 'include=*'
git add largefile
git commit -m 'added a large file to git-annex (unlocked)'
git stash

Then git would have deleted the only copy of largefile, which was the one stored in the working tree. You would have lost data. The hard links, annoying as they are, avoid this problem.

But as we know, when we modify a file, it invalidates its checksum

Right, and if you're going to be running things that open and modify files, then it is not safe to set annex.thin. "echo foo > largefile" will modify the file and lose the original version.

The difference is that you have to run something that does usually modify the file to lose data (with annex.thin set). Running a git command that is normally entirely safe will not lose data.

So the user of annex.thin only needs to keep in mind that some things that would usually modify a file will lose the previous version of it, unless they've copied it to another remote. They don't have to live in fear of running a command that is usually safe and reversable and that causing data loss.

Comment by joey — Fri Sep 9 18:54:08 2022

Remove comment

comment 2

In my case, I really don't care about losing the old version of a file. I have ZFS snapshots and backups to take care of that. I do use thin, and that particular issue doesn't really bother me. I wounder if a "superthin" where there is no hardlinking into .git/annex at all would be possible? I'm aware that, yes, that could make previous versions unavailable and so forth.

Comment by jgoerzen — Sat Sep 10 00:32:57 2022

Remove comment

comment 3

If you're using ZFS, you should not need to set annex.thin at all; git-annex will use reflinks between annex object files and unlocked working tree files, the annex object file will not use any additional disk space.

Comment by joey — Tue Sep 13 19:10:48 2022

Remove comment

comment 4

ZFS doesn't support reflinks. XFS does.

Comment by Lukey — Wed Sep 14 15:09:20 2022

Remove comment

comment 5

Ah, oops.. I was thinking about BTRFS..

However, getting back to the original motivation of jgoerzen to request this, it seems to come down to making a hard link being seen as "mucking with the source data". That seems like a very weak reason to make such a very large change to git-annex, that would only be safe in a small and poorly defined set of circumstances.

And it would be a large change, because currently git-annex can broadly assume that any time a .git/annex/objects/ file exists, the content is present in the repository. Every place that makes that assumption would need to instead check if any of the known work tree files that use the object are populated with the content (or at least are not annex pointer files).

(jgoerzen also mentions timestamps, but git-annex preserves those when ingesting files. Of course timestamp data is not recorded in the git repository unless you use some other tool to do so.)

Comment by joey — Thu Sep 15 16:27:54 2022

Remove comment

comment 6

I understand what you're saying here, and at this point this is probably not super relevant due to the large change it would represent, but just thought I'd further clarify the use case....

I, in some cases, use hard links fairly intentionally. "This file is both a photo and a record of something; let me show it in both places." I don't want things hardlinked together that previously weren't, and don't want existing hardlinks broken. Now this is for my daily use; for long-term archiving, it isn't all that important. So I don't want hard links being adjusted on the source, but don't care so much about the destination (at least so long as broken hardlinks don't result in excessive increases in storage space requirements; but my main area where that would occur has 10s of millions of files, so won't be using git-annex for other reasons.)

Also in some cases, I have a read-only directory (whether from NFS mount or something else). It is easy enough to mount a .git onto it, via a bind-mount or anything else, but trying to modify the actual content of course wouldn't work.

Anyhow, thanks for the conversation!

Comment by jgoerzen — Mon Sep 19 14:49:02 2022

Remove comment

Add a comment