git-annex-cat

git-annex/ todo/ git-annex-cat

Edit
RecentChanges
History
Preferences
Branchable
6 comments

install
assistant
walkthrough
tips
bugs
todo
forum
comments
contact
thanks

It would be useful to have a git-annex-cat command that outputs the contents of an annexed file without storing it in the annex. This can be faster than git-annex-get followed by cat, even if file is already present. It avoids some failure modes of git-annex-get (like running out of local space, or contending for locks). It supports a common use case of just needing a file for some operation, without needing to remember to drop it later. It could be used to implement a web server or FUSE filesystem that serves git-annex repo files on demand.

If file is not present, or remote.here.cost is higher than remote.someremote.cost where file is present, someremote would get a TRANSFER request where the FILE argument is a named pipe, and a cat of that named pipe would be started.

If file is not annexed, for uniformity git-annex-cat file would just call cat file.

RSS Atom

Would be useful

A git annex cat would be useful for the very web server purpose you describe (WIP at https://gitlab.com/chrysn/annex-to-web, though I'm not sure it's going anywhere).

Unlike git annex inprogress that I (will) use for a workaround, this could take a --skip argument that usually just seeks into the file. If the data is served from a remote that allows seeking access (eg. IPFS), then that access could be priorized and that part downloaded first. (Implementing this would require another tmp pool for sparse files as they couldn't go with the git annex inprogress files for there is the expectation that those would grow to completion, but anyway this would be an entry point for such a feature if it is ever added).

Comment by chrysn — Tue Dec 17 09:08:08 2019

Remove comment

comment 2

git-annex's API for getting object content from remotes involve a destination file that is written to. That limits the efficiency of such a command. There would need to be a separate API for streaming, which some remotes will not have any hope of supporting.

Comment by joey — Wed Dec 18 17:49:47 2019

Remove comment

named pipes as destination files

"getting object content from remotes involve a destination file that is written to" -- what happens if git-annex makes a named pipe, and passes that as the destination file name to the remote?

Comment by Ilya_Shlyakhter — Wed Dec 18 18:41:57 2019

Remove comment

comment 4

@Ilya_Shlyakhter, I'd assume:

some remotes would write to the named pipe
some remotes would overwrite it with a file
some remotes would open it, try to seek around as they do non-sequential recieves, and hang or something
some remotes would maybe open and write to it, but would no longer be able to resume interrupted transfers, since they would I guess see its size as 0

Comment by joey — Wed Jan 1 18:44:37 2020

Remove comment

re: git-annex-cat

"There would need to be a separate API for streaming, which some remotes will not have any hope of supporting" -- there could be a default implementation using the current protocol (TRANSFER RETRIEVE to tempfile then cat and rm), which some remotes could override with a true streaming implementation.

"(1) some remotes would write to the named pipe; (2) some remotes would overwrite it with a file; (3) some remotes would open it, try to seek around as they do non-sequential recieves, and hang or something; (3) some remotes would maybe open and write to it, but would no longer be able to resume interrupted transfers, since they would I guess see its size as 0" -- there could be a config flag to tell git-annex to assume that a given (legacy) remote does (1), at user's own risk. Am I wrong to think (1) holds for most legacy remotes?

Comment by Ilya_Shlyakhter — Thu Jul 9 01:06:37 2020

Remove comment

comment 6

Joey, I recently came across this same usecase. There are some intermediate files I store using git annex safely in the cloud and I want to fetch it.

Doing a git annex get and a drop seems like the wrong solution. Why am I unnecessarily adding risk when I know I don't care about whether the file currently exists in my repo? I then have to think about various cases like if I already had the file in my repo or not and be very careful. I can't just do a git annex get; cat; git annex drop.

I could use a pull-only-clone of my git annex repo, but that comes with many issues and usage hassles like reconfiguring everything. On top of this, I'd sometimes need to do a git annex drop --force in my clones since they may not have access to everything that the main repo does which is even more scary.

Your concerns here make sense to me. However, streaming vs downloading is just an optimization. I'm HAPPY to pay the performance cost which is much better than the safety cost I'm currently facing with my hacky solutions to this problem. All we need (from my meager understanding of git annex internals) is to have the git annex cat command download the contents on to a temporary file (in the literal /tmp directory) instead of the annex/objects directory, and then cat that at the end. That's pretty much all I (we?) am asking for.

I do know that you like to do things perfectly and I'm sure there'll be lots of issues with the proposal here that you can see that the rest of us cannot. But that's true of solutions too. Really really hoping you can figure out a solution for this. I'm happy to try and help with the code changes too if that helps. I have never used haskell before but very happy to take that challenge if we can settle on a design.

Comment by Doable8234 — Tue Jan 7 02:11:33 2025

Remove comment

Add a comment

Tags: needsthought

Links: external backends/comment 15 7880557cb94706d82a6d2bfc785288ea let external remotes declare support for named pipes option to put temp files on a RAM disk/comment 2 1df752ac3b9cb2cc0e4a7dd4af71897f

Last edited Wed Jun 17 01:18:32 2020