It would be useful to have a git-annex-cat
command that outputs the contents of an annexed file without storing it in the annex. This can be faster than git-annex-get
followed by cat
, even if file is already present. It avoids some failure modes of git-annex-get
(like running out of local space, or contending for locks). It supports a common use case of just needing a file for some operation, without needing to remember to drop it later. It could be used to implement a web server or FUSE filesystem that serves git-annex repo files on demand.
If file is not present, or remote.here.cost
is higher than remote.someremote.cost
where file is present, someremote
would get a TRANSFER
request where the FILE
argument is a named pipe, and a cat
of that named pipe would be started.
If file is not annexed, for uniformity git-annex-cat file
would just call cat file
.
A
git annex cat
would be useful for the very web server purpose you describe (WIP at https://gitlab.com/chrysn/annex-to-web, though I'm not sure it's going anywhere).Unlike
git annex inprogress
that I (will) use for a workaround, this could take a--skip
argument that usually just seeks into the file. If the data is served from a remote that allows seeking access (eg. IPFS), then that access could be priorized and that part downloaded first. (Implementing this would require another tmp pool for sparse files as they couldn't go with thegit annex inprogress
files for there is the expectation that those would grow to completion, but anyway this would be an entry point for such a feature if it is ever added).git-annex's API for getting object content from remotes involve a destination file that is written to. That limits the efficiency of such a command. There would need to be a separate API for streaming, which some remotes will not have any hope of supporting.
@Ilya_Shlyakhter, I'd assume:
"There would need to be a separate API for streaming, which some remotes will not have any hope of supporting" -- there could be a default implementation using the current protocol (
TRANSFER RETRIEVE
to tempfile thencat
andrm
), which some remotes could override with a true streaming implementation."(1) some remotes would write to the named pipe; (2) some remotes would overwrite it with a file; (3) some remotes would open it, try to seek around as they do non-sequential recieves, and hang or something; (3) some remotes would maybe open and write to it, but would no longer be able to resume interrupted transfers, since they would I guess see its size as 0" -- there could be a config flag to tell git-annex to assume that a given (legacy) remote does (1), at user's own risk. Am I wrong to think (1) holds for most legacy remotes?
Joey, I recently came across this same usecase. There are some intermediate files I store using git annex safely in the cloud and I want to fetch it.
Doing a
git annex get
and a drop seems like the wrong solution. Why am I unnecessarily adding risk when I know I don't care about whether the file currently exists in my repo? I then have to think about various cases like if I already had the file in my repo or not and be very careful. I can't just do agit annex get; cat; git annex drop
.I could use a pull-only-clone of my git annex repo, but that comes with many issues and usage hassles like reconfiguring everything. On top of this, I'd sometimes need to do a
git annex drop --force
in my clones since they may not have access to everything that the main repo does which is even more scary.Your concerns here make sense to me. However, streaming vs downloading is just an optimization. I'm HAPPY to pay the performance cost which is much better than the safety cost I'm currently facing with my hacky solutions to this problem. All we need (from my meager understanding of git annex internals) is to have the
git annex cat
command download the contents on to a temporary file (in the literal/tmp
directory) instead of theannex/objects
directory, and thencat
that at the end. That's pretty much all I (we?) am asking for.I do know that you like to do things perfectly and I'm sure there'll be lots of issues with the proposal here that you can see that the rest of us cannot. But that's true of solutions too. Really really hoping you can figure out a solution for this. I'm happy to try and help with the code changes too if that helps. I have never used haskell before but very happy to take that challenge if we can settle on a design.