It would be useful to have a git-annex-cat
command that outputs the contents of an annexed file without storing it in the annex. This can be faster than git-annex-get
followed by cat
, even if file is already present. It avoids some failure modes of git-annex-get
(like running out of local space, or contending for locks). It supports a common use case of just needing a file for some operation, without needing to remember to drop it later. It could be used to implement a web server or FUSE filesystem that serves git-annex repo files on demand.
If file is not present, or remote.here.cost
is higher than remote.someremote.cost
where file is present, someremote
would get a TRANSFER
request where the FILE
argument is a named pipe, and a cat
of that named pipe would be started.
If file is not annexed, for uniformity git-annex-cat file
would just call cat file
.
A
git annex cat
would be useful for the very web server purpose you describe (WIP at https://gitlab.com/chrysn/annex-to-web, though I'm not sure it's going anywhere).Unlike
git annex inprogress
that I (will) use for a workaround, this could take a--skip
argument that usually just seeks into the file. If the data is served from a remote that allows seeking access (eg. IPFS), then that access could be priorized and that part downloaded first. (Implementing this would require another tmp pool for sparse files as they couldn't go with thegit annex inprogress
files for there is the expectation that those would grow to completion, but anyway this would be an entry point for such a feature if it is ever added).git-annex's API for getting object content from remotes involve a destination file that is written to. That limits the efficiency of such a command. There would need to be a separate API for streaming, which some remotes will not have any hope of supporting.
@Ilya_Shlyakhter, I'd assume:
"There would need to be a separate API for streaming, which some remotes will not have any hope of supporting" -- there could be a default implementation using the current protocol (
TRANSFER RETRIEVE
to tempfile thencat
andrm
), which some remotes could override with a true streaming implementation."(1) some remotes would write to the named pipe; (2) some remotes would overwrite it with a file; (3) some remotes would open it, try to seek around as they do non-sequential recieves, and hang or something; (3) some remotes would maybe open and write to it, but would no longer be able to resume interrupted transfers, since they would I guess see its size as 0" -- there could be a config flag to tell git-annex to assume that a given (legacy) remote does (1), at user's own risk. Am I wrong to think (1) holds for most legacy remotes?