Recent changes to this wiki:
findcompute --inputs
Useful for eg, generating dependency graphs.
Useful for eg, generating dependency graphs.
diff --git a/Command/FindComputed.hs b/Command/FindComputed.hs index d9064b2390..83f9a3cddd 100644 --- a/Command/FindComputed.hs +++ b/Command/FindComputed.hs @@ -17,6 +17,10 @@ import Command.Find (showFormatted, formatVars) import Remote.Compute (isComputeRemote, getComputeState, ComputeState(..)) import qualified Remote import qualified Types.Remote as Remote +import Database.Keys +import Annex.CatFile + +import qualified Data.Map as M cmd :: Command cmd = withAnnexOptions [annexedMatchingOptions] $ noCommit $ noMessages $ @@ -28,6 +32,7 @@ data FindComputedOptions = FindComputedOptions { findThese :: CmdParams , formatOption :: Maybe Utility.Format.Format , keyOptions :: Maybe KeyOptions + , inputsOption :: Bool } optParser :: CmdParamsDesc -> Parser FindComputedOptions @@ -35,6 +40,10 @@ optParser desc = FindComputedOptions <$> cmdParams desc <*> optional parseFormatOption <*> optional parseBranchKeysOption + <*> switch + ( long "inputs" + <> help "display input files" + ) parseFormatOption :: Parser Utility.Format.Format parseFormatOption = @@ -69,22 +78,51 @@ start o isterminal computeremotes _ file key = do if null rcs then stop else startingCustomOutput key $ do - forM_ rcs $ \(r, c) -> do - let computation = unwords (computeParams c) - let unformatted = fromOsPath file - <> " (" <> encodeBS (Remote.name r) - <> ") -- " - <> encodeBS computation - let formatvars = - [ ("remote", Remote.name r) - , ("computation", computation) - ] ++ formatVars key (AssociatedFile (Just file)) - showFormatted isterminal (formatOption o) - unformatted formatvars + forM_ rcs display next $ return True where get r = fmap (r, ) <$> getComputeState (Remote.remoteStateHandle r) key + + showformatted = showFormatted isterminal (formatOption o) + + unformatted r computation = fromOsPath file + <> " (" <> encodeBS (Remote.name r) + <> ") -- " + <> encodeBS computation + + unformattedinputs (Right inputfile) = fromOsPath file + <> " " <> fromOsPath inputfile + unformattedinputs (Left inputkey) = fromOsPath file + <> " " <> serializeKey' inputkey + + display (r, c) = do + let computation = unwords (computeParams c) + let formatvars = + [ ("remote", Remote.name r) + , ("computation", computation) + ] ++ formatVars key (AssociatedFile (Just file)) + if inputsOption o + then forM_ (M.elems $ computeInputs c) $ \inputkey -> do + input <- maybe (Left inputkey) Right + <$> getassociated inputkey + showformatted (unformattedinputs input) $ + [ ("input", either serializeKey fromOsPath input) + , ("inputkey", serializeKey inputkey) + , ("inputfile", either (const "") fromOsPath input) + ] ++ formatvars + else showformatted (unformatted r computation) formatvars + + getassociated inputkey = + getAssociatedFiles inputkey + >>= mapM (fromRepo . fromTopFilePath) + >>= firstM (stillassociated inputkey) + + -- Some associated files that are in the keys database may no + -- longer correspond to files in the repository. + stillassociated k f = catKeyFile f >>= return . \case + Just k' | k' == k -> True + _ -> False startKeys :: FindComputedOptions -> IsTerminal -> [Remote] -> (SeekInput, Key, ActionItem) -> CommandStart startKeys o isterminal computeremotes (si, key, ActionItemBranchFilePath (BranchFilePath _ topf) _) = diff --git a/Command/WhereUsed.hs b/Command/WhereUsed.hs index bfe49d1a73..1a7e7033d8 100644 --- a/Command/WhereUsed.hs +++ b/Command/WhereUsed.hs @@ -70,9 +70,9 @@ start o (_, key, _) = startingCustomOutput key $ do where -- Some associated files that are in the keys database may no -- longer correspond to files in the repository. - stillassociated f = catKeyFile f >>= \case - Just k | k == key -> return True - _ -> return False + stillassociated f = catKeyFile f >>= return . \case + Just k | k == key -> True + _ -> False display :: Key -> StringContainingQuotedPath -> Annex () display key loc = do diff --git a/doc/git-annex-findcomputed.mdwn b/doc/git-annex-findcomputed.mdwn index 8e1bafe7d0..aa3ae07db1 100644 --- a/doc/git-annex-findcomputed.mdwn +++ b/doc/git-annex-findcomputed.mdwn @@ -32,18 +32,44 @@ For example: List computed files in the specified branch or treeish. +* `--inputs` + + Display each computed file followed by the input that is used to + produce it. The current location of the input file in the work tree is + displayed, but if the input file is not in the work tree, the key + is displayed instead. + + For example: + + foo.jpeg file.raw + bar.gz bar + + When multiple input files are needed to compute a file, outputs multiple + lines for that file: + + foo bar + foo baz + * `--format=value` Use custom output formatting. This option works the same as in [[git-annex-find]](1), with these additional variables available for use in it: - remote, computation + "${remote}", "${computation}" The default output format is the same as `--format='${file} (${remote}) -- ${computation}\\n'`, except when outputting to a terminal, control characters will be escaped. + When `--inputs` is used, there are additional variables "${inputfile}" + which is the input filename, "${inputkey}" which is the input key, + and "${input}" which is either the filename or the key. + The default output format for `--inputs` + is the same as `--format='${file} ${input}\\n'` + To separate the pair of files by nulls instead, use eg + `--format='${file}\\000${input}\\n' + * `--json` Output the list of files in JSON format.
checkPresent of compute remote checks inputs are available
If an input file has been lost from all repositories, it is no longer
possible to compute the output. This will avoid dropping content that
was computed in such a situation, as well as making git-annex fsck --from
the compute remote do its usual thing when content has gone missing.
This implementation avoids recursing forever if there is a cycle,
which should not be possible anyway.
Note the use of RemoteStateHandle as a constructor here suggests that
this may not handle sameas remotes right, since usually a
RemoteStateHandle is constructed using the sameas uuid for a sameas
remote. That assumes a compute remote can even have or be a sameas remote.
Which doesn't seem to make sense, so I have not thought through what might
happen here in detail.
If an input file has been lost from all repositories, it is no longer
possible to compute the output. This will avoid dropping content that
was computed in such a situation, as well as making git-annex fsck --from
the compute remote do its usual thing when content has gone missing.
This implementation avoids recursing forever if there is a cycle,
which should not be possible anyway.
Note the use of RemoteStateHandle as a constructor here suggests that
this may not handle sameas remotes right, since usually a
RemoteStateHandle is constructed using the sameas uuid for a sameas
remote. That assumes a compute remote can even have or be a sameas remote.
Which doesn't seem to make sense, so I have not thought through what might
happen here in detail.
diff --git a/Logs/Trust.hs b/Logs/Trust.hs index f2066ba29e..f7a705f7de 100644 --- a/Logs/Trust.hs +++ b/Logs/Trust.hs @@ -1,6 +1,6 @@ {- git-annex trust log - - - Copyright 2010-2022 Joey Hess <id@joeyh.name> + - Copyright 2010-2025 Joey Hess <id@joeyh.name> - - Licensed under the GNU AGPL version 3 or higher. -} @@ -18,17 +18,15 @@ module Logs.Trust ( trustMapLoad, ) where -import qualified Data.Map as M -import Data.Default - import Annex.Common import Types.TrustLevel import qualified Annex import Logs import Remote.List -import qualified Types.Remote import Logs.Trust.Basic as X +import qualified Data.Map as M + {- Returns a list of UUIDs that the trustLog indicates have the - specified trust level. - Note that the list can be incomplete for SemiTrusted, since that's @@ -67,20 +65,4 @@ trustMap = maybe trustMapLoad return =<< Annex.getState Annex.trustmap {- Loads the map, updating the cache, -} trustMapLoad :: Annex TrustMap -trustMapLoad = do - forceoverrides <- Annex.getState Annex.forcetrust - l <- remoteList - let untrustoverrides = M.fromList $ - map (\r -> (Types.Remote.uuid r, UnTrusted)) - (filter Types.Remote.untrustworthy l) - logged <- trustMapRaw - let configured = M.fromList $ mapMaybe configuredtrust l - let m = M.unionWith min untrustoverrides $ - M.union forceoverrides $ - M.union configured logged - Annex.changeState $ \s -> s { Annex.trustmap = Just m } - return m - where - configuredtrust r = (\l -> Just (Types.Remote.uuid r, l)) - =<< readTrustLevel - =<< remoteAnnexTrustLevel (Types.Remote.gitconfig r) +trustMapLoad = trustMapLoad' =<< remoteList diff --git a/Logs/Trust/Basic.hs b/Logs/Trust/Basic.hs index 85e25ed20d..b05c072927 100644 --- a/Logs/Trust/Basic.hs +++ b/Logs/Trust/Basic.hs @@ -1,6 +1,6 @@ {- git-annex trust log, basics - - - Copyright 2010-2012 Joey Hess <id@joeyh.name> + - Copyright 2010-2025 Joey Hess <id@joeyh.name> - - Licensed under the GNU AGPL version 3 or higher. -} @@ -9,16 +9,20 @@ module Logs.Trust.Basic ( module X, trustSet, trustMapRaw, + trustMapLoad', ) where import Annex.Common import Types.TrustLevel import qualified Annex.Branch import qualified Annex +import qualified Types.Remote import Logs import Logs.UUIDBased import Logs.Trust.Pure as X +import qualified Data.Map as M + {- Changes the trust level for a uuid in the trustLog. -} trustSet :: UUID -> TrustLevel -> Annex () trustSet uuid@(UUID _) level = do @@ -34,3 +38,21 @@ trustSet NoUUID _ = error "unknown UUID; cannot modify" - log file. -} trustMapRaw :: Annex TrustMap trustMapRaw = calcTrustMap <$> Annex.Branch.get trustLog + +trustMapLoad' :: [Remote] -> Annex TrustMap +trustMapLoad' l = do + forceoverrides <- Annex.getState Annex.forcetrust + let untrustoverrides = M.fromList $ + map (\r -> (Types.Remote.uuid r, UnTrusted)) + (filter Types.Remote.untrustworthy l) + logged <- trustMapRaw + let configured = M.fromList $ mapMaybe configuredtrust l + let m = M.unionWith min untrustoverrides $ + M.union forceoverrides $ + M.union configured logged + Annex.changeState $ \s -> s { Annex.trustmap = Just m } + return m + where + configuredtrust r = (\lvl -> Just (Types.Remote.uuid r, lvl)) + =<< readTrustLevel + =<< remoteAnnexTrustLevel (Types.Remote.gitconfig r) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 2ef7844808..792105a1b8 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -29,6 +29,8 @@ import Types.Remote import Types.ProposedAccepted import Types.MetaData import Types.Creds +import Types.TrustLevel +import Types.RemoteState import Config import Config.Cost import Remote.Helper.Special @@ -45,6 +47,8 @@ import qualified Annex.Transfer import Logs.MetaData import Logs.EquivilantKeys import Logs.Location +import Logs.Trust.Basic +import Logs.Remote import Messages.Progress import Utility.Metered import Utility.TimeStamp @@ -88,6 +92,11 @@ remote = RemoteType isComputeRemote :: Remote -> Bool isComputeRemote r = typename (remotetype r) == typename remote +isComputeRemote' :: RemoteConfig -> Bool +isComputeRemote' rc = case M.lookup typeField rc of + Nothing -> False + Just t -> fromProposedAccepted t == typename remote + gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle -> Annex (Maybe Remote) gen r u rc gc rs = case getComputeProgram' rc of Left _err -> return Nothing @@ -788,11 +797,40 @@ avoidCycles outputkeys inputkey = filterM go rs' <- avoidCycles (inputkey:outputkeys) inputkey' rs return (rs' == rs) --- Make sure that the compute state exists. +-- Make sure that the compute state exists, and that the input keys are +-- still available (are not dead, and are stored in some repository). +-- +-- When an input key is itself stored in a compute remote, check that +-- its inputs are also still available. checkKey :: RemoteStateHandle -> Key -> Annex Bool checkKey rs k = do - states <- getComputeStatesUnsorted rs k - return (not (null states)) + deadset <- S.fromList . M.keys . M.filter (== DeadTrusted) + <$> (trustMapLoad' =<< Annex.getState Annex.remotes) + computeset <- S.fromList . M.keys . M.filter isComputeRemote' + <$> remoteConfigMap + availablecompute [] deadset computeset k rs + where + availablecompute inputkeys deadset computeset k' rs' + | k' `elem` inputkeys = return False -- avoid cycles + | otherwise = + anyM (hasinputs inputkeys deadset computeset . snd) + =<< getComputeStatesUnsorted rs' k' + + hasinputs inputkeys deadset computeset state = do + let ks = M.elems (computeInputs state) + ifM (anyM checkDead ks) + ( return False + , allM (available inputkeys deadset computeset) ks + ) + + available inputkeys deadset computeset k' = do + (repolocs, computelocs) <- + partition (flip S.notMember computeset) + . filter (flip S.notMember deadset) + <$> loggedLocations k' + if not (null repolocs) + then return True + else anyM (availablecompute (k':inputkeys) deadset computeset k' . RemoteStateHandle) computelocs -- Unsetting the compute state will prevent computing the key. dropKey :: RemoteStateHandle -> Maybe SafeDropProof -> Key -> Annex () diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 44ffe03b8d..4a2a23859e 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -42,15 +42,3 @@ compute special remote. --[[Joey]] Or it could build a DAG and traverse it, but building a DAG of a large directory tree has its own problems. - -* Should checkPresent check that each input file is also present in some - (non-dead) repo? - - Currently it only checks if compute state is recorded. The problem (Diff truncated)
update
diff --git a/doc/git-annex-findcomputed.mdwn b/doc/git-annex-findcomputed.mdwn index a1c6cf1351..8e1bafe7d0 100644 --- a/doc/git-annex-findcomputed.mdwn +++ b/doc/git-annex-findcomputed.mdwn @@ -18,7 +18,7 @@ was provided to [[git-annex-addcomputed]](1). For example: # git-annex findcomputed - foo.png (imageconvert) -- convert file.raw file.jpeg passes=10 + foo.jpeg (imageconvert) -- convert file.raw file.jpeg passes=10 bar.gz (compressor) -- compress bar --level=9 # OPTIONS diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index c067051833..44ffe03b8d 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -15,9 +15,10 @@ compute special remote. --[[Joey]] * annex.diskreserve can also be violated if computing a file gets source files that are larger than the disk reserve. This could be checked. -* Maybe add a file matching option, eg: +* Maybe add a file matching options, eg: - git-annex find --inputof=remote:file + git-annex find --computeinputof=remote:file + git-annex find --computeoutputof=remote:file * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs
findcomputed: New command, displays information about computed files.
diff --git a/CHANGELOG b/CHANGELOG index 83df038ec3..51298af244 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -4,6 +4,7 @@ git-annex (10.20250116) UNRELEASED; urgency=medium * addcomputed: New command, adds a file that is generated by a compute special remote. * recompute: New command, recomputes computed files. + * findcomputed: New command, displays information about computed files. * Support help.autocorrect settings "prompt", "never", and "immediate". * Allow setting remote.foo.annex-tracking-branch to a branch name that contains "/", as long as it's not a remote tracking branch. diff --git a/CmdLine/GitAnnex.hs b/CmdLine/GitAnnex.hs index 8dc64f8b7b..5032278873 100644 --- a/CmdLine/GitAnnex.hs +++ b/CmdLine/GitAnnex.hs @@ -135,6 +135,7 @@ import qualified Command.MaxSize import qualified Command.Sim import qualified Command.AddComputed import qualified Command.Recompute +import qualified Command.FindComputed import qualified Command.Version import qualified Command.RemoteDaemon #ifdef WITH_ASSISTANT @@ -269,6 +270,7 @@ cmds testoptparser testrunner mkbenchmarkgenerator = map addGitAnnexCommonOption , Command.Sim.cmd , Command.AddComputed.cmd , Command.Recompute.cmd + , Command.FindComputed.cmd , Command.Version.cmd , Command.RemoteDaemon.cmd #ifdef WITH_ASSISTANT diff --git a/Command/FindComputed.hs b/Command/FindComputed.hs new file mode 100644 index 0000000000..d9064b2390 --- /dev/null +++ b/Command/FindComputed.hs @@ -0,0 +1,93 @@ +{- git-annex command + - + - Copyright 2025 Joey Hess <id@joeyh.name> + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE OverloadedStrings, TupleSections #-} + +module Command.FindComputed where + +import Command +import Git.FilePath +import qualified Utility.Format +import Utility.Terminal +import Command.Find (showFormatted, formatVars) +import Remote.Compute (isComputeRemote, getComputeState, ComputeState(..)) +import qualified Remote +import qualified Types.Remote as Remote + +cmd :: Command +cmd = withAnnexOptions [annexedMatchingOptions] $ noCommit $ noMessages $ + withAnnexOptions [jsonOptions] $ + command "findcomputed" SectionQuery "lists computed files" + paramPaths (seek <$$> optParser) + +data FindComputedOptions = FindComputedOptions + { findThese :: CmdParams + , formatOption :: Maybe Utility.Format.Format + , keyOptions :: Maybe KeyOptions + } + +optParser :: CmdParamsDesc -> Parser FindComputedOptions +optParser desc = FindComputedOptions + <$> cmdParams desc + <*> optional parseFormatOption + <*> optional parseBranchKeysOption + +parseFormatOption :: Parser Utility.Format.Format +parseFormatOption = + option (Utility.Format.gen <$> str) + ( long "format" <> metavar paramFormat + <> help "control format of output" + ) + +seek :: FindComputedOptions -> CommandSeek +seek o = do + unless (isJust (keyOptions o)) $ + checkNotBareRepo + isterminal <- liftIO $ checkIsTerminal stdout + computeremotes <- filter isComputeRemote <$> Remote.remoteList + let seeker = AnnexedFileSeeker + { startAction = const (start o isterminal computeremotes) + , checkContentPresent = Nothing + , usesLocationLog = True + } + withKeyOptions (keyOptions o) False seeker + (commandAction . startKeys o isterminal computeremotes) + (withFilesInGitAnnex ww seeker) + =<< workTreeItems ww (findThese o) + where + ww = WarnUnmatchLsFiles "findcomputed" + +start :: FindComputedOptions -> IsTerminal -> [Remote] -> SeekInput -> OsPath -> Key -> CommandStart +start o isterminal computeremotes _ file key = do + rs <- Remote.remotesWithUUID computeremotes + <$> Remote.keyLocations key + rcs <- catMaybes <$> forM rs get + if null rcs + then stop + else startingCustomOutput key $ do + forM_ rcs $ \(r, c) -> do + let computation = unwords (computeParams c) + let unformatted = fromOsPath file + <> " (" <> encodeBS (Remote.name r) + <> ") -- " + <> encodeBS computation + let formatvars = + [ ("remote", Remote.name r) + , ("computation", computation) + ] ++ formatVars key (AssociatedFile (Just file)) + showFormatted isterminal (formatOption o) + unformatted formatvars + next $ return True + where + get r = fmap (r, ) + <$> getComputeState (Remote.remoteStateHandle r) key + +startKeys :: FindComputedOptions -> IsTerminal -> [Remote] -> (SeekInput, Key, ActionItem) -> CommandStart +startKeys o isterminal computeremotes (si, key, ActionItemBranchFilePath (BranchFilePath _ topf) _) = + start o isterminal computeremotes si (getTopFilePath topf) key +startKeys _ _ _ _ = stop + diff --git a/Remote/List/Util.hs b/Remote/List/Util.hs index e022d23190..c251198067 100644 --- a/Remote/List/Util.hs +++ b/Remote/List/Util.hs @@ -55,8 +55,8 @@ remoteLocations' (IncludeIgnored ii) locations trusted rs = do -- remotes that match uuids that have the key allremotes <- if not ii - then filterM (not <$$> liftIO . getDynamicConfig . remoteAnnexIgnore . gitconfig) rs - else return rs + then filterM (not <$$> liftIO . getDynamicConfig . remoteAnnexIgnore . gitconfig) rs + else return rs let validremotes = remotesWithUUID allremotes locations return (sortBy (comparing cost) validremotes, validtrustedlocations) diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 7b2ca0b86a..0b59650268 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -99,6 +99,8 @@ the parameters provided to `git-annex addcomputed`. [[git-annex-recompute]](1) +[[git-annex-findcomputed]](1) + [[git-annex-initremote]](1) # AUTHOR diff --git a/doc/git-annex-findcomputed.mdwn b/doc/git-annex-findcomputed.mdwn new file mode 100644 index 0000000000..a1c6cf1351 --- /dev/null +++ b/doc/git-annex-findcomputed.mdwn @@ -0,0 +1,75 @@ +# NAME + +git-annex findcomputed - lists computed files + +# SYNOPSIS + +git annex findcomputed `[path ...]` + +# DESCRIPTION + +Outputs a list of files in the specified path that can be computed by +enabled compute special remotes. With no path, lists files in the current +directory and its subdirectories. + +Along with the name of each computed file, this displays the input that +was provided to [[git-annex-addcomputed]](1). + +For example: + + # git-annex findcomputed + foo.png (imageconvert) -- convert file.raw file.jpeg passes=10 + bar.gz (compressor) -- compress bar --level=9 + +# OPTIONS + +* matching options + + The [[git-annex-matching-options]](1) + can be used to specify files to list. + +* `--branch=ref` + + List computed files in the specified branch or treeish. + +* `--format=value` (Diff truncated)
update
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 760e3e7ba5..36a17f461f 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -18,8 +18,15 @@ compute special remote. --[[Joey]] * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an url. Maybe a separate command? That would also allow querying for eg, - what files are inputs for another file. Or it could be exposed in the - Remote interface, and made into a file matching option. + what files are inputs for another file. + + Or it could be exposed in the + Remote interface, and made into a file matching option: + + git-annex find --inputof=foo + + But that would require running expensive find over the whole tree, + and wouldn't work if the input file is no longer in the tree. * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs @@ -43,3 +50,15 @@ compute special remote. --[[Joey]] Or it could build a DAG and traverse it, but building a DAG of a large directory tree has its own problems. + +* Should checkPresent check that each input file is also present in some + (non-dead) repo? + + Currently it only checks if compute state is recorded. The problem + this additional checking would solve is if an input file gets lost, + then a computation cannot be run again. + + Should it be an active check against existing remotes, or a + passive check? An active check certainly makes sense if the input + file is itself present in a compute repo, either the same one or a + different one. Otherwise, a passive check seems enough.
--json for addcomputed and recompute
Not very useful, but it does work.
Not very useful, but it does work.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 2c389ef53a..4c1b68e111 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -36,7 +36,7 @@ import qualified Data.Map as M import Data.Time.Clock cmd :: Command -cmd = notBareRepo $ withAnnexOptions [backendOption] $ +cmd = notBareRepo $ withAnnexOptions [backendOption, jsonOptions] $ command "addcomputed" SectionCommon "add computed files to annex" (paramRepeating paramExpression) (seek <$$> optParser) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index df701fb852..5d9f93fde1 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -29,7 +29,7 @@ import qualified Data.Map as M import System.PosixCompat.Files (isSymbolicLink) cmd :: Command -cmd = notBareRepo $ +cmd = notBareRepo $ withAnnexOptions [jsonOptions] $ command "recompute" SectionCommon "recompute computed files" paramPaths (seek <$$> optParser) diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index faff1d96b6..7b2ca0b86a 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -86,6 +86,11 @@ the parameters provided to `git-annex addcomputed`. Specifies which key-value backend to use. +* `--json` + + Enable JSON output. This is intended to be parsed by programs that use + git-annex. Each line of output is a JSON object. + * Also the [[git-annex-common-options]](1) can be used. # SEE ALSO diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index f10125827c..daf403471f 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -48,6 +48,11 @@ updated with the new content. The updated file is staged in git. This is the default when the compute remote indicates that it produces reproducible output. +* `--json` + + Enable JSON output. This is intended to be parsed by programs that use + git-annex. Each line of output is a JSON object. + * matching options The [[git-annex-matching-options]](1) can be used to control what
record fscked files in fsck db by default
Remember the files that are checked, so a later run with --more will
skip them, without needing to use --incremental.
Remember the files that are checked, so a later run with --more will
skip them, without needing to use --incremental.
diff --git a/CHANGELOG b/CHANGELOG index 8c944a4bfb..83df038ec3 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -10,6 +10,8 @@ git-annex (10.20250116) UNRELEASED; urgency=medium * Added OsPath build flag, which speeds up git-annex's operations on files. * git-lfs: Added an optional apiurl parameter. (This needs version 1.2.5 of the haskell git-lfs library to be used.) + * fsck: Remember the files that are checked, so a later run with --more + will skip them, without needing to use --incremental. -- Joey Hess <id@joeyh.name> Mon, 20 Jan 2025 10:24:51 -0400 diff --git a/Command/Fsck.hs b/Command/Fsck.hs index 4e66755c02..a6b6e54875 100644 --- a/Command/Fsck.hs +++ b/Command/Fsck.hs @@ -713,13 +713,12 @@ getStartTime u = do #endif data Incremental - = NonIncremental + = NonIncremental (Maybe FsckDb.FsckHandle) | ScheduleIncremental Duration UUID Incremental | StartIncremental FsckDb.FsckHandle | ContIncremental FsckDb.FsckHandle prepIncremental :: UUID -> Maybe IncrementalOpt -> Annex Incremental -prepIncremental _ Nothing = pure NonIncremental prepIncremental u (Just StartIncrementalO) = do recordStartTime u ifM (FsckDb.newPass u) @@ -734,6 +733,14 @@ prepIncremental u (Just (ScheduleIncrementalO delta)) = do Nothing -> StartIncrementalO Just _ -> MoreIncrementalO return (ScheduleIncremental delta u i) +prepIncremental u Nothing = + ifM (Annex.getRead Annex.fast) + -- Avoid recording fscked files in --fast mode, + -- since that can interfere with a non-fast incremental + -- fsck. + ( pure (NonIncremental Nothing) + , (NonIncremental . Just) <$> openFsckDb u + ) cleanupIncremental :: Incremental -> Annex () cleanupIncremental (ScheduleIncremental delta u i) = do @@ -757,6 +764,6 @@ openFsckDb u = do withFsckDb :: Incremental -> (FsckDb.FsckHandle -> Annex ()) -> Annex () withFsckDb (ContIncremental h) a = a h withFsckDb (StartIncremental h) a = a h -withFsckDb NonIncremental _ = noop +withFsckDb (NonIncremental mh) a = maybe noop a mh withFsckDb (ScheduleIncremental _ _ i) a = withFsckDb i a diff --git a/doc/git-annex-fsck.mdwn b/doc/git-annex-fsck.mdwn index 4083ba4bf1..89760119d8 100644 --- a/doc/git-annex-fsck.mdwn +++ b/doc/git-annex-fsck.mdwn @@ -37,17 +37,24 @@ better format. * `--incremental` - Start a new incremental fsck pass. An incremental fsck can be interrupted - at any time, with eg ctrl-c. + Start a new incremental fsck pass, clearing records of all files that + were checked in the previous incremental fsck pass. * `--more` - Resume the last incremental fsck pass, where it left off. + Skip files that were checked since the last incremental fsck pass + was started. + + Note that before `--incremental` is used to start an incremental fsck + pass, files that are checked are still recorded, and using this option + will skip checking those files again. Resuming may redundantly check some files that were checked before. Any files that fsck found problems with before will be re-checked on resume. Also, checkpoints are made every 1000 files or every 5 minutes - during a fsck, and it resumes from the last checkpoint. + during a fsck, and it resumes from the last checkpoint, so if an + incremental fsck is interrupted using eg ctrl-c, it will recheck files + that didn't get into the last checkpoint. * `--incremental-schedule=time` diff --git a/doc/todo/Incremental_fsck_by_default.mdwn b/doc/todo/Incremental_fsck_by_default.mdwn index f662549e63..169e02c6be 100644 --- a/doc/todo/Incremental_fsck_by_default.mdwn +++ b/doc/todo/Incremental_fsck_by_default.mdwn @@ -9,3 +9,6 @@ I actually don't see much reason to not make use of an incremental fsck either u On that note: There also does not appear to be a documented method to figure out whether a fsck was interrupted before. You could infer existence and date from the annex internal directory structure but seeing the progress requires manual sql. Perhaps there could be a `fsck --info` flag for showing both interrupted fsck progress and perhaps also the progress of the current fsck. + +> I've implemented the default recording to the fsck database. [[done]] +> --[[Joey]] diff --git a/doc/todo/Incremental_fsck_by_default/comment_1_5f35afc17e865899f72a62bff8ff30e9._comment b/doc/todo/Incremental_fsck_by_default/comment_1_5f35afc17e865899f72a62bff8ff30e9._comment new file mode 100644 index 0000000000..2cfeabf04c --- /dev/null +++ b/doc/todo/Incremental_fsck_by_default/comment_1_5f35afc17e865899f72a62bff8ff30e9._comment @@ -0,0 +1,35 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-03-17T18:34:20Z" + content=""" +I think it could make sense, when --incremental/--more are not passed, to +initialize a new fsck database if there is not already one, and +add each fscked key to the fsck database. + +That way, the user could run any combination of fscks, interrupted or not, +and then use --more to fsck only new files. When the user wants to start +a new fsck pass, they would use --incremental. + +It would need to avoid recording an incremental fsck pass start time, +to avoid interfering with --incremental-schedule. + +The only problem I see with this is, someone might have a long-term +incremental fsck they're running that is doing full checksumming. +If they then do a quick fsck --fast for other reasons, it would +record that every key has been fscked, and so lose their place. +So it seems --fast should disable this new behavior. (Also incremental +--fast fsck is not likely to be very useful anyway.) + +> I actually don't see much reason to not make use of an incremental fsck +> either unless it's *really* old + +That's a hard judgement call for a program to make... someone might think +10 minutes is really old, and someone else that a month is. + +As to figuring out whether a fsck was interrupted before, surely what +matters is you remembering that? All git-annex has is a timestamp when +the last fsck pass started, which is available in +`.git/annex/fsck/*/state`, and a list of the keys that were fscked, +which is not very useful as far as determining the progress of that fsck. +"""]]
decided to leave message as-is
"getting input <file> from <remote>" is talking about the original
input filename. I think that's ok.
"getting input <file> from <remote>" is talking about the original
input filename. I think that's ok.
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 1de0213bc8..760e3e7ba5 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -21,11 +21,6 @@ compute special remote. --[[Joey]] what files are inputs for another file. Or it could be exposed in the Remote interface, and made into a file matching option. -* "getting input from <file>" message uses the original filename, - but that file might have been renamed. Would be more clear to use - whatever file in the tree currently points to the key it's getting - (what if there is not one?) - * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs
decided addcomputed will not support annex.smallfiles
If it did, recompute would need to somehow support recomputing
non-annexed files.
And, annex.smallfiles is typically used for configuration files or
source code kind of things, where the user doesn't want it to be an
annexed file. Computed artifacts are not likely that kind of thing.
Also, git-annex importfeed is an example of something that does support
annex.addunlocked, but does not support annex.smallfiles.
If it did, recompute would need to somehow support recomputing
non-annexed files.
And, annex.smallfiles is typically used for configuration files or
source code kind of things, where the user doesn't want it to be an
annexed file. Computed artifacts are not likely that kind of thing.
Also, git-annex importfeed is an example of something that does support
annex.addunlocked, but does not support annex.smallfiles.
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index db31b873cf..1de0213bc8 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -48,9 +48,3 @@ compute special remote. --[[Joey]] Or it could build a DAG and traverse it, but building a DAG of a large directory tree has its own problems. - -* Should addcomputed honor annex.smallfiles? That would seem to imply - that recompute should also support recomputing non-annexed files. - Otherwise, adding a file and then recomputing it would vary in - what the content of the file is, depending on annex.smallfiles setting. -
annex.addunlocked support for git-annex compute
And for git-annex recompute, add the file unlocked when the original is
unlocked.
And for git-annex recompute, add the file unlocked when the original is
unlocked.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 02d8826683..2c389ef53a 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -24,11 +24,13 @@ import Annex.UUID import Annex.GitShaKey import Types.KeySource import Types.Key +import Annex.FileMatcher import Messages.Progress import Logs.Location import Logs.EquivilantKeys import Utility.Metered import Backend.URL (fromUrl) +import Git.FilePath import qualified Data.Map as M import Data.Time.Clock @@ -73,20 +75,21 @@ seek o = startConcurrency commandStages (seek' o) seek' :: AddComputedOptions -> CommandSeek seek' o = do + addunlockedmatcher <- addUnlockedMatcher r <- getParsed (computeRemote o) unless (Remote.Compute.isComputeRemote r) $ giveup "That is not a compute remote." - commandAction $ start o r + commandAction $ start o r addunlockedmatcher -start :: AddComputedOptions -> Remote -> CommandStart -start o r = starting "addcomputed" ai si $ perform o r +start :: AddComputedOptions -> Remote -> AddUnlockedMatcher -> CommandStart +start o r = starting "addcomputed" ai si . perform o r where ai = ActionItemUUID (Remote.uuid r) (UnquotedString (Remote.name r)) si = SeekInput (computeParams o) -perform :: AddComputedOptions -> Remote -> CommandPerform -perform o r = do +perform :: AddComputedOptions -> Remote -> AddUnlockedMatcher -> CommandPerform +perform o r addunlockedmatcher = do program <- Remote.Compute.getComputeProgram r repopath <- fromRepo Git.repoPath subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") @@ -102,8 +105,11 @@ perform o r = do (Remote.Compute.ImmutableState False) (getInputContent fast) Nothing - (addComputed (Just "adding") r (reproducible o) chooseBackend Just fast) + (go fast) next $ return True + where + go fast = addComputed (Just "adding") r (reproducible o) + chooseBackend Just fast (Right addunlockedmatcher) addComputed :: Maybe StringContainingQuotedPath @@ -112,11 +118,12 @@ addComputed -> (OsPath -> Annex Backend) -> (OsPath -> Maybe OsPath) -> Bool + -> Either Bool AddUnlockedMatcher -> Remote.Compute.ComputeProgramResult -> OsPath -> NominalDiffTime -> Annex () -addComputed maddaction r reproducibleconfig choosebackend destfile fast result tmpdir ts = do +addComputed maddaction r reproducibleconfig choosebackend destfile fast addunlockedmatcher result tmpdir ts = do when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." oks <- forM (M.keys outputs) $ \outputfile -> do @@ -163,19 +170,43 @@ addComputed maddaction r reproducibleconfig choosebackend destfile fast result t stateurl = Remote.Compute.computeStateUrl r state outputfile stateurlk = fromUrl stateurl Nothing True outputfile' = tmpdir </> outputfile - ld f = LockedDown ldc (ks f) - ks f = KeySource - { keyFilename = f - , contentLocation = outputfile' - , inodeCache = Nothing - } genkey f p = do backend <- choosebackend outputfile - fst <$> genKey (ks f) p backend - ingesthelper f p mk = - ingestwith $ do - k <- maybe (genkey f p) return mk - ingestAdd' p (Just (ld f)) (Just k) + let ks = KeySource + { keyFilename = f + , contentLocation = outputfile' + , inodeCache = Nothing + } + fst <$> genKey ks p backend + ingesthelper f p mk = ingestwith $ do + k <- maybe (genkey f p) return mk + topf <- inRepo $ toTopFilePath f + let fi = FileInfo + { contentFile = outputfile' + , matchFile = getTopFilePath topf + , matchKey = Just k + } + lockingfile <- case addunlockedmatcher of + Right addunlockedmatcher' -> + not <$> addUnlocked addunlockedmatcher' + (MatchingFile fi) + (not fast) + Left v -> pure v + let ldc = LockDownConfig + { lockingFile = lockingfile + , hardlinkFileTmpDir = Nothing + , checkWritePerms = True + } + liftIO $ createDirectoryIfMissing True $ + takeDirectory f + liftIO $ moveFile outputfile' f + let ks = KeySource + { keyFilename = f + , contentLocation = f + , inodeCache = Nothing + } + let ld = LockedDown ldc ks + ingestAdd' p (Just ld) (Just k) ingestwith a = a >>= \case Nothing -> giveup "ingestion failed" Just k -> do @@ -188,12 +219,6 @@ addComputed maddaction r reproducibleconfig choosebackend destfile fast result t =<< calcRepo (gitAnnexLocation k) return k - ldc = LockDownConfig - { lockingFile = True - , hardlinkFileTmpDir = Nothing - , checkWritePerms = True - } - isreproducible = case reproducibleconfig of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible result diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 82ed7ab37e..df701fb852 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -23,8 +23,10 @@ import Logs.Location import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent, getInputContent', addComputed) import Backend (maybeLookupBackendVariety, unknownBackendVarietyMessage, chooseBackend) import Types.Key +import qualified Utility.RawFilePath as R import qualified Data.Map as M +import System.PosixCompat.Files (isSymbolicLink) cmd :: Command cmd = notBareRepo $ @@ -126,19 +128,22 @@ perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.Compute perform o r file origkey origstate = do program <- Remote.Compute.getComputeProgram r reproducibleconfig <- getreproducibleconfig + originallocked <- liftIO $ isSymbolicLink + <$> R.getSymbolicLinkStatus (fromOsPath file) showOutput Remote.Compute.runComputeProgram program origstate (Remote.Compute.ImmutableState False) (getinputcontent program) Nothing - (go program reproducibleconfig) + (go program reproducibleconfig originallocked) next cleanup where - go program reproducibleconfig result tmpdir ts = do + go program reproducibleconfig originallocked result tmpdir ts = do checkbehaviorchange program (Remote.Compute.computeState result) addComputed Nothing r reproducibleconfig - choosebackend destfile False result tmpdir ts + choosebackend destfile False (Left originallocked) + result tmpdir ts checkbehaviorchange program state = do let check s w a b = forM_ (M.keys (w a)) $ \f -> diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index c6e5a64de6..db31b873cf 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -29,21 +29,6 @@ compute special remote. --[[Joey]] * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs -* addcomputed should honor annex.addunlocked. - - What about recompute? It seems it should either write the new version of - the file as an unlocked file when the old version was unlocked, or also - honor annex.addunlocked. - - Problem: Since recompute does not stage the file, it would have to write - the content to the working tree. And then the user would need to - git-annex add. But then, if the key was a VURL key, it would add it with - the default backend instead, and the file would no longer use a computed - key. (Diff truncated)
diff --git a/doc/todo/Incremental_fsck_by_default.mdwn b/doc/todo/Incremental_fsck_by_default.mdwn new file mode 100644 index 0000000000..f662549e63 --- /dev/null +++ b/doc/todo/Incremental_fsck_by_default.mdwn @@ -0,0 +1,11 @@ +Whenever I do an fsck, it's always annoyed me that you have to think of adding `--incremental` and then also think about whether an incremental fsck was started and interrupted before which would then require `--more` instead. + +Forgetting to add `--incremental` can leave you in a pickle when you later find out that you need to interrupt the fsck, losing all progress. + +I've found myself wondering whether there'd ever be a case where I'd not want an fsck to be resumeable. Could git-annex not just simply always store that information and leave it up to the next fsck execution to decide whether to use it or not? + +I actually don't see much reason to not make use of an incremental fsck either unless it's *really* old but I find this a lot more debatable than at least storing fsck state on each run. + +On that note: There also does not appear to be a documented method to figure out whether a fsck was interrupted before. You could infer existence and date from the annex internal directory structure but seeing the progress requires manual sql. + +Perhaps there could be a `fsck --info` flag for showing both interrupted fsck progress and perhaps also the progress of the current fsck.
diff --git a/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__.mdwn b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__.mdwn new file mode 100644 index 0000000000..29ceee21f5 --- /dev/null +++ b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__.mdwn @@ -0,0 +1,3 @@ +The man page is not too clear on this and I noticed that it's not possible to pass both flags at once. + +Does interrupting `fsck --more` lose the progress made since the initial incremental fsck?
diff --git a/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn b/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn new file mode 100644 index 0000000000..c73a886d46 --- /dev/null +++ b/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn @@ -0,0 +1,39 @@ +### Please describe the problem. + +When running an fsck, I just had git-annex tell me that required content was missing from a bunch of repos that comprise my cold storage which makes sense but it also listed dead repos in the listing. Those repos are still in the group and still have `groupwanted` as the required setting. + +Dead drives should never be considered requiring or wanting content, even if they're still configured as such. (Or holding content for that matter but I hope that part works.) + +### What steps will reproduce the problem? + +1. Have dead repos that require content +2. Have alive repos that require the same content (unsure if required) +3. `git annex fsck` + +### What version of git-annex are you using? On what operating system? + +``` +git-annex version: 10.20241202 +build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Servant Feeds Testsuite S3 WebDAV +dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.6 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1 +key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X* +remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external +operating system: linux x86_64 +supported repository versions: 8 9 10 +upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10 +local repository version: 10 +``` + +### Please provide any additional information below. + +[[!format sh """ +# If you can, paste a complete transcript of the problem occurring here. +# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log + + +# End of transcript or log. +"""]] + +### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) + +Best invention since sliced bread.
diff --git a/doc/users/msz.mdwn b/doc/users/msz.mdwn new file mode 100644 index 0000000000..1d4fd37e4b --- /dev/null +++ b/doc/users/msz.mdwn @@ -0,0 +1,4 @@ +Michał Szczepanik +[@doktorpanik@masto.ai](https://masto.ai/@doktorpanik) + +Postdoc @ [Psychoinformatics Group](https://psychoinformatics.de/), INM-7, Forschungszentrum Jülich
Added a comment
diff --git a/doc/todo/compute_special_remote/comment_23_11e2e14ab20856b005793e80e79d2382._comment b/doc/todo/compute_special_remote/comment_23_11e2e14ab20856b005793e80e79d2382._comment new file mode 100644 index 0000000000..513cce8436 --- /dev/null +++ b/doc/todo/compute_special_remote/comment_23_11e2e14ab20856b005793e80e79d2382._comment @@ -0,0 +1,12 @@ +[[!comment format=mdwn + username="msz" + avatar="http://cdn.libravatar.org/avatar/6e8b88e7c70d86f4cfd27d450958aed4" + subject="comment 23" + date="2025-03-12T19:44:23Z" + content=""" +@joey: + +> I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it. + +I think it's excellent to have something like this in git-annex. I didn't have the opportunity to try it out yet, but I am definitely looking forward to seeing how things can work in practice and comparing the implementations. +"""]]
add compute tip
diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 52d650068f..b0027c7419 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -26,6 +26,8 @@ program takes a dashed option, it can be provided after "--": # git-annex initremote myremote type=compute program=git-annex-compute-foo -- --level=9 +See [[tips/computing_annexed_files]] for more documentation. + ## compute programs To write programs used by the compute special remote, see the diff --git a/doc/tips/computing_annexed_files.mdwn b/doc/tips/computing_annexed_files.mdwn new file mode 100644 index 0000000000..8ca448d8cc --- /dev/null +++ b/doc/tips/computing_annexed_files.mdwn @@ -0,0 +1,233 @@ +Do you ever check in original versions of files to `git-annex`, but then +convert them in some way? Maybe you check in original photos from a camera, +but then change them to a more useful file format, or smaller resolution. +Or you clip a video file. Or you crunch some data to a result. + +If you check the computed file into `git-annex` too, and store it on +your remotes along with the original, that's a waste of disk space. +But it is so convenient to be able to `git-annex get` the computed file. + +The [[compute special remote|special_remotes/compute]] is the solution to +this. It "stores" the computed file by remembering how to compute it from +input files. When you `git-annex get` the computed file from it, it re-runs +the computation on the original input file to produced the computed file. + +[[!toc ]] + +## using the compute special remote + +There are many compute programs that each handle some type of computation, +and it's pretty easy to write your own compute program too. In this tip, +we'll use [[special_remotes/compute/git-annex-compute-imageconvert]], +which uses imagemagick to convert between image formats. + +To follow along, install that program in PATH (and remember to make it +executable!) and make sure you have +[imagemagick](https://www.imagemagick.org/) installed. + +First, initialize a compute remote: + + # git-annex initremote imageconvert type=compute program=git-annex-compute-imageconvert + +Now suppose you have a file `foo.jpeg`, and you want to add a computed +`foo.gif` to the git-annex repository. + + # git-annex addcomputed --to=imageconvert foo.jpeg foo.gif + +(The syntax of the `git-annex addcomputed` command will vary depending on the +program that a compute remote uses. Some may have multiple input files, or +multiple ouput files, or other options to control the computation. See +the documentation of each compute program for details.) + +Now you have `foo.gif` and can use it as usual, including copying it to +other remotes. But it's already "stored" in the imageconvert remote, +as a computation. So to free up space, you can drop it: + + # git-annex drop foo.gif + drop foo.gif ok + +By the way, you can also add a computed file to the repository +without bothering to compute it yet! Just use `--fast`: + + # git-annex addcomputed --fast --to=imageconvert bar.jpeg bar.gif + +Now suppose you're in another clone of this same repository, and you want +these gifs. + + # git-annex get foo.gif + get foo.gif (not available) + Maybe enable some of these special remotes (git annex enableremote ...): + 8332f7ad-d54e-435e-803b-138c1cfa7b71 -- imageconvert + failed + +With [[special_remotes/compute/git-annex-compute-imageconvert]] and +imagemagic installed, all you need to do is enable the special remote to +get the computed files from it: + + # git-annex enableremote imageconvert + # git-annex get foo.gif + get foo.gif (from imageconvert...) + (getting input foo.jpeg from origin...) + ok + +Notice that, when the input file is not present in the repository, getting +a file from a compute remote will first get the input file. + +That's the basics of using the compute special remote. + +## recomputation + +What happens if the input file `foo.gif` is changed to a new version? +Will getting `foo.jpeg` from the compute remote base it on the new version +too? No. `foo.gif` is stuck on the original version of the input file that +was used to compute it. + +But, it's easy to recompute the file with a new version of the input file. +Just `git-annex add` the new version of the input file, and then: + + # git-annex recompute foo.gif + recompute foo.gif (foo.jpeg changed) + ok + +You can use commands like `git diff` and `git status` to see the +change that made to `foo.gif`. + + # git status --short foo.gif + M foo.gif + +Now both the new and old versions of `foo.gif` are stored in the +imageconvert remote, and it can compute either as needed. + +## reproducibility + +You might be wondering, what happens if a computed file, such as `foo.gif` +isn't exactly the same identical file each time it's computed? For example, +what if there's a timestamp in there. + +The answer is that, by default, files computed by a compute special remote +are not required, or guaranteed to be bit-for-bit reproducible. One gif +converted from a jpeg is much like any other converted from the same jpeg. + +So git-annex usually treats all files computed in the same way from the +same input as interchangeable. (Unless the compute program indicates +that it produces reproducible files.) + +Sometimes though, it's important that a file be bit-for-bit reproducible. And +you can ask git-annex to enforce this for computed files. +There is a `--reproducible` option for this, which you can pass to +`git-annex addcomputed` or to `git-annex recompute`. + +Let's switch the computed `foo.gif` to a reproducible file: + + # git-annex recompute --original --reproducible foo.gif + recompute foo.gif + ok + +You can `git commit foo.gif` to store this change. + +But first, let's check if that computation actually *is* reproducible. +This is easy, just drop it and get it from the compute remote again: + + # git-annex drop foo.gif + drop foo.gif ok + # git-annex get foo.gif --from imageconvert + get foo.gif (from imageconvert...) + ok + +If it turned out that the computation was not reproducible, getting the +file would fail, like this: + + # git-annex get foo.gif --from imageconvert + get foo.gif (from imageconvert...) + Verification of content failed + +This is because a reproducible file uses a regular [[backend]], which +by default uses a hash to verify the content of the file. + +If it does turn out that a file that was expected to be reproducible isn't, +you can always convert it to an unreproducible file: + + # git-annex recompute --original --unreproducible foo.gif + recompute foo.gif + ok + +## writing your own compute programs + +There is a whole little protocol that compute programs use to +communicate with git-annex. It's all documented at +[[design/compute_special_remote_interface]]. + +But it's really easy to write simple ones, and you don't need to +dive into all the details to do it. Let's walk through the code +to [[special_remotes/compute/git-annex-compute-imageconvert]], +which at 14 lines, is about as simple as one can be. + + #!/bin/sh + +It's a shell script. + + set -e + +If it fails to read input from standard input, or if a command fails, it +will exit nonzero. + + if [ -z "$1" ] || [ -z "$2" ]; then + echo "Specify the input image file, followed by the output image file." >&2 + echo "Example: foo.jpg foo.gif" >&2 + exit 1 + fi + +It expects to be passed two parameters, which were "foo.jpeg" and "foo.gif" in +the examples above. And it outputs some usage to stderr otherwise. That is (Diff truncated)
recompute: stage new version of file in git
When writing doc/tips/computing_annexed_files.mdwn, I noticed
that a recompute --reproducible followed by a drop and a re-get did not
actually test if the file could be reproducible computed again.
Turns out that get and drop both operate on staged files. If there is an
unstaged modification in the work tree, that's ignored. Somewhat
surprisingly, other commands like info do operate on staged files. So
behavior is inconsistent, and fairly surprising really, when there are
unstaged modifications to files.
Probably this is rarely noticed because `git-annex add` is used to add a
new version of a file, and then it's staged. Or `git mv` is used to move
a file, rather than `mv` of a file over top of an existing file. So it's
uncommon to have an unstaged annexed file in a worktree.
It might be worth making things more consistent, but that's out of scope
for what I'm working on currently.
Also, I anticipate that supporting unlocked files with recompute will
require it to stage changes anyway.
So, make recompute stage the new version of the file.
I considered having recompute refuse to overwrite an existing staged
file. After all, whatever version was staged before will get lost when
the new version is staged over top of it. But, that's no different than
`git-annex addcomputed` being run with the name of an existing staged
file. Or `git-annex add` being run with a new file content when there is
an existing staged file. Or, for that matter, `git add` being ran with a
new content when there is an existing staged file.
When writing doc/tips/computing_annexed_files.mdwn, I noticed
that a recompute --reproducible followed by a drop and a re-get did not
actually test if the file could be reproducible computed again.
Turns out that get and drop both operate on staged files. If there is an
unstaged modification in the work tree, that's ignored. Somewhat
surprisingly, other commands like info do operate on staged files. So
behavior is inconsistent, and fairly surprising really, when there are
unstaged modifications to files.
Probably this is rarely noticed because `git-annex add` is used to add a
new version of a file, and then it's staged. Or `git mv` is used to move
a file, rather than `mv` of a file over top of an existing file. So it's
uncommon to have an unstaged annexed file in a worktree.
It might be worth making things more consistent, but that's out of scope
for what I'm working on currently.
Also, I anticipate that supporting unlocked files with recompute will
require it to stage changes anyway.
So, make recompute stage the new version of the file.
I considered having recompute refuse to overwrite an existing staged
file. After all, whatever version was staged before will get lost when
the new version is staged over top of it. But, that's no different than
`git-annex addcomputed` being run with the name of an existing staged
file. Or `git-annex add` being run with a new file content when there is
an existing staged file. Or, for that matter, `git add` being ran with a
new content when there is an existing staged file.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index dd6c310b06..02d8826683 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -102,12 +102,11 @@ perform o r = do (Remote.Compute.ImmutableState False) (getInputContent fast) Nothing - (addComputed (Just "adding") True r (reproducible o) chooseBackend Just fast) + (addComputed (Just "adding") r (reproducible o) chooseBackend Just fast) next $ return True addComputed :: Maybe StringContainingQuotedPath - -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Annex Backend) @@ -117,7 +116,7 @@ addComputed -> OsPath -> NominalDiffTime -> Annex () -addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fast result tmpdir ts = do +addComputed maddaction r reproducibleconfig choosebackend destfile fast result tmpdir ts = do when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." oks <- forM (M.keys outputs) $ \outputfile -> do @@ -148,9 +147,7 @@ addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fa | fast = do case destfile outputfile of Nothing -> noop - Just f - | stagefiles -> addSymlink f stateurlk Nothing - | otherwise -> makelink f stateurlk + Just f -> addSymlink f stateurlk Nothing return stateurlk | isreproducible = do sz <- liftIO $ getFileSize outputfile' @@ -175,16 +172,10 @@ addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fa genkey f p = do backend <- choosebackend outputfile fst <$> genKey (ks f) p backend - makelink f k = void $ makeLink f k Nothing - ingesthelper f p mk - | stagefiles = ingestwith $ do + ingesthelper f p mk = + ingestwith $ do k <- maybe (genkey f p) return mk ingestAdd' p (Just (ld f)) (Just k) - | otherwise = ingestwith $ do - k <- maybe (genkey f p) return mk - mk' <- fst <$> ingest p (Just (ld f)) (Just k) - maybe noop (makelink f) mk' - return mk' ingestwith a = a >>= \case Nothing -> giveup "ingestion failed" Just k -> do diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 17246d10e4..82ed7ab37e 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -137,7 +137,7 @@ perform o r file origkey origstate = do go program reproducibleconfig result tmpdir ts = do checkbehaviorchange program (Remote.Compute.computeState result) - addComputed Nothing False r reproducibleconfig + addComputed Nothing r reproducibleconfig choosebackend destfile False result tmpdir ts checkbehaviorchange program state = do diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index 498c85e26c..f10125827c 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -15,8 +15,7 @@ By default, this only recomputes files whose input files have changed. The new contents of the input files are used to re-run the computation. When the output of the computation is different, the computed file is -updated with the new content. The updated file is written to the worktree, -but is not staged, in order to avoid overwriting any staged changes. +updated with the new content. The updated file is staged in git. # OPTIONS diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 820b423199..c6e5a64de6 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,13 +1,6 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* recompute should stage files in git. Otherwise, - `git-annex drop` after recompute --reproducible drops the staged - file, and `git-annex get` gets the staged file, and if it wasn't - actually reproducible, this is not apparent. - - This is blocking adding the tip. - * Support parallel get of input files. The design allows for this, but how much parallelism makes sense? Would it be possible to use the usual worker pool?
todo
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index f478c5d966..820b423199 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,7 +1,12 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* write a tip showing how to use this +* recompute should stage files in git. Otherwise, + `git-annex drop` after recompute --reproducible drops the staged + file, and `git-annex get` gets the staged file, and if it wasn't + actually reproducible, this is not apparent. + + This is blocking adding the tip. * Support parallel get of input files. The design allows for this, but how much parallelism makes sense? Would it be possible to use the @@ -10,7 +15,12 @@ compute special remote. --[[Joey]] * compute on input files in submodules * annex.diskreserve can be violated if getting a file computes it but also - some other output files, which get added to the annex. + some other output files, which get added to the annex. This can't be + avoided at addcomputed time, but when getting later from the compute + remote, it could check (but not when using VURL without size information) + +* annex.diskreserve can also be violated if computing a file gets source + files that are larger than the disk reserve. This could be checked. * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an
improve
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index f286a0b7cd..001b57a6d1 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -29,7 +29,7 @@ avoid exposing user input to the shell unprotected, or otherwise executing it. (Except when the program is explicitly running user input in some form of sandbox.) -## interface +## program parameters and environment Whatever values the user passes to `git-annex addcomputed` are passed to the program in `ARGV`, followed by any values that the user provided to
comment
diff --git a/doc/special_remotes/compute/comment_5_d3faa33c3876d6f4883cce19189b7928._comment b/doc/special_remotes/compute/comment_5_d3faa33c3876d6f4883cce19189b7928._comment new file mode 100644 index 0000000000..3a41e92045 --- /dev/null +++ b/doc/special_remotes/compute/comment_5_d3faa33c3876d6f4883cce19189b7928._comment @@ -0,0 +1,19 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: just thinking out loud""" + date="2025-03-11T16:42:46Z" + content=""" +> And there could be some generic "helper" (or a number of them) which would then provide desired CLI interfacing over arbitrary command + +Absolutely! + +You do need to use "--" before your own custom dashed options. + +And bear in mind that "field=value" parameters passed to initremote will +be passed on to the program. So you can have a generic helper +that is instantiated with a parameter like --command=, which then gets used +automatically when running addcompute: + + git-annex initremote foo type=compute program=git-annex-compute-generic-helper -- --command='convert {inputs} {outputs}' + git-annex addcomputed --to=foo -- -i foo.jpeg -o foo.gif +"""]]
buffer responses to compute programs in a TQueue
This avoids a potential problem where the program sends several INPUT
before reading responses, so flushing the respose to the pipe could
block. It's unlikely, but seemed worth making sure it can't happen.
This avoids a potential problem where the program sends several INPUT
before reading responses, so flushing the respose to the pipe could
block. It's unlikely, but seemed worth making sure it can't happen.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 0b27d135ba..2ef7844808 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -435,10 +435,12 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) showOutput starttime <- liftIO currentMonotonicTimestamp let startresult = ComputeProgramResult state False False False - result <- withmeterfile $ \meterfile -> bracket - (liftIO $ createProcess pr) - (liftIO . cleanupProcess) - (getinput tmpdir subdir startresult meterfile) + result <- withmeterfile $ \meterfile -> + bracket + (liftIO $ createProcess pr) + (liftIO . cleanupProcess) $ \p -> + withoutputv p $ + getinput tmpdir subdir startresult meterfile p endtime <- liftIO currentMonotonicTimestamp liftIO $ checkoutputs result subdir cont result subdir (calcduration starttime endtime) @@ -453,14 +455,14 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) , return tmpdir ) - getinput tmpdir subdir result meterfile p = + getinput tmpdir subdir result meterfile p outputv = liftIO (hGetLineUntilExitOrEOF (processHandle p) (stdoutHandle p)) >>= \case Just l - | null l -> getinput tmpdir subdir result meterfile p + | null l -> getinput tmpdir subdir result meterfile p outputv | otherwise -> do fastDebug "Compute" ("< " ++ l) - result' <- parseoutput p tmpdir subdir result meterfile l - getinput tmpdir subdir result' meterfile p + result' <- parseoutput outputv tmpdir subdir result meterfile l + getinput tmpdir subdir result' meterfile p outputv Nothing -> do liftIO $ hClose (stdoutHandle p) liftIO $ hClose (stdinHandle p) @@ -468,19 +470,14 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) giveup $ program ++ " exited unsuccessfully" return result - sendresponse p s = do - fastDebug "Compute" ("> " ++ s) - liftIO $ hPutStrLn (stdinHandle p) s - liftIO $ hFlush (stdinHandle p) - - parseoutput p tmpdir subdir result meterfile l = case Proto.parseMessage l of - Just (ProcessInput f) -> handleinput f False p tmpdir subdir result - Just (ProcessInputRequired f) -> handleinput f True p tmpdir subdir result + parseoutput outputv tmpdir subdir result meterfile l = case Proto.parseMessage l of + Just (ProcessInput f) -> handleinput f False outputv tmpdir subdir result + Just (ProcessInputRequired f) -> handleinput f True outputv tmpdir subdir result Just (ProcessOutput f) -> do let f' = toOsPath f checksafefile tmpdir subdir f' "output" -- Modify filename so eg "-foo" becomes "./-foo" - sendresponse p $ toCommand' (File f) + sendresponse outputv $ toCommand' (File f) -- If the output file is in a subdirectory, make -- the directories so the compute program doesn't -- need to. @@ -508,7 +505,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Just ProcessSandbox -> do sandboxpath <- liftIO $ fromOsPath <$> relPathDirToFile subdir tmpdir - sendresponse p $ + sendresponse outputv $ if null sandboxpath then "." else sandboxpath @@ -516,7 +513,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Nothing -> giveup $ program ++ " output an unparseable line: \"" ++ l ++ "\"" - handleinput f required p tmpdir subdir result = do + handleinput f required outputv tmpdir subdir result = do let f' = toOsPath f let knowninput = M.member f' (computeInputs (computeState result)) @@ -534,7 +531,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) mkrel $ pure obj Just (Left gitsha) -> mkrel $ populategitsha gitsha tmpdir - sendresponse p $ + sendresponse outputv $ maybe "" fromOsPath mp let result' = result { computeInputsUnavailable = @@ -630,6 +627,28 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Just sz -> progress $ BytesProcessed $ floor $ fromIntegral sz * percent / 100 + + withoutputv p a = do + outputv <- liftIO $ atomically newTQueue + let cleanup pid = do + atomically $ writeTQueue outputv Nothing + wait pid + bracket + (liftIO $ async $ sendoutput' p outputv) + (liftIO . cleanup) + (const $ a outputv) + + sendoutput' p outputv = + atomically (readTQueue outputv) >>= \case + Nothing -> return () + Just s -> do + liftIO $ hPutStrLn (stdinHandle p) s + liftIO $ hFlush (stdinHandle p) + sendoutput' p outputv + + sendresponse outputv s = do + fastDebug "Compute" ("> " ++ s) + liftIO $ atomically $ writeTQueue outputv (Just s) computationBehaviorChangeError :: ComputeProgram -> String -> OsPath -> Annex a computationBehaviorChangeError (ComputeProgram program) requestdesc p = diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index bba17b2300..f478c5d966 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,11 +1,6 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* git-annex responds to each INPUT immediately, and flushes stdout. - This could cause problems if the program is sending several INPUT - first, before reading responses, as is documented it should do to allow - for parallel get of the input files. - * write a tip showing how to use this * Support parallel get of input files. The design allows for this,
close off newline injection attacks against compute special remote protocol
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 7ed6040ceb..0b27d135ba 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -56,6 +56,7 @@ import Utility.CopyFile import Types.Key import Backend import qualified Git +import qualified Utility.OsString as OS import qualified Utility.FileIO as F import qualified Utility.RawFilePath as R import qualified Utility.SimpleProtocol as Proto @@ -271,7 +272,9 @@ formatComputeState' mk st = renderQuery False $ concat parseComputeState :: Key -> B.ByteString -> Maybe ComputeState parseComputeState k b = let st = go emptycomputestate (parseQuery b) - in if st == emptycomputestate then Nothing else Just st + in if st == emptycomputestate || illegalComputeState st + then Nothing + else Just st where emptycomputestate = ComputeState { computeParams = mempty @@ -317,6 +320,20 @@ parseComputeState k b = _ -> Nothing in go c' rest +{- This is used to avoid ComputeStates that should never happen, + - but which could be injected into a repository by an attacker. -} +illegalComputeState :: ComputeState -> Bool +illegalComputeState st + -- The protocol is line-based, so filenames used in it cannot + -- contain newlines. + | any containsnewline (M.keys (computeInputs st)) = True + | any containsnewline (M.keys (computeOutputs st)) = True + -- Just in case. + | containsnewline (computeSubdir st) = True + | otherwise = False + where + containsnewline p = unsafeFromChar '\n' `OS.elem` p + {- A compute: url for a given output file of a computation. -} computeStateUrl :: Remote -> ComputeState -> OsPath -> URLString computeStateUrl r st p = diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index fab644f0e4..bba17b2300 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,10 +1,6 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* prohibit using compute states where an input or output filename contains - a newline. The protocol doesn't allow this to happen usually, but an - attacker might try it in order to scramble the protocol. - * git-annex responds to each INPUT immediately, and flushes stdout. This could cause problems if the program is sending several INPUT first, before reading responses, as is documented it should do to allow
update
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index c13a4e6425..fab644f0e4 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,6 +1,10 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] +* prohibit using compute states where an input or output filename contains + a newline. The protocol doesn't allow this to happen usually, but an + attacker might try it in order to scramble the protocol. + * git-annex responds to each INPUT immediately, and flushes stdout. This could cause problems if the program is sending several INPUT first, before reading responses, as is documented it should do to allow @@ -12,12 +16,6 @@ compute special remote. --[[Joey]] but how much parallelism makes sense? Would it be possible to use the usual worker pool? -* Write some simple compute programs so we have something to start with. - - - convert between images eg jpeg to png - - run a command in a singularity container (that is one of the inputs) - - run a wasm binary (that is one of the inputs) - * compute on input files in submodules * annex.diskreserve can be violated if getting a file computes it but also
add INPUT-REQUIRED
Used by git-annex-compute-singularity to make addcomputed --fast work.
Also, simplified git-annex-compute-singularity; there is no need to hard
link the container into place. singularity does not care about the
extension of the container, so can just pass it the annex object file.
Used by git-annex-compute-singularity to make addcomputed --fast work.
Also, simplified git-annex-compute-singularity; there is no need to hard
link the container into place. singularity does not care about the
extension of the container, so can just pass it the annex object file.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 4774caae9b..50c0ee28f6 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -206,14 +206,14 @@ addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fa Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible result -getInputContent :: Bool -> OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath)) -getInputContent fast p = catKeyFile p >>= \case - Just inputkey -> getInputContent' fast inputkey filedesc +getInputContent :: Bool -> OsPath -> Bool -> Annex (Key, Maybe (Either Git.Sha OsPath)) +getInputContent fast p required = catKeyFile p >>= \case + Just inputkey -> getInputContent' fast inputkey required filedesc Nothing -> inRepo (Git.fileRef p) >>= \case Just fileref -> catObjectMetaData fileref >>= \case Just (sha, _, t) | t == Git.BlobObject -> - getInputContent' fast (gitShaKey sha) filedesc + getInputContent' fast (gitShaKey sha) required filedesc | otherwise -> badinput $ ", not a git " ++ decodeBS (Git.fmtObjectType t) Nothing -> notcheckedin @@ -223,9 +223,9 @@ getInputContent fast p = catKeyFile p >>= \case badinput s = giveup $ "The computation needs an input file " ++ s ++ ": " ++ fromOsPath p notcheckedin = badinput "that is not checked into the git repository" -getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe (Either Git.Sha OsPath)) -getInputContent' fast inputkey filedesc - | fast = return (inputkey, Nothing) +getInputContent' :: Bool -> Key -> Bool -> String -> Annex (Key, Maybe (Either Git.Sha OsPath)) +getInputContent' fast inputkey required filedesc + | fast && not required = return (inputkey, Nothing) | otherwise = case keyGitSha inputkey of Nothing -> ifM (inAnnex inputkey) ( do diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 6b21ce8ee7..b85f5d449d 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -152,14 +152,14 @@ perform o r file origkey origstate = do check "not outputting" Remote.Compute.computeOutputs origstate state - getinputcontent program p + getinputcontent program p required | originalOption o = case M.lookup p (Remote.Compute.computeInputs origstate) of - Just inputkey -> getInputContent' False inputkey + Just inputkey -> getInputContent' False inputkey required (fromOsPath p ++ "(key " ++ serializeKey inputkey ++ ")") Nothing -> Remote.Compute.computationBehaviorChangeError program "requesting a new input file" p - | otherwise = getInputContent False p + | otherwise = getInputContent False p required destfile outputfile | Just outputfile == origfile = Just file diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 7d21ddccdb..3adec4bc5b 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -201,17 +201,19 @@ programField = Accepted "program" data ProcessCommand = ProcessInput FilePath | ProcessOutput FilePath + | ProcessProgress PercentFloat | ProcessReproducible | ProcessSandbox - | ProcessProgress PercentFloat + | ProcessInputRequired FilePath deriving (Show, Eq) instance Proto.Receivable ProcessCommand where parseCommand "INPUT" = Proto.parse1 ProcessInput parseCommand "OUTPUT" = Proto.parse1 ProcessOutput + parseCommand "PROGRESS" = Proto.parse1 ProcessProgress parseCommand "REPRODUCIBLE" = Proto.parse0 ProcessReproducible parseCommand "SANDBOX" = Proto.parse0 ProcessSandbox - parseCommand "PROGRESS" = Proto.parse1 ProcessProgress + parseCommand "INPUT-REQUIRED" = Proto.parse1 ProcessInputRequired parseCommand _ = Proto.parseFail newtype PercentFloat = PercentFloat Float @@ -392,9 +394,10 @@ runComputeProgram :: ComputeProgram -> ComputeState -> ImmutableState - -> (OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath))) - -- ^ get input file's content, or Nothing the input file's - -- content is not available + -> (OsPath -> Bool -> Annex (Key, Maybe (Either Git.Sha OsPath))) + -- ^ Get input file's content, or Nothing the input file's + -- content is not available. True is passed when the input content + -- is required even for addcomputed --fast. -> Maybe (Key, MeterUpdate) -- ^ update meter for this key -> (ComputeProgramResult -> OsPath -> NominalDiffTime -> Annex v) @@ -454,37 +457,8 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO $ hFlush (stdinHandle p) parseoutput p tmpdir subdir result meterfile l = case Proto.parseMessage l of - Just (ProcessInput f) -> do - let f' = toOsPath f - let knowninput = M.member f' - (computeInputs (computeState result)) - checksafefile tmpdir subdir f' "input" - checkimmutable knowninput "inputting" f' $ do - (k, inputcontent) <- getinputcontent f' - let mkrel a = Just <$> - (a >>= liftIO . relPathDirToFile subdir) - mp <- case inputcontent of - Nothing -> pure Nothing - Just (Right obj) - | computeSandbox result -> - mkrel $ populatesandbox obj tmpdir - | otherwise -> - mkrel $ pure obj - Just (Left gitsha) -> - mkrel $ populategitsha gitsha tmpdir - sendresponse p $ - maybe "" fromOsPath mp - let result' = result - { computeInputsUnavailable = - isNothing mp || computeInputsUnavailable result - } - return $ if immutablestate - then result' - else modresultstate result' $ \s -> s - { computeInputs = - M.insert f' k - (computeInputs s) - } + Just (ProcessInput f) -> handleinput f False p tmpdir subdir result + Just (ProcessInputRequired f) -> handleinput f True p tmpdir subdir result Just (ProcessOutput f) -> do let f' = toOsPath f checksafefile tmpdir subdir f' "output" @@ -525,6 +499,38 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Nothing -> giveup $ program ++ " output an unparseable line: \"" ++ l ++ "\"" + handleinput f required p tmpdir subdir result = do + let f' = toOsPath f + let knowninput = M.member f' + (computeInputs (computeState result)) + checksafefile tmpdir subdir f' "input" + checkimmutable knowninput "inputting" f' $ do + (k, inputcontent) <- getinputcontent f' required + let mkrel a = Just <$> + (a >>= liftIO . relPathDirToFile subdir) + mp <- case inputcontent of + Nothing -> pure Nothing + Just (Right obj) + | computeSandbox result -> + mkrel $ populatesandbox obj tmpdir + | otherwise -> + mkrel $ pure obj + Just (Left gitsha) -> + mkrel $ populategitsha gitsha tmpdir + sendresponse p $ + maybe "" fromOsPath mp + let result' = result + { computeInputsUnavailable = + isNothing mp || computeInputsUnavailable result + } + return $ if immutablestate + then result' + else modresultstate result' $ \s -> s + { computeInputs = + M.insert f' k + (computeInputs s) + } + modresultstate result f = result { computeState = f (computeState result) } @@ -630,7 +636,7 @@ computeKey rs (ComputeProgram program) k _af dest meterupdate vc = (Just (k, p)) (postcompute keyfile) - getinputcontent state f = + getinputcontent state f _required = case M.lookup f (computeInputs state) of Just inputkey -> case keyGitSha inputkey of Nothing -> diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 56bb90f14f..f286a0b7cd 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -73,12 +73,13 @@ If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails, the program should exit. -When `git-annex addcomputed --fast` is being used to add a computation -to the git-annex repository without actually performing it, the -response to eaach `INPUT` will be an empty line rather than the path to -an input file. In that case, the program should proceed with the rest of -its output to stdout (eg `OUTPUT` and `REPRODUCIBLE`), but should not -perform any computation. (Diff truncated)
Added a comment: just thinking out loud
diff --git a/doc/special_remotes/compute/comment_4_6e02f138330b13adcfa8fbbce494205e._comment b/doc/special_remotes/compute/comment_4_6e02f138330b13adcfa8fbbce494205e._comment new file mode 100644 index 0000000000..d2b1145a4d --- /dev/null +++ b/doc/special_remotes/compute/comment_4_6e02f138330b13adcfa8fbbce494205e._comment @@ -0,0 +1,26 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="just thinking out loud" + date="2025-03-11T15:15:15Z" + content=""" +> it was more flexible to have a more freeform command line, which the compute program parses + +agree. And there could be some generic \"helper\" (or a number of them) which would then provide desired CLI interfacing over arbitrary command, smth like (mimicing [datalad-run](https://docs.datalad.org/en/stable/generated/man/datalad-run.html) interface here): + +``` +git-annex addcomputed --to=runcmd -i foo.jpeg -o foo.gif +``` + +as long as we can pass options like that or after `--`, e.g. + +``` +git-annex addcomputed --to=runcmd -- -i foo.jpeg -o foo.gif -- convert {inputs} {outputs}` +``` + +which would then +- ensure no stdout from `convert` +- follow the *compute special remote interface* to let git-annex know what inputs/outputs were + + +"""]]
reorg and expand security section
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 52b676c04e..56bb90f14f 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -12,23 +12,50 @@ a command like one of these: git-annex addcomputed --to=myremote -- compress in out --level=9 git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz +## security + +Security is very important here, because a user who enables a compute +special remote and runs `git pull` followed by `git-annex get` is running +the compute program with inputs under the control of anyone who has +commit access to the repository. + +The contents of input files should be assumed to be untrusted, and so +should the filenames of input and output files, as well as everything +else passed to the program in `ARGV` and the environment. + +The program should make sure that whatever user input is passed +to it can result in only safe and expected behavior. The program should +avoid exposing user input to the shell unprotected, or otherwise executing +it. (Except when the program is explicitly running user input in some form +of sandbox.) + +## interface + Whatever values the user passes to `git-annex addcomputed` are passed to the program in `ARGV`, followed by any values that the user provided to `git-annex initremote`. -For security, the program should avoid exposing user input to the shell -unprotected, or otherwise executing it. And when running a command, make -sure that whatever user input is passed to it can result in only safe and -expected behavior. - To simplify the program's option parsing, any value that the user provides that is in the form "foo=bar" will also result in an environment variable being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. The program is run in a temporary directory, which will be cleaned up after -it exits. Note that it may be run in a subdirectory of a temporary -directory. This is done when `git-annex addcomputed` was run in a subdirectory -of the git repository. +it exits. It may be run in a subdirectory of the temporary directory. This +is done when `git-annex addcomputed` was run in a subdirectory of the git +repository. + +Anything that the program outputs to stderr will be displayed to the user. +This stderr should be used for error messages, and possibly computation +output, but not for progress displays. + +If the program exits nonzero, nothing it computed will be stored in the +git-annex repository. + +## input files + +Before doing any computation, the program needs to communicate with +git-annex about what input files it needs, and what output files it will +generate. The content of any file in the repository can be an input to the computation. The program requests an input by writing a line to stdout: @@ -48,25 +75,26 @@ the program should exit. When `git-annex addcomputed --fast` is being used to add a computation to the git-annex repository without actually performing it, the -response to each "INPUT" will be an empty line rather than the path to +response to eaach `INPUT` will be an empty line rather than the path to an input file. In that case, the program should proceed with the rest of -its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not +its output to stdout (eg `OUTPUT` and `REPRODUCIBLE`), but should not perform any computation. +## output files + For each output file that it will compute, the program should write a -line to stdout: +line to stdout, indicating the name of the file that will be added to the +git-annex repository by `git-annex compute`. OUTPUT file.jpeg -Then it can read a line from stdin. This will be a sanitized version of the -output filename. It's important to use that sanitized version to avoid path -traversal attacks, as well as problems like filenames that look like -dashed options. If there is a path traversal attack, the program's stdin will -be closed without a path being written to it. - -The filename of the output file is both the filename in the program's -temporary directory that it should write to, and also the filename that will -be added to the git-annex repository by `git-annex compute`. +Then it should read a line from stdin, which is the path, in the program's +temporary directory, where it should write the output file. Often this will +be the same filename, but it also may be a sanitized version. It's +important to use that sanitized version to avoid path traversal attacks, as +well as problems like filenames that look like dashed options. +If there is a path traversal attack, the program's stdin will be closed +without a path being written to it. The program must write a regular file to the output file. Symlinks or other special files will not be accepted as output files. @@ -78,30 +106,34 @@ to somewhere else and renaming it at the end. But, if the program seeks around and writes out of order, it should write to a file somewhere else and rename it at the end. -The program can also output lines to stdout to indicate its current -progress: +## other messages - PROGRESS 50% +As well as `INPUT` and `OUTPUT` described above, there are some other +messages that the program can output. All of these are optional. -The program can optionally also output a "REPRODUCIBLE" line. That -indicates that the results of its computations are expected to be -bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if -the `--reproducible` option is set. +* `PROGRESS 50%` + + To indicate its current progress while performing the computation, + the program can output lines like this. This is not needed if the program + streams output to an output file. -The program can also output a "SANDBOX" line, and then read a line from -stdin that will be the path to the directory it should sandbox to (which -corresponds to the top of the git repository, so may be above its working -directory). Any "INPUT" lines that come after "SANDBOX" will have input -files be provided via paths that are inside the sandbox directory. Usually -that is done by making hard links, but it will fall back to copying annexed -files if the filesystem does not support hard links. +* `REPRODUCIBLE` + + This indicates that the results of the computation are expected to be + bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if + the `--reproducible` option is set. -Anything that the program outputs to stderr will be displayed to the user. -This stderr should be used for error messages, and possibly computation -output, but not for progress displays. +* `SANDBOX` -If the program exits nonzero, nothing it computed will be stored in the -git-annex repository. + After outputting this line, the program can read a line from stdin + that will be the path to the directory it should sandbox to (which + corresponds to the top of the git repository, so may be above its working + directory). Any `INPUT` lines that come after `SANDBOX` will have input + files be provided via paths that are inside the sandbox directory. Usually + that is done by making hard links, but it will fall back to copying annexed + files if the filesystem does not support hard links. + +## example An example `git-annex-compute-foo` shell script follows:
Added a comment
diff --git a/doc/special_remotes/compute/comment_3_d563e79fa8cb539bdf26a281824ad2ea._comment b/doc/special_remotes/compute/comment_3_d563e79fa8cb539bdf26a281824ad2ea._comment new file mode 100644 index 0000000000..985d62ebc6 --- /dev/null +++ b/doc/special_remotes/compute/comment_3_d563e79fa8cb539bdf26a281824ad2ea._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="comment 3" + date="2025-03-11T15:09:20Z" + content=""" +Thank you for the clarification -- I have missed that there is an \"entire\" [compute special remote interface](https://git-annex.branchable.com/design/compute_special_remote_interface/). **Cool!** +"""]]
expand
diff --git a/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn b/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn index 7613667cdc..48c9f1e052 100644 --- a/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn +++ b/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn @@ -68,3 +68,13 @@ documentation for details about these options. * `--no-compat` * `--fakeroot` + +For example, passing the --fakeroot option: + + git-annex addcomputed --to=singularity -- --fakeroot debian.sif foo bar -- baz -- sh -c 'cat foo bar > baz' + +Since singularity happens to also accept `--fakeroot=1` and +`--no-compat=1`, it's also possible to set these options by +default in initremote: + + git-annex initremote foo type=compute program=git-annex-compute-singularity passthrough=imageconvert.sif -- --fakeroot=1
response
diff --git a/doc/special_remotes/compute/comment_2_53493ef08c5cef81c6b6ae64afc47c07._comment b/doc/special_remotes/compute/comment_2_53493ef08c5cef81c6b6ae64afc47c07._comment new file mode 100644 index 0000000000..ef6def3f4e --- /dev/null +++ b/doc/special_remotes/compute/comment_2_53493ef08c5cef81c6b6ae64afc47c07._comment @@ -0,0 +1,13 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: Any way to annotate what are input files?""" + date="2025-03-10T20:42:26Z" + content=""" +git-annex does know what both the input and the output files are. +It learns this by running the compute program and seeing what INPUT and OUTPUT +lines it emits. + +I considered having some `--input=` option, but decided that it was more +flexible to have a more freeform command line, which the compute program +parses. +"""]]
added git-annex-compute-singularity
And implemented SANDBOX, which it needs.
And implemented SANDBOX, which it needs.
diff --git a/COPYRIGHT b/COPYRIGHT index 54a250abae..3ca3debd09 100644 --- a/COPYRIGHT +++ b/COPYRIGHT @@ -14,7 +14,7 @@ Files: doc/special_remotes/external/* Copyright: © 2013 Joey Hess <id@joeyh.name> License: GPL-3+ -Files: doc/special_remotes/compute/git-annex-compute-imageconvert doc/special_remotes/compute/git-annex-compute-wasmedge +Files: doc/special_remotes/compute/git-annex-compute-imageconvert doc/special_remotes/compute/git-annex-compute-wasmedge doc/special_remotes/compute/git-annex-compute-singularity Copyright: © 2025 Joey Hess <id@joeyh.name> License: GPL-3+ diff --git a/Remote/Compute.hs b/Remote/Compute.hs index be8429435c..7d21ddccdb 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -52,6 +52,7 @@ import Utility.Env import Utility.Tmp.Dir import Utility.Url import Utility.MonotonicClock +import Utility.CopyFile import Types.Key import Backend import qualified Git @@ -201,6 +202,7 @@ data ProcessCommand = ProcessInput FilePath | ProcessOutput FilePath | ProcessReproducible + | ProcessSandbox | ProcessProgress PercentFloat deriving (Show, Eq) @@ -208,6 +210,7 @@ instance Proto.Receivable ProcessCommand where parseCommand "INPUT" = Proto.parse1 ProcessInput parseCommand "OUTPUT" = Proto.parse1 ProcessOutput parseCommand "REPRODUCIBLE" = Proto.parse0 ProcessReproducible + parseCommand "SANDBOX" = Proto.parse0 ProcessSandbox parseCommand "PROGRESS" = Proto.parse1 ProcessProgress parseCommand _ = Proto.parseFail @@ -382,6 +385,7 @@ data ComputeProgramResult = ComputeProgramResult { computeState :: ComputeState , computeInputsUnavailable :: Bool , computeReproducible :: Bool + , computeSandbox :: Bool } runComputeProgram @@ -410,7 +414,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) } showOutput starttime <- liftIO currentMonotonicTimestamp - let startresult = ComputeProgramResult state False False + let startresult = ComputeProgramResult state False False False result <- withmeterfile $ \meterfile -> bracket (liftIO $ createProcess pr) (liftIO . cleanupProcess) @@ -457,13 +461,17 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) checksafefile tmpdir subdir f' "input" checkimmutable knowninput "inputting" f' $ do (k, inputcontent) <- getinputcontent f' + let mkrel a = Just <$> + (a >>= liftIO . relPathDirToFile subdir) mp <- case inputcontent of Nothing -> pure Nothing - Just (Right f'') -> liftIO $ - Just <$> relPathDirToFile subdir f'' - Just (Left gitsha) -> - Just <$> (liftIO . relPathDirToFile subdir - =<< populategitsha gitsha tmpdir) + Just (Right obj) + | computeSandbox result -> + mkrel $ populatesandbox obj tmpdir + | otherwise -> + mkrel $ pure obj + Just (Left gitsha) -> + mkrel $ populategitsha gitsha tmpdir sendresponse p $ maybe "" fromOsPath mp let result' = result @@ -506,6 +514,14 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) return result Just ProcessReproducible -> return $ result { computeReproducible = True } + Just ProcessSandbox -> do + sandboxpath <- liftIO $ fromOsPath <$> + relPathDirToFile subdir tmpdir + sendresponse p $ + if null sandboxpath + then "." + else sandboxpath + return $ result { computeSandbox = True } Nothing -> giveup $ program ++ " output an unparseable line: \"" ++ l ++ "\"" @@ -546,12 +562,23 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) -- to the program as a parameter, which could parse it as a dashed -- option or other special parameter. populategitsha gitsha tmpdir = do - let f = tmpdir </> literalOsPath ".git" </> literalOsPath "objects" + let f = tmpdir </> literalOsPath ".git" + </> literalOsPath "objects" </> toOsPath (Git.fromRef' gitsha) liftIO $ createDirectoryIfMissing True $ takeDirectory f liftIO . F.writeFile f =<< catObject gitsha return f + populatesandbox annexobj tmpdir = do + let f = tmpdir </> literalOsPath ".git" + </> literalOsPath "annex" + </> literalOsPath "objects" + </> takeFileName annexobj + liftIO $ createDirectoryIfMissing True $ takeDirectory f + liftIO $ unlessM (createLinkOrCopy annexobj f) $ + giveup "Unable to populate compute sandbox directory" + return f + withmeterfile a = case meterkey of Nothing -> a (const noop) Just (_, progress) -> do diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 0ab7c45df4..52b676c04e 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -88,6 +88,14 @@ indicates that the results of its computations are expected to be bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if the `--reproducible` option is set. +The program can also output a "SANDBOX" line, and then read a line from +stdin that will be the path to the directory it should sandbox to (which +corresponds to the top of the git repository, so may be above its working +directory). Any "INPUT" lines that come after "SANDBOX" will have input +files be provided via paths that are inside the sandbox directory. Usually +that is done by making hard links, but it will fall back to copying annexed +files if the filesystem does not support hard links. + Anything that the program outputs to stderr will be displayed to the user. This stderr should be used for error messages, and possibly computation output, but not for progress displays. diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 33b1253978..52d650068f 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -39,6 +39,13 @@ List it here with an example! `git-annex addcomputed --to=imageconvert foo.jpeg foo.gif` +* [[compute/git-annex-compute-singularity]] + Uses [Singularity](https://sylabs.io/) to run a container, which is + checked into the git-annex repository, to compute other files in the + repository. Amoung other things, this can run other compute programs + inside a singularity container. + [[Examples here|compute/git-annex-compute-singularity-examples]] + * [[compute/git-annex-compute-wasmedge]] Uses [WasmEdge](https://WasmEdge.org/) to run WASM programs that are checked into the git-annex repository, to compute other files in the diff --git a/doc/special_remotes/compute/git-annex-compute-singularity b/doc/special_remotes/compute/git-annex-compute-singularity new file mode 100755 index 0000000000..d296e0162d --- /dev/null +++ b/doc/special_remotes/compute/git-annex-compute-singularity @@ -0,0 +1,94 @@ +#!/bin/bash +# git-annex compute remote program that runs singularity containers +# from the git-annex repository. +# +# Copyright 2025 Joey Hess; licenced under the GNU GPL version 3 or higher. +set -e + +if [ -z "$1" ]; then + echo "Usage: container [singularity options] [inputs] -- [outputs] -- [command params]" >&2 + exit 1 +fi + +nocompat_opt="" +fakeroot_opt="" +container="" +binddir="`pwd`" +rundir="`pwd`" + +run_singularity () { + # Network access is disabled (with --net --network=none), to + # prevent an untrusted singularity image from phoning home and/or + # attacking the local network. + # + # --oci is used to get process namespacing + singularity run --net --network=none --oci \ + --bind="$binddir" --pwd="$rundir" \ + $nocompat_opt $fakeroot_opt \ + "$container" "$@" +} + +# Avoid any security problems with harmful terminal escape sequences. +strip_escape () { + sed 's/[\x1B]//g' +} + +if [ -z "$ANNEX_COMPUTE_passthrough" ]; then (Diff truncated)
document output files must be regular files
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index e6fad0f2b1..0ab7c45df4 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -65,8 +65,11 @@ dashed options. If there is a path traversal attack, the program's stdin will be closed without a path being written to it. The filename of the output file is both the filename in the program's -temporary directory, and also the filename that will be added to the -git-annex repository by `git-annex compute`. +temporary directory that it should write to, and also the filename that will +be added to the git-annex repository by `git-annex compute`. + +The program must write a regular file to the output file. Symlinks +or other special files will not be accepted as output files. If git-annex sees that an output file is growing, it will use its file size when displaying progress to the user. So if possible, the program should
make usage an error
diff --git a/doc/special_remotes/compute/git-annex-compute-wasmedge b/doc/special_remotes/compute/git-annex-compute-wasmedge index b93adb9370..51d7f3d40d 100755 --- a/doc/special_remotes/compute/git-annex-compute-wasmedge +++ b/doc/special_remotes/compute/git-annex-compute-wasmedge @@ -6,8 +6,9 @@ set -e if [ -z "$1" ]; then - echo "Usage: file.wasm [inputs] -- [outputs] -- [options]" - echo "Example: concat.wasm foo bar -- baz --" + echo "Usage: file.wasm [inputs] -- [outputs] -- [options]" >&2 + echo "Example: concat.wasm foo bar -- baz --" >&2 + exit 1 fi stage=1
Added a comment
diff --git a/doc/forum/Is_there_a_way_to_have_assistant_add_files_locked__63__/comment_10_4f9090f9d4b2152c250cd64e9c7afeb4._comment b/doc/forum/Is_there_a_way_to_have_assistant_add_files_locked__63__/comment_10_4f9090f9d4b2152c250cd64e9c7afeb4._comment new file mode 100644 index 0000000000..2e6d34b204 --- /dev/null +++ b/doc/forum/Is_there_a_way_to_have_assistant_add_files_locked__63__/comment_10_4f9090f9d4b2152c250cd64e9c7afeb4._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="comment 10" + date="2025-03-09T01:02:54Z" + content=""" +given the passed time, indeed it might be not as pressing of an issue, but indeed might be nice to have for ReproNim and beyond ;) +"""]]
Added a comment: Any way to annotate what are input files?
diff --git a/doc/special_remotes/compute/comment_1_db995549136250c8dd6ea459cc7f0080._comment b/doc/special_remotes/compute/comment_1_db995549136250c8dd6ea459cc7f0080._comment new file mode 100644 index 0000000000..752cb157e0 --- /dev/null +++ b/doc/special_remotes/compute/comment_1_db995549136250c8dd6ea459cc7f0080._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="Any way to annotate what are input files?" + date="2025-03-08T14:51:20Z" + content=""" +I don't see an option to specify which annexed files are input files, so annex could get them for comparing to happen to produce output file. That's what we do in datalad run, and it is very handy since allows to not worry about figuring out what to get first +"""]]
symlink, don't hardlink
hardlink can cause problems with unlocked files
hardlink can cause problems with unlocked files
diff --git a/doc/special_remotes/compute/git-annex-compute-wasmedge b/doc/special_remotes/compute/git-annex-compute-wasmedge index a734418125..b93adb9370 100755 --- a/doc/special_remotes/compute/git-annex-compute-wasmedge +++ b/doc/special_remotes/compute/git-annex-compute-wasmedge @@ -23,7 +23,7 @@ while [ -n "$1" ]; do if [ -n "$input" ]; then p="./$1" mkdir -p "$(dirname "$p")" - ln $(realpath "$input") "$p" + ln -s $(realpath "$input") "$p" if [ -z "$wasm" ]; then wasm="$p" fi
disconnect stdio for wasm binaries
diff --git a/doc/special_remotes/compute/git-annex-compute-wasmedge b/doc/special_remotes/compute/git-annex-compute-wasmedge index c20f7824a8..a734418125 100755 --- a/doc/special_remotes/compute/git-annex-compute-wasmedge +++ b/doc/special_remotes/compute/git-annex-compute-wasmedge @@ -43,5 +43,10 @@ if [ -n "$wasm" ]; then # Use --force-interpreter to avoid wasmedge running AOT native # instructions, which is insecure if the WASM binary comes from # an untrusted source. - wasmedge --dir "/:`pwd`" --force-interpreter -- "$wasm" "$@" >&2 + # + # Avoid displaying output from the WASM binary, since it is + # untrusted, and could contain harmful terminal escape sequences, + # for example. + wasmedge --dir "/:`pwd`" --force-interpreter -- "$wasm" "$@" \ + >/dev/null 2>/dev/null </dev/null fi
use pwd and quote it
Seems more portable and safe
Seems more portable and safe
diff --git a/doc/special_remotes/compute/git-annex-compute-wasmedge b/doc/special_remotes/compute/git-annex-compute-wasmedge index 0c361c6230..c20f7824a8 100755 --- a/doc/special_remotes/compute/git-annex-compute-wasmedge +++ b/doc/special_remotes/compute/git-annex-compute-wasmedge @@ -43,5 +43,5 @@ if [ -n "$wasm" ]; then # Use --force-interpreter to avoid wasmedge running AOT native # instructions, which is insecure if the WASM binary comes from # an untrusted source. - wasmedge --dir /:$PWD --force-interpreter -- "$wasm" "$@" >&2 + wasmedge --dir "/:`pwd`" --force-interpreter -- "$wasm" "$@" >&2 fi
case
diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 6c554cd761..33b1253978 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -40,6 +40,6 @@ List it here with an example! `git-annex addcomputed --to=imageconvert foo.jpeg foo.gif` * [[compute/git-annex-compute-wasmedge]] - Uses [wasmedge](https://WasmEdge.org/) to run WASM programs that are + Uses [WasmEdge](https://WasmEdge.org/) to run WASM programs that are checked into the git-annex repository, to compute other files in the repository. [[Examples here|compute/git-annex-compute-wasmedge-examples]]
layout
diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index a6828b9791..6c554cd761 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -42,6 +42,4 @@ List it here with an example! * [[compute/git-annex-compute-wasmedge]] Uses [wasmedge](https://WasmEdge.org/) to run WASM programs that are checked into the git-annex repository, to compute other files in the - repository. - - [[examples|compute/git-annex-compute-wasmedge-examples]] + repository. [[Examples here|compute/git-annex-compute-wasmedge-examples]]
layout
diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index c85b1a9625..a6828b9791 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -34,12 +34,12 @@ To write programs used by the compute special remote, see the Have you written a generally useful (and secure) compute program? List it here with an example! -* [[compute/git-annex-compute-imageconvert]] -- +* [[compute/git-annex-compute-imageconvert]] Uses imagemagick to convert between image formats `git-annex addcomputed --to=imageconvert foo.jpeg foo.gif` -* [[compute/git-annex-compute-wasmedge]] +* [[compute/git-annex-compute-wasmedge]] Uses [wasmedge](https://WasmEdge.org/) to run WASM programs that are checked into the git-annex repository, to compute other files in the repository.
add git-annex-compute-wasmedge
diff --git a/COPYRIGHT b/COPYRIGHT index 7dfe659c6a..54a250abae 100644 --- a/COPYRIGHT +++ b/COPYRIGHT @@ -14,7 +14,7 @@ Files: doc/special_remotes/external/* Copyright: © 2013 Joey Hess <id@joeyh.name> License: GPL-3+ -Files: doc/special_remotes/compute/git-annex-compute-imageconvert +Files: doc/special_remotes/compute/git-annex-compute-imageconvert doc/special_remotes/compute/git-annex-compute-wasmedge Copyright: © 2025 Joey Hess <id@joeyh.name> License: GPL-3+ diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 36bda5a62b..c85b1a9625 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -38,3 +38,10 @@ List it here with an example! Uses imagemagick to convert between image formats `git-annex addcomputed --to=imageconvert foo.jpeg foo.gif` + +* [[compute/git-annex-compute-wasmedge]] + Uses [wasmedge](https://WasmEdge.org/) to run WASM programs that are + checked into the git-annex repository, to compute other files in the + repository. + + [[examples|compute/git-annex-compute-wasmedge-examples]] diff --git a/doc/special_remotes/compute/git-annex-compute-wasmedge b/doc/special_remotes/compute/git-annex-compute-wasmedge new file mode 100755 index 0000000000..0c361c6230 --- /dev/null +++ b/doc/special_remotes/compute/git-annex-compute-wasmedge @@ -0,0 +1,47 @@ +#!/bin/sh +# git-annex compute remote program that uses wasmedge to run WASM binaries +# from the git-annex repository. +# +# Copyright 2025 Joey Hess; licenced under the GNU GPL version 3 or higher. +set -e + +if [ -z "$1" ]; then + echo "Usage: file.wasm [inputs] -- [outputs] -- [options]" + echo "Example: concat.wasm foo bar -- baz --" +fi + +stage=1 +wasm="" +while [ -n "$1" ]; do + if [ "$1" = "--" ]; then + stage=$((stage+1)) + shift 1 + else + if [ "$stage" = 1 ]; then + echo "INPUT $1" + read input + if [ -n "$input" ]; then + p="./$1" + mkdir -p "$(dirname "$p")" + ln $(realpath "$input") "$p" + if [ -z "$wasm" ]; then + wasm="$p" + fi + fi + shift 1 + elif [ "$stage" = 2 ]; then + echo "OUTPUT $1" + read output + shift 1 + else + break + fi + fi +done + +if [ -n "$wasm" ]; then + # Use --force-interpreter to avoid wasmedge running AOT native + # instructions, which is insecure if the WASM binary comes from + # an untrusted source. + wasmedge --dir /:$PWD --force-interpreter -- "$wasm" "$@" >&2 +fi diff --git a/doc/special_remotes/compute/git-annex-compute-wasmedge-examples.mdwn b/doc/special_remotes/compute/git-annex-compute-wasmedge-examples.mdwn new file mode 100644 index 0000000000..c483ed5614 --- /dev/null +++ b/doc/special_remotes/compute/git-annex-compute-wasmedge-examples.mdwn @@ -0,0 +1,52 @@ +[[git-annex-compute-wasmedge]] uses [WasmEdge](https://wasmedge.org/) +to run WASM programs, that are checked into the git-annex repository, +to [[compute]] other files in the repository. + +The WASM programs are limited to sandboxed file IO, and cannot access the +network. + +The first parameter is the WASM file to run. It is followed by any other +input files that it should have access to. Then by "--", and a list of all +output files that the program computes. Finally, there can be another "--" +that is followed by any ARGV to pass to the WASM program. + +An example is: + + git-annex initremote wasmedge type=compute program=git-annex-compute-wasmedge + git-annex addcomputed --to=wasmedge -- concat.wasm foo bar -- baz -- baz + +To use that, you will need to write a concat.wasm program that combines +together files foo and bar, and writes the result to a file named baz. + +---- + +Here's another example, using an existing WASM build of python, from +this article <https://wasmlabs.dev/articles/python-wasm32-wasi/>. + +Download it and add it to your git-annex repository: + + wget https://github.com/vmware-labs/webassembly-language-runtimes/releases/download/python%2F3.11.1%2B20230127-c8036b4/python-aio-3.11.1.zip + unzip python-aio-3.11.1.zip + rm python-aio-3.11.1.zip bin/python-3.11.1-wasmedge.wasm + + git-annex add bin/python-3.11.1.wasm usr/local/lib/python3.11* + +Notice that the wasm binary needs a few other files that constitute the +python runtime. Those files have to be provided as inputs when using it. + +Here's how to use it to compute a file: + + git-annex initremote wasmedge type=compute program=git-annex-compute-wasmedge + git-annex addcomputed --fast --to wasmedge -- bin/python-3.11.1.wasm \ + usr/local/lib/python311.zip usr/local/lib/python3.11/lib-dynload/.empty \ + usr/local/lib/python3.11/os.py \ + -- foo -- -c 'with open("foo", "w") as f: f.write("hello wasm world\n")' + +Of course, you can replace the python code with something more interesting. +Add additional input files, read them, do whatever. + +While this makes a nice easy example, python built this way is quite slow, and +it would be hard to use other python libraries with it. +It would probably be better to use +<https://wasmer.io/posts/py2wasm-a-python-to-wasm-compiler> +to convert a python program into a WASM binary.
redirect command stdout to stderr
Otherwise it will be interpreted as compute program protocol
Otherwise it will be interpreted as compute program protocol
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 93da38147b..e6fad0f2b1 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -110,5 +110,5 @@ An example `git-annex-compute-foo` shell script follows: echo REPRODUCIBLE if [ -n "$input" ]; then - frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$output" + frobnicate --passes="$ANNEX_COMPUTE_passes" -i "$input" -o "$output" >&2 fi diff --git a/doc/special_remotes/compute/git-annex-compute-imageconvert b/doc/special_remotes/compute/git-annex-compute-imageconvert index 3ecd4c0790..bdcb665ea1 100755 --- a/doc/special_remotes/compute/git-annex-compute-imageconvert +++ b/doc/special_remotes/compute/git-annex-compute-imageconvert @@ -17,5 +17,5 @@ echo "OUTPUT $2" read output if [ -n "$input" ]; then - convert "$input" "$output" + convert "$input" "$output" >&2 fi
make OUTPUT subdirs
Simplifies compute programs.
Simplifies compute programs.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index d4bcd3359a..c41c1b91dc 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -476,6 +476,11 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO $ hPutStrLn (stdinHandle p) $ toCommand' (File f) liftIO $ hFlush (stdinHandle p) + -- If the output file is in a subdirectory, make + -- the directories so the compute program doesn't + -- need to. + liftIO $ createDirectoryIfMissing True $ + takeDirectory (subdir </> f') knownoutput <- case M.lookup f' (computeOutputs $ computeState result) of Nothing -> return False Just mk -> do diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 5a55e2fe2c..93da38147b 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -110,6 +110,5 @@ An example `git-annex-compute-foo` shell script follows: echo REPRODUCIBLE if [ -n "$input" ]; then - mkdir -p "$(dirname "$output")" frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$output" fi diff --git a/doc/special_remotes/compute/git-annex-compute-imageconvert b/doc/special_remotes/compute/git-annex-compute-imageconvert index 16eb14da07..3ecd4c0790 100755 --- a/doc/special_remotes/compute/git-annex-compute-imageconvert +++ b/doc/special_remotes/compute/git-annex-compute-imageconvert @@ -17,6 +17,5 @@ echo "OUTPUT $2" read output if [ -n "$input" ]; then - mkdir -p "$(dirname "$output")" convert "$input" "$output" fi
compute: add response to OUTPUT
This allows rejecting output filenames that are outside the repository,
and also handles converting eg "-foo" to "./-foo" to prevent a command
that it's passed to interpreting the output filename as a dashed option.
This allows rejecting output filenames that are outside the repository,
and also handles converting eg "-foo" to "./-foo" to prevent a command
that it's passed to interpreting the output filename as a dashed option.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index b6ec907bda..d4bcd3359a 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -472,6 +472,10 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Just (ProcessOutput f) -> do let f' = toOsPath f checksafefile tmpdir subdir f' "output" + -- Modify filename so eg "-foo" becomes "./-foo" + liftIO $ hPutStrLn (stdinHandle p) $ + toCommand' (File f) + liftIO $ hFlush (stdinHandle p) knownoutput <- case M.lookup f' (computeOutputs $ computeState result) of Nothing -> return False Just mk -> do diff --git a/Utility/SafeCommand.hs b/Utility/SafeCommand.hs index 6f9419cd8c..aa96bda0ea 100644 --- a/Utility/SafeCommand.hs +++ b/Utility/SafeCommand.hs @@ -10,6 +10,7 @@ module Utility.SafeCommand ( CommandParam(..), toCommand, + toCommand', boolSystem, boolSystem', boolSystemEnv, diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 2c97946a81..5a55e2fe2c 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -16,15 +16,15 @@ Whatever values the user passes to `git-annex addcomputed` are passed to the program in `ARGV`, followed by any values that the user provided to `git-annex initremote`. -To simplify the program's option parsing, any value that the user provides -that is in the form "foo=bar" will also result in an environment variable -being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. - For security, the program should avoid exposing user input to the shell unprotected, or otherwise executing it. And when running a command, make sure that whatever user input is passed to it can result in only safe and expected behavior. +To simplify the program's option parsing, any value that the user provides +that is in the form "foo=bar" will also result in an environment variable +being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. + The program is run in a temporary directory, which will be cleaned up after it exits. Note that it may be run in a subdirectory of a temporary directory. This is done when `git-annex addcomputed` was run in a subdirectory @@ -58,6 +58,12 @@ line to stdout: OUTPUT file.jpeg +Then it can read a line from stdin. This will be a sanitized version of the +output filename. It's important to use that sanitized version to avoid path +traversal attacks, as well as problems like filenames that look like +dashed options. If there is a path traversal attack, the program's stdin will +be closed without a path being written to it. + The filename of the output file is both the filename in the program's temporary directory, and also the filename that will be added to the git-annex repository by `git-annex compute`. @@ -100,10 +106,9 @@ An example `git-annex-compute-foo` shell script follows: echo "INPUT $2" read input echo "OUTPUT $3" - # Prefixing with ./ makes sure that the output is treated as a - # filename, rather than a dashed option. - output="./$3" + read output echo REPRODUCIBLE + if [ -n "$input" ]; then mkdir -p "$(dirname "$output")" frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$output" diff --git a/doc/special_remotes/compute/git-annex-compute-imageconvert b/doc/special_remotes/compute/git-annex-compute-imageconvert index 8edc6b8f24..16eb14da07 100755 --- a/doc/special_remotes/compute/git-annex-compute-imageconvert +++ b/doc/special_remotes/compute/git-annex-compute-imageconvert @@ -1,5 +1,5 @@ #!/bin/sh -# git-annex compute special remote program that uses imagemagic's convert +# git-annex compute remote program that uses imagemagic's convert # to convert one type of image format into another. Eg, jpeg to gif. # # Copyright 2025 Joey Hess; licenced under the GNU GPL version 3 or higher. @@ -14,10 +14,7 @@ fi echo "INPUT $1" read input echo "OUTPUT $2" - -# Prefixing with ./ makes sure that the output is treated as a -# filename, rather than a dashed option. -output="./$2" +read output if [ -n "$input" ]; then mkdir -p "$(dirname "$output")"
remove todo I just added
If a compute program does this, it has a security hole. Not git-annex.
If a compute program does this, it has a security hole. Not git-annex.
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index b3e7b4723e..bb522398a4 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,11 +1,6 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* **important** output of the program can contain escape sequences, - which could be a security problem. Eg, it could rebind a key to run a - command. So it seems the output needs to be passed through something to - sanitize it. - * write a tip showing how to use this * Write some simple compute programs so we have something to start with.
todo
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index bb522398a4..b3e7b4723e 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,6 +1,11 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] +* **important** output of the program can contain escape sequences, + which could be a security problem. Eg, it could rebind a key to run a + command. So it seems the output needs to be passed through something to + sanitize it. + * write a tip showing how to use this * Write some simple compute programs so we have something to start with.
diff --git a/doc/forum/filter-process__58___git-annex_command_not_found.mdwn b/doc/forum/filter-process__58___git-annex_command_not_found.mdwn new file mode 100644 index 0000000000..44579c4dfb --- /dev/null +++ b/doc/forum/filter-process__58___git-annex_command_not_found.mdwn @@ -0,0 +1,82 @@ +So this is mystifying, because this happens for only a single repo out of many that sync between two machines with the git-annex assistant. All other repos sync successfully. + +``` +Updating fb9c1a7..e5b6add +git-annex filter-process: git-annex: command not found +fatal: the remote end hung up unexpectedly + + realpath: does not exist (No such file or directory) +``` + +Outbound config: + +``` +[core] + repositoryformatversion = 0 + filemode = true + bare = false + logallrefupdates = true + ignorecase = true + precomposeunicode = true +[remote "mini"] + url = me@mini:Self/repos/trading-obsidian + fetch = +refs/heads/*:refs/remotes/mini/* + annex-uuid = xxxx +[branch "master"] + remote = mini + merge = refs/heads/master +[annex] + uuid = xxxx + version = 10 +[filter "annex"] + smudge = git-annex smudge -- %f + clean = git-annex smudge --clean -- %f + process = git-annex filter-process +``` + + +I can ssh in and run commands: + +``` +ssh -A mini 'git-annex version' +git-annex version: 10.20241202 +build flags: Assistant Webapp Pairing FsEvents TorrentParser MagicMime Servant Benchmark Feeds Testsuite S3 WebDAV +dependency versions: aws-0.24.3 bloomfilter-2.0.1.2 crypton-1.0.1 DAV-1.3.4 feed-1.3.2.1 ghc-9.8.2 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.16 yesod-1.6.2.1 +key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X* +remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external +operating system: darwin x86_64 +supported repository versions: 8 9 10 +upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10 + +ssh -A mini 'which git-annex-shell' +/Users/me/bin/git-annex-shell + +``` + +I didn't intentionally configure this repo any differently than any other repo. + +For example, the assistant doesn't emit this error for another repo on the same source machine, although I have no idea what a `realpath` is and why this happens in every repo, but this error doesn't seem to be causing the problem. + +``` +Updating 4963189..4055a92 +Fast-forward + file.md | 2 ++ + 1 file changed, 2 insertions(+) + + realpath: does not exist (No such file or directory) +ssh: connect to host mini port 22: Operation timed out + + realpath: does not exist (No such file or directory) +ssh: connect to host mini port 22: Operation timed out + + realpath: does not exist (No such file or directory) +ssh: connect to host mini port 22: Operation timed out + + realpath: does not exist (No such file or directory) +ssh: connect to host mini port 22: Operation timed out +ssh: connect to host mini port 22: Operation timed out + + realpath: does not exist (No such file or directory) +``` + +Any suggestions are appreciated, thanks!
initial report on slow thaw
diff --git a/doc/bugs/thawing_directory_-_takes_long_+_logs_twice.mdwn b/doc/bugs/thawing_directory_-_takes_long_+_logs_twice.mdwn new file mode 100644 index 0000000000..b4e963446d --- /dev/null +++ b/doc/bugs/thawing_directory_-_takes_long_+_logs_twice.mdwn @@ -0,0 +1,135 @@ +### Please describe the problem. + +I have spotted that `git annex drop` was at time quite slow. I thought it was testing the remote(s) but ended up being not that: e.g. + +`git annex drop --debug sub-440958/sub-440958_ses-20190213T115547_behavior+ecephys+image+ogen.nwb` + +has + +``` +[2025-03-06 17:17:27.142370501] (Utility.Process) process [3058825] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"] +[2025-03-06 17:17:27.147619069] (Annex.Perms) thawing content directory .git/annex/objects/Q2/6g/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb +[2025-03-06 17:17:27.147915521] (Annex.Perms) freezing content directory .git/annex/objects/Q2/6g/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb +[2025-03-06 17:17:27.196116156] (Utility.Url) Request { + host = "dandiarchive.s3.amazonaws.com" + port = 443 + secure = True + requestHeaders = [("Accept-Encoding",""),("User-Agent","git-annex/10.20250115+git160-g93fb1ba536-1~ndall+1")] + path = "/blobs/227/91d/22791d80-26dc-4495-b4c0-651fe10e3298" + queryString = "?versionId=mnU_QOZwfaLCY7kSeUX_lcNGHUjjbeU2" + method = "HEAD" + proxy = Nothing + rawBody = False + redirectCount = 10 + responseTimeout = ResponseTimeoutDefault + requestVersion = HTTP/1.1 + proxySecureMode = ProxySecureWithConnect +} + +[2025-03-06 17:17:27.388964083] (Command.Drop) Dropping from here proof: Just (SafeDropProof (NumCopies 1) (MinCopies 1) [RecentlyVerifiedCopy UUID "00000000-0000-0000-0000-000000000001"] Nothing (Just (ContentRemovalLock (MkKey {keyData = Key {keyName = "40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb", keyVariety = SHA2Key (HashSize 256) (HasExt True), keySize = Just 458941350881, keyMtime = Nothing, keyChunkSize = Nothing, keyChunkNum = Nothing}, keySerialization = "SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb"})))) +[2025-03-06 17:17:27.389404507] (Annex.Perms) thawing content directory .git/annex/objects/Q2/6g/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb +[2025-03-06 17:19:16.06748192] (Annex.Perms) thawing content directory .git/annex/objects/Q2/6g/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb +[2025-03-06 17:19:16.068644724] (Annex.Branch) read a22/51b/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb.log +[2025-03-06 17:19:16.070244398] (Annex.Branch) set a22/51b/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb.log + +``` + +so +- it has actually 3 (not two) log lines on thawing and 1st one comes even before testing remote URL. +- it takes almost two minutes between 2nd and 3rd thawing lines (without any output inbetween) + +repository is quite light in number of files (not size though) + +``` +annexed files in working tree: 174 +size of annexed files in working tree: 65.7 terabytes +``` + +and could be cloned from https://github.com/dandisets/000363 . + +Interestingly, looking at other not yet dropped files there -- those folders are not even write-protected! (actual key files are): + +``` +dandi@drogon:/mnt/backup/dandi/dandisets/000363$ git annex find --in here sub-440* | while read f; do ccd=$(readlink -f "$f"|xargs dirname); ls -ld "$ccd"; done +drwxr-sr-x 1 dandi dandi 182 Mar 3 23:21 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/v7/W1/SHA256E-s446084260431--0c66c1f84686ab5838353fe42630c35d5b9c255e3ebf92a82d1d1adfa8a7fedf.nwb +drwxr-sr-x 1 dandi dandi 182 Mar 3 23:24 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/G7/zQ/SHA256E-s440859400651--7364c08d63b04a92b8cbe82a506733ff4e7ba36e78060ca1aa06b32111622a78.nwb +drwxr-sr-x 1 dandi dandi 176 Mar 3 23:25 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/6J/wk/SHA256E-s159195870--66243f8b55567671400b14da186518293a76d406f0b1c979cc9d97ecb254d9ae.nwb +drwxr-sr-x 1 dandi dandi 182 Mar 3 23:25 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/6K/V6/SHA256E-s480007274147--da636994efc2d9c53f4f2261d04dd0f98f56d6626088bf77aa393a0090d8fb63.nwb +drwxr-sr-x 1 dandi dandi 182 Mar 3 23:25 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/jP/mJ/SHA256E-s296110873457--68d9d40d3d356183b3d2a433fabffc5ed19d04f2057e2121d7b7c9c5372ebe44.nwb +drwxr-sr-x 1 dandi dandi 182 Mar 3 23:21 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/51/pP/SHA256E-s372184652945--5bdbb5e06e77c714d0c2460a961bf95753623618e1e5c83073fb7f0a3cb8f0b8.nwb +drwxr-sr-x 1 dandi dandi 182 Mar 3 23:25 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/02/mz/SHA256E-s359721690048--7e90fae4aa4808f2c4d85361dd57a8e188bf349b9c88993e74beb6f54719ae8e.nwb +drwxr-sr-x 1 dandi dandi 182 Mar 3 23:25 /mnt/backup/dandi/dandisets/000363/.git/annex/objects/QM/j4/SHA256E-s274399228635--8833b0c1024b4130e9ce389a76c0241cd0cf3c4a7ded74d23b2fd151aaffe311.nwb +``` + +and repo is not 'shared' AFAIK + +``` +dandi@drogon:/mnt/backup/dandi/dandisets/000363$ grep share .git/config +dandi@drogon:/mnt/backup/dandi/dandisets/000363$ + +``` + + +### What version of git-annex are you using? On what operating system? + +detected with earlier version but was investigating further with "bleeding edge" + +``` +dandi@drogon:/mnt/backup/dandi/dandisets/000363$ git annex version +git-annex version: 10.20250115+git160-g93fb1ba536-1~ndall+1 +build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Servant Benchmark Feeds Testsuite S3 WebDAV +dependency versions: aws-0.22.1 bloomfilter-2.0.1.0 cryptonite-0.29 DAV-1.3.4 feed-1.3.2.1 ghc-9.0.2 http-client-0.7.13.1 persistent-sqlite-2.13.1.0 torrent-10000.1.1 uuid-1.3.15 yesod-1.6.2.1 +key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X* +remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external +operating system: linux x86_64 +supported repository versions: 8 9 10 +upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10 +local repository version: 10 + +``` + +I do not think custom threeze/thaw configured here + +``` +dandi@drogon:/mnt/backup/dandi/dandisets/000363$ git config -l | grep annex +alias.annexdiff=!GIT_EXTERNAL_DIFF='git-annex diffdriver -- git-annex-diff-wrapper --' git diff +annex.stalldetection=1KB/120s +annex.autoupgraderepository=false +annex.uuid=1ce3e7b7-edf2-4907-ad29-610e9a46315f +annex.version=10 +filter.annex.smudge=git-annex smudge -- %f +filter.annex.clean=git-annex smudge --clean -- %f +filter.annex.process=git-annex filter-process +remote.dandi-dandisets-dropbox.annex-externaltype=rclone +remote.dandi-dandisets-dropbox.annex-uuid=727f466f-60c3-4778-90b2-b2332856c2f8 +remote.dandi-dandisets-dropbox.annex-cost=200.0 +remote.dandi-dandisets-dropbox.annex-availability=GloballyAvailable +remote.github.annex-ignore=true +remote.dandiapi.annex-web=true +remote.dandiapi.annex-uuid=00000000-0000-0000-0000-000000000001 +remote.dandiapi.annex-config-uuid=f80ff5ab-efbc-4b0b-aaf7-bfbed92e4c29 + +``` + +file system is BTRFS and quite "busy" but I would not expect 2 minutes on chmod call there... unclear why 3 log lines as well + +Seems to be chronic but not all the time -- seems one run had it go fast albeit with 3 log lines as well + +``` +dandi@drogon:/mnt/backup/dandi/dandisets/000363$ grep 'thawing content directory' .duct/logs/2025.03.* +.duct/logs/2025.03.04T09.19.20-178850_stderr:[2025-03-04 09:19:20.914767806] (Annex.Perms) thawing content directory .git/annex/objects/zg/F6/SHA256E-s424900914587--0f859aca43eaf7fb3a12de068b667ffd1b836d8edcd7574547c1e8782cd1f04e.nwb +.duct/logs/2025.03.04T09.19.20-178850_stderr:[2025-03-04 09:19:21.187113505] (Annex.Perms) thawing content directory .git/annex/objects/zg/F6/SHA256E-s424900914587--0f859aca43eaf7fb3a12de068b667ffd1b836d8edcd7574547c1e8782cd1f04e.nwb +.duct/logs/2025.03.04T09.19.20-178850_stderr:[2025-03-04 09:22:28.320213649] (Annex.Perms) thawing content directory .git/annex/objects/zg/F6/SHA256E-s424900914587--0f859aca43eaf7fb3a12de068b667ffd1b836d8edcd7574547c1e8782cd1f04e.nwb +.duct/logs/2025.03.06T17.08.02-3042067_stderr:[2025-03-06 17:08:03.059666572] (Annex.Perms) thawing content directory .git/annex/objects/M2/5j/SHA256E-s296808899160--b2c4f356904a6a034ef8630a34508950958cfda4bf7969662f1a1ee3ad682cfc.nwb +.duct/logs/2025.03.06T17.08.02-3042067_stderr:[2025-03-06 17:08:03.356505174] (Annex.Perms) thawing content directory .git/annex/objects/M2/5j/SHA256E-s296808899160--b2c4f356904a6a034ef8630a34508950958cfda4bf7969662f1a1ee3ad682cfc.nwb +.duct/logs/2025.03.06T17.08.02-3042067_stderr:[2025-03-06 17:09:25.985686417] (Annex.Perms) thawing content directory .git/annex/objects/M2/5j/SHA256E-s296808899160--b2c4f356904a6a034ef8630a34508950958cfda4bf7969662f1a1ee3ad682cfc.nwb +.duct/logs/2025.03.06T17.17.16-3058713_stderr:[2025-03-06 17:17:16.437911131] (Annex.Perms) thawing content directory .git/annex/objects/Zx/gm/SHA256E-s190766322--419f317867cae557e0cf72460d981dafed54eb5a2136fecf25e7650db3bec944.nwb +.duct/logs/2025.03.06T17.17.16-3058713_stderr:[2025-03-06 17:17:16.693636236] (Annex.Perms) thawing content directory .git/annex/objects/Zx/gm/SHA256E-s190766322--419f317867cae557e0cf72460d981dafed54eb5a2136fecf25e7650db3bec944.nwb +.duct/logs/2025.03.06T17.17.16-3058713_stderr:[2025-03-06 17:17:16.804165428] (Annex.Perms) thawing content directory .git/annex/objects/Zx/gm/SHA256E-s190766322--419f317867cae557e0cf72460d981dafed54eb5a2136fecf25e7650db3bec944.nwb +.duct/logs/2025.03.06T17.17.26-3058787_stderr:[2025-03-06 17:17:27.147619069] (Annex.Perms) thawing content directory .git/annex/objects/Q2/6g/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb +.duct/logs/2025.03.06T17.17.26-3058787_stderr:[2025-03-06 17:17:27.389404507] (Annex.Perms) thawing content directory .git/annex/objects/Q2/6g/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb +.duct/logs/2025.03.06T17.17.26-3058787_stderr:[2025-03-06 17:19:16.06748192] (Annex.Perms) thawing content directory .git/annex/objects/Q2/6g/SHA256E-s458941350881--40f2a2dd11d2f5f5572b32f654d0c92557f2a390fe1434e70caecd44fd3225c9.nwb +``` + +[[!meta author=yoh]] +[[!tag projects/dandi]]
improve
diff --git a/doc/special_remotes/compute/git-annex-compute-imageconvert b/doc/special_remotes/compute/git-annex-compute-imageconvert index fb106e55d7..8edc6b8f24 100755 --- a/doc/special_remotes/compute/git-annex-compute-imageconvert +++ b/doc/special_remotes/compute/git-annex-compute-imageconvert @@ -15,10 +15,11 @@ echo "INPUT $1" read input echo "OUTPUT $2" +# Prefixing with ./ makes sure that the output is treated as a +# filename, rather than a dashed option. +output="./$2" + if [ -n "$input" ]; then - # Prefixing the filenames with "./" makes sure that they are processed - # as files, even if they look like dashed options. - mkdir -p "$(dirname "./$2")" - ln -s "$input" "./$1" - convert "./$1" "./$2" + mkdir -p "$(dirname "$output")" + convert "$input" "$output" fi
add git-annex-compute-imageconvert
diff --git a/COPYRIGHT b/COPYRIGHT index 71f1d59aaa..7dfe659c6a 100644 --- a/COPYRIGHT +++ b/COPYRIGHT @@ -14,6 +14,10 @@ Files: doc/special_remotes/external/* Copyright: © 2013 Joey Hess <id@joeyh.name> License: GPL-3+ +Files: doc/special_remotes/compute/git-annex-compute-imageconvert +Copyright: © 2025 Joey Hess <id@joeyh.name> +License: GPL-3+ + Files: doc/design/external_backend_protocol/git-annex-backend-XFOO Copyright: © 2020 Joey Hess <id@joeyh.name> License: GPL-3+ diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 264cec825a..36bda5a62b 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -32,6 +32,9 @@ To write programs used by the compute special remote, see the [[design/compute_special_remote_interface]]. Have you written a generally useful (and secure) compute program? -List it here! +List it here with an example! -* ... +* [[compute/git-annex-compute-imageconvert]] -- + Uses imagemagick to convert between image formats + + `git-annex addcomputed --to=imageconvert foo.jpeg foo.gif` diff --git a/doc/special_remotes/compute/git-annex-compute-imageconvert b/doc/special_remotes/compute/git-annex-compute-imageconvert new file mode 100755 index 0000000000..fb106e55d7 --- /dev/null +++ b/doc/special_remotes/compute/git-annex-compute-imageconvert @@ -0,0 +1,24 @@ +#!/bin/sh +# git-annex compute special remote program that uses imagemagic's convert +# to convert one type of image format into another. Eg, jpeg to gif. +# +# Copyright 2025 Joey Hess; licenced under the GNU GPL version 3 or higher. +set -e + +if [ -z "$1" ] || [ -z "$2" ]; then + echo "Specify the input image file, followed by the output image file." >&2 + echo "Example: foo.jpg foo.gif" >&2 + exit 1 +fi + +echo "INPUT $1" +read input +echo "OUTPUT $2" + +if [ -n "$input" ]; then + # Prefixing the filenames with "./" makes sure that they are processed + # as files, even if they look like dashed options. + mkdir -p "$(dirname "./$2")" + ln -s "$input" "./$1" + convert "./$1" "./$2" +fi
prefix output with ./ in example
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 869883f918..2c97946a81 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -100,8 +100,11 @@ An example `git-annex-compute-foo` shell script follows: echo "INPUT $2" read input echo "OUTPUT $3" + # Prefixing with ./ makes sure that the output is treated as a + # filename, rather than a dashed option. + output="./$3" echo REPRODUCIBLE if [ -n "$input" ]; then - mkdir -p "$(dirname "$3")" - frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" + mkdir -p "$(dirname "$output")" + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$output" fi
no longer a draft
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 0dfd93e314..869883f918 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -1,5 +1,3 @@ -**draft** - The [[special_remotes/compute]] special remote uses this interface to run compute programs.
preparing to merge compute
diff --git a/CHANGELOG b/CHANGELOG index 475277f8f4..8c944a4bfb 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,5 +1,9 @@ git-annex (10.20250116) UNRELEASED; urgency=medium + * Added the compute special remote. + * addcomputed: New command, adds a file that is generated by a compute + special remote. + * recompute: New command, recomputes computed files. * Support help.autocorrect settings "prompt", "never", and "immediate". * Allow setting remote.foo.annex-tracking-branch to a branch name that contains "/", as long as it's not a remote tracking branch. diff --git a/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment b/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment new file mode 100644 index 0000000000..1416d77bde --- /dev/null +++ b/doc/todo/compute_special_remote/comment_21_2546562f7a00e082cd0500debc904cf3._comment @@ -0,0 +1,22 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: DataLad exploration of the compute on demand space""" + date="2025-03-06T17:39:04Z" + content=""" +Thanks for explaining the design points of datalad-remake. Some +different design choices than I have made, but mostly they strike me as +implementing what is easier/possible from outside git-annex. + +Eg, storing the compute inputs under `.datalad` in the branch is fine -- +and might even be useful if you want to make a branch that changes +something in there -- but of course in the git-annex implementation it +stores the equvilant thing in the git-annex branch. + +I do hope I'm not closing off the design space from such differences +by dropping a compute special remote right into git-annex. But I also +expect that having a standard and easy way for at least simple +computations will lead to a lot of contributions as others use it. + +Your fMRI case seems like one that my compute remote could handle well +and easily. +"""]] diff --git a/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment b/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment new file mode 100644 index 0000000000..bfacbdf57d --- /dev/null +++ b/doc/todo/compute_special_remote/comment_22_d1561153a3916411ed8caa92fa53893c._comment @@ -0,0 +1,69 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 22""" + date="2025-03-06T17:54:50Z" + content=""" +I've merged the compute special remote now. +See [[special_remotes/compute]], [[git-annex-addcomputed]] +and [[git-annex-recompute]]. + +I have opened [[todo/compute_special_remote_remaining_todos]] with +some various ways that I want to improve it further. Including, notably, +computing on inputs from submodules, which is not currently supported at +all. + +---- + +Here I'll go down mih's original and quite useful design criteria and see +how the compute special remote applies to them: + +### Generate annex keys (that have never existed) + +`git-annex addcomputed --fast` + +### Re-generate annex keys + +`git-annex addcomputed` optionally with the --reproducible option, +followed by a later `git-annex get` + +Another thing that fits under this heading is when one of the original +input files has gotten modified, and you want to compute a new version of +the output file from it, using the same method as was used to compute it +before. That's `git-annex recompute $output_file` + +### Worktree provisioning? + +This is the main thing I didn't implement. Given that git-annex is working +with large files and needs to support various filesystems and OS's that +lack hardlinks and softlinks, it's hard to do this inexpensively. + +Also, it turned out to make sense for the compute program to request +the input files it needs, since this lets git-annex learn what the input +files are, so it can make them available when regenerating a computed file +later. And so the protocol just has git-annex respond with the path to +the content of the file. + +### Request one key, receive many + +This is supported. (So is using multiple inputs to produce one (or more) +outputs.) + +### Instruction deposition + +`git-annex addcomputed` + +### Storage redundancy tests + +It did make sense to have it automatically `git-annex get` the inputs. +Well, I think it makes sense in most cases, this may become a tunable +setting of the compute special remote. + +### Trust + +Handled by requiring the user install a `git-annex-compute-foo` command +in PATH, and provide the name of the command to `initremote`. + +And for later `enableremote` or `autoenable=true`, it will only +allow programs that are listed in the annex.security.allowed-compute-programs +git config. +"""]] diff --git a/TODO-compute b/doc/todo/compute_special_remote_remaining_todos.mdwn similarity index 82% rename from TODO-compute rename to doc/todo/compute_special_remote_remaining_todos.mdwn index 7749ad1be3..bb522398a4 100644 --- a/TODO-compute +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,3 +1,19 @@ +This is the remainder of my todo list while I was building the +compute special remote. --[[Joey]] + +* write a tip showing how to use this + +* Write some simple compute programs so we have something to start with. + + - convert between images eg jpeg to png + - run a command in a singularity container (that is one of the inputs) + - run a wasm binary (that is one of the inputs) + +* compute on input files in submodules + +* annex.diskreserve can be violated if getting a file computes it but also + some other output files, which get added to the annex. + * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an url. Maybe a separate command? That would also allow querying for eg, @@ -27,8 +43,6 @@ So it, seems that, for this to be done, recompute would need to stage the pointer file. -* compute on files in submodules - * recompute could ingest keys for other files than the one being recomputed, and remember them. Then recomputing those files could just use those keys, without re-running a computation. (Better than --others
Added a comment: Special use case for Scientific application
diff --git a/doc/tips/using_the_web_as_a_special_remote/comment_16_3d15f477fd0cbe336f24f5083a62369e._comment b/doc/tips/using_the_web_as_a_special_remote/comment_16_3d15f477fd0cbe336f24f5083a62369e._comment new file mode 100644 index 0000000000..5e04e87b78 --- /dev/null +++ b/doc/tips/using_the_web_as_a_special_remote/comment_16_3d15f477fd0cbe336f24f5083a62369e._comment @@ -0,0 +1,23 @@ +[[!comment format=mdwn + username="jerome.charousset@86fd8ed1bf55902989d7e70a11c38cb3a444b72d" + nickname="jerome.charousset" + avatar="http://cdn.libravatar.org/avatar/4c5d71789f9469db26c261284be0f41c" + subject="Special use case for Scientific application" + date="2025-03-06T17:02:22Z" + content=""" +Hello, +I'm trying to use of Git-annex for some scientific applications, with a specific use case in mind: + +1. One large file is published on Zenodo.org, with a specific and permanent URL for each version (past & future). + +2. I annexed the current version with something like : `git annex addurl --file=my-input-file <the-url-of-the-current-version-on-zenodo> ; git commit ... ; git annex sync` + +3. When a newer version is published on Zenodo, I would like to \"upgrade\" the annexed file to the newer version. +I did some experiment and it seems to work by doing this : `git rm my-input-file ; git annex addurl --file=my-input-file <the-url-of-the-newer-version-on-zenodo> ; git commit ... ; git annex sync ` + +Does that make sense ? is there any risk ? is there a easier/more direct way to achieve what I want ? + +Thanks in advance for your support ! + + +"""]]
update
diff --git a/TODO-compute b/TODO-compute index 5b212695c6..8d26a0777d 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,7 +1,8 @@ * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an url. Maybe a separate command? That would also allow querying for eg, - what files are inputs for another file. + what files are inputs for another file. Or it could be exposed in the + Remote interface, and made into a file matching option. * "getting input from <file>" message uses the original filename, but that file might have been renamed. Would be more clear to use @@ -16,6 +17,15 @@ * Perhaps recompute should write a new version of a file as an unlocked file when the file is currently unlocked? + Problem: Since recompute does not stage the file, it would have to write + the content to the working tree. And then the user would need to + git-annex add. But then, if the key was a VURL key, it would add it with + the default backend instead, and the file would no longer use a computed + key. + + So it, seems that, for this to be done, recompute would need to stage the + pointer file. + * compute on files in submodules * recompute could ingest keys for other files than the one being @@ -42,4 +52,3 @@ that recompute should also support recomputing non-annexed files. Otherwise, adding a file and then recomputing it would vary in what the content of the file is, depending on annex.smallfiles setting. - diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index 0f8dd56901..498c85e26c 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -29,9 +29,9 @@ but is not staged, in order to avoid overwriting any staged changes. Only recompute files that were computed by this compute remote. When this option is not used, all computed files are recomputed using - whatever compute remote was originally used to add them. In cases where + whatever compute remote was originally used to add them. (In cases where a file can be computed by multiple remotes, the one with the lowest - configured cost will be used. + configured cost is used.) * `--unreproducible`, `-u`
Added a comment
diff --git a/doc/forum/Can_I_reset_the_get__47__drop_information_and_resync__63__/comment_1_593a9fd60bab1a4a346476c89e0932c6._comment b/doc/forum/Can_I_reset_the_get__47__drop_information_and_resync__63__/comment_1_593a9fd60bab1a4a346476c89e0932c6._comment new file mode 100644 index 0000000000..766e305ef8 --- /dev/null +++ b/doc/forum/Can_I_reset_the_get__47__drop_information_and_resync__63__/comment_1_593a9fd60bab1a4a346476c89e0932c6._comment @@ -0,0 +1,10 @@ +[[!comment format=mdwn + username="matrss" + avatar="http://cdn.libravatar.org/avatar/cd1c0b3be1af288012e49197918395f0" + subject="comment 1" + date="2025-03-05T15:40:44Z" + content=""" +If I understand you correctly you mean that you somehow got git-annex' location tracking out-of-sync with the reality of where files are? I.e. `git annex list` shows some files as present in a repository which aren't actually there? + +If yes then `git annex fsck` (<https://git-annex.branchable.com/git-annex-fsck/>) should find and correct these issues (or tell you if they are not fixable). +"""]]
Added a comment
diff --git a/doc/todo/use_copy__95__file__95__range_for_get_and_copy/comment_2_85483ca3d6d913bc23f23e5faef278d4._comment b/doc/todo/use_copy__95__file__95__range_for_get_and_copy/comment_2_85483ca3d6d913bc23f23e5faef278d4._comment new file mode 100644 index 0000000000..3e38807ee8 --- /dev/null +++ b/doc/todo/use_copy__95__file__95__range_for_get_and_copy/comment_2_85483ca3d6d913bc23f23e5faef278d4._comment @@ -0,0 +1,9 @@ +[[!comment format=mdwn + username="bpoldrack" + avatar="http://cdn.libravatar.org/avatar/8e929a6e60f16f50e7463f88aab26f40" + subject="comment 2" + date="2025-03-05T14:23:56Z" + content=""" +The problem is indeed specific to the combination of NFS and ZFS. +Not sure how to properly probe for that. But since this is relevant only for copies across ZFS over NFS4.2 mounts, a config option would be OK, I think. It's not that something isn't working w/o it, it's just unnecessarily slow. +"""]]
Tag copy_file_range todo with projects/INM7 (came from our cluster)
diff --git a/doc/todo/use_copy__95__file__95__range_for_get_and_copy.mdwn b/doc/todo/use_copy__95__file__95__range_for_get_and_copy.mdwn index 46dcf4d65d..72bb84fb96 100644 --- a/doc/todo/use_copy__95__file__95__range_for_get_and_copy.mdwn +++ b/doc/todo/use_copy__95__file__95__range_for_get_and_copy.mdwn @@ -14,3 +14,4 @@ Just to be clear: It's specifically ZFS via NFS on linux that's the issue here. P.S.: Didn't want to call this a bug, mostly b/c the "real bug" isn't in annex exactly (see link above), so putting it here. [[!meta author=ben]] +[[!tag projects/INM7]]
Added a comment: DataLad exploration of the compute on demand space
diff --git a/doc/todo/compute_special_remote/comment_20_38b36d6d763cd0823dea7016d0ad9153._comment b/doc/todo/compute_special_remote/comment_20_38b36d6d763cd0823dea7016d0ad9153._comment new file mode 100644 index 0000000000..20382bd509 --- /dev/null +++ b/doc/todo/compute_special_remote/comment_20_38b36d6d763cd0823dea7016d0ad9153._comment @@ -0,0 +1,25 @@ +[[!comment format=mdwn + username="msz" + avatar="http://cdn.libravatar.org/avatar/6e8b88e7c70d86f4cfd27d450958aed4" + subject="DataLad exploration of the compute on demand space" + date="2025-03-05T13:31:41Z" + content=""" +Thank you for the interesting discussion @matrss & @joey, you raise a lot of great points. I will need some time to take that all in, so I don't have any direct comments about the open questions. Meanwhile, I am excited to see the git-annex implementation taking shape. + +> Of course, git-annex having its own compute special remote would not preclude other external special remotes that compute. And for that matter, a single external special remote could implement an extension interface. + +With that in mind, I would like to share that my colleagues from the DataLad team (Psychoinformatics group) have experimented in parallel with a [datalad-remake](https://github.com/datalad/datalad-remake) special remote (not released yet, but FMPOV already functional). I am not the best person to explain the design decisions (I was more of a user-tester) but these are the key elements (note: I don't think any of these are final): + +- Essential parameters for recompute are associated with the files using `datalad-remake://` URLs (note to self: learn more about VURL). +- You can associate different sets of parameters (different URLs, leading to different computations, e.g. containerized vs native environment) by using a \"label\" parameter; preferred label is chosen via config. +- The association can be prospective, a'la `git annex addurl --relaxed`. +- The actual compute instructions and data dependencies are stored in TOML files under `.datalad`, in the same branch as the file to be recomputed. +- The trust is addressed by requiring the TOML instructions to be added in a gpg-signed git commit. A user-scope config declares the key ids trusted for that purpose. +- The (re)-computation is done in a git worktree provisioned for the purpose (which means using the past state, not HEAD). +- Files listed as data dependencies are retrieved with git annex get. +- This also works if some dependencies are in subdatasets (submodules). + +I am looking forward to exploring convergence -- or specialization -- with the git-annex implementation! + +The model use case for me personally is fMRI preprocessing, which involves computing and applying spatial transforms on 4D (3D, repeated over time) images. The initial computation is time-consuming, and it produces a large target file (transformed image) as well as several small ancillary files (mostly transformation matrices). Applying these ancillary files to the raw image (which would typically be stored in a subdataset of the dataset which holds the results) is cheap, and can be reproducible in a byte-exact fashion. So I would run the initial computation normally, and then create \"shortcut\" recompute instructions (data dependencies are pretty straightforward in this case), attach them to the target file, and drop its contents. In practice, that would mean trading a few hundred MB for a minute or two of recomputing. I know this is domain-specific, but FTR, I made a demo dataset with a short write-up: [ds005479-remake-demo](https://hub.datalad.org/mslw/ds005479-remake-demo). +"""]]
Added a comment
diff --git a/doc/todo/Read-only_support_for_webdav/comment_2_968e09a53e4d583c67dea328d00930b4._comment b/doc/todo/Read-only_support_for_webdav/comment_2_968e09a53e4d583c67dea328d00930b4._comment new file mode 100644 index 0000000000..ca1a0408ce --- /dev/null +++ b/doc/todo/Read-only_support_for_webdav/comment_2_968e09a53e4d583c67dea328d00930b4._comment @@ -0,0 +1,10 @@ +[[!comment format=mdwn + username="msz" + avatar="http://cdn.libravatar.org/avatar/6e8b88e7c70d86f4cfd27d450958aed4" + subject="comment 2" + date="2025-03-05T11:27:38Z" + content=""" +Thank you! I had an opportunity to try it out only recently, and it worked great! + +With that, I can use a regular (read/write) WebDAV special remote to store data on a Nextcloud instance; then, when I want to publish (without granting write access), I can `initremote --sameas`, and enable the second remote using a public read-only WebDAV URL (which is not the share link generated by Nextcloud, but can be easily derived from it). The public share can be password-protected. And, indeed, git-remote-annex works seamlessly on top. Very little needs to be done in terms of remote configuration to make it work. This is cool. +"""]]
filled out bug description
diff --git a/doc/bugs/IPv6_link-local_address_as_remote.mdwn b/doc/bugs/IPv6_link-local_address_as_remote.mdwn new file mode 100644 index 0000000000..d2c43ba221 --- /dev/null +++ b/doc/bugs/IPv6_link-local_address_as_remote.mdwn @@ -0,0 +1,33 @@ +### Please describe the problem. + +Unable to use IPv6 link-local address as ssh remote. + +### What steps will reproduce the problem? + +``` +git clone [fe80::aaaa:aaaa:aaaa:aaaa%enp4s0f2]:/path/to/repo.git repo +cd repo +git annex pull + + Remote origin cannot currently be accessed. +``` + +### What version of git-annex are you using? On what operating system? + +10.20241202-1~bpo12+1 on Debian backports amd64 + +### Please provide any additional information below. + +There is no problem with global IPv6 addresses, so it is likely that the percent suffix "%ethernetdevice" is causing the problem. Note that git (`git clone` in the example above) works fine with the link-local IPv6 address. + +[[!format sh """ +# If you can, paste a complete transcript of the problem occurring here. +# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log + + +# End of transcript or log. +"""]] + +### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) + +Yes, I have successfully used git-annex with a local remote (same computer), ssh over IPv4, and ssh to a globally visible IPv6 address.
safer git sha object filename
Rather than use the filename provided by INPUT, which could come from user
input, and so could be something that looks like a dashed parameter,
use a .git/object/<sha> filename.
This avoids user input passing through INPUT and back out, with the file
path then passed to a command, which could do something unexpected with
a dashed parameter, or other special parameter.
Added a note in the design about being careful of passing user input to
commands. They still have to be careful of that in general, just not in
this case.
Rather than use the filename provided by INPUT, which could come from user
input, and so could be something that looks like a dashed parameter,
use a .git/object/<sha> filename.
This avoids user input passing through INPUT and back out, with the file
path then passed to a command, which could do something unexpected with
a dashed parameter, or other special parameter.
Added a note in the design about being careful of passing user input to
commands. They still have to be careful of that in general, just not in
this case.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 53f08c6cf9..58e0ef6e8b 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -425,10 +425,9 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Nothing -> pure Nothing Just (Right f'') -> liftIO $ Just <$> relPathDirToFile subdir f'' - Just (Left gitsha) -> do - liftIO . F.writeFile (subdir </> f') - =<< catObject gitsha - return (Just f') + Just (Left gitsha) -> + Just <$> (liftIO . relPathDirToFile subdir + =<< populategitsha gitsha tmpdir) liftIO $ hPutStrLn (stdinHandle p) $ maybe "" fromOsPath mp liftIO $ hFlush (stdinHandle p) @@ -479,6 +478,17 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = fromIntegral (endtime - starttime) :: NominalDiffTime + -- Writes to a .git/objects/ file in the tmpdir, rather than + -- using the input filename, to avoid exposing the input filename + -- to the program as a parameter, which could parse it as a dashed + -- option or other special parameter. + populategitsha gitsha tmpdir = do + let f = tmpdir </> ".git" </> "objects" + </> toOsPath (Git.fromRef' gitsha) + liftIO $ createDirectoryIfMissing True $ takeDirectory f + liftIO . F.writeFile f =<< catObject gitsha + return f + computationBehaviorChangeError :: ComputeProgram -> String -> OsPath -> Annex a computationBehaviorChangeError (ComputeProgram program) requestdesc p = giveup $ program ++ " is not behaving the same way it used to, now " ++ requestdesc ++ ": " ++ fromOsPath p diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 8b62a601fa..0dfd93e314 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -23,7 +23,9 @@ that is in the form "foo=bar" will also result in an environment variable being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. For security, the program should avoid exposing user input to the shell -unprotected, or otherwise executing it. +unprotected, or otherwise executing it. And when running a command, make +sure that whatever user input is passed to it can result in only safe and +expected behavior. The program is run in a temporary directory, which will be cleaned up after it exits. Note that it may be run in a subdirectory of a temporary
rename config to annex.security.allowed-compute-programs
And require for enable as well as autoenable.
It seemed asking for trouble for `git-annex enable foo` to use whatever
compute program is stored in the git config, without verifying that the
user wants that program to be used.
Note that it would be good to allow `git-annex enable foo program=...`
to be used without the program being in the git config. Not implemented yet
though.
And require for enable as well as autoenable.
It seemed asking for trouble for `git-annex enable foo` to use whatever
compute program is stored in the git config, without verifying that the
user wants that program to be used.
Note that it would be good to allow `git-annex enable foo program=...`
to be used without the program being in the git config. Not implemented yet
though.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index d43e745e95..2903f926b2 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -121,21 +121,19 @@ gen r u rc gc rs = case getComputeProgram' rc of setupInstance :: SetupStage -> Maybe UUID -> Maybe CredPair -> RemoteConfig -> RemoteGitConfig -> Annex (RemoteConfig, UUID) setupInstance ss mu _ c _ = do ComputeProgram program <- either giveup return $ getComputeProgram' c + allowedprograms <- maybe [] words . annexAllowedComputePrograms + <$> Annex.getGitConfig case ss of - AutoEnable _ -> do - l <- maybe [] words - . annexAutoEnableComputePrograms - <$> Annex.getGitConfig - unless (program `elem` l) $ do - let remotename = fromMaybe "(unknown)" (lookupName c) - giveup $ unwords - [ "Not auto-enabling compute special remote" - , remotename - , "because its compute program" - , program - , " is not listed in annex.security.autoenable-compute-programs" - ] - _ -> noop + Init -> noop + _ -> unless (program `elem` allowedprograms) $ do + let remotename = fromMaybe "(unknown)" (lookupName c) + giveup $ unwords + [ "Not enabling compute special remote" + , remotename + , "because its compute program" + , program + , "is not listed in annex.security-allowed-compute-programs" + ] unlessM (liftIO $ inSearchPath program) $ giveup $ "Cannot find " ++ program ++ " in PATH" u <- maybe (liftIO genUUID) return mu diff --git a/TODO-compute b/TODO-compute index 547730914e..3d02d9cc00 100644 --- a/TODO-compute +++ b/TODO-compute @@ -1,3 +1,6 @@ +* allow git-annex enableremote with program= explicitly specified, + without checking annex.security.allowed-compute-programs + * need progress bars for computations and implement PROGRESS message * get input files for a computation (so `git-annex get .` gets every file, diff --git a/Types/GitConfig.hs b/Types/GitConfig.hs index 6ea4503d1a..eeae1a0c7e 100644 --- a/Types/GitConfig.hs +++ b/Types/GitConfig.hs @@ -146,7 +146,7 @@ data GitConfig = GitConfig , annexAllowedUrlSchemes :: S.Set Scheme , annexAllowedIPAddresses :: String , annexAllowUnverifiedDownloads :: Bool - , annexAutoEnableComputePrograms :: Maybe String + , annexAllowedComputePrograms :: Maybe String , annexMaxExtensionLength :: Maybe Int , annexMaxExtensions :: Maybe Int , annexJobs :: Concurrency @@ -262,8 +262,8 @@ extractGitConfig configsource r = GitConfig getmaybe (annexConfig "security.allowed-http-addresses") -- old name , annexAllowUnverifiedDownloads = (== Just "ACKTHPPT") $ getmaybe (annexConfig "security.allow-unverified-downloads") - , annexAutoEnableComputePrograms = - getmaybe (annexConfig "security.autoenable-compute-programs") + , annexAllowedComputePrograms = + getmaybe (annexConfig "security.allowed-compute-programs") , annexMaxExtensionLength = getmayberead (annexConfig "maxextensionlength") , annexMaxExtensions = getmayberead (annexConfig "maxextensions") , annexJobs = fromMaybe NonConcurrent $ diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 2146104456..5a39aa3bfa 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -2201,12 +2201,12 @@ Remotes are configured using these settings in `.git/config`. Per-remote configuration of annex.security.allow-unverified-downloads. -* `annex.security.autoenable-compute-programs` +* `annex.security.allowed-compute-programs` This is a space separated list of compute programs eg "git-annex-compute-foo git-annex-compute-bar". Listing a compute program here allows compute special remotes that use that program to be - autoenabled. + enabled by `git-annex enableremote` or autoenabled. # CONFIGURATION OF ASSISTANT diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 811640e2f6..264cec825a 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -15,11 +15,10 @@ program to use to compute the contents of annexed files. It must start with "git-annex-compute-". The program needs to be installed somewhere in the `PATH`. -The `autoenable` parameter can be set to "true" like with other special -remotes to make git-annex automatically enable this special remote when -run in a new clone of the repository. However, for security, autoenabling -is only done when the git config `annex.security.autoenable-compute-programs` -includes the name of the compute program. +Any program can be passed to `git-annex initremote`. However, when enabling +a compute special remote later with `git-annex enableremote` or due to +"autoenable=true", the program must be listed in the git config +`annex.security.allowed-compute-programs`. All other "field=value" parameters passed to `initremote` will be passed to the program when running [[git-annex-addcomputed]]. Note that when the
autoenable security for compute special remote
Added annex.security.autoenable-compute-programs and only allow
autoenabling special remotes that use compute programs on that list.
The reason this is needed is a user might have some compute programs
that are less safe to use than others. They might want to use an unsafe
one only with one repository, where they are the only committer or other
committers are trusted. They might be ok with others being used by any
repository, and if so they can add them to the list.
Another reason would be a user who has installed a compute program by
accident. Eg, it might be included with git-annex at some point, or
pulled in by some dependency. That user doesn't necessarily want that
compute program to be used in an autoenabled special remote.
Added annex.security.autoenable-compute-programs and only allow
autoenabling special remotes that use compute programs on that list.
The reason this is needed is a user might have some compute programs
that are less safe to use than others. They might want to use an unsafe
one only with one repository, where they are the only committer or other
committers are trusted. They might be ok with others being used by any
repository, and if so they can add them to the list.
Another reason would be a user who has installed a compute program by
accident. Eg, it might be included with git-annex at some point, or
pulled in by some dependency. That user doesn't necessarily want that
compute program to be used in an autoenabled special remote.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 564ecbda70..d43e745e95 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -23,6 +23,7 @@ module Remote.Compute ( ) where import Annex.Common +import qualified Annex import Types.Remote import Types.ProposedAccepted import Types.MetaData @@ -118,8 +119,23 @@ gen r u rc gc rs = case getComputeProgram' rc of } setupInstance :: SetupStage -> Maybe UUID -> Maybe CredPair -> RemoteConfig -> RemoteGitConfig -> Annex (RemoteConfig, UUID) -setupInstance _ mu _ c _ = do +setupInstance ss mu _ c _ = do ComputeProgram program <- either giveup return $ getComputeProgram' c + case ss of + AutoEnable _ -> do + l <- maybe [] words + . annexAutoEnableComputePrograms + <$> Annex.getGitConfig + unless (program `elem` l) $ do + let remotename = fromMaybe "(unknown)" (lookupName c) + giveup $ unwords + [ "Not auto-enabling compute special remote" + , remotename + , "because its compute program" + , program + , " is not listed in annex.security.autoenable-compute-programs" + ] + _ -> noop unlessM (liftIO $ inSearchPath program) $ giveup $ "Cannot find " ++ program ++ " in PATH" u <- maybe (liftIO genUUID) return mu diff --git a/TODO-compute b/TODO-compute index b3f67016a7..547730914e 100644 --- a/TODO-compute +++ b/TODO-compute @@ -3,8 +3,6 @@ * get input files for a computation (so `git-annex get .` gets every file, even when input files in a directory are processed after computed files) -* autoinit security - * addcomputed should honor annex.addunlocked. * Perhaps recompute should write a new version of a file as an unlocked diff --git a/Types/GitConfig.hs b/Types/GitConfig.hs index 255778387f..6ea4503d1a 100644 --- a/Types/GitConfig.hs +++ b/Types/GitConfig.hs @@ -146,6 +146,7 @@ data GitConfig = GitConfig , annexAllowedUrlSchemes :: S.Set Scheme , annexAllowedIPAddresses :: String , annexAllowUnverifiedDownloads :: Bool + , annexAutoEnableComputePrograms :: Maybe String , annexMaxExtensionLength :: Maybe Int , annexMaxExtensions :: Maybe Int , annexJobs :: Concurrency @@ -261,6 +262,8 @@ extractGitConfig configsource r = GitConfig getmaybe (annexConfig "security.allowed-http-addresses") -- old name , annexAllowUnverifiedDownloads = (== Just "ACKTHPPT") $ getmaybe (annexConfig "security.allow-unverified-downloads") + , annexAutoEnableComputePrograms = + getmaybe (annexConfig "security.autoenable-compute-programs") , annexMaxExtensionLength = getmayberead (annexConfig "maxextensionlength") , annexMaxExtensions = getmayberead (annexConfig "maxextensions") , annexJobs = fromMaybe NonConcurrent $ diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index cd53a04aa1..8b62a601fa 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -26,13 +26,12 @@ For security, the program should avoid exposing user input to the shell unprotected, or otherwise executing it. The program is run in a temporary directory, which will be cleaned up after -it exits. Note that it may be run in a subdirectory of its temporary +it exits. Note that it may be run in a subdirectory of a temporary directory. This is done when `git-annex addcomputed` was run in a subdirectory of the git repository. -The content of any annexed file in the repository can be an input -to the computation. The program requests an input by writing a line to -stdout: +The content of any file in the repository can be an input to the +computation. The program requests an input by writing a line to stdout: INPUT file.raw diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index daed2be98a..2146104456 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -2201,6 +2201,13 @@ Remotes are configured using these settings in `.git/config`. Per-remote configuration of annex.security.allow-unverified-downloads. +* `annex.security.autoenable-compute-programs` + + This is a space separated list of compute programs eg + "git-annex-compute-foo git-annex-compute-bar". Listing a compute + program here allows compute special remotes that use that program to be + autoenabled. + # CONFIGURATION OF ASSISTANT * `annex.delayadd` diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index c3f4186008..811640e2f6 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -15,6 +15,12 @@ program to use to compute the contents of annexed files. It must start with "git-annex-compute-". The program needs to be installed somewhere in the `PATH`. +The `autoenable` parameter can be set to "true" like with other special +remotes to make git-annex automatically enable this special remote when +run in a new clone of the repository. However, for security, autoenabling +is only done when the git config `annex.security.autoenable-compute-programs` +includes the name of the compute program. + All other "field=value" parameters passed to `initremote` will be passed to the program when running [[git-annex-addcomputed]]. Note that when the program takes a dashed option, it can be provided after "--":
recompute: display one of the changed files
diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 81fe35cbff..2eda098867 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -79,18 +79,22 @@ start' o r si file key = Remote.Compute.getComputeState (Remote.remoteStateHandle r) key >>= \case Nothing -> stop - Just state -> - stopUnless (shouldrecompute state) $ - starting "recompute" ai si $ - perform o r file key state + Just state -> shouldrecompute state >>= \case + Nothing -> stop + Just mreason -> starting "recompute" ai si $ do + maybe noop showNote mreason + perform o r file key state where ai = mkActionItem (key, file) shouldrecompute state - | originalOption o = return True - | otherwise = - anyM (inputchanged state) $ - M.toList (Remote.Compute.computeInputs state) + | originalOption o = return (Just Nothing) + | otherwise = firstM (inputchanged state) + (M.toList (Remote.Compute.computeInputs state)) + >>= return . \case + Nothing -> Nothing + Just (inputfile, _) -> Just $ Just $ + QuotedPath inputfile <> " changed" inputchanged state (inputfile, inputkey) = do -- Note that the paths from the remote state are not to be @@ -109,11 +113,13 @@ start' o r si file key = Just (sha, _, _) -> sha /= inputgitsha Nothing -> inputfilemissing Nothing -> return inputfilemissing - - -- When an input file is missing, go ahead and recompute. This way, - -- the user will see the computation fail, with an error message that - -- explains the problem. - inputfilemissing = True + where + -- When an input file is missing, go ahead and recompute. + -- This way, the user will see the computation fail, + -- with an error message that explains the problem. + -- Or, if the input file is only optionally used by the + -- computation, it might succeed. + inputfilemissing = True perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform perform o r file origkey origstate = do diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index fb895aa75c..0f8dd56901 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -15,7 +15,8 @@ By default, this only recomputes files whose input files have changed. The new contents of the input files are used to re-run the computation. When the output of the computation is different, the computed file is -updated with the new content. +updated with the new content. The updated file is written to the worktree, +but is not staged, in order to avoid overwriting any staged changes. # OPTIONS
support git files as input to computations
Using GIT keys, like are used when exporting git files to special
remotes. Except here the GIT key refers to a file checked into the git
repo.
Note that, since the compute remote uses catObject to get the content,
a symlink that is checked into git does not get followed. This is important
for security, because following a symlink and adding the content to the
repo as an annex object would allow exfiltrating content from outside
the repository.
Instead, the behavior with a symlink is to run the computation on the
symlink target. This may turn out to be confusing, and it might be worth
addcomputed checking if the file in git is a symlink and erroring out.
Or it could follow symlinks as long as the destination is a file in the
repisitory.
Using GIT keys, like are used when exporting git files to special
remotes. Except here the GIT key refers to a file checked into the git
repo.
Note that, since the compute remote uses catObject to get the content,
a symlink that is checked into git does not get followed. This is important
for security, because following a symlink and adding the content to the
repo as an annex object would allow exfiltrating content from outside
the repository.
Instead, the behavior with a symlink is to run the computation on the
symlink target. This may turn out to be confusing, and it might be worth
addcomputed checking if the file in git is a symlink and erroring out.
Or it could follow symlinks as long as the destination is a file in the
repisitory.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 857e495ad0..b0127b10ba 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -11,6 +11,8 @@ module Command.AddComputed where import Command import qualified Git +import qualified Git.Types as Git +import qualified Git.Ref as Git import qualified Annex import qualified Remote.Compute import qualified Types.Remote as Remote @@ -18,6 +20,7 @@ import Backend import Annex.CatFile import Annex.Content.Presence import Annex.Ingest +import Annex.GitShaKey import Types.KeySource import Types.Key import Messages.Progress @@ -192,20 +195,31 @@ addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fas Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state -getInputContent :: Bool -> OsPath -> Annex (Key, Maybe OsPath) +getInputContent :: Bool -> OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath)) getInputContent fast p = catKeyFile p >>= \case - Just inputkey -> getInputContent' fast inputkey (fromOsPath p) - Nothing -> ifM (liftIO $ doesFileExist p) - ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p - , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p - ) - -getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe OsPath) -getInputContent' fast inputkey filedesc = do - obj <- calcRepo (gitAnnexLocation inputkey) - if fast - then return (inputkey, Nothing) - else ifM (inAnnex inputkey) - ( return (inputkey, Just obj) - , giveup $ "The computation needs the content of a file which is not present: " ++ filedesc + Just inputkey -> getInputContent' fast inputkey filedesc + Nothing -> inRepo (Git.fileRef p) >>= \case + Just fileref -> catObjectMetaData fileref >>= \case + Just (sha, _, t) + | t == Git.BlobObject -> + getInputContent' fast (gitShaKey sha) filedesc + | otherwise -> + badinput $ ", not a git " ++ decodeBS (Git.fmtObjectType t) + Nothing -> notcheckedin + Nothing -> notcheckedin + where + filedesc = fromOsPath p + badinput s = giveup $ "The computation needs an input file " ++ s ++ ": " ++ fromOsPath p + notcheckedin = badinput "that is not checked into the git repository" + +getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe (Either Git.Sha OsPath)) +getInputContent' fast inputkey filedesc + | fast = return (inputkey, Nothing) + | otherwise = case keyGitSha inputkey of + Nothing -> ifM (inAnnex inputkey) + ( do + obj <- calcRepo (gitAnnexLocation inputkey) + return (inputkey, Just (Right obj)) + , giveup $ "The computation needs the content of an annexed file which is not present: " ++ filedesc ) + Just sha -> return (inputkey, Just (Left sha)) diff --git a/Git/Types.hs b/Git/Types.hs index a32d07d4f7..1ad145452b 100644 --- a/Git/Types.hs +++ b/Git/Types.hs @@ -145,7 +145,7 @@ newtype RefDate = RefDate String {- Types of objects that can be stored in git. -} data ObjectType = BlobObject | CommitObject | TreeObject - deriving (Show) + deriving (Show, Eq) readObjectType :: S.ByteString -> Maybe ObjectType readObjectType "blob" = Just BlobObject diff --git a/Remote/Compute.hs b/Remote/Compute.hs index eaef6d44fb..564ecbda70 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -35,6 +35,8 @@ import Annex.SpecialRemote.Config import Annex.UUID import Annex.Content import Annex.Tmp +import Annex.GitShaKey +import Annex.CatFile import Logs.MetaData import Logs.EquivilantKeys import Utility.Metered @@ -43,10 +45,11 @@ import Utility.Env import Utility.Tmp.Dir import Utility.Url import Utility.MonotonicClock -import qualified Git -import qualified Utility.SimpleProtocol as Proto import Types.Key import Backend +import qualified Git +import qualified Utility.FileIO as F +import qualified Utility.SimpleProtocol as Proto import Network.HTTP.Types.URI import Data.Time.Clock @@ -341,7 +344,7 @@ runComputeProgram :: ComputeProgram -> ComputeState -> ImmutableState - -> (OsPath -> Annex (Key, Maybe OsPath)) + -> (OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath))) -> (ComputeState -> OsPath -> NominalDiffTime -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = @@ -395,12 +398,17 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) let knowninput = M.member f' (computeInputs state') checksafefile tmpdir subdir f' "input" checkimmutable knowninput "inputting" f' $ do - (k, mp) <- getinputcontent f' - mp' <- liftIO $ maybe (pure Nothing) - (Just <$$> relPathDirToFile subdir) - mp + (k, inputcontent) <- getinputcontent f' + mp <- case inputcontent of + Nothing -> pure Nothing + Just (Right f'') -> liftIO $ + Just <$> relPathDirToFile subdir f'' + Just (Left gitsha) -> do + liftIO . F.writeFile (subdir </> f') + =<< catObject gitsha + return (Just f') liftIO $ hPutStrLn (stdinHandle p) $ - maybe "" fromOsPath mp' + maybe "" fromOsPath mp liftIO $ hFlush (stdinHandle p) return $ if immutablestate then state @@ -467,10 +475,13 @@ computeKey rs (ComputeProgram program) k _af dest p vc = getinputcontent state f = case M.lookup (fromOsPath f) (computeInputs state) of - Just inputkey -> do - obj <- calcRepo (gitAnnexLocation inputkey) - -- XXX get input object when not present - return (inputkey, Just obj) + Just inputkey -> case keyGitSha inputkey of + Nothing -> do + obj <- calcRepo (gitAnnexLocation inputkey) + -- XXX get input object when not present + return (inputkey, Just (Right obj)) + Just gitsha -> + return (inputkey, Just (Left gitsha)) Nothing -> error "internal" computeskey state = diff --git a/TODO-compute b/TODO-compute index dfa629ab8b..b3f67016a7 100644 --- a/TODO-compute +++ b/TODO-compute @@ -5,13 +5,13 @@ * autoinit security -* Support non-annexed files as inputs to computations. - * addcomputed should honor annex.addunlocked. * Perhaps recompute should write a new version of a file as an unlocked file when the file is currently unlocked? +* compute on files in submodules + * recompute could ingest keys for other files than the one being recomputed, and remember them. Then recomputing those files could just use those keys, without re-running a computation. (Better than --others diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 3301381c66..faff1d96b6 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -8,8 +8,8 @@ git annex addcomputed `--to=remote -- ...` # DESCRIPTION -Adds files to the annex that are computed from input files, -using a compute special remote. +Adds files to the annex that are computed from input files in the +repository, using a compute special remote. Once a file has been added to a compute remote, commands like `git-annex get` will use it to compute the content of the file.
Added a comment: Permission fix
diff --git a/doc/tips/unlocked_files/comment_18_c1e4af3235acf9cca4730c8dabf66dce._comment b/doc/tips/unlocked_files/comment_18_c1e4af3235acf9cca4730c8dabf66dce._comment new file mode 100644 index 0000000000..7e0c11e9e3 --- /dev/null +++ b/doc/tips/unlocked_files/comment_18_c1e4af3235acf9cca4730c8dabf66dce._comment @@ -0,0 +1,45 @@ +[[!comment format=mdwn + username="czard" + avatar="http://cdn.libravatar.org/avatar/52ff2343057d711a2000506267bf91d8" + subject="Permission fix" + date="2025-03-03T12:08:28Z" + content=""" +Hi, + +Lots of gratitude for your work on git annex. + +I have an annex repo with a default setting to unlock files. When I run `git annex add myfile`, I notice a change related to permission is added to my file in the working tree, which I need to further `git add` in order to get to a clean state. See below. + +Is that expected? I'm wondering if it wouldn't make more sense / be a better experience if `git annex add myfile` would seamlessly handle that permission change and add it to git for unlocked files, so I don't have to run both `git annex add` and `git add` to get to a clean state? + +Thanks. + +``` +$ git status +On branch main +Untracked files: + (use \"git add <file>...\" to include in what will be committed) + 05 Tapestry.mp3 + +nothing added to commit but untracked files present (use \"git add\" to track) + +$ git annex add . +add 05 Tapestry.mp3 +ok +(recording state in git...) + +$ git status +On branch main +Changes to be committed: + (use \"git restore --staged <file>...\" to unstage) + new file: 05 Tapestry.mp3 + +Changes not staged for commit: + (use \"git add <file>...\" to update what will be committed) + (use \"git restore <file>...\" to discard changes in working directory) + modified: 05 Tapestry.mp3 + +$ git diff +05 Tapestry.mp3 changed file mode from 100644 to 100755 +``` +"""]]
many recompute improvements
I've lost track of them all, but it includes:
* Using the same key backend as was used in the original computation.
* Fixing bug that prevented updating the source file key in the compute
state
* Handling --reproducible and --unreproducible.
* recompute --original of a file using VURL, when the result is
different, but the key remains the same, makes the object file
be updated with the new content
* Detecting some other ways the program behavior can change, just for
completeness.
* Also adds --backend to addcomputed.
I've lost track of them all, but it includes:
* Using the same key backend as was used in the original computation.
* Fixing bug that prevented updating the source file key in the compute
state
* Handling --reproducible and --unreproducible.
* recompute --original of a file using VURL, when the result is
different, but the key remains the same, makes the object file
be updated with the new content
* Detecting some other ways the program behavior can change, just for
completeness.
* Also adds --backend to addcomputed.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 071015e014..20eacf954f 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -28,7 +28,7 @@ import qualified Data.Map as M import Data.Time.Clock cmd :: Command -cmd = notBareRepo $ +cmd = notBareRepo $ withAnnexOptions [backendOption] $ command "addcomputed" SectionCommon "add computed files to annex" (paramRepeating paramExpression) (seek <$$> optParser) @@ -96,11 +96,22 @@ perform o r = do Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) - (addComputed "adding" True r (reproducible o) Just fast) + (addComputed "adding" True r (reproducible o) chooseBackend Just fast) next $ return True -addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Maybe OsPath) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () -addComputed addaction stagefiles r reproducibleconfig destfile fast state tmpdir ts = do +addComputed + :: StringContainingQuotedPath + -> Bool + -> Remote + -> Maybe Reproducible + -> (OsPath -> Annex Backend) + -> (OsPath -> Maybe OsPath) + -> Bool + -> Remote.Compute.ComputeState + -> OsPath + -> NominalDiffTime + -> Annex () +addComputed addaction stagefiles r reproducibleconfig choosebackend destfile fast state tmpdir ts = do let outputs = Remote.Compute.computeOutputs state when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." @@ -146,22 +157,24 @@ addComputed addaction stagefiles r reproducibleconfig destfile fast state tmpdir , contentLocation = outputfile' , inodeCache = Nothing } - ingestwith a = a >>= \case - Nothing -> giveup "ingestion failed" - Just k -> do - logStatus NoLiveUpdate k InfoPresent - return k genkey f p = do - backend <- chooseBackend outputfile + backend <- choosebackend outputfile fst <$> genKey (ks f) p backend makelink f k = void $ makeLink f k Nothing ingesthelper f p mk - | stagefiles = ingestwith $ - ingestAdd' p (Just (ld f)) mk + | stagefiles = ingestwith $ do + k <- maybe (genkey f p) return mk + ingestAdd' p (Just (ld f)) (Just k) | otherwise = ingestwith $ do - mk' <- fst <$> ingest p (Just (ld f)) mk + k <- maybe (genkey f p) return mk + mk' <- fst <$> ingest p (Just (ld f)) (Just k) maybe noop (makelink f) mk' return mk' + ingestwith a = a >>= \case + Nothing -> giveup "ingestion failed" + Just k -> do + logStatus NoLiveUpdate k InfoPresent + return k ldc = LockDownConfig { lockingFile = True diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 2193216d29..4a3c8355ad 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -14,10 +14,13 @@ import qualified Annex import qualified Remote.Compute import qualified Remote import qualified Types.Remote as Remote +import Annex.Content import Annex.CatFile import Git.FilePath import Logs.Location import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent, getInputContent', addComputed) +import Backend (maybeLookupBackendVariety, unknownBackendVarietyMessage) +import Types.Key import qualified Data.Map as M @@ -62,7 +65,7 @@ seek' o = do start :: RecomputeOptions -> Maybe Remote -> SeekInput -> OsPath -> Key -> CommandStart start o (Just computeremote) si file key = - stopUnless (notElem (Remote.uuid computeremote) <$> loggedLocations key) $ + stopUnless (elem (Remote.uuid computeremote) <$> loggedLocations key) $ start' o computeremote si file key start o Nothing si file key = do rs <- catMaybes <$> (mapM Remote.byUUID =<< loggedLocations key) @@ -103,31 +106,73 @@ start' o r si file key = -- explains the problem. Nothing -> True --- TODO When reproducible is not set, preserve the --- reproducible/unreproducible of the input key. perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform -perform o r file key origstate = do +perform o r file origkey origstate = do program <- Remote.Compute.getComputeProgram r - fast <- Annex.getRead Annex.fast + reproducibleconfig <- getreproducibleconfig showOutput Remote.Compute.runComputeProgram program origstate - (Remote.Compute.ImmutableState True) - (getinputcontent program fast) - (addComputed "processing" False r (reproducible o) destfile fast) + (Remote.Compute.ImmutableState False) + (getinputcontent program) + (go program reproducibleconfig) next $ return True where - getinputcontent program fast p + go program reproducibleconfig state tmpdir ts = do + checkbehaviorchange program state + addComputed "processing" False r reproducibleconfig + choosebackend destfile state tmpdir ts + + checkbehaviorchange program state = do + let check s w a b = forM_ (M.keys (w a)) $ \f -> + unless (M.member f (w b)) $ + Remote.Compute.computationBehaviorChangeError program s f + + check "not using input file" + Remote.Compute.computeInputs origstate state + check "outputting" + Remote.Compute.computeOutputs state origstate + check "not outputting" + Remote.Compute.computeOutputs origstate state + + getinputcontent program p | originalOption o = case M.lookup p (Remote.Compute.computeInputs origstate) of - Just inputkey -> getInputContent' fast inputkey + Just inputkey -> getInputContent' False inputkey (fromOsPath p ++ "(key " ++ serializeKey inputkey ++ ")") Nothing -> Remote.Compute.computationBehaviorChangeError program "requesting a new input file" p - | otherwise = getInputContent fast p + | otherwise = getInputContent False p destfile outputfile | Just outputfile == origfile = Just file | otherwise = Nothing - origfile = headMaybe $ M.keys $ M.filter (== Just key) + origfile = headMaybe $ M.keys $ M.filter (== Just origkey) (Remote.Compute.computeOutputs origstate) + + origbackendvariety = fromKey keyVariety origkey + + getreproducibleconfig = case reproducible o of + Just (Reproducible True) -> return (Just (Reproducible True)) + -- A VURL key is used when the computation was + -- unreproducible. So recomputing should too, but that + -- will result in the same VURL key. Since moveAnnex + -- will prefer the current annex object to a new one, + -- delete the annex object first, so that if recomputing + -- generates a new version of the file, it replaces + -- the old version. + v -> case origbackendvariety of + VURLKey -> do + lockContentForRemoval origkey noop removeAnnex + -- in case computation fails or is interupted + logStatus NoLiveUpdate origkey InfoMissing + return (Just (Reproducible False)) + _ -> return v + + choosebackend _outputfile + -- Use the same backend as was used to compute it before, + -- so if the computed file is the same, there will be + -- no change. + | otherwise = maybeLookupBackendVariety origbackendvariety >>= \case + Just b -> return b + Nothing -> giveup $ unknownBackendVarietyMessage origbackendvariety diff --git a/Remote/Compute.hs b/Remote/Compute.hs index e3ec2a8fdd..a8a3cdd32e 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -399,8 +399,8 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO $ hPutStrLn (stdinHandle p) $ maybe "" fromOsPath mp' liftIO $ hFlush (stdinHandle p) - return $ if knowninput - then state' + return $ if immutablestate + then state else state' { computeInputs = M.insert f' k (Diff truncated)
Added a comment
diff --git a/doc/forum/b2_rclone_initremote_failed__58___empty_folders/comment_4_f08244202629f6e26209d2ab876f1b4e._comment b/doc/forum/b2_rclone_initremote_failed__58___empty_folders/comment_4_f08244202629f6e26209d2ab876f1b4e._comment new file mode 100644 index 0000000000..f11e58dfb4 --- /dev/null +++ b/doc/forum/b2_rclone_initremote_failed__58___empty_folders/comment_4_f08244202629f6e26209d2ab876f1b4e._comment @@ -0,0 +1,12 @@ +[[!comment format=mdwn + username="dmcardle" + avatar="http://cdn.libravatar.org/avatar/b79468a0d03ec3ec7acbae547c4fa994" + subject="comment 4" + date="2025-02-27T19:02:14Z" + content=""" +Hi datamanager, sorry for the delay! + +If you're up for it, could you try out a beta build? The [v1.70.0-beta.8618.431386085](https://beta.rclone.org/v1.70.0-beta.8618.431386085/) build includes [PR #7987](https://github.com/rclone/rclone/pull/7987), which removed the offending mkdir operation. It also enabled us to run some unit tests against various backends. There's reason to believe this might work for you because the tests passed against the b2 backend (see results for [2025-02-27-010118](https://pub.rclone.org/integration-tests/2025-02-27-010118/)). + + +"""]]
fix recompute of renamed files
When a computed file has been renamed, a recompute needs to write to the
new filename.
I decided to remove --others because it's not clear what it should do in
the face of renames. Should it update only other files that have not
been renamed? Or update files that use the old key to the new key
anywhere in the tree? Or write the other files to the cwd, ignoring
renames? Since --others is just a way to save on compute time, adding
this complexity at this point seems like a bad idea. May revisit later.
Added temporary TODO-compute file
When a computed file has been renamed, a recompute needs to write to the
new filename.
I decided to remove --others because it's not clear what it should do in
the face of renames. Should it update only other files that have not
been renamed? Or update files that use the old key to the new key
anywhere in the tree? Or write the other files to the cwd, ignoring
renames? Since --others is just a way to save on compute time, adding
this complexity at this point seems like a bad idea. May revisit later.
Added temporary TODO-compute file
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index b2b55fb605..071015e014 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -96,11 +96,11 @@ perform o r = do Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) - (addComputed "adding" True r (reproducible o) (const True) fast) + (addComputed "adding" True r (reproducible o) Just fast) next $ return True -addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Bool) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () -addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir ts = do +addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Maybe OsPath) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () +addComputed addaction stagefiles r reproducibleconfig destfile fast state tmpdir ts = do let outputs = Remote.Compute.computeOutputs state when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." @@ -120,29 +120,29 @@ addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir where addfile outputfile | fast = do - when (wantfile outputfile) $ - if stagefiles - then addSymlink outputfile stateurlk Nothing - else makelink stateurlk + case destfile outputfile of + Nothing -> noop + Just f + | stagefiles -> addSymlink f stateurlk Nothing + | otherwise -> makelink f stateurlk return stateurlk | isreproducible = do sz <- liftIO $ getFileSize outputfile' metered Nothing sz Nothing $ \_ p -> - if wantfile outputfile - then ingesthelper p Nothing - else genkey p - | otherwise = - if wantfile outputfile - then ingesthelper nullMeterUpdate - (Just stateurlk) - else return stateurlk + case destfile outputfile of + Just f -> ingesthelper f p Nothing + Nothing -> genkey outputfile p + | otherwise = case destfile outputfile of + Just f -> ingesthelper f nullMeterUpdate + (Just stateurlk) + Nothing -> return stateurlk where stateurl = Remote.Compute.computeStateUrl r state outputfile stateurlk = fromUrl stateurl Nothing True outputfile' = tmpdir </> outputfile - ld = LockedDown ldc ks - ks = KeySource - { keyFilename = outputfile + ld f = LockedDown ldc (ks f) + ks f = KeySource + { keyFilename = f , contentLocation = outputfile' , inodeCache = Nothing } @@ -151,16 +151,16 @@ addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir Just k -> do logStatus NoLiveUpdate k InfoPresent return k - genkey p = do + genkey f p = do backend <- chooseBackend outputfile - fst <$> genKey ks p backend - makelink k = void $ makeLink outputfile k Nothing - ingesthelper p mk + fst <$> genKey (ks f) p backend + makelink f k = void $ makeLink f k Nothing + ingesthelper f p mk | stagefiles = ingestwith $ - ingestAdd' p (Just ld) mk + ingestAdd' p (Just (ld f)) mk | otherwise = ingestwith $ do - mk' <- fst <$> ingest p (Just ld) mk - maybe noop makelink mk' + mk' <- fst <$> ingest p (Just (ld f)) mk + maybe noop (makelink f) mk' return mk' ldc = LockDownConfig diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 44d89f6a33..2193216d29 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -29,7 +29,6 @@ cmd = notBareRepo $ data RecomputeOptions = RecomputeOptions { recomputeThese :: CmdParams , originalOption :: Bool - , othersOption :: Bool , reproducible :: Maybe Reproducible , computeRemote :: Maybe (DeferredParse Remote) } @@ -41,10 +40,6 @@ optParser desc = RecomputeOptions ( long "original" <> help "recompute using original content of input files" ) - <*> switch - ( long "others" - <> help "stage other files that are recomputed in passing" - ) <*> parseReproducible <*> optional (mkParseRemoteOption <$> parseRemoteOption) @@ -111,25 +106,28 @@ start' o r si file key = -- TODO When reproducible is not set, preserve the -- reproducible/unreproducible of the input key. perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform -perform o r file key oldstate = do +perform o r file key origstate = do program <- Remote.Compute.getComputeProgram r fast <- Annex.getRead Annex.fast showOutput - Remote.Compute.runComputeProgram program oldstate + Remote.Compute.runComputeProgram program origstate (Remote.Compute.ImmutableState True) (getinputcontent program fast) - (addComputed "processing" False r (reproducible o) wantfile fast) + (addComputed "processing" False r (reproducible o) destfile fast) next $ return True where getinputcontent program fast p | originalOption o = - case M.lookup p (Remote.Compute.computeInputs oldstate) of + case M.lookup p (Remote.Compute.computeInputs origstate) of Just inputkey -> getInputContent' fast inputkey (fromOsPath p ++ "(key " ++ serializeKey inputkey ++ ")") Nothing -> Remote.Compute.computationBehaviorChangeError program "requesting a new input file" p | otherwise = getInputContent fast p - wantfile outputfile - | othersOption o = True - | otherwise = outputfile == file + destfile outputfile + | Just outputfile == origfile = Just file + | otherwise = Nothing + + origfile = headMaybe $ M.keys $ M.filter (== Just key) + (Remote.Compute.computeOutputs origstate) diff --git a/TODO-compute b/TODO-compute new file mode 100644 index 0000000000..31d8aaa7b2 --- /dev/null +++ b/TODO-compute @@ -0,0 +1,36 @@ +* recompute could ingest keys for other files than the one being + recomputed, and remember them. Then recomputing those files could just + use those keys, without re-running a computation. (Better than --others + which got removed.) + +* `git-annex recompute foo bar baz`, when foo depends on bar which depends + on baz, and when baz has changed, will not recompute foo, because bar has + not changed. It then recomputes bar. So running the command again is + needed to recompute foo. + + What it could do is, after it recomputes bar, notice that it already + considered foo, and revisit foo, and recompute it then. It could either + use a bloom filter to remember the files it considered but did not + compute, or it could just notice that the command line includes foo + (or includes a directory that contains foo), and then foo is not + modified. + + Or it could build a DAG and traverse it, but building a DAG of a large + directory tree has its own problems. + +* recompute should use the same key backend for a file that it used before + (except when --reproducible/--unreproducible is passed). + +* Check recompute's handling of --reproducible and --unreproducible. + +* addcomputed should honor annex.addunlocked. + +* Perhaps recompute should write a new version of a file as an unlocked + file when the file is currently unlocked? + +* Support non-annexed files as inputs to computations. + +* Should addcomputed honor annex.smallfiles? That would seem to imply + that recompute should also support recomputing non-annexed files. + Otherwise, adding a file and then recomputing it would vary in + what the content of the file is, depending on annex.smallfiles setting. diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index b5176285e7..b65488bae8 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -23,11 +23,6 @@ updated with the new content. Use the original content of input files. -* `--others` - - When recomputing one file also generates new versions of other files, (Diff truncated)
recompute closer to working properly
Proper behavior without --others implemented.
And eliminated most of the code duplication through refactoring.
Also, changed it to not stage recomputed files. This way, git diff will
show files that have differences.
Proper behavior without --others implemented.
And eliminated most of the code duplication through refactoring.
Also, changed it to not stage recomputed files. This way, git diff will
show files that have differences.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index f27932405e..b2b55fb605 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -14,6 +14,7 @@ import qualified Git import qualified Annex import qualified Remote.Compute import qualified Types.Remote as Remote +import Backend import Annex.CatFile import Annex.Content.Presence import Annex.Ingest @@ -24,6 +25,7 @@ import Utility.Metered import Backend.URL (fromUrl) import qualified Data.Map as M +import Data.Time.Clock cmd :: Command cmd = notBareRepo $ @@ -94,73 +96,97 @@ perform o r = do Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getInputContent fast) - (go fast) + (addComputed "adding" True r (reproducible o) (const True) fast) next $ return True + +addComputed :: StringContainingQuotedPath -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Bool) -> Bool -> Remote.Compute.ComputeState -> OsPath -> NominalDiffTime -> Annex () +addComputed addaction stagefiles r reproducibleconfig wantfile fast state tmpdir ts = do + let outputs = Remote.Compute.computeOutputs state + when (M.null outputs) $ + giveup "The computation succeeded, but it did not generate any files." + oks <- forM (M.keys outputs) $ \outputfile -> do + showAction $ addaction <> " " <> QuotedPath outputfile + k <- catchNonAsync (addfile outputfile) + (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) + return (outputfile, Just k) + let state' = state + { Remote.Compute.computeOutputs = M.fromList oks + } + forM_ (mapMaybe snd oks) $ \k -> do + Remote.Compute.setComputeState + (Remote.remoteStateHandle r) + k ts state' + logChange NoLiveUpdate k (Remote.uuid r) InfoPresent where - go fast state tmpdir ts = do - let outputs = Remote.Compute.computeOutputs state - when (M.null outputs) $ - giveup "The computation succeeded, but it did not generate any files." - oks <- forM (M.keys outputs) $ \outputfile -> do - showAction $ "adding " <> QuotedPath outputfile - k <- catchNonAsync (addfile fast state tmpdir outputfile) - (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) - return (outputfile, Just k) - let state' = state - { Remote.Compute.computeOutputs = M.fromList oks - } - forM_ (mapMaybe snd oks) $ \k -> do - Remote.Compute.setComputeState - (Remote.remoteStateHandle r) - k ts state' - logChange NoLiveUpdate k (Remote.uuid r) InfoPresent - - addfile fast state tmpdir outputfile + addfile outputfile | fast = do - addSymlink outputfile stateurlk Nothing + when (wantfile outputfile) $ + if stagefiles + then addSymlink outputfile stateurlk Nothing + else makelink stateurlk return stateurlk - | isreproducible state = do + | isreproducible = do sz <- liftIO $ getFileSize outputfile' metered Nothing sz Nothing $ \_ p -> - ingestwith $ ingestAdd p (Just ld) - | otherwise = ingestwith $ - ingestAdd' nullMeterUpdate (Just ld) (Just stateurlk) + if wantfile outputfile + then ingesthelper p Nothing + else genkey p + | otherwise = + if wantfile outputfile + then ingesthelper nullMeterUpdate + (Just stateurlk) + else return stateurlk where stateurl = Remote.Compute.computeStateUrl r state outputfile stateurlk = fromUrl stateurl Nothing True outputfile' = tmpdir </> outputfile - ld = LockedDown ldc $ KeySource - { keyFilename = outputfile - , contentLocation = outputfile' - , inodeCache = Nothing - } + ld = LockedDown ldc ks + ks = KeySource + { keyFilename = outputfile + , contentLocation = outputfile' + , inodeCache = Nothing + } ingestwith a = a >>= \case - Nothing -> giveup "key generation failed" + Nothing -> giveup "ingestion failed" Just k -> do logStatus NoLiveUpdate k InfoPresent return k - + genkey p = do + backend <- chooseBackend outputfile + fst <$> genKey ks p backend + makelink k = void $ makeLink outputfile k Nothing + ingesthelper p mk + | stagefiles = ingestwith $ + ingestAdd' p (Just ld) mk + | otherwise = ingestwith $ do + mk' <- fst <$> ingest p (Just ld) mk + maybe noop makelink mk' + return mk' + ldc = LockDownConfig { lockingFile = True , hardlinkFileTmpDir = Nothing , checkWritePerms = True } - isreproducible state = case reproducible o of + isreproducible = case reproducibleconfig of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state getInputContent :: Bool -> OsPath -> Annex (Key, Maybe OsPath) getInputContent fast p = catKeyFile p >>= \case - Just inputkey -> do - obj <- calcRepo (gitAnnexLocation inputkey) - if fast - then return (inputkey, Nothing) - else ifM (inAnnex inputkey) - ( return (inputkey, Just obj) - , giveup $ "The computation needs the content of a file which is not present: " ++ fromOsPath p - ) + Just inputkey -> getInputContent' fast inputkey (fromOsPath p) Nothing -> ifM (liftIO $ doesFileExist p) ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p ) + +getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe OsPath) +getInputContent' fast inputkey filedesc = do + obj <- calcRepo (gitAnnexLocation inputkey) + if fast + then return (inputkey, Nothing) + else ifM (inAnnex inputkey) + ( return (inputkey, Just obj) + , giveup $ "The computation needs the content of a file which is not present: " ++ filedesc + ) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 42a313ee75..a5a82b7028 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -15,14 +15,9 @@ import qualified Remote.Compute import qualified Remote import qualified Types.Remote as Remote import Annex.CatFile -import Annex.Ingest import Git.FilePath -import Types.KeySource -import Messages.Progress import Logs.Location -import Utility.Metered -import Backend.URL (fromUrl) -import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent) +import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent, getInputContent', addComputed) import qualified Data.Map as M @@ -111,81 +106,28 @@ start' o r si file key = -- recompute. This way, the user will see the -- computation fail, with an error message that -- explains the problem. - -- XXX check that this works well Nothing -> True perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.ComputeState -> CommandPerform perform o r file key oldstate = do program <- Remote.Compute.getComputeProgram r - let recomputestate = oldstate - { Remote.Compute.computeInputs = mempty - , Remote.Compute.computeOutputs = mempty - } fast <- Annex.getRead Annex.fast showOutput - Remote.Compute.runComputeProgram program recomputestate - (Remote.Compute.ImmutableState False) + Remote.Compute.runComputeProgram program oldstate + (Remote.Compute.ImmutableState True) (getinputcontent program fast) (Diff truncated)
started git-annex recompute
The perform action of this still needs work to do the right thing.
In particular, it currently behaves as if --others was always set.
And, it duplicates a lot of code from addcomputed.
The perform action of this still needs work to do the right thing.
In particular, it currently behaves as if --others was always set.
And, it duplicates a lot of code from addcomputed.
diff --git a/CmdLine/GitAnnex.hs b/CmdLine/GitAnnex.hs index 71d9f2e51f..8dc64f8b7b 100644 --- a/CmdLine/GitAnnex.hs +++ b/CmdLine/GitAnnex.hs @@ -134,6 +134,7 @@ import qualified Command.UpdateProxy import qualified Command.MaxSize import qualified Command.Sim import qualified Command.AddComputed +import qualified Command.Recompute import qualified Command.Version import qualified Command.RemoteDaemon #ifdef WITH_ASSISTANT @@ -267,6 +268,7 @@ cmds testoptparser testrunner mkbenchmarkgenerator = map addGitAnnexCommonOption , Command.MaxSize.cmd , Command.Sim.cmd , Command.AddComputed.cmd + , Command.Recompute.cmd , Command.Version.cmd , Command.RemoteDaemon.cmd #ifdef WITH_ASSISTANT diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index fad9c1dc30..9ff13f1f70 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -17,7 +17,6 @@ import qualified Types.Remote as Remote import Annex.CatFile import Annex.Content.Presence import Annex.Ingest -import Types.RemoteConfig import Types.KeySource import Messages.Progress import Logs.Location @@ -68,23 +67,20 @@ seek o = startConcurrency commandStages (seek' o) seek' :: AddComputedOptions -> CommandSeek seek' o = do r <- getParsed (computeRemote o) - unless (Remote.typename (Remote.remotetype r) == Remote.typename Remote.Compute.remote) $ + unless (Remote.Compute.isComputeRemote r) $ giveup "That is not a compute remote." - let rc = unparsedRemoteConfig (Remote.config r) - case Remote.Compute.getComputeProgram rc of - Left err -> giveup $ - "Problem with the configuration of the compute remote: " ++ err - Right program -> commandAction $ start o r program + commandAction $ start o r -start :: AddComputedOptions -> Remote -> Remote.Compute.ComputeProgram -> CommandStart -start o r program = starting "addcomputed" ai si $ perform o r program +start :: AddComputedOptions -> Remote -> CommandStart +start o r = starting "addcomputed" ai si $ perform o r where ai = ActionItemUUID (Remote.uuid r) (UnquotedString (Remote.name r)) si = SeekInput (computeParams o) -perform :: AddComputedOptions -> Remote -> Remote.Compute.ComputeProgram -> CommandPerform -perform o r program = do +perform :: AddComputedOptions -> Remote -> CommandPerform +perform o r = do + program <- Remote.Compute.getComputeProgram r repopath <- fromRepo Git.repoPath subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") let state = Remote.Compute.ComputeState @@ -100,24 +96,10 @@ perform o r program = do showOutput Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) - (getinputcontent fast) + (getInputContent fast) (go starttime fast) next $ return True where - getinputcontent fast p = catKeyFile p >>= \case - Just inputkey -> do - obj <- calcRepo (gitAnnexLocation inputkey) - if fast - then return (inputkey, Nothing) - else ifM (inAnnex inputkey) - ( return (inputkey, Just obj) - , giveup $ "The computation needs the content of a file which is not present: " ++ fromOsPath p - ) - Nothing -> ifM (liftIO $ doesFileExist p) - ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p - , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p - ) - go starttime fast state tmpdir = do endtime <- liftIO currentMonotonicTimestamp let ts = calcduration starttime endtime @@ -175,3 +157,18 @@ perform o r program = do isreproducible state = case reproducible o of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible state + +getInputContent :: Bool -> OsPath -> Annex (Key, Maybe OsPath) +getInputContent fast p = catKeyFile p >>= \case + Just inputkey -> do + obj <- calcRepo (gitAnnexLocation inputkey) + if fast + then return (inputkey, Nothing) + else ifM (inAnnex inputkey) + ( return (inputkey, Just obj) + , giveup $ "The computation needs the content of a file which is not present: " ++ fromOsPath p + ) + Nothing -> ifM (liftIO $ doesFileExist p) + ( giveup $ "The computation needs an input file that is not an annexed file: " ++ fromOsPath p + , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p + ) diff --git a/Command/Recompute.hs b/Command/Recompute.hs new file mode 100644 index 0000000000..95f8f3e16f --- /dev/null +++ b/Command/Recompute.hs @@ -0,0 +1,202 @@ +{- git-annex command + - + - Copyright 2025 Joey Hess <id@joeyh.name> + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE OverloadedStrings #-} + +module Command.Recompute where + +import Command +import qualified Git +import qualified Annex +import qualified Remote.Compute +import qualified Remote +import qualified Types.Remote as Remote +import Annex.CatFile +import Annex.Content.Presence +import Annex.Ingest +import Git.FilePath +import Types.RemoteConfig +import Types.KeySource +import Messages.Progress +import Logs.Location +import Utility.Metered +import Utility.MonotonicClock +import Backend.URL (fromUrl) +import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent) + +import qualified Data.Map as M +import Data.Time.Clock + +cmd :: Command +cmd = notBareRepo $ + command "recompute" SectionCommon "recompute computed files" + paramPaths (seek <$$> optParser) + +data RecomputeOptions = RecomputeOptions + { recomputeThese :: CmdParams + , originalOption :: Bool + , othersOption :: Bool + , reproducible :: Maybe Reproducible + , computeRemote :: Maybe (DeferredParse Remote) + } + +optParser :: CmdParamsDesc -> Parser RecomputeOptions +optParser desc = RecomputeOptions + <$> cmdParams desc + <*> switch + ( long "original" + <> help "recompute using original content of input files" + ) + <*> switch + ( long "others" + <> help "stage other files that are recomputed in passing" + ) + <*> parseReproducible + <*> optional (mkParseRemoteOption <$> parseRemoteOption) + +seek :: RecomputeOptions -> CommandSeek +seek o = startConcurrency commandStages (seek' o) + +seek' :: RecomputeOptions -> CommandSeek +seek' o = do + computeremote <- maybe (pure Nothing) (Just <$$> getParsed) + (computeRemote o) + let seeker = AnnexedFileSeeker + { startAction = const $ start o computeremote + , checkContentPresent = Nothing + , usesLocationLog = True + } + withFilesInGitAnnex ww seeker + =<< workTreeItems ww (recomputeThese o) + where + ww = WarnUnmatchLsFiles "recompute" + +start :: RecomputeOptions -> Maybe Remote -> SeekInput -> OsPath -> Key -> CommandStart +start o (Just computeremote) si file key = + stopUnless (notElem (Remote.uuid computeremote) <$> loggedLocations key) $ + start' o computeremote si file key +start o Nothing si file key = do + rs <- catMaybes <$> (mapM Remote.byUUID =<< loggedLocations key) + case sortOn Remote.cost $ filter Remote.Compute.isComputeRemote rs of + [] -> stop + (r:_) -> start' o r si file key (Diff truncated)
addcomputed inherits extra initremote parameters
This is limited because the remote config is a field/value map. So order
is not preserved, and when 2 parameters have the same field name, only
the last one will be passed.
This is limited because the remote config is a field/value map. So order
is not preserved, and when 2 parameters have the same field name, only
the last one will be passed.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 8af8e7c900..8b983a738c 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -88,8 +88,8 @@ perform o r program = do repopath <- fromRepo Git.repoPath subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") let state = Remote.Compute.ComputeState - -- TODO add inherited initremote parameters - { Remote.Compute.computeParams = computeParams o + { Remote.Compute.computeParams = computeParams o ++ + Remote.Compute.defaultComputeParams r , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir diff --git a/Remote/Compute.hs b/Remote/Compute.hs index b6ba1dbf2e..09ab45687a 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -17,6 +17,7 @@ module Remote.Compute ( getComputeProgram, runComputeProgram, ImmutableState(..), + defaultComputeParams, ) where import Annex.Common @@ -127,6 +128,11 @@ computeConfigParser _ = return $ RemoteConfigParser ) } +defaultComputeParams :: Remote -> [String] +defaultComputeParams = map mk . M.toList . getRemoteConfigPassedThrough . config + where + mk (f, v) = fromProposedAccepted f ++ '=' : v + newtype ComputeProgram = ComputeProgram String deriving (Show) diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 1da287f0cf..245d4a04b0 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -33,7 +33,9 @@ Some examples of how this might look: git-annex addcomputed --to=y -- compress foo --level=9 git-annex addcomputed --to=z -- clip foo 2:01-3:00 combine with bar to baz - +Note that parameters that were passed to `git-annex initremote` +when setting up the compute special remote will be appended to the end of +the parameters provided to `git-annex addcomputed`. # OPTIONS @@ -88,6 +90,8 @@ Some examples of how this might look: [[git-annex-recompute]](1) +[[git-annex-initremote]](1) + # AUTHOR Joey Hess <id@joeyh.name> diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index b840f5fcbe..c3f4186008 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -2,6 +2,10 @@ While other remotes store the contents of annexed files somewhere, this special remote uses a program to compute the contents of annexed files. +To add a file to a compute special remote, use the [[git-annex-addcomputed]] +command. Once a file has been added to a compute special remote, commands +like `git-annex get` will use it to compute the content of the file. + To enable an instance of this special remote: # git-annex initremote myremote type=compute program=git-annex-compute-foo @@ -11,16 +15,11 @@ program to use to compute the contents of annexed files. It must start with "git-annex-compute-". The program needs to be installed somewhere in the `PATH`. -To add a file to a compute special remote, use the [[git-annex-addcomputed]] -command. Once a file has been added to a compute special remote, commands -like `git-annex get` will use it to compute the content of the file. - -You can provide other parameters to `initremote`, in order to provide -default configuration values to use when adding files with -[[git-annex-addcomputed]]. To see a list of all the configuration values -supported by a given program, pass `--whatelse` to `initremote`: +All other "field=value" parameters passed to `initremote` will be passed +to the program when running [[git-annex-addcomputed]]. Note that when the +program takes a dashed option, it can be provided after "--": - # git-annex initremote myremote type=compute program=git-annex-compute-foo --whatelse + # git-annex initremote myremote type=compute program=git-annex-compute-foo -- --level=9 ## compute programs
wording
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index f8fa92ed46..cd53a04aa1 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -27,8 +27,8 @@ unprotected, or otherwise executing it. The program is run in a temporary directory, which will be cleaned up after it exits. Note that it may be run in a subdirectory of its temporary -directory. Eg, when `git-annex addcomputed` is run in a `foo/bar/` -subdirectory of the git repository. +directory. This is done when `git-annex addcomputed` was run in a subdirectory +of the git repository. The content of any annexed file in the repository can be an input to the computation. The program requests an input by writing a line to @@ -40,8 +40,8 @@ Then it can read a line from stdin, which will be the path to the content (eg a `.git/annex/objects/` path). If the program needs multiple input files, it should output multiple -`INPUT` lines at once, and then read multiple paths from stdin. This -allows retrival of the inputs to potentially run in parallel. +`INPUT` lines first, and then read multiple paths from stdin. This +allows retrieval of the inputs to potentially run in parallel. If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails,
pdate demo program
needed a mkdir
needed a mkdir
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 34b7da7e77..f8fa92ed46 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -92,16 +92,17 @@ An example `git-annex-compute-foo` shell script follows: #!/bin/sh set -e if [ "$1" != "convert" ]; then - echo "Usage: convert input output [passes=n]" >&2 - exit 1 + echo "Usage: convert input output [passes=n]" >&2 + exit 1 fi if [ -z "$ANNEX_COMPUTE_passes" ]; then - ANNEX_COMPUTE_passes=1 + ANNEX_COMPUTE_passes=1 fi echo "INPUT $2" read input echo "OUTPUT $3" echo REPRODUCIBLE if [ -n "$input" ]; then - frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" + mkdir -p "$(dirname "$3")" + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" fi
use compute program REPRODUCIBLE by default
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 4edce492e2..0d614a1fc0 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -37,14 +37,14 @@ cmd = notBareRepo $ data AddComputedOptions = AddComputedOptions { computeParams :: CmdParams , computeRemote :: DeferredParse Remote - , reproducible :: Reproducible + , reproducible :: Maybe Reproducible } optParser :: CmdParamsDesc -> Parser AddComputedOptions optParser desc = AddComputedOptions <$> cmdParams desc <*> (mkParseRemoteOption <$> parseToOption) - <*> (fromMaybe (Reproducible False) <$> parseReproducible) + <*> parseReproducible newtype Reproducible = Reproducible { isReproducible :: Bool } @@ -92,7 +92,7 @@ perform o r program = do , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir - , Remote.Compute.computeReproducible = isreproducible + , Remote.Compute.computeReproducible = False } fast <- Annex.getRead Annex.fast starttime <- liftIO currentMonotonicTimestamp @@ -140,7 +140,7 @@ perform o r program = do | fast = do addSymlink outputfile stateurlk Nothing return stateurlk - | isreproducible = do + | isreproducible state = do sz <- liftIO $ getFileSize outputfile' metered Nothing sz Nothing $ \_ p -> ingestwith $ ingestAdd p (Just ld) @@ -170,4 +170,6 @@ perform o r program = do calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = fromIntegral (endtime - starttime) :: NominalDiffTime - isreproducible = isReproducible (reproducible o) + isreproducible state = case reproducible o of + Just v -> isReproducible v + Nothing -> Remote.Compute.computeReproducible state diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 9f096770b7..1da287f0cf 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -53,8 +53,11 @@ Some examples of how this might look: * `--fast` - Adds computed files to the repository, without generating their content - yet. + Adds computed files to the repository, without doing any work yet to + compute their content. + + This implies `--unreproducible`, because even if the compute remote + produces reproducible output, it's not available. * `--unreproducible`, `-u` @@ -70,7 +73,7 @@ Some examples of how this might look: Indicate that the computation is expected to be fully reproducible. This is the default when the compute remote indicates that it produces - reproducible output. + reproducible output (except when using `--fast`). If a computation turns out not to be fully reproducible, then getting the file from the compute remote will later fail with a checksum
addcomputed --fast and --unreproducible working
For these, use VURL and URL keys, with an "annex-compute:" URI prefix.
These URL keys will look something like this:
URL--annex-compute&cbar4,63pconvert,3-f4d3d72cf3f16ac9c3e9a8012bde4462
Generally it's too long so most of it gets md5summed. It's a little
ugly, but it's what fell out of the existing URL key generation
machinery. I did consider special casing to eg
"URL--annex-compute&c4d3d72cf3f16ac9c3e9a8012bde4462". But it seems at
least possibly useful that the name of the file that was computed is
visible and perhaps one or two words of the git-annex compute command
parameters.
Note that two different output files from the same computation will get
the same URL key. And these keys should remain stable.
For these, use VURL and URL keys, with an "annex-compute:" URI prefix.
These URL keys will look something like this:
URL--annex-compute&cbar4,63pconvert,3-f4d3d72cf3f16ac9c3e9a8012bde4462
Generally it's too long so most of it gets md5summed. It's a little
ugly, but it's what fell out of the existing URL key generation
machinery. I did consider special casing to eg
"URL--annex-compute&c4d3d72cf3f16ac9c3e9a8012bde4462". But it seems at
least possibly useful that the name of the file that was computed is
visible and perhaps one or two words of the git-annex compute command
parameters.
Note that two different output files from the same computation will get
the same URL key. And these keys should remain stable.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index d80fb168da..01a334bf9e 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -20,8 +20,9 @@ import Annex.Ingest import Types.RemoteConfig import Types.KeySource import Messages.Progress -import Utility.MonotonicClock import Logs.Location +import Utility.MonotonicClock +import Backend.URL (fromUrl) import qualified Data.Map as M import Data.Time.Clock @@ -42,19 +43,19 @@ optParser :: CmdParamsDesc -> Parser AddComputedOptions optParser desc = AddComputedOptions <$> cmdParams desc <*> (mkParseRemoteOption <$> parseToOption) - <*> (fromMaybe Unreproducible <$> parseReproducible) + <*> (fromMaybe (Reproducible False) <$> parseReproducible) -data Reproducible = Reproducible | Unreproducible +newtype Reproducible = Reproducible { isReproducible :: Bool } parseReproducible :: Parser (Maybe Reproducible) parseReproducible = r <|> unr where - r = flag Nothing (Just Reproducible) + r = flag Nothing (Just (Reproducible True)) ( long "reproducible" <> short 'r' <> help "computation is fully reproducible" ) - unr = flag Nothing (Just Unreproducible) + unr = flag Nothing (Just (Reproducible False)) ( long "unreproducible" <> short 'u' <> help "computation is not fully reproducible" @@ -90,17 +91,14 @@ perform o r program = do , Remote.Compute.computeInputs = mempty , Remote.Compute.computeOutputs = mempty , Remote.Compute.computeSubdir = subdir - , Remote.Compute.computeReproducible = - case reproducible o of - Reproducible -> True - Unreproducible -> False + , Remote.Compute.computeReproducible = isreproducible } fast <- Annex.getRead Annex.fast starttime <- liftIO currentMonotonicTimestamp Remote.Compute.runComputeProgram program state (Remote.Compute.ImmutableState False) (getinputcontent fast) - (go starttime) + (go starttime fast) next $ return True where getinputcontent fast p = catKeyFile p >>= \case @@ -117,7 +115,7 @@ perform o r program = do , giveup $ "The computation needs an input file which does not exist: " ++ fromOsPath p ) - go starttime state tmpdir = do + go starttime fast state tmpdir = do endtime <- liftIO currentMonotonicTimestamp let ts = calcduration starttime endtime let outputs = Remote.Compute.computeOutputs state @@ -125,7 +123,7 @@ perform o r program = do giveup "The computation succeeded, but it did not generate any files." oks <- forM (M.keys outputs) $ \outputfile -> do showAction $ "adding " <> QuotedPath outputfile - k <- catchNonAsync (addfile tmpdir outputfile) + k <- catchNonAsync (addfile fast state tmpdir outputfile) (\err -> giveup $ "Failed to ingest output file " ++ fromOsPath outputfile ++ ": " ++ show err) return (outputfile, Just k) let state' = state @@ -137,24 +135,32 @@ perform o r program = do k ts state' logChange NoLiveUpdate k (Remote.uuid r) InfoPresent - addfile tmpdir outputfile = do - let outputfile' = tmpdir </> outputfile - let ld = LockedDown ldc $ KeySource - { keyFilename = outputfile - , contentLocation = outputfile' - , inodeCache = Nothing - } - sz <- liftIO $ getFileSize outputfile' - metered Nothing sz Nothing $ \_ p -> - ingestAdd p (Just ld) >>= \case - Nothing -> giveup "key generation failed" - Just k -> return k + addfile fast state tmpdir outputfile + | fast || not isreproducible = do + let stateurl = Remote.Compute.computeStateUrl state outputfile + let k = fromUrl stateurl Nothing isreproducible + addSymlink outputfile k Nothing + return k + | otherwise = do + let outputfile' = tmpdir </> outputfile + let ld = LockedDown ldc $ KeySource + { keyFilename = outputfile + , contentLocation = outputfile' + , inodeCache = Nothing + } + sz <- liftIO $ getFileSize outputfile' + metered Nothing sz Nothing $ \_ p -> + ingestAdd p (Just ld) >>= \case + Nothing -> giveup "key generation failed" + Just k -> return k ldc = LockDownConfig { lockingFile = True , hardlinkFileTmpDir = Nothing , checkWritePerms = True } - + calcduration (MonotonicTimestamp starttime) (MonotonicTimestamp endtime) = fromIntegral (endtime - starttime) :: NominalDiffTime + + isreproducible = isReproducible (reproducible o) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index cb2bd1f479..1157ac581d 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -12,6 +12,7 @@ module Remote.Compute ( ComputeState(..), setComputeState, getComputeStates, + computeStateUrl, ComputeProgram, getComputeProgram, runComputeProgram, @@ -36,6 +37,7 @@ import Utility.Metered import Utility.TimeStamp import Utility.Env import Utility.Tmp.Dir +import Utility.Url import qualified Git import qualified Utility.SimpleProtocol as Proto @@ -190,7 +192,10 @@ data ComputeState = ComputeState - and computeOutputs are sorted in ascending order for stability. -} formatComputeState :: Key -> ComputeState -> B.ByteString -formatComputeState k st = renderQuery False $ concat +formatComputeState k = formatComputeState' (Just k) + +formatComputeState' :: Maybe Key -> ComputeState -> B.ByteString +formatComputeState' mk st = renderQuery False $ concat [ map formatparam (computeParams st) , map formatinput (M.toAscList (computeInputs st)) , mapMaybe formatoutput (M.toAscList (computeOutputs st)) @@ -202,7 +207,7 @@ formatComputeState k st = renderQuery False $ concat ("i" <> fromOsPath file, Just (serializeKey' key)) formatoutput (file, (Just key)) = Just $ ("o" <> fromOsPath file, - if key == k + if Just key == mk then Nothing else Just (serializeKey' key) ) @@ -251,6 +256,17 @@ parseComputeState k b = _ -> Nothing in go c' rest +{- A compute: url for a given output file of a computation. -} +computeStateUrl :: ComputeState -> OsPath -> URLString +computeStateUrl st p = + "annex-compute:" ++ fromOsPath p ++ "?" + ++ decodeBS (formatComputeState' Nothing st') + where + -- Omit computeOutputs, so this gives the same result whether + -- it's called on a ComputeState with the computeOutputs + -- Keys populated or not. + st' = st { computeOutputs = mempty } + {- The per remote metadata is used to store ComputeState. This allows - recording multiple ComputeStates that generate the same key. - diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index bca6e1144d..9f096770b7 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -54,7 +54,7 @@ Some examples of how this might look: * `--fast` Adds computed files to the repository, without generating their content - yet. + yet. * `--unreproducible`, `-u`
diff --git a/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options.mdwn b/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options.mdwn index 65139d8e9a..f099c4e7c1 100644 --- a/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options.mdwn +++ b/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options.mdwn @@ -25,6 +25,7 @@ instead of the expected: ### What version of git-annex are you using? On what operating system? git-annex version: 10.20230926-g4ac2758ba589562e427a66437b9fdcd5172357e1 + OS: Arch Linux ### Please provide any additional information below.
create bug report: creating can't pass spaces in youtube-dl-options
diff --git a/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options.mdwn b/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options.mdwn new file mode 100644 index 0000000000..65139d8e9a --- /dev/null +++ b/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options.mdwn @@ -0,0 +1,95 @@ +### Please describe the problem. + +When setting arguments for yt-dlp though `annex.youtube-dl-options`, +there doesn't seem to be a way to pass arguments that contain spaces. + +For example, I'd like to pass these arguments to yt-dlp: `--ppa "Merger+ffmpeg: -metadata:s:v:0 handler_name= -bitexact"` + +### What steps will reproduce the problem? + +[[!format sh """ +$ git config 'annex.youtube-dl-options' '--ppa "Merger+ffmpeg: -metadata:s:v:0 handler_name= -bitexact"' +$ git -c 'annex.security.allowed-ip-addresses=all' annex addurl --debug --no-raw 'https://www.youtube.com/watch?v=...' +""" +]] + +Observe that git annex tries to run yt-dlp with the following arguments: + +`["--ppa","\"Merger+ffmpeg:","-metadata:s:v:0","handler_name=","-bitexact\"",...]` + +instead of the expected: + +`["--ppa","Merger+ffmpeg: -metadata:s:v:0 handler_name= -bitexact",...]` + + +### What version of git-annex are you using? On what operating system? + +git-annex version: 10.20230926-g4ac2758ba589562e427a66437b9fdcd5172357e1 +OS: Arch Linux + +### Please provide any additional information below. + +[[!format sh """ +# If you can, paste a complete transcript of the problem occurring here. +# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log +$ git -c 'annex.security.allowed-ip-addresses=all' annex addurl --debug --no-raw 'https://www.youtube.com/watch?v=SUBjhViTLkc' +[2025-02-25 19:59:58.414615162] (Utility.Process) process [1753238] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","show-ref","git-annex"] +[2025-02-25 19:59:58.415898944] (Utility.Process) process [1753238] done ExitSuccess +[2025-02-25 19:59:58.416303507] (Utility.Process) process [1753239] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","show-ref","--hash","refs/heads/git-annex"] +[2025-02-25 19:59:58.417687901] (Utility.Process) process [1753239] done ExitSuccess +[2025-02-25 19:59:58.418868643] (Utility.Process) process [1753240] chat: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch"] +addurl https://www.youtube.com/watch?v=SUBjhViTLkc [2025-02-25 19:59:58.440089439] (Utility.Url) Request { + host = "www.youtube.com" + port = 443 + secure = True + requestHeaders = [("Accept-Encoding",""),("User-Agent","git-annex/10.20230926-g4ac2758ba589562e427a66437b9fdcd5172357e1")] + path = "/watch" + queryString = "?v=SUBjhViTLkc" + method = "HEAD" + proxy = Nothing + rawBody = False + redirectCount = 10 + responseTimeout = ResponseTimeoutDefault + requestVersion = HTTP/1.1 + proxySecureMode = ProxySecureWithConnect +} + +[2025-02-25 19:59:59.355890186] (Utility.Process) process [1753242] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","symbolic-ref","-q","HEAD"] +[2025-02-25 19:59:59.357185374] (Utility.Process) process [1753242] done ExitSuccess +[2025-02-25 19:59:59.357432029] (Utility.Process) process [1753243] read: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","show-ref","refs/heads/master"] +[2025-02-25 19:59:59.358927257] (Utility.Process) process [1753243] done ExitSuccess +[2025-02-25 19:59:59.359545264] (Utility.Process) process [1753244] chat: git ["--git-dir=../../.git","--work-tree=../..","--literal-pathspecs","-c","annex.debug=true","check-attr","-z","--stdin","annex.backend","annex.largefiles","annex.numcopies","annex.mincopies","--"] + +[2025-02-25 19:59:59.362120984] (Utility.Url) Request { + host = "www.youtube.com" + port = 443 + secure = True + requestHeaders = [("Accept-Encoding","identity"),("User-Agent","git-annex/10.20230926-g4ac2758ba589562e427a66437b9fdcd5172357e1")] + path = "/watch" + queryString = "?v=SUBjhViTLkc" + method = "GET" + proxy = Nothing + rawBody = False + redirectCount = 10 + responseTimeout = ResponseTimeoutDefault + requestVersion = HTTP/1.1 + proxySecureMode = ProxySecureWithConnect +} + +[2025-02-25 20:00:00.311641851] (Utility.Process) process [1753245] read: yt-dlp ["--ppa","\"Merger+ffmpeg:","-metadata:s:v:0","handler_name=","-bitexact\"","https://www.youtube.com/watch?v=SUBjhViTLkc","--get-filename","--no-warnings","--no-playlist"] +[2025-02-25 20:00:00.594102346] (Utility.Process) process [1753245] done ExitFailure 2 + + Unable to use youtube-dl or a special remote and --no-raw was specified: no media in url +failed +[2025-02-25 20:00:00.594927375] (Utility.Process) process [1753240] done ExitSuccess +[2025-02-25 20:00:00.59528691] (Utility.Process) process [1753244] done ExitSuccess +addurl: 1 failed +# End of transcript or log. +"""]] + +### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) + +So far I've been using it to store .deb packages in my ansible repo. + +Recently, I've been trying to figure out a workflow for managing my media collection with it. +Other than a few rough edges around non-raw `addurl`, it's been great!
handle comutations in subdirs of the git repository
Eg, a computation might be run in "foo/" and refer to "../bar" as an
input or output.
So, the subdir is part of the computation state.
Also, prevent input or output of files that are outside the git
repository. Of course, the program can access any file on disk if it
wants to; this is just a guard against mistakes. And it may also be
useful if the program comunicates with something less trusted than it,
eg a container image, so input/output files communicated by that are not
the source of security problems.
Eg, a computation might be run in "foo/" and refer to "../bar" as an
input or output.
So, the subdir is part of the computation state.
Also, prevent input or output of files that are outside the git
repository. Of course, the program can access any file on disk if it
wants to; this is just a guard against mistakes. And it may also be
useful if the program comunicates with something less trusted than it,
eg a container image, so input/output files communicated by that are not
the source of security problems.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 08b7f385d4..06017e6365 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -12,8 +12,10 @@ module Remote.Compute ( ComputeState(..), setComputeState, getComputeStates, + ComputeProgram, getComputeProgram, runComputeProgram, + ImmutableState(..), ) where import Annex.Common @@ -33,6 +35,7 @@ import Logs.MetaData import Utility.Metered import Utility.TimeStamp import Utility.Env +import Utility.Tmp.Dir import qualified Git import qualified Utility.SimpleProtocol as Proto @@ -166,8 +169,9 @@ instance Proto.Serializable PercentFloat where data ComputeState = ComputeState { computeParams :: [String] - , computeInputs :: M.Map FilePath Key - , computeOutputs :: M.Map FilePath (Maybe Key) + , computeInputs :: M.Map OsPath Key + , computeOutputs :: M.Map OsPath (Maybe Key) + , computeSubdir :: OsPath , computeReproducible :: Bool } deriving (Show, Eq) @@ -175,12 +179,12 @@ data ComputeState = ComputeState {- Formats a ComputeState as an URL query string. - - Prefixes computeParams with 'p', computeInputs with 'i', - - and computeOutput with 'o'. + - and computeOutput with 'o'. Uses "d" for computeSubdir. - - When the passed Key is an output, rather than duplicate it - in the query string, that output has no value. - - - Example: "psomefile&pdestfile&pbaz&isomefile=WORM--foo&odestfile=" + - Example: "psomefile&pdestfile&pbaz&isomefile=WORM--foo&odestfile=&d=subdir" - - The computeParams are in the order they were given. The computeInputs - and computeOutputs are sorted in ascending order for stability. @@ -190,13 +194,14 @@ formatComputeState k st = renderQuery False $ concat [ map formatparam (computeParams st) , map formatinput (M.toAscList (computeInputs st)) , mapMaybe formatoutput (M.toAscList (computeOutputs st)) + , [("d", Just (fromOsPath (computeSubdir st)))] ] where formatparam p = ("p" <> encodeBS p, Nothing) formatinput (file, key) = - ("i" <> toRawFilePath file, Just (serializeKey' key)) + ("i" <> fromOsPath file, Just (serializeKey' key)) formatoutput (file, (Just key)) = Just $ - ("o" <> toRawFilePath file, + ("o" <> fromOsPath file, if key == k then Nothing else Just (serializeKey' key) @@ -208,7 +213,7 @@ parseComputeState k b = let st = go emptycomputestate (parseQuery b) in if st == emptycomputestate then Nothing else Just st where - emptycomputestate = ComputeState mempty mempty mempty False + emptycomputestate = ComputeState mempty mempty mempty "." False go :: ComputeState -> [QueryItem] -> ComputeState go c [] = c { computeParams = reverse (computeParams c) } go c ((f, v):rest) = @@ -220,7 +225,7 @@ parseComputeState k b = key <- deserializeKey' =<< v Just $ c { computeInputs = - M.insert i key + M.insert (toOsPath i) key (computeInputs c) } ('o':o) -> case v of @@ -228,14 +233,21 @@ parseComputeState k b = key <- deserializeKey' kv Just $ c { computeOutputs = - M.insert o (Just key) + M.insert (toOsPath o) + (Just key) (computeOutputs c) } Nothing -> Just $ c { computeOutputs = - M.insert o (Just k) + M.insert (toOsPath o) + (Just k) (computeOutputs c) } + ('d':[]) -> do + subdir <- v + Just $ c + { computeSubdir = toOsPath subdir + } _ -> Nothing in go c' rest @@ -288,14 +300,14 @@ runComputeProgram -> (ComputeState -> OsPath -> Annex v) -> Annex v runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) getinputcontent cont = - withOtherTmp $ \tmpdir -> - go tmpdir - `finally` liftIO (removeDirectoryRecursive tmpdir) + withOtherTmp $ \othertmpdir -> + withTmpDirIn othertmpdir "compute" go where go tmpdir = do environ <- computeProgramEnvironment state + subdir <- liftIO $ getsubdir tmpdir let pr = (proc program (computeParams state)) - { cwd = Just (fromOsPath tmpdir) + { cwd = Just (fromOsPath subdir) , std_in = CreatePipe , std_out = CreatePipe , env = Just environ @@ -303,16 +315,26 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) state' <- bracket (liftIO $ createProcess pr) (liftIO . cleanupProcess) - (getinput state tmpdir) - cont state' tmpdir + (getinput state tmpdir subdir) + cont state' subdir + + getsubdir tmpdir = do + let subdir = tmpdir </> computeSubdir state + ifM (dirContains <$> absPath tmpdir <*> absPath subdir) + ( do + createDirectoryIfMissing True subdir + return subdir + -- Ignore unsafe value in state. + , return tmpdir + ) - getinput state' tmpdir p = + getinput state' tmpdir subdir p = liftIO (hGetLineUntilExitOrEOF (processHandle p) (stdoutHandle p)) >>= \case Just l - | null l -> getinput state' tmpdir p + | null l -> getinput state' tmpdir subdir p | otherwise -> do - state'' <- parseoutput p state' l - getinput state'' tmpdir p + state'' <- parseoutput p tmpdir subdir state' l + getinput state'' tmpdir subdir p Nothing -> do liftIO $ hClose (stdoutHandle p) liftIO $ hClose (stdinHandle p) @@ -320,28 +342,36 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) giveup $ program ++ " exited unsuccessfully" return state' - parseoutput p state' l = case Proto.parseMessage l of - Just (ProcessInput f) -> - let knowninput = M.member f (computeInputs state') - in checkimmutable knowninput l $ do - (k, mp) <- getinputcontent (toOsPath f) + parseoutput p tmpdir subdir state' l = case Proto.parseMessage l of + Just (ProcessInput f) -> do + let f' = toOsPath f + let knowninput = M.member f' (computeInputs state') + checksafefile tmpdir subdir f' "input" + checkimmutable knowninput l $ do + (k, mp) <- getinputcontent f' + mp' <- liftIO $ maybe (pure Nothing) + (Just <$$> relPathDirToFile subdir) + mp liftIO $ hPutStrLn (stdinHandle p) $ - maybe "" fromOsPath mp + maybe "" fromOsPath mp' + liftIO $ hFlush (stdinHandle p) return $ if knowninput then state' else state' { computeInputs = - M.insert f k + M.insert f' k (computeInputs state') } - Just (ProcessOutput f) -> - let knownoutput = M.member f (computeOutputs state') - in checkimmutable knownoutput l $ + Just (ProcessOutput f) -> do + let f' = toOsPath f + checksafefile tmpdir subdir f' "output" + let knownoutput = M.member f' (computeOutputs state') + checkimmutable knownoutput l $ (Diff truncated)
updated interface
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 8b1a732e7a..5c771c17ad 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -8,62 +8,76 @@ When an compute special remote is initremoted, a program is specified: git-annex initremote myremote type=compute program=git-annex-compute-foo The user adds an annexed file that is computed by the program by running -a command like this: +a command like one of these: - git-annex addcomputed --to myremote \ - --input raw=file.raw --value passes=10 \ - --output photo=file.jpeg + git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10 + git-annex addcomputed --to=myremote -- compress in out --level=9 + git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz -That command and later `git-annex get` of a computed file both -run the program the same way. +Whatever values the user passes to `git-annex addcomputed` are passed to +the program in `ARGV`, followed by any values that the user provided to +`git-annex initremote`. -The program is passed inputs to the computation via environment variables, -which are all prefixed with `"ANNEX_COMPUTE_"`. +To simplify the program's option parsing, any value that the user provides +that is in the form "foo=bar" will also result in an environment variable +being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. -In the example above, the program will be passed this environment: +For security, the program should avoid exposing user input to the shell +unprotected, or otherwise executing it. - ANNEX_COMPUTE_INPUT_raw=/path/.git/annex/objects/.. - ANNEX_COMPUTE_VALUE_passes=10 +The program is run in a temporary directory, which will be cleaned up after +it exits. -Default values that are provided to `git-annex initremote` will also be set -in the environment. Eg `git-annex initremote myremote type=compute -program=foo passes=9` will set `ANNEX_COMPUTE_VALUE_passes=9` by default. +The content of any annexed file in the repository can be an input +to the computation. The program requests an input by writing a line to +stdout: -For security, the program should avoid exposing values from `ANNEX_COMPUTE_*` -variables to the shell unprotected, or otherwise executing them. + INPUT file.raw -The program will also inherit other environment variables -that were set when git-annex was run, like PATH. (`ANNEX_COMPUTE_*` -environment variables are not inherited.) +Then it can read a line from stdin, which will be the path to the content +(eg a `.git/annex/objects/` path). -The program is run in a temporary directory, which will be cleaned up after -it exits. It writes the files that it computes to that directory. +If the program needs multiple input files, it should output multiple +`INPUT` lines at once, and then read multiple paths from stdin. This +allows retrival of the inputs to potentially run in parallel. -Before starting the main computation, the program must output a list of the -files that it will compute, in the form "COMPUTING Id filename". -Here "Id" is a short identifier for a particular file, which the -user specifies when running `git-annex addcomputed`. +If an input file is not available, the program's stdin will be closed +without a path being written to it. So when reading from stdin fails, +the program should exit. -In the example above, the program is expected to output something like: +When `git-annex addcomputed --fast` is being used to add a computation +to the git-annex repository without actually performing it, the +response to each "INPUT" will be an empty line rather than the path to +an input file. In that case, the program should proceed with the rest of +its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not +perform any computation. - COMPUTING photo out.jpeg - COMPUTING sidecar otherfile +For each output file that it will compute, the program should write a +line to stdout: -If possible, the program should write the content of the file it is -computing directly to the file listed in COMPUTING, rather than writing to -somewhere else and renaming it at the end. Except, when the program writes -the file it computes out of order, it should write to a file somewhere else -and rename it at the end. + OUTPUT file.jpeg -If git-annex sees that the file corresponding to the key it requested be -computed is growing, it will use its file size when displaying progress to -the user. +The filename of the output file is both the filename in the program's +temporary directory, and also the filename that will be added to the +git-annex repository by `git-annex compute`. + +If git-annex sees that an output file is growing, it will use its file size +when displaying progress to the user. So if possible, the program should +write the content to the file it is computing directly, rather than writing +to somewhere else and renaming it at the end. But, if the program seeks +around and writes out of order, it should write to a file somewhere else +and rename it at the end. The program can also output lines to stdout to indicate its current progress: PROGRESS 50% +The program can optionally also output a "REPRODUCIBLE" line. That +indicates that the results of its computations are expected to be +bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if +the `--reproducible` option is set. + Anything that the program outputs to stderr will be displayed to the user. This stderr should be used for error messages, and possibly computation output, but not for progress displays. @@ -71,42 +85,21 @@ output, but not for progress displays. If the program exits nonzero, nothing it computed will be stored in the git-annex repository. -When run with the "interface" parameter, the program must describe its -interface. This is a list of the inputs and outputs that it -supports. This allows `git-annex addcomputed` and `git-annex initremote` to -list inputs and outputs, and also lets them reject invalid inputs and -outputs. - -The output is lines, in the form: - - INPUT[?] Id Description - VALUE[?] Id Description - OUTPUT Id Description - -Use "INPUT" when a file is an input to the computation, -and "VALUE" for all other input values. Use "INPUT?" and "VALUE?" -for optional inputs and values. - -The interface can also optionally include a "REPRODUCIBLE" line. -That indicates that the results of its computations are -expected to be bit-for-bit reproducible. -That makes `git-annex addcomputed` behave as if the `--reproducible` -option is set. - An example `git-annex-compute-foo` shell script follows: #!/bin/sh set -e - if [ "$1" = interface ]; then - echo "INPUT raw A photo in RAW format" - echo "VALUE? passes Number of passes" - echo "OUTPUT photo Computed JPEG" - echo "REPRODUCIBLE" - exit 0 + if [ "$1" != "convert" ]; then + echo "Usage: convert input output [passes=n]" >&2 + exit 1 + fi + if [ -z "$ANNEX_COMPUTE_passes" ]; + ANNEX_COMPUTE_passes=1 fi - if [ -z "$ANNEX_COMPUTE_VALUE_passes" ]; then - ANNEX_COMPUTE_VALUE_passes=1 + echo "INPUT "$2" + read input + echo "OUTPUT $3" + echo REPRODUCIBLE + if [ -n "$input" ]; then + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" fi - echo "COMPUTING photo out.jpeg" - frobnicate --passes="$ANNEX_COMPUTE_VALUE_passes" \ - <"$ANNEX_COMPUTE_INPUT_raw" >out.jpeg
update for new interface
diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 395d6246e7..487bb70ff1 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -4,31 +4,34 @@ git-annex addcomputed - adds computed files to the repository # SYNOPSIS -git annex addcomputed `--to=remote [--input name=file ...] [--value name=value ...] [--output name=file ...]` - -git annex addcomputed `--describe=remote` +git annex addcomputed `--to=remote -- ...` # DESCRIPTION -Adds files to the annex that are computed from input files and values, +Adds files to the annex that are computed from input files, using a compute special remote. -For example, this adds a file `foo.jpeg` to the repository. It is computed -by the "photoconv" compute remote, based on an input file, `foo.raw`. A -configurable "passes" value is set to 10 when computing the file. +Once a file has been added to a compute remote, commands +like `git-annex get` will use it to compute the content of the file. + +The syntax of this command after the `--` is up to the program that +the compute special remote is set up to run to perform the comuptation. + +To see the program's usage, you can run: - git-annex addcomputed --to photoconv \ - --input raw=foo.raw --output photo=foo.jpeg \ - --value passes=10 + git-annex addcomputed --to=foo + +Generally you will provide an input file (or files), and often also an +output filename, and additional parameters to control the computation. There can be more than one input file that are combined to compute an output file. And multiple output files can be computed at the same time. -The output files are added to the repository as annexed files. -Once a file has been added to a compute remote, commands -like `git-annex get` will use it to compute the content of the file. -It is also possible to use commands like `git-annex drop` on the file, -with the compute remote being counted as one copy of it. +Some examples of how this might look: + + git-annex addcomputed --to=x -- convert file.raw file.jpeg passes=10 + git-annex addcomputed --to=y -- compress foo --level=9 + git-annex addcomputed --to=z -- clip foo 2:01-3:00 combine with bar to baz # OPTIONS @@ -46,27 +49,6 @@ with the compute remote being counted as one copy of it. of the programs that are available, see <https://git-annex.branchable.com/special_remotes/compute/> -* `--input name=file` - - Provide a file as input to the computation, with the specified input name. - - The input file can be an annexed file, or a file stored in git. - -* `--output name=file` - - Add the output of the computation to the repository as an annexed file, - with the specified filename. - -* `--value name=value` - - Provide a value to the computation, with the specified name. - -* `--describe=remote` - - Describe all inputs, outputs, and values supported by a compute remote. - - For a machine-readable list, use this with `--json`. - * `--fast` Adds computed files to the repository, without generating their content
support addcomputed --fast
This complicates the interface but it's still simpler to understand than
the old interface.
This complicates the interface but it's still simpler to understand than
the old interface.
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 33c33ad2ad..5c771c17ad 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -14,8 +14,8 @@ a command like one of these: git-annex addcomputed --to=myremote -- compress in out --level=9 git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz -Whatever values the user passes to `git-annex addcomputed` are passed on to -the program, followed by any values that the user provided to +Whatever values the user passes to `git-annex addcomputed` are passed to +the program in `ARGV`, followed by any values that the user provided to `git-annex initremote`. To simplify the program's option parsing, any value that the user provides @@ -45,8 +45,15 @@ If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails, the program should exit. -The program computes one or more output files. For each output file that it -will compute, the program should write a line to stdout: +When `git-annex addcomputed --fast` is being used to add a computation +to the git-annex repository without actually performing it, the +response to each "INPUT" will be an empty line rather than the path to +an input file. In that case, the program should proceed with the rest of +its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not +perform any computation. + +For each output file that it will compute, the program should write a +line to stdout: OUTPUT file.jpeg @@ -93,4 +100,6 @@ An example `git-annex-compute-foo` shell script follows: read input echo "OUTPUT $3" echo REPRODUCIBLE - frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" + if [ -n "$input" ]; then + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3" + fi
new compute program interface
This is much more flexible, and also simpler to understand.
This is much more flexible, and also simpler to understand.
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index cc03b4861f..33c33ad2ad 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -8,62 +8,69 @@ When an compute special remote is initremoted, a program is specified: git-annex initremote myremote type=compute program=git-annex-compute-foo The user adds an annexed file that is computed by the program by running -a command like this: +a command like one of these: - git-annex addcomputed --to myremote \ - --input raw=file.raw --value passes=10 \ - --output photo=file.jpeg + git-annex addcomputed --to=myremote -- convert file.raw file.jpeg passes=10 + git-annex addcomputed --to=myremote -- compress in out --level=9 + git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz -That command and later `git-annex get` of a computed file both -run the program the same way. +Whatever values the user passes to `git-annex addcomputed` are passed on to +the program, followed by any values that the user provided to +`git-annex initremote`. -The program is passed inputs to the computation via environment variables, -which are all prefixed with `"ANNEX_COMPUTE_"`. +To simplify the program's option parsing, any value that the user provides +that is in the form "foo=bar" will also result in an environment variable +being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. -In the example above, the program will be passed this environment: +For security, the program should avoid exposing user input to the shell +unprotected, or otherwise executing it. - ANNEX_COMPUTE_INPUT_raw=/path/.git/annex/objects/.. - ANNEX_COMPUTE_VALUE_passes=10 +The program is run in a temporary directory, which will be cleaned up after +it exits. -Default values that are provided to `git-annex initremote` will also be set -in the environment. Eg `git-annex initremote myremote type=compute -program=foo passes=9` will set `ANNEX_COMPUTE_VALUE_passes=9` by default. +The content of any annexed file in the repository can be an input +to the computation. The program requests an input by writing a line to +stdout: -For security, the program should avoid exposing values from `ANNEX_COMPUTE_*` -variables to the shell unprotected, or otherwise executing them. + INPUT file.raw -The program will also inherit other environment variables -that were set when git-annex was run, like PATH. (`ANNEX_COMPUTE_*` -environment variables are not inherited.) +Then it can read a line from stdin, which will be the path to the content +(eg a `.git/annex/objects/` path). -The program is run in a temporary directory, which will be cleaned up after -it exits. It writes the files that it computes to that directory. +If the program needs multiple input files, it should output multiple +`INPUT` lines at once, and then read multiple paths from stdin. This +allows retrival of the inputs to potentially run in parallel. -Before starting the main computation, the program must output a list of the -files that it will compute, in the form "COMPUTING Id filename". -Here "Id" is a short identifier for a particular file, which the -user specifies when running `git-annex addcomputed`. +If an input file is not available, the program's stdin will be closed +without a path being written to it. So when reading from stdin fails, +the program should exit. -In the example above, the program is expected to output something like: +The program computes one or more output files. For each output file that it +will compute, the program should write a line to stdout: - COMPUTING photo out.jpeg - COMPUTING sidecar otherfile + OUTPUT file.jpeg -If possible, the program should write the content of the file it is -computing directly to the file listed in COMPUTING, rather than writing to -somewhere else and renaming it at the end. Except, when the program writes -the file it computes out of order, it should write to a file somewhere else -and rename it at the end. +The filename of the output file is both the filename in the program's +temporary directory, and also the filename that will be added to the +git-annex repository by `git-annex compute`. -If git-annex sees that the file corresponding to the key it requested be -computed is growing, it will use its file size when displaying progress to -the user. +If git-annex sees that an output file is growing, it will use its file size +when displaying progress to the user. So if possible, the program should +write the content to the file it is computing directly, rather than writing +to somewhere else and renaming it at the end. But, if the program seeks +around and writes out of order, it should write to a file somewhere else +and rename it at the end. The program can also output lines to stdout to indicate its current progress: PROGRESS 50% +The program can optionally also output a "REPRODUCIBLE" line. That +indicates that the results of its computations are expected to be +bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if +the `--reproducible` option is set. + Anything that the program outputs to stderr will be displayed to the user. This stderr should be used for error messages, and possibly computation output, but not for progress displays. @@ -71,42 +78,19 @@ output, but not for progress displays. If the program exits nonzero, nothing it computed will be stored in the git-annex repository. -When run with the "interface" parameter, the program must describe its -interface. This is a list of the inputs and outputs that it -supports. This allows `git-annex addcomputed` and `git-annex initremote` to -list inputs and outputs, and also lets them reject invalid inputs and -outputs. - -The output is lines, in the form: - - INPUT[?] Name Description - VALUE[?] Name Description - OUTPUT Id Description - -Use "INPUT" when a file is an input to the computation, -and "VALUE" for all other input values. Use "INPUT?" and "VALUE?" -for optional inputs and values. - -The interface can also optionally include a "REPRODUCIBLE" line. -That indicates that the results of its computations are -expected to be bit-for-bit reproducible. -That makes `git-annex addcomputed` behave as if the `--reproducible` -option is set. - An example `git-annex-compute-foo` shell script follows: #!/bin/sh set -e - if [ "$1" = interface ]; then - echo "INPUT raw A photo in RAW format" - echo "VALUE? passes Number of passes" - echo "OUTPUT photo Computed JPEG" - echo "REPRODUCIBLE" - exit 0 + if [ "$1" != "convert" ]; then + echo "Usage: convert input output [passes=n]" >&2 + exit 1 fi - if [ -z "$ANNEX_COMPUTE_VALUE_passes" ]; then - ANNEX_COMPUTE_VALUE_passes=1 + if [ -z "$ANNEX_COMPUTE_passes" ]; + ANNEX_COMPUTE_passes=1 fi - echo "COMPUTING photo out.jpeg" - frobnicate --passes="$ANNEX_COMPUTE_VALUE_passes" \ - <"$ANNEX_COMPUTE_INPUT_raw" >out.jpeg + echo "INPUT "$2" + read input + echo "OUTPUT $3" + echo REPRODUCIBLE + frobnicate --passes="$ANNEX_COMPUTE_passes" <"$input" >"$3"
diff --git a/doc/bugs/httpalso_windows_URL_errors.mdwn b/doc/bugs/httpalso_windows_URL_errors.mdwn new file mode 100644 index 0000000000..15747fbad0 --- /dev/null +++ b/doc/bugs/httpalso_windows_URL_errors.mdwn @@ -0,0 +1,37 @@ +### Please describe the problem. + +We use `httpalso` for public read-only access to an ssh writable remote storing many datasets (this uses datalad ria-store, but that's a detail here). +That works flawlessly on linux, but a windows user recently reported that they couldn't download the data. +As I had some linux "deploy" tests on that dataset (mainly `datalad install` and `fsck --fast` from special remotes), I managed after wrestling with windows actions setup to run an equivalent on windows and I am able to reproduce the bug. +As I don't have a windows VM that was the simplest and more long-term way to ensure that works. + +It seems `httpalso` on Windows malforms the path in the URLs, see the troubleshooting actions job here https://github.com/courtois-neuromod/algonauts_2025.competitors/actions/runs/13502500268/job/37724172008 + +The path in the URL is `/users/cneuromod/ria-conp/5d2/0a1ae-3571-4de8-94c8-8ddb416cd3b0/annex/objects/xw%5CXV%5C/MD5E-s105090--c62dbfef65f486c626c10f778ebbe4f2.nii.gz/MD5E-s105090--c62dbfef65f486c626c10f778ebbe4f2.nii.gz` where `xw%5CXV%5C` should be something like `xw/XV` so some conversion of the windows backslash to posix might not be working. + +(`httpalso` also fallbacks to an URL assuming all objects are flat in `annex/objects`, but that obviously fails as this is not stored that way in the remote.) + +### What steps will reproduce the problem? + +`git-annex get` a file from a `httpalso` special remote on a windows machine. +The dataset https://github.com/courtois-neuromod/algonauts_2025.competitors can be used for troubleshooting as this is fully public (git+annex). + +### What version of git-annex are you using? On what operating system? + +`10.20250115` on windows + + +### Please provide any additional information below. + +[[!format sh """ +# If you can, paste a complete transcript of the problem occurring here. +# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log + + +# End of transcript or log. +"""]] + +### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) + + +Yes, our whole data management relies on git-annex and datalad! Thanks for that amazing tool!
diff --git a/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn b/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn index e759cfe821..56b7ddad56 100644 --- a/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn +++ b/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn @@ -1,4 +1,4 @@ -I have some repositories in group backup and some in client. I was surprised that I had to add the backup repositories to group archive as well for the client groupwanted expression work as intended. +I have some repositories in group backup and some in client. I was surprised that I had to add the backup repositories to group archive as well for the client groupwanted expression work to as intended. So shouldn't the client expression read
diff --git a/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn b/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn index 08add16e4f..e759cfe821 100644 --- a/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn +++ b/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn @@ -1,4 +1,4 @@ -I have some repositories in group backup and some in client. I was surprised that I had to add the backup repositiries to group archive as well for the client groupwanted expression work as intended. +I have some repositories in group backup and some in client. I was surprised that I had to add the backup repositories to group archive as well for the client groupwanted expression work as intended. So shouldn't the client expression read
diff --git a/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn b/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn new file mode 100644 index 0000000000..08add16e4f --- /dev/null +++ b/doc/forum/Should_the_client_group_expression_include_backup__63__.mdwn @@ -0,0 +1,11 @@ +I have some repositories in group backup and some in client. I was surprised that I had to add the backup repositiries to group archive as well for the client groupwanted expression work as intended. + +So shouldn't the client expression read + + (include=* and ((exclude=*/archive/* and exclude=archive/*) or (not (copies=archive:1 or copies=smallarchive:1 or copies=backup:1)))) or approxlackingcopies=1 + +instead of just + + (include=* and ((exclude=*/archive/* and exclude=archive/*) or (not (copies=archive:1 or copies=smallarchive:1)))) or approxlackingcopies=1 + +?
diff --git a/doc/forum/Can_I_reset_the_get__47__drop_information_and_resync__63__.mdwn b/doc/forum/Can_I_reset_the_get__47__drop_information_and_resync__63__.mdwn new file mode 100644 index 0000000000..be3b670f08 --- /dev/null +++ b/doc/forum/Can_I_reset_the_get__47__drop_information_and_resync__63__.mdwn @@ -0,0 +1 @@ +Some wanted/required files are missing in some repositories and don't get synced. I think I did not drop these files but this doesn't matter right now. Can I reset the information which files got dropped or getted (hope it's clear what mean) and just tell annex to (re-)distribute the files according groupwanted/wanted/required?
diff --git a/doc/forum/Does___34__required__34___use_the_groupwanted_expressions__63__.mdwn b/doc/forum/Does___34__required__34___use_the_groupwanted_expressions__63__.mdwn new file mode 100644 index 0000000000..565dba87a0 --- /dev/null +++ b/doc/forum/Does___34__required__34___use_the_groupwanted_expressions__63__.mdwn @@ -0,0 +1 @@ +As the title says: Can I set "required" to "standard" to use the groupwanted expressions?
distribits 2025
diff --git a/doc/news/Distribits_meeting_2025.mdwn b/doc/news/Distribits_meeting_2025.mdwn new file mode 100644 index 0000000000..4bf321cad0 --- /dev/null +++ b/doc/news/Distribits_meeting_2025.mdwn @@ -0,0 +1,12 @@ +We are happy to announce that the call for participation for Distribits +2025 is now online at <https://distribits.live/> + +This meeting is organized by the folks behind git-annex and DataLad. We aim +to bring together enthusiasts of tools and workflows in the domain of +distributed data. + +We are looking forward to October 2025 and to meeting people from everywhere, +online and in Düsseldorf. + +Registration is open now and until May 1st. Please consider submitting a +talk if you are doing something interesting or unusual with git-annex.
diff --git a/doc/forum/Move_part_of_one_repository_into_other.mdwn b/doc/forum/Move_part_of_one_repository_into_other.mdwn index 7727e1b949..94adad499e 100644 --- a/doc/forum/Move_part_of_one_repository_into_other.mdwn +++ b/doc/forum/Move_part_of_one_repository_into_other.mdwn @@ -1,7 +1,7 @@ I have two sets of git annex repositories: A: Contains a tree of some files and then a *huge* chunk of files -B: Contains already has a ton of files +B: Already contains a ton of files Both repository sets have their own set of repos on different machines that are connected to another.
diff --git a/doc/forum/Move_part_of_one_repository_into_other.mdwn b/doc/forum/Move_part_of_one_repository_into_other.mdwn index 036b76a5ef..7727e1b949 100644 --- a/doc/forum/Move_part_of_one_repository_into_other.mdwn +++ b/doc/forum/Move_part_of_one_repository_into_other.mdwn @@ -1,6 +1,6 @@ I have two sets of git annex repositories: -A: Contains a tree of some files and then a *huge* chunk of files +A: Contains a tree of some files and then a *huge* chunk of files B: Contains already has a ton of files Both repository sets have their own set of repos on different machines that are connected to another.