Recent changes to this wiki:
https://github.com/named-data-mobile/ndn-photo-app/issues/186
diff --git a/doc/todo/Abstract__44__LibreVault__44__ResilioSync_alternative_on_NDN.mdwn b/doc/todo/Abstract__44__LibreVault__44__ResilioSync_alternative_on_NDN.mdwn index 4dfde78906..6401148e0f 100644 --- a/doc/todo/Abstract__44__LibreVault__44__ResilioSync_alternative_on_NDN.mdwn +++ b/doc/todo/Abstract__44__LibreVault__44__ResilioSync_alternative_on_NDN.mdwn @@ -2,4 +2,4 @@ Average users could get so much value out of a simple and intuitive p2p file-syn Currently, `git-annex` depends on Tor and [Magic Wormhole](https://github.com/magic-wormhole/magic-wormhole) to share/collaborate/sync with others. There is [Hypercore](https://docs.pears.com/building-blocks/hypercore) by [Holepunch](https://holepunch.to/), but [Named Data Networking (NDN)](https://named-data.net/project/archoverview/) ([video](https://youtu.be/qbAB0iN1-zQ)) offers a more robust internet backbone. -Even better, there was an experiment for "[Distributed Git over Named Data Networking](https://github.com/JonnyKong/GitSync/issues/2)". Could NDN be used as the backbone for a built-in method for connections to be made for `git-annex` and set the foundation—pave the way—for the perfect file synchronization/sharing app? +Even better, there was an experiment for "[Distributed Git over Named Data Networking](https://github.com/JonnyKong/GitSync)" and [npChat](https://github.com/named-data-mobile/ndn-photo-app/issues/186) both exist. Could NDN be used as the backbone for a built-in method for connections to be made for `git-annex` and set the foundation—pave the way—for the perfect file synchronization/sharing app?
https://github.com/JonnyKong/GitSync/issues/2
diff --git a/doc/todo/Abstract__44__LibreVault__44__ResilioSync_alternative_on_NDN.mdwn b/doc/todo/Abstract__44__LibreVault__44__ResilioSync_alternative_on_NDN.mdwn new file mode 100644 index 0000000000..4dfde78906 --- /dev/null +++ b/doc/todo/Abstract__44__LibreVault__44__ResilioSync_alternative_on_NDN.mdwn @@ -0,0 +1,5 @@ +Average users could get so much value out of a simple and intuitive p2p file-syncing service that is as polished as [ResilioSync](https://www.resilio.com/sync/), provides revision control on arbitrary files like [Abstract (version control for designers)](https://www.goabstract.com/), but open source like LibreVault, SyncThing and SparkleShare. This would be priceless for science, design, all sorts of collaborative workspaces, and especially for backups and data redundancy. + +Currently, `git-annex` depends on Tor and [Magic Wormhole](https://github.com/magic-wormhole/magic-wormhole) to share/collaborate/sync with others. There is [Hypercore](https://docs.pears.com/building-blocks/hypercore) by [Holepunch](https://holepunch.to/), but [Named Data Networking (NDN)](https://named-data.net/project/archoverview/) ([video](https://youtu.be/qbAB0iN1-zQ)) offers a more robust internet backbone. + +Even better, there was an experiment for "[Distributed Git over Named Data Networking](https://github.com/JonnyKong/GitSync/issues/2)". Could NDN be used as the backbone for a built-in method for connections to be made for `git-annex` and set the foundation—pave the way—for the perfect file synchronization/sharing app?
comment
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_8423d36a81a91aacce9e1906baee5626._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_6_8423d36a81a91aacce9e1906baee5626._comment similarity index 92% rename from doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_8423d36a81a91aacce9e1906baee5626._comment rename to doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_6_8423d36a81a91aacce9e1906baee5626._comment index 8472398e18..0ef8fc920d 100644 --- a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_8423d36a81a91aacce9e1906baee5626._comment +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_6_8423d36a81a91aacce9e1906baee5626._comment @@ -1,6 +1,6 @@ [[!comment format=mdwn username="joey" - subject="""comment 5""" + subject="""comment 6""" date="2025-04-11T17:20:08Z" content=""" Implemented this as the mask special remote. diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_7_57aac89ca247fb312fd4874d2f51a198._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_7_57aac89ca247fb312fd4874d2f51a198._comment new file mode 100644 index 0000000000..6790e393a7 --- /dev/null +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_7_57aac89ca247fb312fd4874d2f51a198._comment @@ -0,0 +1,9 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 7""" + date="2025-04-11T17:23:59Z" + content=""" +> Could such a special remote use the same \"transport\" as the underlying remote (thinking of p2p http in particular), which would mean the same authentication & the same set of permissions server side? + +Yes, the underlying remote is used as-is, whatever it is. +"""]]
done
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site.mdwn b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site.mdwn index f637a5addb..ff25801601 100644 --- a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site.mdwn +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site.mdwn @@ -14,3 +14,5 @@ The advantage of having the annexed files but not the git repo encrypted is that Thanks in advance for considering! -- MSz [[!tag projects/INM7]] + +> [[done]], by implementing the mask special remote --[[Joey]] diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_8423d36a81a91aacce9e1906baee5626._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_8423d36a81a91aacce9e1906baee5626._comment new file mode 100644 index 0000000000..8472398e18 --- /dev/null +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_8423d36a81a91aacce9e1906baee5626._comment @@ -0,0 +1,12 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 5""" + date="2025-04-11T17:20:08Z" + content=""" +Implemented this as the mask special remote. + +For example, to make a remote that stores annexed +files encrypted in the origin remote: + + git annex initremote encryptedorigin type=mask remote=origin encryption=hybrid pubkey=id@joeyh.name +"""]]
fixes for enabling and autoenabling mask special remotes
diff --git a/Remote/Mask.hs b/Remote/Mask.hs index b4b981fbe9..c17c370fda 100644 --- a/Remote/Mask.hs +++ b/Remote/Mask.hs @@ -104,8 +104,13 @@ maskSetup setupstage mu _ c gc = do (M.lookup remoteField c) setupremote =<< findnamed maskremotename _ -> case M.lookup remoteField c of - Just (Proposed maskremotename) -> - setupremote =<< findnamed maskremotename + -- enableremote with remote= overrides the remote + -- name that was used with initremote. + Just (Proposed maskremotename) -> do + r <- findnamed maskremotename + unless (uuid r == maskremoteuuid) $ + giveup $ "Remote \"" ++ maskremotename ++ "\" does not have the expected uuid (" ++ fromUUID maskremoteuuid ++ ")" + setupremote r _ -> enableremote remotelist where setupremote r = do @@ -117,18 +122,20 @@ maskSetup setupstage mu _ c gc = do u <- maybe (liftIO genUUID) return mu gitConfigSpecialRemote u c'' [ ("mask", name r) ] return (c'', u) + + maskremoteuuid = fromMaybe NoUUID $ + toUUID . fromProposedAccepted + <$> M.lookup remoteUUIDField c enableremote remotelist = do - let maskremoteuuid = fromMaybe NoUUID $ - toUUID . fromProposedAccepted - <$> M.lookup remoteUUIDField c case filter (\r -> uuid r == maskremoteuuid) remotelist of (r:_) -> setupremote r [] -> case setupstage of Enable _ -> missingMaskedRemote maskremoteuuid -- When autoenabling, the masked remote may - -- get autoenabled later. + -- get autoenabled later, or need to be + -- manually enabled. _ -> do (c', _) <- encryptionSetup c gc u <- maybe (liftIO genUUID) return mu @@ -170,14 +177,17 @@ findMaskedRemote c gc myuuid = case remoteAnnexMask gc of Just "true" -> case getmaskedremoteuuid of Just maskremoteuuid -> - selectremote maskremoteuuid - (\r -> uuid r == maskremoteuuid) + selectremote maskremoteuuid $ \r -> + uuid r == maskremoteuuid Nothing -> missingMaskedRemote NoUUID Just maskremotename -> - selectremote NoUUID (\r -> name r == maskremotename) + selectremote (fromMaybe NoUUID getmaskedremoteuuid) $ \r -> + name r == maskremotename + && Just (uuid r) == getmaskedremoteuuid Nothing -> missingMaskedRemote NoUUID where - getmaskedremoteuuid = toUUID . fromProposedAccepted <$> M.lookup remoteField c + getmaskedremoteuuid = toUUID . fromProposedAccepted + <$> M.lookup remoteUUIDField c selectremote u f = do remotelist <- Annex.getState Annex.remotes case filter f remotelist of diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index 25fb8ec3b2..c3b1ea0745 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -2053,7 +2053,9 @@ Remotes are configured using these settings in `.git/config`. * `remote.<name>.annex-mask` - Used by mask special remotes. + Used by mask special remotes, this is set to the name of the remote + that is masked. If this is set to "true", then any remote with the + right UUID will be used. Normally this is automatically set up by `git annex initremote`. * `remote.<name>.annex-externaltype`
diff --git a/doc/forum/path-specific_configuration_of_autocommit.mdwn b/doc/forum/path-specific_configuration_of_autocommit.mdwn new file mode 100644 index 0000000000..b1215fbf2d --- /dev/null +++ b/doc/forum/path-specific_configuration_of_autocommit.mdwn @@ -0,0 +1,3 @@ +Is there any way to make the assistant (presumably also sync, etc) autocommit only in particular directories? + +I usually want to manage things myself, but autocommit would make it possible to use annex alongside some other automated tools if I moved their data directories under annex.
Added a comment
diff --git a/doc/forum/Authentication_for_URL_downloads/comment_2_2bb64609b66a24d1c071342cb75ac8cd._comment b/doc/forum/Authentication_for_URL_downloads/comment_2_2bb64609b66a24d1c071342cb75ac8cd._comment new file mode 100644 index 0000000000..5c37542861 --- /dev/null +++ b/doc/forum/Authentication_for_URL_downloads/comment_2_2bb64609b66a24d1c071342cb75ac8cd._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="tomdhunt" + avatar="http://cdn.libravatar.org/avatar/02694633d0fb05bb89f025cf779218a3" + subject="comment 2" + date="2025-04-10T20:25:30Z" + content=""" +Just passing the `--cookie` option to `curl` doesn't do it; what I need is to automatically fetch the current cookies from the browser and use those every time, i.e. to run custom code at download time. The old `web-download-command` option would have worked fine; alternatively, a setting to run a script to generate the curl options would also work. +"""]]
mask remotes, partial implementation
Everything implemented except for passing through to the masked remote.
Which should be trivial.
Everything implemented except for passing through to the masked remote.
Which should be trivial.
diff --git a/Remote/Helper/Encryptable.hs b/Remote/Helper/Encryptable.hs index 9f4bd7fcb1..33eb5b3837 100644 --- a/Remote/Helper/Encryptable.hs +++ b/Remote/Helper/Encryptable.hs @@ -8,7 +8,7 @@ {-# LANGUAGE FlexibleContexts, ScopedTypeVariables, PackageImports #-} module Remote.Helper.Encryptable ( - EncryptionIsSetup, + EncryptionIsSetup(..), encryptionSetup, noEncryptionUsed, encryptionAlreadySetup, diff --git a/Remote/List.hs b/Remote/List.hs index 9d39ddd81d..80a9781f10 100644 --- a/Remote/List.hs +++ b/Remote/List.hs @@ -41,6 +41,7 @@ import qualified Remote.Rclone import qualified Remote.Hook import qualified Remote.External import qualified Remote.Compute +import qualified Remote.Mask remoteTypes :: [RemoteType] remoteTypes = map adjustExportImportRemoteType @@ -65,6 +66,7 @@ remoteTypes = map adjustExportImportRemoteType , Remote.Hook.remote , Remote.External.remote , Remote.Compute.remote + , Remote.Mask.remote ] {- Builds a list of all Remotes. diff --git a/Remote/Mask.hs b/Remote/Mask.hs new file mode 100644 index 0000000000..04ebd2553e --- /dev/null +++ b/Remote/Mask.hs @@ -0,0 +1,205 @@ +{- Mask another remote with added encryption + - + - Copyright 2025 Joey Hess <id@joeyh.name> + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE RankNTypes #-} + +module Remote.Mask (remote) where + +import Annex.Common +import Types.Remote +import Types.Creds +import Types.Crypto +import qualified Git +import qualified Annex +import Remote.Helper.Special +import Remote.Helper.ExportImport +import Config +import Config.Cost +import Annex.UUID +import Types.ProposedAccepted +import Annex.SpecialRemote.Config +import Logs.UUID +import qualified Remote.Git + +import qualified Data.Map as M + +remote :: RemoteType +remote = specialRemoteType $ RemoteType + { typename = "mask" + , enumerate = const (findSpecialRemotes "mask") + , generate = gen + , configParser = mkRemoteConfigParser + [ optionalStringParser remoteField + (FieldDesc "remote to mask") + ] + , setup = maskSetup + , exportSupported = exportIsSupported + , importSupported = importIsSupported + , thirdPartyPopulated = False + } + +gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle -> Annex (Maybe Remote) +gen r u rc gc rs = do + maskedremote <- getMaskedRemote rc gc + let inherited d f = case maskedremote of + Right mr -> f mr + Left _ -> d + c <- parsedRemoteConfig remote rc + cst <- remoteCost gc c $ encryptedRemoteCostAdj + + inherited semiExpensiveRemoteCost cost + let this = Remote + { uuid = u + , cost = cst + , name = Git.repoDescribe r + , storeKey = storeKeyDummy + , retrieveKeyFile = retrieveKeyFileDummy + , retrieveKeyFileInOrder = pure True + , retrieveKeyFileCheap = Nothing + , retrievalSecurityPolicy = inherited RetrievalVerifiableKeysSecure retrievalSecurityPolicy + , removeKey = removeKeyDummy + , lockContent = Nothing + , checkPresent = checkPresentDummy + , checkPresentCheap = inherited False checkPresentCheap + , exportActions = exportUnsupported + , importActions = importUnsupported + , whereisKey = Nothing + , remoteFsck = Nothing + , repairRepo = Nothing + , config = c + , getRepo = return r + , gitconfig = gc + , localpath = Nothing + , remotetype = remote + , availability = inherited (pure Unavailable) availability + , readonly = inherited False readonly + , appendonly = inherited False appendonly + , untrustworthy = inherited False untrustworthy + , mkUnavailable = return Nothing + , getInfo = inherited (pure []) getInfo + , claimUrl = Nothing + , checkUrl = Nothing + , remoteStateHandle = rs + } + return $ Just $ specialRemote c + (store maskedremote) + (retrieve maskedremote) + (remove maskedremote) + (checkKey maskedremote) + this + +maskSetup :: SetupStage -> Maybe UUID -> Maybe CredPair -> RemoteConfig -> RemoteGitConfig -> Annex (RemoteConfig, UUID) +maskSetup setupstage mu _ c gc = do + remotelist <- Annex.getState Annex.remotes + let findnamed maskremotename = + case filter (\r -> name r == maskremotename) remotelist of + (r:_) -> return r + [] -> giveup $ "There is no remote named \"" ++ maskremotename ++ "\"" + case setupstage of + Init -> do + maskremotename <- maybe + (giveup "Specify remote=") + (pure . fromProposedAccepted) + (M.lookup remoteField c) + setupremote =<< findnamed maskremotename + _ -> case M.lookup remoteField c of + Just (Proposed maskremotename) -> + setupremote =<< findnamed maskremotename + _ -> enableremote remotelist + where + setupremote r = do + let c' = M.insert remoteUUIDField + (Proposed (fromUUID (uuid r) :: String)) c + (c'', encsetup) <- encryptionSetup c' gc + verifyencryptionok encsetup r + + u <- maybe (liftIO genUUID) return mu + gitConfigSpecialRemote u c'' [ ("mask", name r) ] + return (c'', u) + + enableremote remotelist = do + let maskremoteuuid = fromMaybe NoUUID $ + toUUID . fromProposedAccepted + <$> M.lookup remoteUUIDField c + case filter (\r -> uuid r == maskremoteuuid) remotelist of + (r:_) -> setupremote r + [] -> case setupstage of + Enable _ -> + missingMaskedRemote maskremoteuuid + -- When autoenabling, the masked remote may + -- get autoenabled later. + _ -> do + (c', _) <- encryptionSetup c gc + u <- maybe (liftIO genUUID) return mu + gitConfigSpecialRemote u c' [ ("mask", "true") ] + return (c', u) + + verifyencryptionok NoEncryption _ = + giveup "Must use encryption with a mask special remote." + verifyencryptionok EncryptionIsSetup r + | remotetype r == Remote.Git.remote = + verifyencryptionokgit + | otherwise = noop + + verifyencryptionokgit = case parseEncryptionMethod c of + Right SharedEncryption -> + giveup "It's not secure to use encryption=shared with a git remote." + _ -> noop + +getMaskedRemote :: RemoteConfig -> RemoteGitConfig -> Annex (Either UUID Remote) +getMaskedRemote c gc = case remoteAnnexMask gc of + -- This remote was autoenabled, so use any remote with the + -- uuid of the masked remote, so that it can also be autoenabled. + Just "true" -> + case getmaskedremoteuuid of + Just maskremoteuuid -> + selectremote (\r -> uuid r == maskremoteuuid) + maskremoteuuid + Nothing -> return (Left NoUUID) (Diff truncated)
Added a comment
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_e3c67afcb36ba91a31de740b3dd02ba3._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_e3c67afcb36ba91a31de740b3dd02ba3._comment new file mode 100644 index 0000000000..7e7383dbea --- /dev/null +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_5_e3c67afcb36ba91a31de740b3dd02ba3._comment @@ -0,0 +1,10 @@ +[[!comment format=mdwn + username="msz" + avatar="http://cdn.libravatar.org/avatar/6e8b88e7c70d86f4cfd27d450958aed4" + subject="comment 5" + date="2025-04-10T16:56:50Z" + content=""" +Thank you, this is very informative! + +Could such a special remote use the same \"transport\" as the underlying remote (thinking of p2p http in particular), which would mean the same authentication & the same set of permissions server side? +"""]]
page for mask remotes
documentation only so far
documentation only so far
diff --git a/doc/special_remotes/mask.mdwn b/doc/special_remotes/mask.mdwn new file mode 100644 index 0000000000..88b911a83b --- /dev/null +++ b/doc/special_remotes/mask.mdwn @@ -0,0 +1,31 @@ +This adds a layer of encryption to another remote. Files are stored on the +underlying remote, but get encrypted first by the mask. + +For example, a git repository is usually not encrypted (although see +[[gcrypt]]). If you want to store some annexed files encrypted +in the git remote "foo", you can set up a mask remote: + + git annex initremote foo-encrypted type=mask remote=foo encryption=hybrid keyid=... + +When someone else clones that git repository, they will be able to access +any annexed files that were sent directly to foo, which are stored unencrypted. +But any files that were sent to foo-encrypted will only be accessible to +people with the configured gpg keys. + +## configuration + +* `remote` - The name of the remote to use under the mask, which is where + files are stored. This must be provided when running `initremote`. + + When later running `enableremote`, any enabled remote with the same uuid + will be used, even if it has a different name than the name given here. This + parameter can also be provided when running `enableremote` to specify + explicitly which remote to use under the mask. + +* `encryption` - Encryption *must* be enabled for a mask. + One of "hybrid", "shared", or "pubkey". See [[encryption]]. + +* `keyid` - Specifies the gpg key to use for [[encryption]]. + +* `chunk` - Enables [[chunking]] when storing large files. + `chunk=1MiB` is a good starting point for chunking.
update
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_4_d19a6c42a6c4b0be270e1a1fe167631d._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_4_d19a6c42a6c4b0be270e1a1fe167631d._comment index 4cc63688a7..9de9ff171e 100644 --- a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_4_d19a6c42a6c4b0be270e1a1fe167631d._comment +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_4_d19a6c42a6c4b0be270e1a1fe167631d._comment @@ -41,4 +41,12 @@ A few gotchas I can see: are set up all storing to the same underlying remote. I think this is minor, because there would be 2 actual copies, just copies that happen to be on the same drive. +* `encryption=shared` will not add any security if the underlying remote + is a git repository, because pushing the git-annex branch there will make + the encryption key available to anyone who can access the git repository. + Instead will need to use `encryption=pubkey`. + (Since this is a bit non-obvious, it should probably reject attempts + to do that.) + +I have some early work (documentation) in the `maskremote` branch. """]]
sort special remote type list
diff --git a/doc/special_remotes.mdwn b/doc/special_remotes.mdwn index 0c4ff0131f..93c2ac297d 100644 --- a/doc/special_remotes.mdwn +++ b/doc/special_remotes.mdwn @@ -10,23 +10,23 @@ the content of files. * [[adb]] (for Android devices) * [[Amazon_Glacier|glacier]] * [[bittorrent]] +* [[borg]] * [[bup]] * [[compute]] * [[ddar]] * [[directory]] * [[gcrypt]] (encrypted git repositories!) * [[git-lfs]] +* [[git]] * [[hook]] +* [[httpalso]] +* [[rclone]] * [[rsync]] * [[S3]] (Amazon S3, and other compatible services) * [[tahoe]] * [[tor]] * [[web]] * [[webdav]] -* [[git]] -* [[httpalso]] -* [[borg]] -* [[rclone]] The above special remotes are built into git-annex, and can be used to tie git-annex into many cloud services.
comment
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_4_d19a6c42a6c4b0be270e1a1fe167631d._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_4_d19a6c42a6c4b0be270e1a1fe167631d._comment new file mode 100644 index 0000000000..4cc63688a7 --- /dev/null +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_4_d19a6c42a6c4b0be270e1a1fe167631d._comment @@ -0,0 +1,44 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 4""" + date="2025-04-09T15:17:25Z" + content=""" +What I was talking about is still hypothetical. But I think it would be +fairly easy to implement. + +This would be a regular special remote, so it supports encryption=yes and +related settings as usual. When a file is stored to this special remote, it +would take the object (which would be encrypted if it were so configured), +and store it on the remote it is layered on top of. Retrieval would get +the object from the layered remote. And so on. + +That could probably be implemented outside git-annex as an external special +remote. It might be better to build it into git-annex, to allow +for better streaming of files through it. + +When used on top of a regular git remote, it would result in the remote +having `.git/annex/objects/` containing some encrypted keys. (It could also +contain un-encrypted keys stored in it as usual.) + +The proxy would not be needed to use it. A proxy is just another case +where a layered special remote could be useful, when the user wants +client-side encryption. + +A few gotchas I can see: + +* Running `git-annex unused against the repository storing those + encrypted keys would see them as unused. +* If the special remote did not use encryption, it would be possible + to get into situations where drop violates numcopies. Eg, a drop could + verify that the key being dropped from the special remote is present + in the remote it's layered on top of and so count it as a copy. + But then dropping from the special remote would remove it from the + other remote. Probably the solution is for the special remote to require + encryption. +* If a file is stored on both this special remote and on the underlying remote, + that would count as 2 copies. But losing a single repository risks losing + both copies at once. Same problem if multiple of these special remotes + are set up all storing to the same underlying remote. I think this is + minor, because there would be 2 actual copies, just copies that happen to + be on the same drive. +"""]]
comment and close
diff --git a/doc/todo/Disable_checksum_with___96__get_--fast__96__.mdwn b/doc/todo/Disable_checksum_with___96__get_--fast__96__.mdwn index dd7abb816a..fd669d3585 100644 --- a/doc/todo/Disable_checksum_with___96__get_--fast__96__.mdwn +++ b/doc/todo/Disable_checksum_with___96__get_--fast__96__.mdwn @@ -12,3 +12,5 @@ Is this feasible? [[!meta author=cjmarkie]] [[!tag projects/openneuro]] + +> [[closing|done]] since the git config exists already. --[[Joey]] diff --git a/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_5_88e846b1ba4b957584680d0d6c70220d._comment b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_5_88e846b1ba4b957584680d0d6c70220d._comment new file mode 100644 index 0000000000..14e84d8a0e --- /dev/null +++ b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_5_88e846b1ba4b957584680d0d6c70220d._comment @@ -0,0 +1,9 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 5""" + date="2025-04-09T15:11:14Z" + content=""" +It's documented on the main git-annex man page with all other git configs. +The git-annex-config man page is for configs that can be stored in the +git-annex branch, which this one cannot be. +"""]]
Spoke with mih and he said this could be useful in the project context
diff --git a/doc/todo/generic_p2p_socket_transport.mdwn b/doc/todo/generic_p2p_socket_transport.mdwn index 6137b73a0b..d86bb9ed97 100644 --- a/doc/todo/generic_p2p_socket_transport.mdwn +++ b/doc/todo/generic_p2p_socket_transport.mdwn @@ -12,3 +12,5 @@ My understanding is that the current tor p2p support is essentially a special ca This should also make it possible to build e.g. a `git annex enable-yggstack` and `yggstack-annex::<pubkey>.pk.ygg` remote in terms of enable-p2p-socket and `p2p-annex::`, even outside of git-annex itself. What do you think? + +[[!tag projects/INM7]]
Added a comment
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_3_7ee9ffdf44d5917f5d6aba324e291cde._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_3_7ee9ffdf44d5917f5d6aba324e291cde._comment new file mode 100644 index 0000000000..1b1cc9fb5a --- /dev/null +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_3_7ee9ffdf44d5917f5d6aba324e291cde._comment @@ -0,0 +1,22 @@ +[[!comment format=mdwn + username="msz" + avatar="http://cdn.libravatar.org/avatar/6e8b88e7c70d86f4cfd27d450958aed4" + subject="comment 3" + date="2025-04-07T17:06:15Z" + content=""" +I've had the opportunity to revisit this old question of mine, and I'd like to ask some further questions. + +I understand the preference for git+annex remotes to not support encryption. However, I'm not sure how to understand layering the special remote (in particular, in front of a proxy). + +Is this possible with the existing git-annex tooling, or hypothetical? As far as I understand, a proxy 1) has to be a git repo thus cannot have encryption enabled, and 2) can push to an encrypted special remote (which must be a different type than git). It cannot pull encrypted annex keys from one special remote and put them unmodified into another (especially not into an annex-supporting git remote), right? + +The (modified) Forgejo instances we use support git-annex, i.e. git remotes which do not ignore the annex and accept content pushes (I call that git+annex). AFAIK the internal layout of such a Forgejo repository is not different from a bare repository ([DataLad blog: forgejo-anexajo - behind the curtain](https://blog.datalad.org/posts/forgejo-aneksajo/#behind-the-curtain)). The goal would be to have the annex objects sent encrypted to a Forgejo instance, *inside* or *alongside* the git repository. It seems that we would need a \"layer\" sitting on top of a normal git remote - but I don't see what that layer could look like. + +The best proxy set-up I came up with was a like this, with the encrypted remote behind the proxy (I'm using bare repository as the push target - not sure if Forgejo could be bent to our will like that): + +``` +local repository ----> (bare repository on a server) --proxy--> (directory special remote on the same server) + \"origin\" \"storage\" / proxied as \"origin-storage\" +``` +It worked, also with encryption, but the setup has limitations. First, encryption happens server-side. Second, only sharedpubkey encryption does not require private keys to be on the server -- in which case pushing to the proxied \"origin-storage\" works, but getting (necessarily) requires enabling \"storage\" locally. +"""]]
Added a comment: you're cool
diff --git a/doc/users/joey/comment_3_60af4cdc7429bdf3929072853099998b._comment b/doc/users/joey/comment_3_60af4cdc7429bdf3929072853099998b._comment new file mode 100644 index 0000000000..388cf1a14a --- /dev/null +++ b/doc/users/joey/comment_3_60af4cdc7429bdf3929072853099998b._comment @@ -0,0 +1,18 @@ +[[!comment format=mdwn + username="yannick" + avatar="http://cdn.libravatar.org/avatar/d60b72e0322e0661f7942823700d3374" + subject="you're cool" + date="2025-04-07T11:13:43Z" + content=""" +Hey Joey, + +Just wanted to leave you a message to say I think you're cool. I'm Gen Z and just recently I am diving into the world of Open Source and its culture. I just have a lot of fascination for people like you who do great, but really magnificent thing and stay so cool about it, not in the spotlights or whatever. Believe it or not, but I found out about `git annex` from ChatGPT when I asked how I can organize multiple hard drives that I have gather over the years with a lot of duplicate files on them. I want to go to film school you see and I want to organize all the footage I have so I can do cool things with it. + +Anyway it is always a breeze to come from something like ChatGPT to a corner of the internet that is so real. In a way it feels quite intimate to leave a message here on your user page of some thing that you created to explain something else you created! It gives me nostalgia to a time I never experienced. + +Looking forward to learn more about you and the things you do and to figure whether git annex is a fit for me. If it is then I will make sure to tell everybody I know about it because it all looks so cool to me already. + + +Warm regards, +Yannick +"""]]
Added a comment
diff --git a/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_4_4ae2b9582fa0c122d1a1ef7c347ebd84._comment b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_4_4ae2b9582fa0c122d1a1ef7c347ebd84._comment new file mode 100644 index 0000000000..54b4607c80 --- /dev/null +++ b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_4_4ae2b9582fa0c122d1a1ef7c347ebd84._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="cjmarkie" + avatar="http://cdn.libravatar.org/avatar/cfcc5894b1cec05bf76d200073dd8977" + subject="comment 4" + date="2025-04-04T01:14:55Z" + content=""" +Ah, thanks! Honestly, I'm okay with this being hard and scary-looking to do, as long as I can find it. I don't see `annex.verify` in `man git-annex-config`. Do those docs need refreshing? +"""]]
Support git remotes that use a IPV6 link-local address with a zone ID
Fixed 3 problems, and it seems to work now for both forms:
ssh://[fe80::7697:xxx:xxxx:xxxx%wlp3s0]/foo
fe80::7697:xxx:xxxx:xxxx%wlp3s0:foo
Fixed 3 problems, and it seems to work now for both forms:
ssh://[fe80::7697:xxx:xxxx:xxxx%wlp3s0]/foo
fe80::7697:xxx:xxxx:xxxx%wlp3s0:foo
diff --git a/Annex/Ssh.hs b/Annex/Ssh.hs index 07519d0390..a0e4ff9319 100644 --- a/Annex/Ssh.hs +++ b/Annex/Ssh.hs @@ -351,10 +351,15 @@ hostport2socket host (Just port) = hostport2socket' $ fromSshHost host ++ "!" ++ show port hostport2socket' :: String -> OsPath hostport2socket' s - | length s > lengthofmd5s = toOsPath $ show $ md5 $ encodeBL s - | otherwise = toOsPath s + | length s' > lengthofmd5s = toOsPath $ show $ md5 $ encodeBL s' + | otherwise = toOsPath s' where lengthofmd5s = 32 + -- ssh parses the socket filename as a ControlPath, so it can + -- contain eg "%h". We don't want that here, and it's possible + -- for a hostname to itself contain a '%', eg a IPV6 link-local + -- address with a zone ID. + s' = filter (/= '%') s socket2lock :: OsPath -> OsPath socket2lock socket = socket <> lockExt diff --git a/CHANGELOG b/CHANGELOG index aa6dc06ea8..d172c1b897 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -11,6 +11,7 @@ git-annex (10.20250321) UNRELEASED; urgency=medium version of the annex.web-options config. * migrate: Fix --remove-size to work when a file is not present. Fixes reversion introduced in version 10.20231129. + * Support git remotes that use a IPV6 link-local address with a zone ID. -- Joey Hess <id@joeyh.name> Fri, 21 Mar 2025 12:27:11 -0400 diff --git a/Git/Remote.hs b/Git/Remote.hs index eb4d78e88d..b0a31314ce 100644 --- a/Git/Remote.hs +++ b/Git/Remote.hs @@ -103,9 +103,10 @@ parseRemoteLocation s knownurl repo = go urlstyle v = isURI (escapeURIString isUnescapedInURI v) -- git remotes can be written scp style -- [user@]host:dir -- but foo::bar is a git-remote-helper location instead + -- (although '::' can also be part of an IPV6 address) scpstyle v = ":" `isInfixOf` v && not ("//" `isInfixOf` v) - && not ("::" `isInfixOf` v) + && not ("::" `isInfixOf` (takeWhile (/= '[') v)) scptourl v = "ssh://" ++ host ++ slash dir where (host, dir) diff --git a/Git/Url.hs b/Git/Url.hs index ad0e61b648..af13f58391 100644 --- a/Git/Url.hs +++ b/Git/Url.hs @@ -36,9 +36,14 @@ uriRegName' a = fixup $ uriRegName a len = length rest - 1 fixup x = x -{- Hostname of an URL repo. -} +{- Hostname of an URL repo. + - + - An IPV6 link-local address in an url can include a + - scope, eg "%wlan0". The "%" is necessarily URI-encoded + - as "%25" in the URI. So the hostname gets URI-decoded here. + -} host :: Repo -> Maybe String -host = authpart uriRegName' +host = authpart (unEscapeString . uriRegName') {- Port of an URL repo, if it has a nonstandard one. -} port :: Repo -> Maybe Integer @@ -53,7 +58,7 @@ port r = hostuser :: Repo -> Maybe String hostuser r = (++) <$> authpart uriUserInfo r - <*> authpart uriRegName' r + <*> host r {- The full authority portion an URL repo. (ie, "user@host:port") -} authority :: Repo -> Maybe String diff --git a/doc/bugs/IPv6_link-local_address_as_remote.mdwn b/doc/bugs/IPv6_link-local_address_as_remote.mdwn index d2c43ba221..d56686d73f 100644 --- a/doc/bugs/IPv6_link-local_address_as_remote.mdwn +++ b/doc/bugs/IPv6_link-local_address_as_remote.mdwn @@ -31,3 +31,5 @@ There is no problem with global IPv6 addresses, so it is likely that the percent ### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) Yes, I have successfully used git-annex with a local remote (same computer), ssh over IPv4, and ssh to a globally visible IPv6 address. + +> [[fixed|done]] --[[Joey]]
comment
diff --git a/doc/tips/using_the_web_as_a_special_remote/comment_17_50ec9369b389f6dc78284b2a087b9821._comment b/doc/tips/using_the_web_as_a_special_remote/comment_17_50ec9369b389f6dc78284b2a087b9821._comment new file mode 100644 index 0000000000..773957c822 --- /dev/null +++ b/doc/tips/using_the_web_as_a_special_remote/comment_17_50ec9369b389f6dc78284b2a087b9821._comment @@ -0,0 +1,17 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: Special use case for Scientific application""" + date="2025-04-01T14:48:34Z" + content=""" +Sure it will work fine to have different versions of annexed files. +git-annex will know the url it can use to get whichever version is checked +out in the current git branch. + +As for an easier way, it's possible to use [[git-annex-import]] +with certian special remotes, which imports a tree of files from them, and +re-running it imports whatever files are new or changed. This needs a +special remote that supports it, and it would perhaps be possible to write +such a special remote for Zenodo. Dunno if it would be worth the work to +implement, but it may be worth seeing if Datalad could support that, if +you use Datalad. +"""]]
migrate: Fix --remove-size to work when a file is not present
5f74a45861357be2a3233ddbbdbe7f7b0cf1814e added this bug
5f74a45861357be2a3233ddbbdbe7f7b0cf1814e added this bug
diff --git a/CHANGELOG b/CHANGELOG index 44a0305bd3..aa6dc06ea8 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -9,6 +9,8 @@ git-annex (10.20250321) UNRELEASED; urgency=medium * httpalso: Windows url fix. * Added remote.name.annex-web-options config, which is a per-remote version of the annex.web-options config. + * migrate: Fix --remove-size to work when a file is not present. + Fixes reversion introduced in version 10.20231129. -- Joey Hess <id@joeyh.name> Fri, 21 Mar 2025 12:27:11 -0400 diff --git a/Command/Migrate.hs b/Command/Migrate.hs index a2dab7ab00..ffaabf7da6 100644 --- a/Command/Migrate.hs +++ b/Command/Migrate.hs @@ -90,7 +90,7 @@ start o ksha si file key = do newbackend <- chooseBackend file if (newbackend /= oldbackend || upgradableKey oldbackend || forced) && exists then go False oldbackend newbackend - else if cantweaksize newbackend oldbackend && exists + else if cantweaksize newbackend oldbackend exists then go True oldbackend newbackend else stop where @@ -101,10 +101,10 @@ start o ksha si file key = do starting "migrate" (mkActionItem (key, file)) si $ perform onlytweaksize o file key keyrec oldbackend newbackend - cantweaksize newbackend oldbackend + cantweaksize newbackend oldbackend exists | removeSize o = isJust (fromKey keySize key) | newbackend /= oldbackend = False - | isNothing (fromKey keySize key) = True + | isNothing (fromKey keySize key) && exists = True | otherwise = False upgradableKey oldbackend = maybe False (\a -> a key) (canUpgradeKey oldbackend) diff --git a/doc/bugs/migrate_--remove-size_does_nothing.mdwn b/doc/bugs/migrate_--remove-size_does_nothing.mdwn index 37d8582e34..8e8861dc55 100644 --- a/doc/bugs/migrate_--remove-size_does_nothing.mdwn +++ b/doc/bugs/migrate_--remove-size_does_nothing.mdwn @@ -92,4 +92,4 @@ git-annex version: 10.20250115 - +> [[fixed|done]] --[[Joey]] diff --git a/doc/bugs/migrate_--remove-size_does_nothing/comment_1_57e0c4a6b4183a9e08894a7c25b2efda._comment b/doc/bugs/migrate_--remove-size_does_nothing/comment_1_57e0c4a6b4183a9e08894a7c25b2efda._comment new file mode 100644 index 0000000000..5e62a0d4e8 --- /dev/null +++ b/doc/bugs/migrate_--remove-size_does_nothing/comment_1_57e0c4a6b4183a9e08894a7c25b2efda._comment @@ -0,0 +1,10 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-04-01T14:39:24Z" + content=""" +I diagnose a bug introduced in +[[!commit 86dbe9a825b9c615c63e0cfc5e4a737a249f8989]] +that makes it only be able to remove the size if the object file is locally +present. Fixed. +"""]]
comment
diff --git a/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_3_92d978725f80a8f29986567f8fea5187._comment b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_3_92d978725f80a8f29986567f8fea5187._comment new file mode 100644 index 0000000000..a58ae228c1 --- /dev/null +++ b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_3_92d978725f80a8f29986567f8fea5187._comment @@ -0,0 +1,26 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 3""" + date="2025-04-01T14:25:54Z" + content=""" +What you can recommend, which works already, is: + + git -c annex.verify=false annex get + +As to adding this to --fast, I think some would be surprised if --fast +allowed bad data to get into the repository. And commands like +`git-annex copy --to` that do support --fast already use it to avoid round +trip checks. It would not do to make --fast for those commands also avoid +verification. And `git-annex copy` is very close to `git-annex get`, to +the point that `git-annex get --from` is the same as `git-annex copy +--from`. + +So, I think it's better to keep this a separate option, and the -c option I +gave above works well enough I suppose. + +With that said, you're the second person asking about this in an HPC +context this week. I suspect maybe you and @mih were working on the same +problem in asking about this? Anyway, since you both seemed to have +difficulty finding the way to do this, maybe it would be worth making a +dedicated option like `--no-verify`. +"""]]
Added remote.name.annex-web-options config
Which is a per-remote version of the annex.web-options config.
Had to plumb RemoteGitConfig through to getUrlOptions. In cases where a
special remote does not use curl, there was no need to do that and I used
Nothing instead.
In the case of the addurl and importfeed commands, it seemed best to say
that running these commands is not using the web special remote per se,
so the config is not used for those commands.
Which is a per-remote version of the annex.web-options config.
Had to plumb RemoteGitConfig through to getUrlOptions. In cases where a
special remote does not use curl, there was no need to do that and I used
Nothing instead.
In the case of the addurl and importfeed commands, it seemed best to say
that running these commands is not using the web special remote per se,
so the config is not used for those commands.
diff --git a/Annex/Url.hs b/Annex/Url.hs index 795b4b7b97..1cc742f522 100644 --- a/Annex/Url.hs +++ b/Annex/Url.hs @@ -56,8 +56,8 @@ getUserAgent :: Annex U.UserAgent getUserAgent = Annex.getRead $ fromMaybe defaultUserAgent . Annex.useragent -getUrlOptions :: Annex U.UrlOptions -getUrlOptions = Annex.getState Annex.urloptions >>= \case +getUrlOptions :: Maybe RemoteGitConfig -> Annex U.UrlOptions +getUrlOptions mgc = Annex.getState Annex.urloptions >>= \case Just uo -> return uo Nothing -> do uo <- mk @@ -81,10 +81,15 @@ getUrlOptions = Annex.getState Annex.urloptions >>= \case >>= \case Just output -> pure (lines output) Nothing -> annexHttpHeaders <$> Annex.getGitConfig + + getweboptions = case mgc of + Just gc | not (null (remoteAnnexWebOptions gc)) -> + pure (remoteAnnexWebOptions gc) + _ -> annexWebOptions <$> Annex.getGitConfig checkallowedaddr = words . annexAllowedIPAddresses <$> Annex.getGitConfig >>= \case ["all"] -> do - curlopts <- map Param . annexWebOptions <$> Annex.getGitConfig + curlopts <- map Param <$> getweboptions allowedurlschemes <- annexAllowedUrlSchemes <$> Annex.getGitConfig let urldownloader = if null curlopts && not (any (`S.notMember` U.conduitUrlSchemes) allowedurlschemes) then U.DownloadWithConduit $ @@ -148,8 +153,8 @@ ipAddressesUnlimited :: Annex Bool ipAddressesUnlimited = ("all" == ) . annexAllowedIPAddresses <$> Annex.getGitConfig -withUrlOptions :: (U.UrlOptions -> Annex a) -> Annex a -withUrlOptions a = a =<< getUrlOptions +withUrlOptions :: Maybe RemoteGitConfig -> (U.UrlOptions -> Annex a) -> Annex a +withUrlOptions mgc a = a =<< getUrlOptions mgc -- When downloading an url, if authentication is needed, uses -- git-credential to prompt for username and password. @@ -157,10 +162,10 @@ withUrlOptions a = a =<< getUrlOptions -- Note that, when the downloader is curl, it will not use git-credential. -- If the user wants to, they can configure curl to use a netrc file that -- handles authentication. -withUrlOptionsPromptingCreds :: (U.UrlOptions -> Annex a) -> Annex a -withUrlOptionsPromptingCreds a = do +withUrlOptionsPromptingCreds :: Maybe RemoteGitConfig -> (U.UrlOptions -> Annex a) -> Annex a +withUrlOptionsPromptingCreds mgc a = do g <- Annex.gitRepo - uo <- getUrlOptions + uo <- getUrlOptions mgc prompter <- mkPrompter cc <- Annex.getRead Annex.gitcredentialcache a $ uo diff --git a/Annex/YoutubeDl.hs b/Annex/YoutubeDl.hs index 60245eec9d..722823b60b 100644 --- a/Annex/YoutubeDl.hs +++ b/Annex/YoutubeDl.hs @@ -74,7 +74,7 @@ youtubeDlNotAllowedMessage = unwords -- <https://github.com/rg3/youtube-dl/issues/14864>) youtubeDl :: URLString -> OsPath -> MeterUpdate -> Annex (Either String (Maybe OsPath)) youtubeDl url workdir p = ifM ipAddressesUnlimited - ( withUrlOptions $ youtubeDl' url workdir p + ( withUrlOptions Nothing $ youtubeDl' url workdir p , return $ Left youtubeDlNotAllowedMessage ) @@ -194,7 +194,7 @@ youtubeDlTo key url dest p = do -- without it. So, this first downloads part of the content and checks -- if it's a html page; only then is youtube-dl used. htmlOnly :: URLString -> a -> Annex a -> Annex a -htmlOnly url fallback a = withUrlOptions $ \uo -> +htmlOnly url fallback a = withUrlOptions Nothing $ \uo -> liftIO (downloadPartial url uo htmlPrefixLength) >>= \case Just bs | isHtmlBs bs -> a _ -> return fallback @@ -202,7 +202,7 @@ htmlOnly url fallback a = withUrlOptions $ \uo -> -- Check if youtube-dl supports downloading content from an url. youtubeDlSupported :: URLString -> Annex Bool youtubeDlSupported url = either (const False) id - <$> withUrlOptions (youtubeDlCheck' url) + <$> withUrlOptions Nothing (youtubeDlCheck' url) -- Check if youtube-dl can find media in an url. -- @@ -211,7 +211,7 @@ youtubeDlSupported url = either (const False) id -- download won't succeed. youtubeDlCheck :: URLString -> Annex (Either String Bool) youtubeDlCheck url = ifM youtubeDlAllowed - ( withUrlOptions $ youtubeDlCheck' url + ( withUrlOptions Nothing $ youtubeDlCheck' url , return $ Left youtubeDlNotAllowedMessage ) @@ -227,7 +227,7 @@ youtubeDlCheck' url uo -- -- (This is not always identical to the filename it uses when downloading.) youtubeDlFileName :: URLString -> Annex (Either String OsPath) -youtubeDlFileName url = withUrlOptions go +youtubeDlFileName url = withUrlOptions Nothing go where go uo | supportedScheme uo url = flip catchIO (pure . Left . show) $ @@ -238,7 +238,7 @@ youtubeDlFileName url = withUrlOptions go -- Does not check if the url contains htmlOnly; use when that's already -- been verified. youtubeDlFileNameHtmlOnly :: URLString -> Annex (Either String OsPath) -youtubeDlFileNameHtmlOnly = withUrlOptions . youtubeDlFileNameHtmlOnly' +youtubeDlFileNameHtmlOnly = withUrlOptions Nothing . youtubeDlFileNameHtmlOnly' youtubeDlFileNameHtmlOnly' :: URLString -> UrlOptions -> Annex (Either String OsPath) youtubeDlFileNameHtmlOnly' url uo diff --git a/Assistant/Upgrade.hs b/Assistant/Upgrade.hs index 9f82e4fdc6..ca6d5b3ada 100644 --- a/Assistant/Upgrade.hs +++ b/Assistant/Upgrade.hs @@ -324,7 +324,7 @@ usingDistribution = isJust <$> getEnv "GIT_ANNEX_STANDLONE_ENV" downloadDistributionInfo :: Assistant (Maybe GitAnnexDistribution) downloadDistributionInfo = do - uo <- liftAnnex Url.getUrlOptions + uo <- liftAnnex $ Url.getUrlOptions Nothing gpgcmd <- liftAnnex $ gpgCmd <$> Annex.getGitConfig liftIO $ withTmpDir (literalOsPath "git-annex.tmp") $ \tmpdir -> do let infof = tmpdir </> literalOsPath "info" diff --git a/Assistant/WebApp/Configurators/IA.hs b/Assistant/WebApp/Configurators/IA.hs index 1b2d05e6e2..3818ad7fbb 100644 --- a/Assistant/WebApp/Configurators/IA.hs +++ b/Assistant/WebApp/Configurators/IA.hs @@ -179,7 +179,7 @@ escapeHeader = escapeURIString (\c -> isUnescapedInURI c && c /= ' ') getRepoInfo :: RemoteConfig -> Widget getRepoInfo c = do - uo <- liftAnnex Url.getUrlOptions + uo <- liftAnnex $ Url.getUrlOptions Nothing urlexists <- liftAnnex $ catchDefaultIO False $ Url.exists url uo [whamlet| <a href="#{url}"> diff --git a/CHANGELOG b/CHANGELOG index b2ac6a0a44..44a0305bd3 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -7,6 +7,8 @@ git-annex (10.20250321) UNRELEASED; urgency=medium * fsck: Avoid complaining about required content of dead repositories. * drop: Avoid redundant object directory thawing. * httpalso: Windows url fix. + * Added remote.name.annex-web-options config, which is a per-remote + version of the annex.web-options config. -- Joey Hess <id@joeyh.name> Fri, 21 Mar 2025 12:27:11 -0400 diff --git a/CmdLine/GitRemoteAnnex.hs b/CmdLine/GitRemoteAnnex.hs index beacd137a3..d83be209ce 100644 --- a/CmdLine/GitRemoteAnnex.hs +++ b/CmdLine/GitRemoteAnnex.hs @@ -496,7 +496,7 @@ parseSpecialRemoteUrl url remotename = case parseURI url of resolveSpecialRemoteWebUrl :: String -> Annex (Maybe String) resolveSpecialRemoteWebUrl url | "http://" `isPrefixOf` lcurl || "https://" `isPrefixOf` lcurl = - Url.withUrlOptionsPromptingCreds $ \uo -> + Url.withUrlOptionsPromptingCreds Nothing $ \uo -> withTmpFile (literalOsPath "git-remote-annex") $ \tmp h -> do liftIO $ hClose h Url.download' nullMeterUpdate Nothing url tmp uo >>= \case diff --git a/Command/AddUrl.hs b/Command/AddUrl.hs index d81628e6b8..ac825fc409 100644 --- a/Command/AddUrl.hs +++ b/Command/AddUrl.hs @@ -251,7 +251,7 @@ startWeb addunlockedmatcher o si urlstring = go $ fromMaybe bad $ parseURIPortab go url = startingAddUrl si urlstring o $ if relaxedOption (downloadOptions o) then go' url Url.assumeUrlExists - else Url.withUrlOptions (Url.getUrlInfo urlstring) >>= \case + else Url.withUrlOptions Nothing (Url.getUrlInfo urlstring) >>= \case Right urlinfo -> go' url urlinfo Left err -> do warning (UnquotedString err) @@ -352,7 +352,8 @@ downloadWeb addunlockedmatcher o url urlinfo file = go =<< downloadWith' downloader urlkey webUUID url file where urlkey = addSizeUrlKey urlinfo $ Backend.URL.fromUrl url Nothing (verifiableOption o) - downloader f p = Url.withUrlOptions $ downloadUrl False urlkey p Nothing [url] f + downloader f p = Url.withUrlOptions Nothing $ + downloadUrl False urlkey p Nothing [url] f go Nothing = return Nothing go (Just (tmp, backend)) = ifM (useYoutubeDl o <&&> liftIO (isHtmlFile tmp)) ( tryyoutubedl tmp backend diff --git a/Command/ImportFeed.hs b/Command/ImportFeed.hs index df1537fb65..613c9dd0f8 100644 --- a/Command/ImportFeed.hs +++ b/Command/ImportFeed.hs @@ -268,7 +268,7 @@ findDownloads u f = catMaybes $ map mk (feedItems f) downloadFeed :: URLString -> FilePath -> Annex Bool downloadFeed url f | Url.parseURIRelaxed url == Nothing = giveup "invalid feed url" - | otherwise = Url.withUrlOptions $ + | otherwise = Url.withUrlOptions Nothing $ (Diff truncated)
response
diff --git a/doc/bugs/git-annex_should_not___34__annex__34___in_git-annex_branch/comment_1_57b1cdf70619440f7db450a48cf3f558._comment b/doc/bugs/git-annex_should_not___34__annex__34___in_git-annex_branch/comment_1_57b1cdf70619440f7db450a48cf3f558._comment new file mode 100644 index 0000000000..542e45e6b3 --- /dev/null +++ b/doc/bugs/git-annex_should_not___34__annex__34___in_git-annex_branch/comment_1_57b1cdf70619440f7db450a48cf3f558._comment @@ -0,0 +1,21 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-04-01T13:07:03Z" + content=""" +Adding a large file to git just because the git-annex branch is currently +checked out seems like it would be a large footbomb. That is generally +harder to recover from than adding a file to the annex and then realizing +it needs to be added to git instead. + +Since git generally allows switching branches with new files +staged. It would be entirely reasonable to check out the git-annex branch +after adding a new annexed file but before committing it. +And checking out the git-annex branch, `git-annex add` of a large file +without committing it, then switching back to the main branch and committing +there is also possible if someone wants to do that for some reason. + +Since manual commits to the git-annex branch need extra steps anyway +(eg removing .git/annex/index or committing using it instead of the usual +index file), I don't see much point in refining it. +"""]]
fix name of config
diff --git a/doc/special_remotes/web.mdwn b/doc/special_remotes/web.mdwn index 134d547065..08ace320e5 100644 --- a/doc/special_remotes/web.mdwn +++ b/doc/special_remotes/web.mdwn @@ -39,4 +39,4 @@ can't be downloaded from "web" (or some other remote) will it fall back to downloading from slowweb. git-annex initremote --sameas=web slowweb type=web urlinclude='*//slowhost.com/*' - git config remote.slowweb.cost 300 + git config remote.slowweb.annex-cost 300
diff --git a/doc/forum/Authentication_for_URL_downloads.mdwn b/doc/forum/Authentication_for_URL_downloads.mdwn new file mode 100644 index 0000000000..18775e014a --- /dev/null +++ b/doc/forum/Authentication_for_URL_downloads.mdwn @@ -0,0 +1,5 @@ +A while ago, I added a bunch of files from archive.org to my repository, using `git annex addurl --fast`. This worked fine. Unfortunately, since then the relevant archive has been marked access-restricted, meaning you need to log in to archive.org to download it. + +archive.org uses the now-standard Web authentication method of going to a login page and setting an authentication cookie. This is, of course, hard to automate. However, I can just log in from the browser and then export those cookies; I use [curlfire](https://github.com/talwrii/curlfire) for this. + +The problem is there isn't apparently any way to tell git-annex to do this. The old `annex.web-download-command` is apparently defunct; the new `web-options` doesn't let you change what program to use, only pass options to curl. What's the best way to handle this?
Added a comment
diff --git a/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_2_94f9971fc4ed845e3dc84c841eb0ea3b._comment b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_2_94f9971fc4ed845e3dc84c841eb0ea3b._comment new file mode 100644 index 0000000000..b987ed9893 --- /dev/null +++ b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_2_94f9971fc4ed845e3dc84c841eb0ea3b._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="cjmarkie" + avatar="http://cdn.libravatar.org/avatar/cfcc5894b1cec05bf76d200073dd8977" + subject="comment 2" + date="2025-03-28T13:49:12Z" + content=""" +:facepalm: `Command/Get.hs` +"""]]
Added a comment
diff --git a/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_1_7096443a8d28a0d47416f4ff6a419552._comment b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_1_7096443a8d28a0d47416f4ff6a419552._comment new file mode 100644 index 0000000000..b9a6ef5891 --- /dev/null +++ b/doc/todo/Disable_checksum_with___96__get_--fast__96__/comment_1_7096443a8d28a0d47416f4ff6a419552._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="cjmarkie" + avatar="http://cdn.libravatar.org/avatar/cfcc5894b1cec05bf76d200073dd8977" + subject="comment 1" + date="2025-03-28T13:44:13Z" + content=""" +By the way, I'm happy to try my hand at contributing if I can get a couple pointers to where to start work. It's been a while since I wrote Haskell, so following execution paths is rusty. I've found the `Annex.fast` setting, but the `get` command implementation is harder to grep for. +"""]]
migrate --remove-size does not work
diff --git a/doc/bugs/migrate_--remove-size_does_nothing.mdwn b/doc/bugs/migrate_--remove-size_does_nothing.mdwn new file mode 100644 index 0000000000..37d8582e34 --- /dev/null +++ b/doc/bugs/migrate_--remove-size_does_nothing.mdwn @@ -0,0 +1,95 @@ +### Please describe the problem. + +<details> +<summary>I have a file with a key which was missing yt: prefix, so I added it manually</summary> + +``` +$> git show git-annex +commit 318da631abbd7562de52dd4e51fdcc6df01c3622 (git-annex) +Author: Yaroslav Halchenko <debian@onerussian.com> +Date: Fri Mar 28 07:27:33 2025 -0400 + + [DATALAD RUNCMD] Manually trying to fixup one URL so it uses yt-dlp + + refs: + - https://git-annex.branchable.com/bugs/tries_to_download_a_.mkv_video_without_yt-dlp/ + - https://git-annex.branchable.com/todo/yt-dlp__58___parse__47__handle___40__error__41_____34__Video_unavailable__34__/ + + note that key also does not have yt: prefix as those which would be going through yt: + but I think this is irrelevant here for our purposes. Just something to remember to not rely on + + Note: it was not actually "datalad run" committed due to + https://git-annex.branchable.com/bugs/git-annex_should_not___34__annex__34___in_git-annex_branch/?updated + + I just reused the record etc + + === Do not change lines below === + { + "chain": [], + "cmd": "sed -i -e 's,^[0-9]*s\\(.*\\)https:,1743156234s\\1yt:https:,g' 'e9f/464/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE.log.web'", + "exit": 0, + "extra_inputs": [], + "inputs": [], + "outputs": [], + "pwd": "." + } + ^^^ Do not change lines above ^^^ + +diff --git a/e9f/464/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE.log.web b/e9f/464/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE.log.web +index 331362218..e63cffe5d 100644 +--- a/e9f/464/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE.log.web ++++ b/e9f/464/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE.log.web +@@ -1 +1 @@ +-1742983318s 1 https://www.youtube.com/watch?v=hWwtFQLntbE ++1743156234s 1 yt:https://www.youtube.com/watch?v=hWwtFQLntbE + +``` +</details> + +but then `git annex get` would still fail, since verification fails, likely due to the key also having size in it. So I had to migrate to URL backend without size, as the bible teaches us: + +```shell +$> git annex help migrate +... + One use of this option is to convert URL keys that were added by git-annex addurl --fast to ones that would have been added if that command was run with the --relaxed option. Eg: + + git-annex migrate --remove-size --backend=URL somefile +``` + +but no luck -- nothing is done for that file/key: + + +```shell +$> git status +On branch master +Your branch is ahead of 'origin/master' by 1 commit. + (use "git push" to publish your local commits) + +nothing to commit, working tree clean + +$> ls -ld Чат_рулетка/2025-03-26-_Что_у_россиянок_в_головах____чат_рулетка.mkv +lrwxrwxrwx 1 yoh datalad 151 Mar 26 06:01 Чат_рулетка/2025-03-26-_Что_у_россиянок_в_головах____чат_рулетка.mkv -> ../.git/annex/objects/p9/Km/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE + +$> git annex --debug migrate --remove-size --backend=URL Чат_рулетка/2025-03-26-_Что_у_россиянок_в_головах____чат_рулетка.mkv +[2025-03-28 08:09:40.732628384] (Utility.Process) process [1262316] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","ls-files","--stage","-z","--error-unmatch","--","\1063\1072\1090_\1088\1091\1083\1077\1090\1082\1072/2025-03-26-_\1063\1090\1086_\1091_\1088\1086\1089\1089\1080\1103\1085\1086\1082_\1074_\1075\1086\1083\1086\1074\1072\1093____\1095\1072\1090_\1088\1091\1083\1077\1090\1082\1072.mkv"] +[2025-03-28 08:09:40.733804218] (Utility.Process) process [1262317] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)","--buffer"] +[2025-03-28 08:09:40.73435116] (Utility.Process) process [1262318] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)","--buffer"] +[2025-03-28 08:09:40.738110034] (Utility.Process) process [1262318] done ExitSuccess +[2025-03-28 08:09:40.738197257] (Utility.Process) process [1262317] done ExitSuccess +[2025-03-28 08:09:40.738250358] (Utility.Process) process [1262316] done ExitSuccess +[2025-03-28 08:09:40.738775092] (Utility.Process) process [1262319] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","mktree","--missing","--batch","-z"] +[2025-03-28 08:09:40.739138127] (Utility.Process) process [1262320] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","mktree","--missing","--batch","-z"] +[2025-03-28 08:09:40.741935559] (Utility.Process) process [1262320] done ExitSuccess +[2025-03-28 08:09:40.74216456] (Utility.Process) process [1262319] done ExitSuccess + +$> ls -ld Чат_рулетка/2025-03-26-_Что_у_россиянок_в_головах____чат_рулетка.mkv +lrwxrwxrwx 1 yoh datalad 151 Mar 26 06:01 Чат_рулетка/2025-03-26-_Что_у_россиянок_в_головах____чат_рулетка.mkv -> ../.git/annex/objects/p9/Km/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE/URL-s955950--https&c%%www.youtube.com%watch,63v,61hWwtFQLntbE + +$> git annex version | head -n 1 +git-annex version: 10.20250115 +``` + + + + +
issue about annexing while under git-annex branch
diff --git a/doc/bugs/git-annex_should_not___34__annex__34___in_git-annex_branch.mdwn b/doc/bugs/git-annex_should_not___34__annex__34___in_git-annex_branch.mdwn new file mode 100644 index 0000000000..22237e24c9 --- /dev/null +++ b/doc/bugs/git-annex_should_not___34__annex__34___in_git-annex_branch.mdwn @@ -0,0 +1,54 @@ +### Please describe the problem. + +Using `git annex add` while in `git-annex` branch would make git-annex add to annex not to git. + +There might be cases where direct manipulation within git-annex branch is desired. (in my case was to workaround [missing yt: prefix](https://git-annex.branchable.com/bugs/tries_to_download_a_.mkv_video_without_yt-dlp/)). Then I wanted to use generic tools (`sed` + `datalad run`) but was surprised that we got content added to annex while within git-annex branch. I do not see when/why that could potentially be useful (but I might be wrong?!) + +Sure could be argued to be "operator error" but it is more of question of assumptions and automations -- should all tools around git-annex guard for that? + +I think `annex add` should avoid annexing, and just do `git add` while under its dedicated `git-annex` branch -- after all it would only be git-annex which would know how special this (or any other) branch for it. + +### What steps will reproduce the problem? + +``` +❯ git clone -b git-annex https://github.com/OpenNeuroDatasets/ds000003; builtin cd ds000003; echo 123 > 123; git annex add 123; git commit -m 123 123; git show + +Cloning into 'ds000003'... +remote: Enumerating objects: 1121, done. +remote: Counting objects: 100% (29/29), done. +remote: Compressing objects: 100% (19/19), done. +remote: Total 1121 (delta 18), reused 10 (delta 10), pack-reused 1092 (from 1) +Receiving objects: 100% (1121/1121), 92.35 KiB | 331.00 KiB/s, done. +Resolving deltas: 100% (223/223), done. + + Remote origin not usable by git-annex; setting annex-ignore + + https://github.com/OpenNeuroDatasets/ds000003/config download failed: Not Found +add 123 +ok +(recording state in git...) +[git-annex cbbce8c] 123 + 1 file changed, 1 insertion(+) + create mode 120000 123 +commit cbbce8c1d594f9e675eb6784111ee2e7926bc6ec (HEAD -> git-annex) +Author: Yaroslav Halchenko <debian@onerussian.com> +Date: Fri Mar 28 07:21:14 2025 -0400 + + 123 + +diff --git a/123 b/123 +new file mode 120000 +index 0000000..0f13084 +--- /dev/null ++++ b/123 +@@ -0,0 +1 @@ ++.git/annex/objects/G6/qW/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b/SHA256E-s4--181210f8f9c779c26da1d9b2075bde0127302ee0e3fca38c9a83f5b1dd8e5d3b +``` + +### What version of git-annex are you using? On what operating system? + +``` +❯ git annex version +git-annex version: 10.20250115 +... +```
Added a comment
diff --git a/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__/comment_2_68d84a711472a57110cc4aba270aaf1d._comment b/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__/comment_2_68d84a711472a57110cc4aba270aaf1d._comment new file mode 100644 index 0000000000..19bfc2a02b --- /dev/null +++ b/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__/comment_2_68d84a711472a57110cc4aba270aaf1d._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="jnkl" + avatar="http://cdn.libravatar.org/avatar/2ab576f3bf2e0d96b1ee935bb7f33dbe" + subject="comment 2" + date="2025-03-28T10:27:25Z" + content=""" +Thanks for this solution. I am still looking for a general solution. +"""]]
removed
diff --git a/doc/bugs/httpalso_windows_URL_errors/comment_4_3eee9c6c2ab1d0ae62c56b53f0d4118c._comment b/doc/bugs/httpalso_windows_URL_errors/comment_4_3eee9c6c2ab1d0ae62c56b53f0d4118c._comment deleted file mode 100644 index 121d6ff7e0..0000000000 --- a/doc/bugs/httpalso_windows_URL_errors/comment_4_3eee9c6c2ab1d0ae62c56b53f0d4118c._comment +++ /dev/null @@ -1,9 +0,0 @@ -[[!comment format=mdwn - username="Basile.Pinsard" - avatar="http://cdn.libravatar.org/avatar/87e1f73acf277ad0337b90fc0253c62e" - subject="thanks" - date="2025-03-27T18:39:58Z" - content=""" -Thanks for the fix, I will try that when I get the time and when that release gets built on https://github.com/datalad/git-annex/ so that I can use my existing action. - -"""]]
Added a comment: Ignore .crdownload?
diff --git a/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__/comment_1_c250ce993e125dc51a77b76db11a7bec._comment b/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__/comment_1_c250ce993e125dc51a77b76db11a7bec._comment new file mode 100644 index 0000000000..1096a2466c --- /dev/null +++ b/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__/comment_1_c250ce993e125dc51a77b76db11a7bec._comment @@ -0,0 +1,12 @@ +[[!comment format=mdwn + username="cjmarkie" + avatar="http://cdn.libravatar.org/avatar/cfcc5894b1cec05bf76d200073dd8977" + subject="Ignore .crdownload?" + date="2025-03-27T20:18:03Z" + content=""" +I don't use the assistant, but you should be able to add the following line to your `.gitignore` file: + +```gitignore +*.crdownload +``` +"""]]
Propose skipping checksums in git-annex get --fast
diff --git a/doc/todo/Disable_checksum_with___96__get_--fast__96__.mdwn b/doc/todo/Disable_checksum_with___96__get_--fast__96__.mdwn new file mode 100644 index 0000000000..dd7abb816a --- /dev/null +++ b/doc/todo/Disable_checksum_with___96__get_--fast__96__.mdwn @@ -0,0 +1,14 @@ +In ["--fast" option for git annex get?](https://git-annex.branchable.com/forum/__34__--fast__34___option_for_git_annex_get__63__/) it was indicated that `git annex get --fast` doesn't have any effect. + +In an HPC context, users are frequently expected to use login (or dedicated data transfer) nodes for data transfer, and can get their sessions killed for excessive CPU use. For OpenNeuro, the high bandwidth between many HPC centers and S3 means that checksums can become the bottleneck in data transfer. I would like to be able to recommend something like: + +```console +git annex get -f s3-PUBLIC --fast --all +srun git annex fsck +``` + +Is this feasible? + + +[[!meta author=cjmarkie]] +[[!tag projects/openneuro]]
removed
diff --git a/doc/bugs/httpalso_windows_URL_errors/comment_3_e0dcb1a7a4cdb1b7e443d590415b4eaf._comment b/doc/bugs/httpalso_windows_URL_errors/comment_3_e0dcb1a7a4cdb1b7e443d590415b4eaf._comment deleted file mode 100644 index 79c3dc1c68..0000000000 --- a/doc/bugs/httpalso_windows_URL_errors/comment_3_e0dcb1a7a4cdb1b7e443d590415b4eaf._comment +++ /dev/null @@ -1,9 +0,0 @@ -[[!comment format=mdwn - username="Basile.Pinsard" - avatar="http://cdn.libravatar.org/avatar/87e1f73acf277ad0337b90fc0253c62e" - subject="comment 3" - date="2025-03-27T18:39:44Z" - content=""" -Thanks for the fix, I will try that when I get the time and when that release gets built on https://github.com/datalad/git-annex/ so that I can use my existing action. - -"""]]
Added a comment: thanks
diff --git a/doc/bugs/httpalso_windows_URL_errors/comment_4_3eee9c6c2ab1d0ae62c56b53f0d4118c._comment b/doc/bugs/httpalso_windows_URL_errors/comment_4_3eee9c6c2ab1d0ae62c56b53f0d4118c._comment new file mode 100644 index 0000000000..121d6ff7e0 --- /dev/null +++ b/doc/bugs/httpalso_windows_URL_errors/comment_4_3eee9c6c2ab1d0ae62c56b53f0d4118c._comment @@ -0,0 +1,9 @@ +[[!comment format=mdwn + username="Basile.Pinsard" + avatar="http://cdn.libravatar.org/avatar/87e1f73acf277ad0337b90fc0253c62e" + subject="thanks" + date="2025-03-27T18:39:58Z" + content=""" +Thanks for the fix, I will try that when I get the time and when that release gets built on https://github.com/datalad/git-annex/ so that I can use my existing action. + +"""]]
Added a comment
diff --git a/doc/bugs/httpalso_windows_URL_errors/comment_3_e0dcb1a7a4cdb1b7e443d590415b4eaf._comment b/doc/bugs/httpalso_windows_URL_errors/comment_3_e0dcb1a7a4cdb1b7e443d590415b4eaf._comment new file mode 100644 index 0000000000..79c3dc1c68 --- /dev/null +++ b/doc/bugs/httpalso_windows_URL_errors/comment_3_e0dcb1a7a4cdb1b7e443d590415b4eaf._comment @@ -0,0 +1,9 @@ +[[!comment format=mdwn + username="Basile.Pinsard" + avatar="http://cdn.libravatar.org/avatar/87e1f73acf277ad0337b90fc0253c62e" + subject="comment 3" + date="2025-03-27T18:39:44Z" + content=""" +Thanks for the fix, I will try that when I get the time and when that release gets built on https://github.com/datalad/git-annex/ so that I can use my existing action. + +"""]]
Added a comment
diff --git a/doc/bugs/httpalso_windows_URL_errors/comment_2_fc2a435ec4336393bf93db778e6a6ae7._comment b/doc/bugs/httpalso_windows_URL_errors/comment_2_fc2a435ec4336393bf93db778e6a6ae7._comment new file mode 100644 index 0000000000..3c05437a77 --- /dev/null +++ b/doc/bugs/httpalso_windows_URL_errors/comment_2_fc2a435ec4336393bf93db778e6a6ae7._comment @@ -0,0 +1,9 @@ +[[!comment format=mdwn + username="Basile.Pinsard" + avatar="http://cdn.libravatar.org/avatar/87e1f73acf277ad0337b90fc0253c62e" + subject="comment 2" + date="2025-03-27T18:39:31Z" + content=""" +Thanks for the fix, I will try that when I get the time and when that release gets built on https://github.com/datalad/git-annex/ so that I can use my existing action. + +"""]]
diff --git a/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__.mdwn b/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__.mdwn new file mode 100644 index 0000000000..e9e94c1636 --- /dev/null +++ b/doc/forum/tell_assistant_to_wait_5_mins_before_commiting__63__.mdwn @@ -0,0 +1,3 @@ +Git annex assistant commited a downloaded file still named XXXXX.crdownload which was created by chrome before saving the file with a correct name. + +Is it possible to tell assistant to only auto-commit files which were not changed for 5 minutes?
Added a comment
diff --git a/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__/comment_2_7d30da60993a2a486a1434d8563086cf._comment b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__/comment_2_7d30da60993a2a486a1434d8563086cf._comment new file mode 100644 index 0000000000..f27f3be4e4 --- /dev/null +++ b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__/comment_2_7d30da60993a2a486a1434d8563086cf._comment @@ -0,0 +1,13 @@ +[[!comment format=mdwn + username="Atemu" + avatar="http://cdn.libravatar.org/avatar/6ac9c136a74bb8760c66f422d3d6dc32" + subject="comment 2" + date="2025-03-26T17:04:23Z" + content=""" +Please tell me if I'm getting this right: + +- `--incremental` starts a *new* incremental fsck, regardless of whether there was a previously interrupted fsck or not +- `--more` also starts a new incremental fsck but only if there wasn't a previously interrupted incremental fsck in which case it resumes that to the end + +So if I do `git-annex fsck --more` on a repo without previously interrupted incremental fsck, it is effectively the same as `--incremental`? +"""]]
httpalso: Windows url fix
diff --git a/CHANGELOG b/CHANGELOG index 3202d1afe5..b2ac6a0a44 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -6,6 +6,7 @@ git-annex (10.20250321) UNRELEASED; urgency=medium * Fix build without the assistant. * fsck: Avoid complaining about required content of dead repositories. * drop: Avoid redundant object directory thawing. + * httpalso: Windows url fix. -- Joey Hess <id@joeyh.name> Fri, 21 Mar 2025 12:27:11 -0400 diff --git a/Remote/HttpAlso.hs b/Remote/HttpAlso.hs index de0d9e4c09..d6ccf15c13 100644 --- a/Remote/HttpAlso.hs +++ b/Remote/HttpAlso.hs @@ -1,6 +1,6 @@ {- HttpAlso remote (readonly). - - - Copyright 2020-2023 Joey Hess <id@joeyh.name> + - Copyright 2020-2025 Joey Hess <id@joeyh.name> - - Licensed under the GNU AGPL version 3 or higher. -} @@ -24,6 +24,7 @@ import Utility.Metered import Annex.Verify import qualified Annex.Url as Url import Annex.SpecialRemote.Config +import Git.FilePath import Data.Either import qualified Data.Map as M @@ -228,5 +229,10 @@ supportedLayouts baseurl = ] ] where - mkurl k hasher = baseurl P.</> fromOsPath (hasher k) P.</> kf k + mkurl k hasher = baseurl + -- On windows, the hasher uses `\` path separators, + -- but for an url, it needs to use '/'. + -- So, use toInternalGitPath. + P.</> fromOsPath (toInternalGitPath (hasher k)) + P.</> kf k kf k = fromOsPath (keyFile k) diff --git a/doc/bugs/httpalso_windows_URL_errors.mdwn b/doc/bugs/httpalso_windows_URL_errors.mdwn index 90b84f1ed0..f847500672 100644 --- a/doc/bugs/httpalso_windows_URL_errors.mdwn +++ b/doc/bugs/httpalso_windows_URL_errors.mdwn @@ -37,3 +37,5 @@ The dataset https://github.com/courtois-neuromod/algonauts_2025.competitors can Yes, our whole data management relies on git-annex and datalad! Thanks for that amazing tool! [[!tag projects/datalad]] + +> [[fixed|done]] --[[Joey]] diff --git a/doc/bugs/httpalso_windows_URL_errors/comment_1_f44b285c7d3b04db73747369793d2ca2._comment b/doc/bugs/httpalso_windows_URL_errors/comment_1_f44b285c7d3b04db73747369793d2ca2._comment new file mode 100644 index 0000000000..335daf778d --- /dev/null +++ b/doc/bugs/httpalso_windows_URL_errors/comment_1_f44b285c7d3b04db73747369793d2ca2._comment @@ -0,0 +1,18 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-03-26T15:36:35Z" + content=""" +> `xw%5CXV%5C` should be something like `xw/XV` so some conversion of +> the windows backslash to posix might not be working + +That was a good analysis, thanks! + +I see that Remote.HttpAlso.supportedLayouts uses hashDirLower and +hashDirMixed. Which are implemented using OS-native path separators. +So, on Windows, that does come out with the slash the wrong way around. +I don't think that the actual url-encoding of it is problimatic. + +I've put in a workaround. I have not tested on windows, so please re-open +this bug report if you upgrade and find it still somehow doesn't work. +"""]]
tag as datalad
this is a datalad user
this is a datalad user
diff --git a/doc/bugs/httpalso_windows_URL_errors.mdwn b/doc/bugs/httpalso_windows_URL_errors.mdwn index 15747fbad0..90b84f1ed0 100644 --- a/doc/bugs/httpalso_windows_URL_errors.mdwn +++ b/doc/bugs/httpalso_windows_URL_errors.mdwn @@ -35,3 +35,5 @@ The dataset https://github.com/courtois-neuromod/algonauts_2025.competitors can Yes, our whole data management relies on git-annex and datalad! Thanks for that amazing tool! + +[[!tag projects/datalad]]
Added a comment
diff --git a/doc/bugs/thawing_directory_-_takes_long_+_logs_twice/comment_2_996f15e5b3c28ae7a28bbe66e8f7bc03._comment b/doc/bugs/thawing_directory_-_takes_long_+_logs_twice/comment_2_996f15e5b3c28ae7a28bbe66e8f7bc03._comment new file mode 100644 index 0000000000..a695fe8984 --- /dev/null +++ b/doc/bugs/thawing_directory_-_takes_long_+_logs_twice/comment_2_996f15e5b3c28ae7a28bbe66e8f7bc03._comment @@ -0,0 +1,10 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="comment 2" + date="2025-03-26T15:31:45Z" + content=""" +> Do you have annex.freezecontent-command / annex.thawcontent-command configured for this repo? + +I do not think so as I did not see anything of that kind in output of `git config -l | grep annex` I included. Or it could have missed that somehow? +"""]]
drop: Avoid redundant object directory thawing.
Sponsored-by: Dartmouth College's DANDI project
Sponsored-by: Dartmouth College's DANDI project
diff --git a/Annex/Content.hs b/Annex/Content.hs index f01432669e..15d58e2e26 100644 --- a/Annex/Content.hs +++ b/Annex/Content.hs @@ -368,8 +368,16 @@ lockContentUsing contentlocker key fallback a = withContentLockFile key $ \mlock cleanuplockfile lockfile #endif - cleanuplockfile lockfile = void $ tryNonAsync $ do - thawContentDir lockfile + cleanuplockfile lockfile = + -- Often the content directory will be thawed already, + -- so avoid re-thawing, unless cleanup fails. + tryNonAsync (cleanuplockfile' lockfile) >>= \case + Right () -> return () + Left _ -> void $ tryNonAsync $ do + thawContentDir lockfile + cleanuplockfile' lockfile + + cleanuplockfile' lockfile = do liftIO $ removeWhenExistsWith removeFile lockfile cleanObjectDirs lockfile diff --git a/CHANGELOG b/CHANGELOG index 137ba0d99a..3202d1afe5 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -5,6 +5,7 @@ git-annex (10.20250321) UNRELEASED; urgency=medium the git-annex branch. * Fix build without the assistant. * fsck: Avoid complaining about required content of dead repositories. + * drop: Avoid redundant object directory thawing. -- Joey Hess <id@joeyh.name> Fri, 21 Mar 2025 12:27:11 -0400 diff --git a/doc/bugs/thawing_directory_-_takes_long_+_logs_twice/comment_1_b31f5a233587b86983188b600ce9cf5e._comment b/doc/bugs/thawing_directory_-_takes_long_+_logs_twice/comment_1_b31f5a233587b86983188b600ce9cf5e._comment new file mode 100644 index 0000000000..f31960117f --- /dev/null +++ b/doc/bugs/thawing_directory_-_takes_long_+_logs_twice/comment_1_b31f5a233587b86983188b600ce9cf5e._comment @@ -0,0 +1,37 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-03-26T14:45:38Z" + content=""" +Do you have `annex.freezecontent-command` / `annex.thawcontent-command` +configured for this repo? Asking because I know you use it somewhere, +and some expensive command might explain why thawing takes so long. + +As to the content directory perms, `drwxr-sr-x` is weird and the really +weird part of it is the `s`. I don't see how git-annex would ever +set that bit itself. + +Here's what's going on with the repeated thawing: + +* Before it can drop the content, it has to take the drop lock, so thaws + the directory in order to write the lock file. +* After taking the drop lock, it re-freezes. Of course this is not + necessary, but it avoids a complicated error unwind if it is later unable + to drop. +* Then it thaws in order to delete the object file. +* The final thaw is to delete the drop lock file. While the thaw is in + this situation unncessary, since it's left thawed after deleting the + object file, if the drop had failed it would get to this point with the + directory frozen, but would still want to delete the drop lock file, + and so would need to thaw it. + +I have now optimized away that final thaw. I don't think it makes sense to +optimise away the first thaw/freeze cycle, because it would complicate +error unwinding and there is anyway no indication it's slow. + +I don't know though, that the 2 minutes between the 2nd and 3rd thawing +lines are caused by the actual chmod calls. Seems unlikely, unless you do +have a thawcontent-command. It seems more likely that deleting the object +file and whatever else happens at that point is what is taking 2 minutes. +I suggest strace. +"""]]
typo
diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index dd1a862db4..03204045a6 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -1464,7 +1464,7 @@ repository, using [[git-annex-config]]. See its man page for a list.) from being modified or deleted. Freezecontent is run after git-annex has removed (or attempted to remove) the write bit, and can be used to prevent writing in some other way. - Tawcontent should undo its effect, and is run before + Thawcontent should undo its effect, and is run before git-annex restores the write bit. In the command line, %path is replaced with the file or directory to
comment
diff --git a/doc/forum/filter-process__58___git-annex_command_not_found/comment_1_298626aa8691f85bc489c5d3c5bd56ef._comment b/doc/forum/filter-process__58___git-annex_command_not_found/comment_1_298626aa8691f85bc489c5d3c5bd56ef._comment new file mode 100644 index 0000000000..e49418e628 --- /dev/null +++ b/doc/forum/filter-process__58___git-annex_command_not_found/comment_1_298626aa8691f85bc489c5d3c5bd56ef._comment @@ -0,0 +1,32 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-03-26T14:32:46Z" + content=""" +This looks like the problem is happening in the remote repository, +and not in the local repisotry. (Which I think is the one whose config you +showed as the "outbound config"). + +It looks like a `git push` is being run to that remote. And then it seems +like git on that remote is trying to run `git-annex filter-process`. + +That would not usually happen because pushing to a remote does not +update that remote's working tree. Which is what git must be doing when it +runs that command. + +So, I think you likely have `receive.denyCurrentBranch` set to +`updateInstead` in the remote repository. That will make git update the +working tree. And having that configuration for the one repository would +explain why you only see the problem for that one. + +As to why `git-annex` is not in PATH when git tries to run it there, that +probably comes down to how you have git-annex installed on that system. You +showed that an interactive shell does have `git-annex` in PATH when you ssh +in. But sometimes the shell configuration differs between interactive and +noninteractive shells. See [[tips/get_git-annex-shell_into_PATH]] (which is +about git-annex-shell, but also applies to git-annex). + +BTW `realpath` is a unix command, and not one that git-annex uses. This is +probably something in a configuration file like ~/.bashrc that is run when +you ssh in to that host. +"""]]
fsck: Avoid complaining about required content of dead repositories
requiredContentMap does not exclude dead repos. Usually this is not a
problem because it is used when we are operating on a repository, and in
that case, the repository is not dead (or if it is, the required content
configurations should still be used). But in the case of fsck, this made a
old required content config for a dead repository be warned about in a
situation where it is not a problem.
requiredContentMap does not exclude dead repos. Usually this is not a
problem because it is used when we are operating on a repository, and in
that case, the repository is not dead (or if it is, the required content
configurations should still be used). But in the case of fsck, this made a
old required content config for a dead repository be warned about in a
situation where it is not a problem.
diff --git a/CHANGELOG b/CHANGELOG index eea3df362a..137ba0d99a 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -4,6 +4,7 @@ git-annex (10.20250321) UNRELEASED; urgency=medium configured as annex-cluster-node, warn and avoid writing bad data to the git-annex branch. * Fix build without the assistant. + * fsck: Avoid complaining about required content of dead repositories. -- Joey Hess <id@joeyh.name> Fri, 21 Mar 2025 12:27:11 -0400 diff --git a/Command/Fsck.hs b/Command/Fsck.hs index a6b6e54875..50c21149b3 100644 --- a/Command/Fsck.hs +++ b/Command/Fsck.hs @@ -375,11 +375,13 @@ verifyRequiredContent key ai@(ActionItemAssociatedFile afile _) = case afile of -- Can't be checked if there's no associated file. AssociatedFile Nothing -> return True AssociatedFile (Just _) -> do - requiredlocs <- S.fromList . M.keys <$> requiredContentMap - if S.null requiredlocs + requiredlocs <- filterM notdead =<< (M.keys <$> requiredContentMap) + if null requiredlocs then return True - else go requiredlocs + else go (S.fromList requiredlocs) where + notdead u = (/=) DeadTrusted <$> lookupTrust u + go requiredlocs = do presentlocs <- S.fromList <$> loggedLocations key missinglocs <- filterM diff --git a/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn b/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn index c73a886d46..62578ad8e9 100644 --- a/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn +++ b/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn @@ -37,3 +37,5 @@ local repository version: 10 ### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) Best invention since sliced bread. + +> Good catch! [[fixed|done]] --[[Joey]]
comment
diff --git a/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__/comment_1_8137d887bd81e6841d449e0b68348395._comment b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__/comment_1_8137d887bd81e6841d449e0b68348395._comment new file mode 100644 index 0000000000..ffab739480 --- /dev/null +++ b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__/comment_1_8137d887bd81e6841d449e0b68348395._comment @@ -0,0 +1,13 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-03-26T14:18:02Z" + content=""" +No, interrupting will never lose previous incremental progress. In fact, if +you interrupt `--more` and run the same command again, it will skip all +files that were recorded as checked already by the first `--more`. + +I suspect you were reading a previous version of the man page, which +discussed being interrupted under `--incremental`. That wording was +recently adjusted. +"""]]
Fix build without the assistant.
diff --git a/CHANGELOG b/CHANGELOG index 624d23dfd6..eea3df362a 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -3,6 +3,7 @@ git-annex (10.20250321) UNRELEASED; urgency=medium * updatecluster, updateproxy: When a remote that has no annex-uuid is configured as annex-cluster-node, warn and avoid writing bad data to the git-annex branch. + * Fix build without the assistant. -- Joey Hess <id@joeyh.name> Fri, 21 Mar 2025 12:27:11 -0400 diff --git a/Utility/DirWatcher.hs b/Utility/DirWatcher.hs index d7573d7475..f6ec79e6aa 100644 --- a/Utility/DirWatcher.hs +++ b/Utility/DirWatcher.hs @@ -139,7 +139,7 @@ watchDir dir prune scanevents hooks runstartup = runstartup $ Win32Notify.watchDir dir prune scanevents hooks #else type DirWatcherHandle = () -watchDir :: FilePath -> Pruner -> Bool -> WatchHooks -> (IO () -> IO ()) -> IO DirWatcherHandle +watchDir :: OsPath -> Pruner -> Bool -> WatchHooks -> (IO () -> IO ()) -> IO DirWatcherHandle watchDir = error "watchDir not defined" #endif #endif diff --git a/doc/bugs/10.20250320_build_error_when_assistant_is_disabled.mdwn b/doc/bugs/10.20250320_build_error_when_assistant_is_disabled.mdwn index 6b3837aed5..641a71ca54 100644 --- a/doc/bugs/10.20250320_build_error_when_assistant_is_disabled.mdwn +++ b/doc/bugs/10.20250320_build_error_when_assistant_is_disabled.mdwn @@ -60,3 +60,5 @@ of the setup in case they turn out to be relevant: - `--ghc-options="-j$n +RTS -A256m -RTS -split-sections -optc-Os -optl=-pthread"` Full setup is at <https://git.kyleam.com/static-annex/>. + +> [[fixed|done]] --[[Joey]]
interlink 2 related bugs
diff --git a/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options/comment_1_d86aa82f503b0b314da614ba52237187._comment b/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options/comment_1_d86aa82f503b0b314da614ba52237187._comment new file mode 100644 index 0000000000..ecb6b4fc22 --- /dev/null +++ b/doc/bugs/can__39__t_pass_spaces_in_youtube-dl-options/comment_1_d86aa82f503b0b314da614ba52237187._comment @@ -0,0 +1,11 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-03-26T14:11:15Z" + content=""" +It breaks the list of options into words, which is why that can't work. + +Other git configs for options also have this limitation. See also +eg +<https://git-annex.branchable.com/bugs/cannot___40__or_how__63____41___to_pass_socket_path_with_a_space_in_its_path_via_annex-ssh-options/> +"""]] diff --git a/doc/bugs/cannot___40__or_how__63____41___to_pass_socket_path_with_a_space_in_its_path_via_annex-ssh-options/comment_3_6349910b58d22b4d7061834f6190ee3a._comment b/doc/bugs/cannot___40__or_how__63____41___to_pass_socket_path_with_a_space_in_its_path_via_annex-ssh-options/comment_3_6349910b58d22b4d7061834f6190ee3a._comment new file mode 100644 index 0000000000..a6160969c8 --- /dev/null +++ b/doc/bugs/cannot___40__or_how__63____41___to_pass_socket_path_with_a_space_in_its_path_via_annex-ssh-options/comment_3_6349910b58d22b4d7061834f6190ee3a._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 3""" + date="2025-03-26T14:14:11Z" + content=""" +Similar problem reported about a different setting: +<https://git-annex.branchable.com/bugs/can__39__t_pass_spaces_in_youtube-dl-options/> +"""]]
Added a comment
diff --git a/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_3_7e9abea365fd75776bdc6ea30ffb32ac._comment b/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_3_7e9abea365fd75776bdc6ea30ffb32ac._comment new file mode 100644 index 0000000000..157ceedf2e --- /dev/null +++ b/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_3_7e9abea365fd75776bdc6ea30ffb32ac._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="jnkl" + avatar="http://cdn.libravatar.org/avatar/2ab576f3bf2e0d96b1ee935bb7f33dbe" + subject="comment 3" + date="2025-03-25T07:44:00Z" + content=""" +I think you are right. Only unlocked files. +"""]]
10.20250320 build error when assistant is disabled
diff --git a/doc/bugs/10.20250320_build_error_when_assistant_is_disabled.mdwn b/doc/bugs/10.20250320_build_error_when_assistant_is_disabled.mdwn new file mode 100644 index 0000000000..6b3837aed5 --- /dev/null +++ b/doc/bugs/10.20250320_build_error_when_assistant_is_disabled.mdwn @@ -0,0 +1,62 @@ +When building git-annex 10.20250320 under `-f-assistant`, I hit into +the following error: + +``` +[363 of 647] Compiling Annex.ChangedRefs ( Annex/ChangedRefs.hs, /git-annex/dist-newstyle/build/x86_64-linux/ghc-9.8.4/git-annex-10.20250320/build/git-annex/git-annex-tmp/Annex/ChangedRefs.o, /git-annex/dist-newstyle/build/x86_64-linux/ghc-9.8.4/git-annex-10.20250320/build/git-annex/git-annex-tmp/Annex/ChangedRefs.dyn_o ) + +Annex/ChangedRefs.hs:96:48: error: [GHC-83865] + • Couldn't match type ‘OS.ByteString’ with ‘[Char]’ + Expected: FilePath + Actual: System.Posix.ByteString.FilePath.RawFilePath + • In the first argument of ‘watchDir’, namely ‘refdir’ + In the second argument of ‘($)’, namely + ‘watchDir refdir (const False) True hooks id’ + In a stmt of a 'do' block: + h <- liftIO $ watchDir refdir (const False) True hooks id + | +96 | h <- liftIO $ watchDir refdir + | ^^^^^^ +``` + +If I drop the `-f-assistant`, the build completes successfully. + +Alternatively, I was able to keep `-f-assistant` by un-indenting this +block from git-annex.cabal one level: + +``` + if os(linux) + Build-Depends: hinotify (>= 0.3.10) + CPP-Options: -DWITH_INOTIFY + Other-Modules: Utility.DirWatcher.INotify + else + if os(darwin) + Build-Depends: hfsevents + CPP-Options: -DWITH_FSEVENTS + Other-Modules: + Utility.DirWatcher.FSEvents + else + if os(windows) + Build-Depends: Win32-notify + CPP-Options: -DWITH_WIN32NOTIFY + Other-Modules: Utility.DirWatcher.Win32Notify + else + if (! os(solaris) && ! os(gnu) && ! os(linux)) + CPP-Options: -DWITH_KQUEUE + C-Sources: Utility/libkqueue.c + Includes: Utility/libkqueue.h + Other-Modules: Utility.DirWatcher.Kqueue +``` + +I suspect that this is a general issue that anyone compiling without +the assistant feature enabled would encounter. Here are other details +of the setup in case they turn out to be relevant: + + * Alpine Linux v3.21 image [*] with GHC 9.8.4 + (<https://gitlab.com/benz0li/ghc-musl>) + + * `cabal build` called with these flags: + + - `--enable-executable-static` + - `--ghc-options="-j$n +RTS -A256m -RTS -split-sections -optc-Os -optl=-pthread"` + + Full setup is at <https://git.kyleam.com/static-annex/>.
Added a comment: only for unlocked files?
diff --git a/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_2_352d8910c4f31c52731adb8b25066c41._comment b/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_2_352d8910c4f31c52731adb8b25066c41._comment new file mode 100644 index 0000000000..d0bb5aa4d5 --- /dev/null +++ b/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_2_352d8910c4f31c52731adb8b25066c41._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="nobodyinperson" + avatar="http://cdn.libravatar.org/avatar/736a41cd4988ede057bae805d000f4f5" + subject="only for unlocked files?" + date="2025-03-24T18:37:31Z" + content=""" +I also see this very frequently in our shared family documents folder. Several people having the assistant running. I have a feeling it only happens for unlocked files, suggesting git clean/smudge filter nonsense is causing this. Looks like your files are also unlocked. Do you see any of this with locked files? +"""]]
removed
diff --git a/doc/special_remotes/adb/comment_9_b1ae4cec6fa01ada5bc88b92228f848b._comment b/doc/special_remotes/adb/comment_9_b1ae4cec6fa01ada5bc88b92228f848b._comment deleted file mode 100644 index 63c1e1b173..0000000000 --- a/doc/special_remotes/adb/comment_9_b1ae4cec6fa01ada5bc88b92228f848b._comment +++ /dev/null @@ -1,11 +0,0 @@ -[[!comment format=mdwn - username="octvs@17a99a7aaeb0c0e0a2375e14807b138740ba34e9" - nickname="octvs" - avatar="http://cdn.libravatar.org/avatar/e31212348b6392eb67daddf78fadfb1b" - subject="failing with `Operation not petmitted`" - date="2025-03-24T17:56:46Z" - content=""" -I also found the answer on my own. - -The trick was hidden under my redaction. The filenames had special characters which are problematic under Android. Removing them solved the issue. -"""]]
Added a comment: failing with `Operation not petmitted`
diff --git a/doc/special_remotes/adb/comment_9_b1ae4cec6fa01ada5bc88b92228f848b._comment b/doc/special_remotes/adb/comment_9_b1ae4cec6fa01ada5bc88b92228f848b._comment new file mode 100644 index 0000000000..63c1e1b173 --- /dev/null +++ b/doc/special_remotes/adb/comment_9_b1ae4cec6fa01ada5bc88b92228f848b._comment @@ -0,0 +1,11 @@ +[[!comment format=mdwn + username="octvs@17a99a7aaeb0c0e0a2375e14807b138740ba34e9" + nickname="octvs" + avatar="http://cdn.libravatar.org/avatar/e31212348b6392eb67daddf78fadfb1b" + subject="failing with `Operation not petmitted`" + date="2025-03-24T17:56:46Z" + content=""" +I also found the answer on my own. + +The trick was hidden under my redaction. The filenames had special characters which are problematic under Android. Removing them solved the issue. +"""]]
Added a comment: failing with `Operation not petmitted`
diff --git a/doc/special_remotes/adb/comment_8_6082e8893acd99088f56ce06eda8b010._comment b/doc/special_remotes/adb/comment_8_6082e8893acd99088f56ce06eda8b010._comment new file mode 100644 index 0000000000..696d418b2d --- /dev/null +++ b/doc/special_remotes/adb/comment_8_6082e8893acd99088f56ce06eda8b010._comment @@ -0,0 +1,11 @@ +[[!comment format=mdwn + username="octvs@17a99a7aaeb0c0e0a2375e14807b138740ba34e9" + nickname="octvs" + avatar="http://cdn.libravatar.org/avatar/e31212348b6392eb67daddf78fadfb1b" + subject="failing with `Operation not petmitted`" + date="2025-03-24T17:56:39Z" + content=""" +I also found the answer on my own. + +The trick was hidden under my redaction. The filenames had special characters which are problematic under Android. Removing them solved the issue. +"""]]
Added a comment
diff --git a/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_1_d794ccefb4e288b3cfd036d47931009f._comment b/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_1_d794ccefb4e288b3cfd036d47931009f._comment new file mode 100644 index 0000000000..f1781da9ae --- /dev/null +++ b/doc/forum/git_assistant_creates_random_antagonistic_commits/comment_1_d794ccefb4e288b3cfd036d47931009f._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="jnkl" + avatar="http://cdn.libravatar.org/avatar/2ab576f3bf2e0d96b1ee935bb7f33dbe" + subject="comment 1" + date="2025-03-23T11:20:31Z" + content=""" +I looked at the timestamps and these commits could be the result of me doing something (I worked at these times), I don't know what I did to these files though. At least they didn't change (checksum din't change?). +"""]]
diff --git a/doc/forum/git_assistant_creates_random_antagonistic_commits.mdwn b/doc/forum/git_assistant_creates_random_antagonistic_commits.mdwn new file mode 100644 index 0000000000..04b050d8bd --- /dev/null +++ b/doc/forum/git_assistant_creates_random_antagonistic_commits.mdwn @@ -0,0 +1,23 @@ +I have a fleet of some headless systems and some desktop machines. In the git tree there are somethimes commits like these. Time between the two commits is always more or less 10 seconds. + + git diff c69f4ffee8d288174abe34db03e572c04bc8ba8b 5aceafda84e2d737a97f1e6b8f647adc6b787789 + diff --git a/Dokumente/Gesamtrechnung.pdf b/Dokumente/Gesamtrechnung.pdf + deleted file mode 100644 + index ace4b282de7..00000000000 + --- a/Dokumente/Gesamtrechnung.pdf + +++ /dev/null + @@ -1 +0,0 @@ + -/annex/objects/SHA256E-s22050--e1551117820028dc66e8c4d2e834857664535a4297ce9649a5cf40f7b9fafc66.pdf + + git diff 5aceafda84e2d737a97f1e6b8f647adc6b787789 d3c6388f50e06a5dfae94cd2431c6b3827cee05a + diff --git a/Dokumente/Gesamtrechnung.pdf b/Dokumente/Gesamtrechnung.pdf + new file mode 100644 + index 00000000000..ace4b282de7 + --- /dev/null + +++ b/Dokumente/Gesamtrechnung.pdf + @@ -0,0 +1 @@ + +/annex/objects/SHA256E-s22050--e1551117820028dc66e8c4d2e834857664535a4297ce9649a5cf40f7b9fafc66.pdf + +Random files (one at a time) on random machines at random times are deleted and instantly recommitted. At least I think it happens randomly. + +What happens here?
todo
diff --git a/doc/todo/cluster_preferred_content_parallel_evaluation_issue_with_archive_group.mdwn b/doc/todo/cluster_preferred_content_parallel_evaluation_issue_with_archive_group.mdwn new file mode 100644 index 0000000000..4d29f2e5ba --- /dev/null +++ b/doc/todo/cluster_preferred_content_parallel_evaluation_issue_with_archive_group.mdwn @@ -0,0 +1,24 @@ +I tried a cluster where each node was in the archive group. Sending to the +cluster caused the file the end up on multiple nodes, though the preferred +content should have allowed it to be stored on only one. + +This is because a cluster checks preferred content for each node and sends +to all nodes that want it. Which works fine when using balanced preferred +content expressions, but for archive, they all want it until 1 has it. + +So to support archive better, after finding a node that wants the content, +when checking the second node it would need to check its preferred content +under the assumption that the first node already contains the content. And +so on. Currently this is not supported when checking preferred content, but +something similar is done when dropping, with a set of repos to assume +don't contain the content any longer. + +(Oddly, in my case, it seemed to always end up on 2 nodes out of 4, I don't +know why it didn't also get sent to the other 2.) + +(Not considering this a bug, because cluster was designed to be used with +balanced preferred content, which will probably work better in many ways. +Still it would be good to support this, especially for when existing +archive repositories get put in a cluster.) + +--[[Joey]]
add news item for git-annex 10.20250320
diff --git a/doc/news/version_10.20240927.mdwn b/doc/news/version_10.20240927.mdwn deleted file mode 100644 index 7440ded826..0000000000 --- a/doc/news/version_10.20240927.mdwn +++ /dev/null @@ -1,12 +0,0 @@ -git-annex 10.20240927 released with [[!toggle text="these changes"]] -[[!toggleable text=""" * Detect when a preferred content expression contains "not present", - which would lead to repeatedly getting and then dropping files, - and make it never match. This also applies to - "not balanced" and "not sizebalanced". - * Fix --explain display of onlyingroup preferred content expression. - * Allow maxsize to be set to 0 to stop checking maxsize for a repository. - * Fix bug that prevented anything being stored in an empty - repository whose preferred content expression uses sizebalanced. - * sim: New command, can be used to simulate networks of repositories - and see how preferred content and other configuration makes file - content flow through it."""]] \ No newline at end of file diff --git a/doc/news/version_10.20250320.mdwn b/doc/news/version_10.20250320.mdwn new file mode 100644 index 0000000000..17bb165883 --- /dev/null +++ b/doc/news/version_10.20250320.mdwn @@ -0,0 +1,14 @@ +git-annex 10.20250320 released with [[!toggle text="these changes"]] +[[!toggleable text=""" * Added the compute special remote. + * addcomputed: New command, adds a file that is generated by a compute + special remote. + * recompute: New command, recomputes computed files. + * findcomputed: New command, displays information about computed files. + * Support help.autocorrect settings "prompt", "never", and "immediate". + * Allow setting remote.foo.annex-tracking-branch to a branch name + that contains "/", as long as it's not a remote tracking branch. + * Added OsPath build flag, which speeds up git-annex's operations on files. + * git-lfs: Added an optional apiurl parameter. + (This needs version 1.2.5 of the haskell git-lfs library to be used.) + * fsck: Remember the files that are checked, so a later run with --more + will skip them, without needing to use --incremental."""]] \ No newline at end of file
findcompute --inputs
Useful for eg, generating dependency graphs.
Useful for eg, generating dependency graphs.
diff --git a/Command/FindComputed.hs b/Command/FindComputed.hs index d9064b2390..83f9a3cddd 100644 --- a/Command/FindComputed.hs +++ b/Command/FindComputed.hs @@ -17,6 +17,10 @@ import Command.Find (showFormatted, formatVars) import Remote.Compute (isComputeRemote, getComputeState, ComputeState(..)) import qualified Remote import qualified Types.Remote as Remote +import Database.Keys +import Annex.CatFile + +import qualified Data.Map as M cmd :: Command cmd = withAnnexOptions [annexedMatchingOptions] $ noCommit $ noMessages $ @@ -28,6 +32,7 @@ data FindComputedOptions = FindComputedOptions { findThese :: CmdParams , formatOption :: Maybe Utility.Format.Format , keyOptions :: Maybe KeyOptions + , inputsOption :: Bool } optParser :: CmdParamsDesc -> Parser FindComputedOptions @@ -35,6 +40,10 @@ optParser desc = FindComputedOptions <$> cmdParams desc <*> optional parseFormatOption <*> optional parseBranchKeysOption + <*> switch + ( long "inputs" + <> help "display input files" + ) parseFormatOption :: Parser Utility.Format.Format parseFormatOption = @@ -69,22 +78,51 @@ start o isterminal computeremotes _ file key = do if null rcs then stop else startingCustomOutput key $ do - forM_ rcs $ \(r, c) -> do - let computation = unwords (computeParams c) - let unformatted = fromOsPath file - <> " (" <> encodeBS (Remote.name r) - <> ") -- " - <> encodeBS computation - let formatvars = - [ ("remote", Remote.name r) - , ("computation", computation) - ] ++ formatVars key (AssociatedFile (Just file)) - showFormatted isterminal (formatOption o) - unformatted formatvars + forM_ rcs display next $ return True where get r = fmap (r, ) <$> getComputeState (Remote.remoteStateHandle r) key + + showformatted = showFormatted isterminal (formatOption o) + + unformatted r computation = fromOsPath file + <> " (" <> encodeBS (Remote.name r) + <> ") -- " + <> encodeBS computation + + unformattedinputs (Right inputfile) = fromOsPath file + <> " " <> fromOsPath inputfile + unformattedinputs (Left inputkey) = fromOsPath file + <> " " <> serializeKey' inputkey + + display (r, c) = do + let computation = unwords (computeParams c) + let formatvars = + [ ("remote", Remote.name r) + , ("computation", computation) + ] ++ formatVars key (AssociatedFile (Just file)) + if inputsOption o + then forM_ (M.elems $ computeInputs c) $ \inputkey -> do + input <- maybe (Left inputkey) Right + <$> getassociated inputkey + showformatted (unformattedinputs input) $ + [ ("input", either serializeKey fromOsPath input) + , ("inputkey", serializeKey inputkey) + , ("inputfile", either (const "") fromOsPath input) + ] ++ formatvars + else showformatted (unformatted r computation) formatvars + + getassociated inputkey = + getAssociatedFiles inputkey + >>= mapM (fromRepo . fromTopFilePath) + >>= firstM (stillassociated inputkey) + + -- Some associated files that are in the keys database may no + -- longer correspond to files in the repository. + stillassociated k f = catKeyFile f >>= return . \case + Just k' | k' == k -> True + _ -> False startKeys :: FindComputedOptions -> IsTerminal -> [Remote] -> (SeekInput, Key, ActionItem) -> CommandStart startKeys o isterminal computeremotes (si, key, ActionItemBranchFilePath (BranchFilePath _ topf) _) = diff --git a/Command/WhereUsed.hs b/Command/WhereUsed.hs index bfe49d1a73..1a7e7033d8 100644 --- a/Command/WhereUsed.hs +++ b/Command/WhereUsed.hs @@ -70,9 +70,9 @@ start o (_, key, _) = startingCustomOutput key $ do where -- Some associated files that are in the keys database may no -- longer correspond to files in the repository. - stillassociated f = catKeyFile f >>= \case - Just k | k == key -> return True - _ -> return False + stillassociated f = catKeyFile f >>= return . \case + Just k | k == key -> True + _ -> False display :: Key -> StringContainingQuotedPath -> Annex () display key loc = do diff --git a/doc/git-annex-findcomputed.mdwn b/doc/git-annex-findcomputed.mdwn index 8e1bafe7d0..aa3ae07db1 100644 --- a/doc/git-annex-findcomputed.mdwn +++ b/doc/git-annex-findcomputed.mdwn @@ -32,18 +32,44 @@ For example: List computed files in the specified branch or treeish. +* `--inputs` + + Display each computed file followed by the input that is used to + produce it. The current location of the input file in the work tree is + displayed, but if the input file is not in the work tree, the key + is displayed instead. + + For example: + + foo.jpeg file.raw + bar.gz bar + + When multiple input files are needed to compute a file, outputs multiple + lines for that file: + + foo bar + foo baz + * `--format=value` Use custom output formatting. This option works the same as in [[git-annex-find]](1), with these additional variables available for use in it: - remote, computation + "${remote}", "${computation}" The default output format is the same as `--format='${file} (${remote}) -- ${computation}\\n'`, except when outputting to a terminal, control characters will be escaped. + When `--inputs` is used, there are additional variables "${inputfile}" + which is the input filename, "${inputkey}" which is the input key, + and "${input}" which is either the filename or the key. + The default output format for `--inputs` + is the same as `--format='${file} ${input}\\n'` + To separate the pair of files by nulls instead, use eg + `--format='${file}\\000${input}\\n' + * `--json` Output the list of files in JSON format.
checkPresent of compute remote checks inputs are available
If an input file has been lost from all repositories, it is no longer
possible to compute the output. This will avoid dropping content that
was computed in such a situation, as well as making git-annex fsck --from
the compute remote do its usual thing when content has gone missing.
This implementation avoids recursing forever if there is a cycle,
which should not be possible anyway.
Note the use of RemoteStateHandle as a constructor here suggests that
this may not handle sameas remotes right, since usually a
RemoteStateHandle is constructed using the sameas uuid for a sameas
remote. That assumes a compute remote can even have or be a sameas remote.
Which doesn't seem to make sense, so I have not thought through what might
happen here in detail.
If an input file has been lost from all repositories, it is no longer
possible to compute the output. This will avoid dropping content that
was computed in such a situation, as well as making git-annex fsck --from
the compute remote do its usual thing when content has gone missing.
This implementation avoids recursing forever if there is a cycle,
which should not be possible anyway.
Note the use of RemoteStateHandle as a constructor here suggests that
this may not handle sameas remotes right, since usually a
RemoteStateHandle is constructed using the sameas uuid for a sameas
remote. That assumes a compute remote can even have or be a sameas remote.
Which doesn't seem to make sense, so I have not thought through what might
happen here in detail.
diff --git a/Logs/Trust.hs b/Logs/Trust.hs index f2066ba29e..f7a705f7de 100644 --- a/Logs/Trust.hs +++ b/Logs/Trust.hs @@ -1,6 +1,6 @@ {- git-annex trust log - - - Copyright 2010-2022 Joey Hess <id@joeyh.name> + - Copyright 2010-2025 Joey Hess <id@joeyh.name> - - Licensed under the GNU AGPL version 3 or higher. -} @@ -18,17 +18,15 @@ module Logs.Trust ( trustMapLoad, ) where -import qualified Data.Map as M -import Data.Default - import Annex.Common import Types.TrustLevel import qualified Annex import Logs import Remote.List -import qualified Types.Remote import Logs.Trust.Basic as X +import qualified Data.Map as M + {- Returns a list of UUIDs that the trustLog indicates have the - specified trust level. - Note that the list can be incomplete for SemiTrusted, since that's @@ -67,20 +65,4 @@ trustMap = maybe trustMapLoad return =<< Annex.getState Annex.trustmap {- Loads the map, updating the cache, -} trustMapLoad :: Annex TrustMap -trustMapLoad = do - forceoverrides <- Annex.getState Annex.forcetrust - l <- remoteList - let untrustoverrides = M.fromList $ - map (\r -> (Types.Remote.uuid r, UnTrusted)) - (filter Types.Remote.untrustworthy l) - logged <- trustMapRaw - let configured = M.fromList $ mapMaybe configuredtrust l - let m = M.unionWith min untrustoverrides $ - M.union forceoverrides $ - M.union configured logged - Annex.changeState $ \s -> s { Annex.trustmap = Just m } - return m - where - configuredtrust r = (\l -> Just (Types.Remote.uuid r, l)) - =<< readTrustLevel - =<< remoteAnnexTrustLevel (Types.Remote.gitconfig r) +trustMapLoad = trustMapLoad' =<< remoteList diff --git a/Logs/Trust/Basic.hs b/Logs/Trust/Basic.hs index 85e25ed20d..b05c072927 100644 --- a/Logs/Trust/Basic.hs +++ b/Logs/Trust/Basic.hs @@ -1,6 +1,6 @@ {- git-annex trust log, basics - - - Copyright 2010-2012 Joey Hess <id@joeyh.name> + - Copyright 2010-2025 Joey Hess <id@joeyh.name> - - Licensed under the GNU AGPL version 3 or higher. -} @@ -9,16 +9,20 @@ module Logs.Trust.Basic ( module X, trustSet, trustMapRaw, + trustMapLoad', ) where import Annex.Common import Types.TrustLevel import qualified Annex.Branch import qualified Annex +import qualified Types.Remote import Logs import Logs.UUIDBased import Logs.Trust.Pure as X +import qualified Data.Map as M + {- Changes the trust level for a uuid in the trustLog. -} trustSet :: UUID -> TrustLevel -> Annex () trustSet uuid@(UUID _) level = do @@ -34,3 +38,21 @@ trustSet NoUUID _ = error "unknown UUID; cannot modify" - log file. -} trustMapRaw :: Annex TrustMap trustMapRaw = calcTrustMap <$> Annex.Branch.get trustLog + +trustMapLoad' :: [Remote] -> Annex TrustMap +trustMapLoad' l = do + forceoverrides <- Annex.getState Annex.forcetrust + let untrustoverrides = M.fromList $ + map (\r -> (Types.Remote.uuid r, UnTrusted)) + (filter Types.Remote.untrustworthy l) + logged <- trustMapRaw + let configured = M.fromList $ mapMaybe configuredtrust l + let m = M.unionWith min untrustoverrides $ + M.union forceoverrides $ + M.union configured logged + Annex.changeState $ \s -> s { Annex.trustmap = Just m } + return m + where + configuredtrust r = (\lvl -> Just (Types.Remote.uuid r, lvl)) + =<< readTrustLevel + =<< remoteAnnexTrustLevel (Types.Remote.gitconfig r) diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 2ef7844808..792105a1b8 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -29,6 +29,8 @@ import Types.Remote import Types.ProposedAccepted import Types.MetaData import Types.Creds +import Types.TrustLevel +import Types.RemoteState import Config import Config.Cost import Remote.Helper.Special @@ -45,6 +47,8 @@ import qualified Annex.Transfer import Logs.MetaData import Logs.EquivilantKeys import Logs.Location +import Logs.Trust.Basic +import Logs.Remote import Messages.Progress import Utility.Metered import Utility.TimeStamp @@ -88,6 +92,11 @@ remote = RemoteType isComputeRemote :: Remote -> Bool isComputeRemote r = typename (remotetype r) == typename remote +isComputeRemote' :: RemoteConfig -> Bool +isComputeRemote' rc = case M.lookup typeField rc of + Nothing -> False + Just t -> fromProposedAccepted t == typename remote + gen :: Git.Repo -> UUID -> RemoteConfig -> RemoteGitConfig -> RemoteStateHandle -> Annex (Maybe Remote) gen r u rc gc rs = case getComputeProgram' rc of Left _err -> return Nothing @@ -788,11 +797,40 @@ avoidCycles outputkeys inputkey = filterM go rs' <- avoidCycles (inputkey:outputkeys) inputkey' rs return (rs' == rs) --- Make sure that the compute state exists. +-- Make sure that the compute state exists, and that the input keys are +-- still available (are not dead, and are stored in some repository). +-- +-- When an input key is itself stored in a compute remote, check that +-- its inputs are also still available. checkKey :: RemoteStateHandle -> Key -> Annex Bool checkKey rs k = do - states <- getComputeStatesUnsorted rs k - return (not (null states)) + deadset <- S.fromList . M.keys . M.filter (== DeadTrusted) + <$> (trustMapLoad' =<< Annex.getState Annex.remotes) + computeset <- S.fromList . M.keys . M.filter isComputeRemote' + <$> remoteConfigMap + availablecompute [] deadset computeset k rs + where + availablecompute inputkeys deadset computeset k' rs' + | k' `elem` inputkeys = return False -- avoid cycles + | otherwise = + anyM (hasinputs inputkeys deadset computeset . snd) + =<< getComputeStatesUnsorted rs' k' + + hasinputs inputkeys deadset computeset state = do + let ks = M.elems (computeInputs state) + ifM (anyM checkDead ks) + ( return False + , allM (available inputkeys deadset computeset) ks + ) + + available inputkeys deadset computeset k' = do + (repolocs, computelocs) <- + partition (flip S.notMember computeset) + . filter (flip S.notMember deadset) + <$> loggedLocations k' + if not (null repolocs) + then return True + else anyM (availablecompute (k':inputkeys) deadset computeset k' . RemoteStateHandle) computelocs -- Unsetting the compute state will prevent computing the key. dropKey :: RemoteStateHandle -> Maybe SafeDropProof -> Key -> Annex () diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 44ffe03b8d..4a2a23859e 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -42,15 +42,3 @@ compute special remote. --[[Joey]] Or it could build a DAG and traverse it, but building a DAG of a large directory tree has its own problems. - -* Should checkPresent check that each input file is also present in some - (non-dead) repo? - - Currently it only checks if compute state is recorded. The problem (Diff truncated)
update
diff --git a/doc/git-annex-findcomputed.mdwn b/doc/git-annex-findcomputed.mdwn index a1c6cf1351..8e1bafe7d0 100644 --- a/doc/git-annex-findcomputed.mdwn +++ b/doc/git-annex-findcomputed.mdwn @@ -18,7 +18,7 @@ was provided to [[git-annex-addcomputed]](1). For example: # git-annex findcomputed - foo.png (imageconvert) -- convert file.raw file.jpeg passes=10 + foo.jpeg (imageconvert) -- convert file.raw file.jpeg passes=10 bar.gz (compressor) -- compress bar --level=9 # OPTIONS diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index c067051833..44ffe03b8d 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -15,9 +15,10 @@ compute special remote. --[[Joey]] * annex.diskreserve can also be violated if computing a file gets source files that are larger than the disk reserve. This could be checked. -* Maybe add a file matching option, eg: +* Maybe add a file matching options, eg: - git-annex find --inputof=remote:file + git-annex find --computeinputof=remote:file + git-annex find --computeoutputof=remote:file * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs
findcomputed: New command, displays information about computed files.
diff --git a/CHANGELOG b/CHANGELOG index 83df038ec3..51298af244 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -4,6 +4,7 @@ git-annex (10.20250116) UNRELEASED; urgency=medium * addcomputed: New command, adds a file that is generated by a compute special remote. * recompute: New command, recomputes computed files. + * findcomputed: New command, displays information about computed files. * Support help.autocorrect settings "prompt", "never", and "immediate". * Allow setting remote.foo.annex-tracking-branch to a branch name that contains "/", as long as it's not a remote tracking branch. diff --git a/CmdLine/GitAnnex.hs b/CmdLine/GitAnnex.hs index 8dc64f8b7b..5032278873 100644 --- a/CmdLine/GitAnnex.hs +++ b/CmdLine/GitAnnex.hs @@ -135,6 +135,7 @@ import qualified Command.MaxSize import qualified Command.Sim import qualified Command.AddComputed import qualified Command.Recompute +import qualified Command.FindComputed import qualified Command.Version import qualified Command.RemoteDaemon #ifdef WITH_ASSISTANT @@ -269,6 +270,7 @@ cmds testoptparser testrunner mkbenchmarkgenerator = map addGitAnnexCommonOption , Command.Sim.cmd , Command.AddComputed.cmd , Command.Recompute.cmd + , Command.FindComputed.cmd , Command.Version.cmd , Command.RemoteDaemon.cmd #ifdef WITH_ASSISTANT diff --git a/Command/FindComputed.hs b/Command/FindComputed.hs new file mode 100644 index 0000000000..d9064b2390 --- /dev/null +++ b/Command/FindComputed.hs @@ -0,0 +1,93 @@ +{- git-annex command + - + - Copyright 2025 Joey Hess <id@joeyh.name> + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE OverloadedStrings, TupleSections #-} + +module Command.FindComputed where + +import Command +import Git.FilePath +import qualified Utility.Format +import Utility.Terminal +import Command.Find (showFormatted, formatVars) +import Remote.Compute (isComputeRemote, getComputeState, ComputeState(..)) +import qualified Remote +import qualified Types.Remote as Remote + +cmd :: Command +cmd = withAnnexOptions [annexedMatchingOptions] $ noCommit $ noMessages $ + withAnnexOptions [jsonOptions] $ + command "findcomputed" SectionQuery "lists computed files" + paramPaths (seek <$$> optParser) + +data FindComputedOptions = FindComputedOptions + { findThese :: CmdParams + , formatOption :: Maybe Utility.Format.Format + , keyOptions :: Maybe KeyOptions + } + +optParser :: CmdParamsDesc -> Parser FindComputedOptions +optParser desc = FindComputedOptions + <$> cmdParams desc + <*> optional parseFormatOption + <*> optional parseBranchKeysOption + +parseFormatOption :: Parser Utility.Format.Format +parseFormatOption = + option (Utility.Format.gen <$> str) + ( long "format" <> metavar paramFormat + <> help "control format of output" + ) + +seek :: FindComputedOptions -> CommandSeek +seek o = do + unless (isJust (keyOptions o)) $ + checkNotBareRepo + isterminal <- liftIO $ checkIsTerminal stdout + computeremotes <- filter isComputeRemote <$> Remote.remoteList + let seeker = AnnexedFileSeeker + { startAction = const (start o isterminal computeremotes) + , checkContentPresent = Nothing + , usesLocationLog = True + } + withKeyOptions (keyOptions o) False seeker + (commandAction . startKeys o isterminal computeremotes) + (withFilesInGitAnnex ww seeker) + =<< workTreeItems ww (findThese o) + where + ww = WarnUnmatchLsFiles "findcomputed" + +start :: FindComputedOptions -> IsTerminal -> [Remote] -> SeekInput -> OsPath -> Key -> CommandStart +start o isterminal computeremotes _ file key = do + rs <- Remote.remotesWithUUID computeremotes + <$> Remote.keyLocations key + rcs <- catMaybes <$> forM rs get + if null rcs + then stop + else startingCustomOutput key $ do + forM_ rcs $ \(r, c) -> do + let computation = unwords (computeParams c) + let unformatted = fromOsPath file + <> " (" <> encodeBS (Remote.name r) + <> ") -- " + <> encodeBS computation + let formatvars = + [ ("remote", Remote.name r) + , ("computation", computation) + ] ++ formatVars key (AssociatedFile (Just file)) + showFormatted isterminal (formatOption o) + unformatted formatvars + next $ return True + where + get r = fmap (r, ) + <$> getComputeState (Remote.remoteStateHandle r) key + +startKeys :: FindComputedOptions -> IsTerminal -> [Remote] -> (SeekInput, Key, ActionItem) -> CommandStart +startKeys o isterminal computeremotes (si, key, ActionItemBranchFilePath (BranchFilePath _ topf) _) = + start o isterminal computeremotes si (getTopFilePath topf) key +startKeys _ _ _ _ = stop + diff --git a/Remote/List/Util.hs b/Remote/List/Util.hs index e022d23190..c251198067 100644 --- a/Remote/List/Util.hs +++ b/Remote/List/Util.hs @@ -55,8 +55,8 @@ remoteLocations' (IncludeIgnored ii) locations trusted rs = do -- remotes that match uuids that have the key allremotes <- if not ii - then filterM (not <$$> liftIO . getDynamicConfig . remoteAnnexIgnore . gitconfig) rs - else return rs + then filterM (not <$$> liftIO . getDynamicConfig . remoteAnnexIgnore . gitconfig) rs + else return rs let validremotes = remotesWithUUID allremotes locations return (sortBy (comparing cost) validremotes, validtrustedlocations) diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index 7b2ca0b86a..0b59650268 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -99,6 +99,8 @@ the parameters provided to `git-annex addcomputed`. [[git-annex-recompute]](1) +[[git-annex-findcomputed]](1) + [[git-annex-initremote]](1) # AUTHOR diff --git a/doc/git-annex-findcomputed.mdwn b/doc/git-annex-findcomputed.mdwn new file mode 100644 index 0000000000..a1c6cf1351 --- /dev/null +++ b/doc/git-annex-findcomputed.mdwn @@ -0,0 +1,75 @@ +# NAME + +git-annex findcomputed - lists computed files + +# SYNOPSIS + +git annex findcomputed `[path ...]` + +# DESCRIPTION + +Outputs a list of files in the specified path that can be computed by +enabled compute special remotes. With no path, lists files in the current +directory and its subdirectories. + +Along with the name of each computed file, this displays the input that +was provided to [[git-annex-addcomputed]](1). + +For example: + + # git-annex findcomputed + foo.png (imageconvert) -- convert file.raw file.jpeg passes=10 + bar.gz (compressor) -- compress bar --level=9 + +# OPTIONS + +* matching options + + The [[git-annex-matching-options]](1) + can be used to specify files to list. + +* `--branch=ref` + + List computed files in the specified branch or treeish. + +* `--format=value` (Diff truncated)
update
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 760e3e7ba5..36a17f461f 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -18,8 +18,15 @@ compute special remote. --[[Joey]] * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an url. Maybe a separate command? That would also allow querying for eg, - what files are inputs for another file. Or it could be exposed in the - Remote interface, and made into a file matching option. + what files are inputs for another file. + + Or it could be exposed in the + Remote interface, and made into a file matching option: + + git-annex find --inputof=foo + + But that would require running expensive find over the whole tree, + and wouldn't work if the input file is no longer in the tree. * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs @@ -43,3 +50,15 @@ compute special remote. --[[Joey]] Or it could build a DAG and traverse it, but building a DAG of a large directory tree has its own problems. + +* Should checkPresent check that each input file is also present in some + (non-dead) repo? + + Currently it only checks if compute state is recorded. The problem + this additional checking would solve is if an input file gets lost, + then a computation cannot be run again. + + Should it be an active check against existing remotes, or a + passive check? An active check certainly makes sense if the input + file is itself present in a compute repo, either the same one or a + different one. Otherwise, a passive check seems enough.
--json for addcomputed and recompute
Not very useful, but it does work.
Not very useful, but it does work.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 2c389ef53a..4c1b68e111 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -36,7 +36,7 @@ import qualified Data.Map as M import Data.Time.Clock cmd :: Command -cmd = notBareRepo $ withAnnexOptions [backendOption] $ +cmd = notBareRepo $ withAnnexOptions [backendOption, jsonOptions] $ command "addcomputed" SectionCommon "add computed files to annex" (paramRepeating paramExpression) (seek <$$> optParser) diff --git a/Command/Recompute.hs b/Command/Recompute.hs index df701fb852..5d9f93fde1 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -29,7 +29,7 @@ import qualified Data.Map as M import System.PosixCompat.Files (isSymbolicLink) cmd :: Command -cmd = notBareRepo $ +cmd = notBareRepo $ withAnnexOptions [jsonOptions] $ command "recompute" SectionCommon "recompute computed files" paramPaths (seek <$$> optParser) diff --git a/doc/git-annex-addcomputed.mdwn b/doc/git-annex-addcomputed.mdwn index faff1d96b6..7b2ca0b86a 100644 --- a/doc/git-annex-addcomputed.mdwn +++ b/doc/git-annex-addcomputed.mdwn @@ -86,6 +86,11 @@ the parameters provided to `git-annex addcomputed`. Specifies which key-value backend to use. +* `--json` + + Enable JSON output. This is intended to be parsed by programs that use + git-annex. Each line of output is a JSON object. + * Also the [[git-annex-common-options]](1) can be used. # SEE ALSO diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index f10125827c..daf403471f 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -48,6 +48,11 @@ updated with the new content. The updated file is staged in git. This is the default when the compute remote indicates that it produces reproducible output. +* `--json` + + Enable JSON output. This is intended to be parsed by programs that use + git-annex. Each line of output is a JSON object. + * matching options The [[git-annex-matching-options]](1) can be used to control what
record fscked files in fsck db by default
Remember the files that are checked, so a later run with --more will
skip them, without needing to use --incremental.
Remember the files that are checked, so a later run with --more will
skip them, without needing to use --incremental.
diff --git a/CHANGELOG b/CHANGELOG index 8c944a4bfb..83df038ec3 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -10,6 +10,8 @@ git-annex (10.20250116) UNRELEASED; urgency=medium * Added OsPath build flag, which speeds up git-annex's operations on files. * git-lfs: Added an optional apiurl parameter. (This needs version 1.2.5 of the haskell git-lfs library to be used.) + * fsck: Remember the files that are checked, so a later run with --more + will skip them, without needing to use --incremental. -- Joey Hess <id@joeyh.name> Mon, 20 Jan 2025 10:24:51 -0400 diff --git a/Command/Fsck.hs b/Command/Fsck.hs index 4e66755c02..a6b6e54875 100644 --- a/Command/Fsck.hs +++ b/Command/Fsck.hs @@ -713,13 +713,12 @@ getStartTime u = do #endif data Incremental - = NonIncremental + = NonIncremental (Maybe FsckDb.FsckHandle) | ScheduleIncremental Duration UUID Incremental | StartIncremental FsckDb.FsckHandle | ContIncremental FsckDb.FsckHandle prepIncremental :: UUID -> Maybe IncrementalOpt -> Annex Incremental -prepIncremental _ Nothing = pure NonIncremental prepIncremental u (Just StartIncrementalO) = do recordStartTime u ifM (FsckDb.newPass u) @@ -734,6 +733,14 @@ prepIncremental u (Just (ScheduleIncrementalO delta)) = do Nothing -> StartIncrementalO Just _ -> MoreIncrementalO return (ScheduleIncremental delta u i) +prepIncremental u Nothing = + ifM (Annex.getRead Annex.fast) + -- Avoid recording fscked files in --fast mode, + -- since that can interfere with a non-fast incremental + -- fsck. + ( pure (NonIncremental Nothing) + , (NonIncremental . Just) <$> openFsckDb u + ) cleanupIncremental :: Incremental -> Annex () cleanupIncremental (ScheduleIncremental delta u i) = do @@ -757,6 +764,6 @@ openFsckDb u = do withFsckDb :: Incremental -> (FsckDb.FsckHandle -> Annex ()) -> Annex () withFsckDb (ContIncremental h) a = a h withFsckDb (StartIncremental h) a = a h -withFsckDb NonIncremental _ = noop +withFsckDb (NonIncremental mh) a = maybe noop a mh withFsckDb (ScheduleIncremental _ _ i) a = withFsckDb i a diff --git a/doc/git-annex-fsck.mdwn b/doc/git-annex-fsck.mdwn index 4083ba4bf1..89760119d8 100644 --- a/doc/git-annex-fsck.mdwn +++ b/doc/git-annex-fsck.mdwn @@ -37,17 +37,24 @@ better format. * `--incremental` - Start a new incremental fsck pass. An incremental fsck can be interrupted - at any time, with eg ctrl-c. + Start a new incremental fsck pass, clearing records of all files that + were checked in the previous incremental fsck pass. * `--more` - Resume the last incremental fsck pass, where it left off. + Skip files that were checked since the last incremental fsck pass + was started. + + Note that before `--incremental` is used to start an incremental fsck + pass, files that are checked are still recorded, and using this option + will skip checking those files again. Resuming may redundantly check some files that were checked before. Any files that fsck found problems with before will be re-checked on resume. Also, checkpoints are made every 1000 files or every 5 minutes - during a fsck, and it resumes from the last checkpoint. + during a fsck, and it resumes from the last checkpoint, so if an + incremental fsck is interrupted using eg ctrl-c, it will recheck files + that didn't get into the last checkpoint. * `--incremental-schedule=time` diff --git a/doc/todo/Incremental_fsck_by_default.mdwn b/doc/todo/Incremental_fsck_by_default.mdwn index f662549e63..169e02c6be 100644 --- a/doc/todo/Incremental_fsck_by_default.mdwn +++ b/doc/todo/Incremental_fsck_by_default.mdwn @@ -9,3 +9,6 @@ I actually don't see much reason to not make use of an incremental fsck either u On that note: There also does not appear to be a documented method to figure out whether a fsck was interrupted before. You could infer existence and date from the annex internal directory structure but seeing the progress requires manual sql. Perhaps there could be a `fsck --info` flag for showing both interrupted fsck progress and perhaps also the progress of the current fsck. + +> I've implemented the default recording to the fsck database. [[done]] +> --[[Joey]] diff --git a/doc/todo/Incremental_fsck_by_default/comment_1_5f35afc17e865899f72a62bff8ff30e9._comment b/doc/todo/Incremental_fsck_by_default/comment_1_5f35afc17e865899f72a62bff8ff30e9._comment new file mode 100644 index 0000000000..2cfeabf04c --- /dev/null +++ b/doc/todo/Incremental_fsck_by_default/comment_1_5f35afc17e865899f72a62bff8ff30e9._comment @@ -0,0 +1,35 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2025-03-17T18:34:20Z" + content=""" +I think it could make sense, when --incremental/--more are not passed, to +initialize a new fsck database if there is not already one, and +add each fscked key to the fsck database. + +That way, the user could run any combination of fscks, interrupted or not, +and then use --more to fsck only new files. When the user wants to start +a new fsck pass, they would use --incremental. + +It would need to avoid recording an incremental fsck pass start time, +to avoid interfering with --incremental-schedule. + +The only problem I see with this is, someone might have a long-term +incremental fsck they're running that is doing full checksumming. +If they then do a quick fsck --fast for other reasons, it would +record that every key has been fscked, and so lose their place. +So it seems --fast should disable this new behavior. (Also incremental +--fast fsck is not likely to be very useful anyway.) + +> I actually don't see much reason to not make use of an incremental fsck +> either unless it's *really* old + +That's a hard judgement call for a program to make... someone might think +10 minutes is really old, and someone else that a month is. + +As to figuring out whether a fsck was interrupted before, surely what +matters is you remembering that? All git-annex has is a timestamp when +the last fsck pass started, which is available in +`.git/annex/fsck/*/state`, and a list of the keys that were fscked, +which is not very useful as far as determining the progress of that fsck. +"""]]
decided to leave message as-is
"getting input <file> from <remote>" is talking about the original
input filename. I think that's ok.
"getting input <file> from <remote>" is talking about the original
input filename. I think that's ok.
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 1de0213bc8..760e3e7ba5 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -21,11 +21,6 @@ compute special remote. --[[Joey]] what files are inputs for another file. Or it could be exposed in the Remote interface, and made into a file matching option. -* "getting input from <file>" message uses the original filename, - but that file might have been renamed. Would be more clear to use - whatever file in the tree currently points to the key it's getting - (what if there is not one?) - * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs
decided addcomputed will not support annex.smallfiles
If it did, recompute would need to somehow support recomputing
non-annexed files.
And, annex.smallfiles is typically used for configuration files or
source code kind of things, where the user doesn't want it to be an
annexed file. Computed artifacts are not likely that kind of thing.
Also, git-annex importfeed is an example of something that does support
annex.addunlocked, but does not support annex.smallfiles.
If it did, recompute would need to somehow support recomputing
non-annexed files.
And, annex.smallfiles is typically used for configuration files or
source code kind of things, where the user doesn't want it to be an
annexed file. Computed artifacts are not likely that kind of thing.
Also, git-annex importfeed is an example of something that does support
annex.addunlocked, but does not support annex.smallfiles.
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index db31b873cf..1de0213bc8 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -48,9 +48,3 @@ compute special remote. --[[Joey]] Or it could build a DAG and traverse it, but building a DAG of a large directory tree has its own problems. - -* Should addcomputed honor annex.smallfiles? That would seem to imply - that recompute should also support recomputing non-annexed files. - Otherwise, adding a file and then recomputing it would vary in - what the content of the file is, depending on annex.smallfiles setting. -
annex.addunlocked support for git-annex compute
And for git-annex recompute, add the file unlocked when the original is
unlocked.
And for git-annex recompute, add the file unlocked when the original is
unlocked.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 02d8826683..2c389ef53a 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -24,11 +24,13 @@ import Annex.UUID import Annex.GitShaKey import Types.KeySource import Types.Key +import Annex.FileMatcher import Messages.Progress import Logs.Location import Logs.EquivilantKeys import Utility.Metered import Backend.URL (fromUrl) +import Git.FilePath import qualified Data.Map as M import Data.Time.Clock @@ -73,20 +75,21 @@ seek o = startConcurrency commandStages (seek' o) seek' :: AddComputedOptions -> CommandSeek seek' o = do + addunlockedmatcher <- addUnlockedMatcher r <- getParsed (computeRemote o) unless (Remote.Compute.isComputeRemote r) $ giveup "That is not a compute remote." - commandAction $ start o r + commandAction $ start o r addunlockedmatcher -start :: AddComputedOptions -> Remote -> CommandStart -start o r = starting "addcomputed" ai si $ perform o r +start :: AddComputedOptions -> Remote -> AddUnlockedMatcher -> CommandStart +start o r = starting "addcomputed" ai si . perform o r where ai = ActionItemUUID (Remote.uuid r) (UnquotedString (Remote.name r)) si = SeekInput (computeParams o) -perform :: AddComputedOptions -> Remote -> CommandPerform -perform o r = do +perform :: AddComputedOptions -> Remote -> AddUnlockedMatcher -> CommandPerform +perform o r addunlockedmatcher = do program <- Remote.Compute.getComputeProgram r repopath <- fromRepo Git.repoPath subdir <- liftIO $ relPathDirToFile repopath (literalOsPath ".") @@ -102,8 +105,11 @@ perform o r = do (Remote.Compute.ImmutableState False) (getInputContent fast) Nothing - (addComputed (Just "adding") r (reproducible o) chooseBackend Just fast) + (go fast) next $ return True + where + go fast = addComputed (Just "adding") r (reproducible o) + chooseBackend Just fast (Right addunlockedmatcher) addComputed :: Maybe StringContainingQuotedPath @@ -112,11 +118,12 @@ addComputed -> (OsPath -> Annex Backend) -> (OsPath -> Maybe OsPath) -> Bool + -> Either Bool AddUnlockedMatcher -> Remote.Compute.ComputeProgramResult -> OsPath -> NominalDiffTime -> Annex () -addComputed maddaction r reproducibleconfig choosebackend destfile fast result tmpdir ts = do +addComputed maddaction r reproducibleconfig choosebackend destfile fast addunlockedmatcher result tmpdir ts = do when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." oks <- forM (M.keys outputs) $ \outputfile -> do @@ -163,19 +170,43 @@ addComputed maddaction r reproducibleconfig choosebackend destfile fast result t stateurl = Remote.Compute.computeStateUrl r state outputfile stateurlk = fromUrl stateurl Nothing True outputfile' = tmpdir </> outputfile - ld f = LockedDown ldc (ks f) - ks f = KeySource - { keyFilename = f - , contentLocation = outputfile' - , inodeCache = Nothing - } genkey f p = do backend <- choosebackend outputfile - fst <$> genKey (ks f) p backend - ingesthelper f p mk = - ingestwith $ do - k <- maybe (genkey f p) return mk - ingestAdd' p (Just (ld f)) (Just k) + let ks = KeySource + { keyFilename = f + , contentLocation = outputfile' + , inodeCache = Nothing + } + fst <$> genKey ks p backend + ingesthelper f p mk = ingestwith $ do + k <- maybe (genkey f p) return mk + topf <- inRepo $ toTopFilePath f + let fi = FileInfo + { contentFile = outputfile' + , matchFile = getTopFilePath topf + , matchKey = Just k + } + lockingfile <- case addunlockedmatcher of + Right addunlockedmatcher' -> + not <$> addUnlocked addunlockedmatcher' + (MatchingFile fi) + (not fast) + Left v -> pure v + let ldc = LockDownConfig + { lockingFile = lockingfile + , hardlinkFileTmpDir = Nothing + , checkWritePerms = True + } + liftIO $ createDirectoryIfMissing True $ + takeDirectory f + liftIO $ moveFile outputfile' f + let ks = KeySource + { keyFilename = f + , contentLocation = f + , inodeCache = Nothing + } + let ld = LockedDown ldc ks + ingestAdd' p (Just ld) (Just k) ingestwith a = a >>= \case Nothing -> giveup "ingestion failed" Just k -> do @@ -188,12 +219,6 @@ addComputed maddaction r reproducibleconfig choosebackend destfile fast result t =<< calcRepo (gitAnnexLocation k) return k - ldc = LockDownConfig - { lockingFile = True - , hardlinkFileTmpDir = Nothing - , checkWritePerms = True - } - isreproducible = case reproducibleconfig of Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible result diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 82ed7ab37e..df701fb852 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -23,8 +23,10 @@ import Logs.Location import Command.AddComputed (Reproducible(..), parseReproducible, getInputContent, getInputContent', addComputed) import Backend (maybeLookupBackendVariety, unknownBackendVarietyMessage, chooseBackend) import Types.Key +import qualified Utility.RawFilePath as R import qualified Data.Map as M +import System.PosixCompat.Files (isSymbolicLink) cmd :: Command cmd = notBareRepo $ @@ -126,19 +128,22 @@ perform :: RecomputeOptions -> Remote -> OsPath -> Key -> Remote.Compute.Compute perform o r file origkey origstate = do program <- Remote.Compute.getComputeProgram r reproducibleconfig <- getreproducibleconfig + originallocked <- liftIO $ isSymbolicLink + <$> R.getSymbolicLinkStatus (fromOsPath file) showOutput Remote.Compute.runComputeProgram program origstate (Remote.Compute.ImmutableState False) (getinputcontent program) Nothing - (go program reproducibleconfig) + (go program reproducibleconfig originallocked) next cleanup where - go program reproducibleconfig result tmpdir ts = do + go program reproducibleconfig originallocked result tmpdir ts = do checkbehaviorchange program (Remote.Compute.computeState result) addComputed Nothing r reproducibleconfig - choosebackend destfile False result tmpdir ts + choosebackend destfile False (Left originallocked) + result tmpdir ts checkbehaviorchange program state = do let check s w a b = forM_ (M.keys (w a)) $ \f -> diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index c6e5a64de6..db31b873cf 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -29,21 +29,6 @@ compute special remote. --[[Joey]] * allow git-annex enableremote with program= explicitly specified, without checking annex.security.allowed-compute-programs -* addcomputed should honor annex.addunlocked. - - What about recompute? It seems it should either write the new version of - the file as an unlocked file when the old version was unlocked, or also - honor annex.addunlocked. - - Problem: Since recompute does not stage the file, it would have to write - the content to the working tree. And then the user would need to - git-annex add. But then, if the key was a VURL key, it would add it with - the default backend instead, and the file would no longer use a computed - key. (Diff truncated)
diff --git a/doc/todo/Incremental_fsck_by_default.mdwn b/doc/todo/Incremental_fsck_by_default.mdwn new file mode 100644 index 0000000000..f662549e63 --- /dev/null +++ b/doc/todo/Incremental_fsck_by_default.mdwn @@ -0,0 +1,11 @@ +Whenever I do an fsck, it's always annoyed me that you have to think of adding `--incremental` and then also think about whether an incremental fsck was started and interrupted before which would then require `--more` instead. + +Forgetting to add `--incremental` can leave you in a pickle when you later find out that you need to interrupt the fsck, losing all progress. + +I've found myself wondering whether there'd ever be a case where I'd not want an fsck to be resumeable. Could git-annex not just simply always store that information and leave it up to the next fsck execution to decide whether to use it or not? + +I actually don't see much reason to not make use of an incremental fsck either unless it's *really* old but I find this a lot more debatable than at least storing fsck state on each run. + +On that note: There also does not appear to be a documented method to figure out whether a fsck was interrupted before. You could infer existence and date from the annex internal directory structure but seeing the progress requires manual sql. + +Perhaps there could be a `fsck --info` flag for showing both interrupted fsck progress and perhaps also the progress of the current fsck.
diff --git a/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__.mdwn b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__.mdwn new file mode 100644 index 0000000000..29ceee21f5 --- /dev/null +++ b/doc/forum/Does___96__fsck_--more__96___imply___96__--incremental__96____63__.mdwn @@ -0,0 +1,3 @@ +The man page is not too clear on this and I noticed that it's not possible to pass both flags at once. + +Does interrupting `fsck --more` lose the progress made since the initial incremental fsck?
diff --git a/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn b/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn new file mode 100644 index 0000000000..c73a886d46 --- /dev/null +++ b/doc/bugs/fsck_complains_about_requires_of_dead_repos.mdwn @@ -0,0 +1,39 @@ +### Please describe the problem. + +When running an fsck, I just had git-annex tell me that required content was missing from a bunch of repos that comprise my cold storage which makes sense but it also listed dead repos in the listing. Those repos are still in the group and still have `groupwanted` as the required setting. + +Dead drives should never be considered requiring or wanting content, even if they're still configured as such. (Or holding content for that matter but I hope that part works.) + +### What steps will reproduce the problem? + +1. Have dead repos that require content +2. Have alive repos that require the same content (unsure if required) +3. `git annex fsck` + +### What version of git-annex are you using? On what operating system? + +``` +git-annex version: 10.20241202 +build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Servant Feeds Testsuite S3 WebDAV +dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.6 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1 +key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X* +remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external +operating system: linux x86_64 +supported repository versions: 8 9 10 +upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10 +local repository version: 10 +``` + +### Please provide any additional information below. + +[[!format sh """ +# If you can, paste a complete transcript of the problem occurring here. +# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log + + +# End of transcript or log. +"""]] + +### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) + +Best invention since sliced bread.
diff --git a/doc/users/msz.mdwn b/doc/users/msz.mdwn new file mode 100644 index 0000000000..1d4fd37e4b --- /dev/null +++ b/doc/users/msz.mdwn @@ -0,0 +1,4 @@ +Michał Szczepanik +[@doktorpanik@masto.ai](https://masto.ai/@doktorpanik) + +Postdoc @ [Psychoinformatics Group](https://psychoinformatics.de/), INM-7, Forschungszentrum Jülich
Added a comment
diff --git a/doc/todo/compute_special_remote/comment_23_11e2e14ab20856b005793e80e79d2382._comment b/doc/todo/compute_special_remote/comment_23_11e2e14ab20856b005793e80e79d2382._comment new file mode 100644 index 0000000000..513cce8436 --- /dev/null +++ b/doc/todo/compute_special_remote/comment_23_11e2e14ab20856b005793e80e79d2382._comment @@ -0,0 +1,12 @@ +[[!comment format=mdwn + username="msz" + avatar="http://cdn.libravatar.org/avatar/6e8b88e7c70d86f4cfd27d450958aed4" + subject="comment 23" + date="2025-03-12T19:44:23Z" + content=""" +@joey: + +> I do hope I'm not closing off the design space from such differences by dropping a compute special remote right into git-annex. But I also expect that having a standard and easy way for at least simple computations will lead to a lot of contributions as others use it. + +I think it's excellent to have something like this in git-annex. I didn't have the opportunity to try it out yet, but I am definitely looking forward to seeing how things can work in practice and comparing the implementations. +"""]]
add compute tip
diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 52d650068f..b0027c7419 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -26,6 +26,8 @@ program takes a dashed option, it can be provided after "--": # git-annex initremote myremote type=compute program=git-annex-compute-foo -- --level=9 +See [[tips/computing_annexed_files]] for more documentation. + ## compute programs To write programs used by the compute special remote, see the diff --git a/doc/tips/computing_annexed_files.mdwn b/doc/tips/computing_annexed_files.mdwn new file mode 100644 index 0000000000..8ca448d8cc --- /dev/null +++ b/doc/tips/computing_annexed_files.mdwn @@ -0,0 +1,233 @@ +Do you ever check in original versions of files to `git-annex`, but then +convert them in some way? Maybe you check in original photos from a camera, +but then change them to a more useful file format, or smaller resolution. +Or you clip a video file. Or you crunch some data to a result. + +If you check the computed file into `git-annex` too, and store it on +your remotes along with the original, that's a waste of disk space. +But it is so convenient to be able to `git-annex get` the computed file. + +The [[compute special remote|special_remotes/compute]] is the solution to +this. It "stores" the computed file by remembering how to compute it from +input files. When you `git-annex get` the computed file from it, it re-runs +the computation on the original input file to produced the computed file. + +[[!toc ]] + +## using the compute special remote + +There are many compute programs that each handle some type of computation, +and it's pretty easy to write your own compute program too. In this tip, +we'll use [[special_remotes/compute/git-annex-compute-imageconvert]], +which uses imagemagick to convert between image formats. + +To follow along, install that program in PATH (and remember to make it +executable!) and make sure you have +[imagemagick](https://www.imagemagick.org/) installed. + +First, initialize a compute remote: + + # git-annex initremote imageconvert type=compute program=git-annex-compute-imageconvert + +Now suppose you have a file `foo.jpeg`, and you want to add a computed +`foo.gif` to the git-annex repository. + + # git-annex addcomputed --to=imageconvert foo.jpeg foo.gif + +(The syntax of the `git-annex addcomputed` command will vary depending on the +program that a compute remote uses. Some may have multiple input files, or +multiple ouput files, or other options to control the computation. See +the documentation of each compute program for details.) + +Now you have `foo.gif` and can use it as usual, including copying it to +other remotes. But it's already "stored" in the imageconvert remote, +as a computation. So to free up space, you can drop it: + + # git-annex drop foo.gif + drop foo.gif ok + +By the way, you can also add a computed file to the repository +without bothering to compute it yet! Just use `--fast`: + + # git-annex addcomputed --fast --to=imageconvert bar.jpeg bar.gif + +Now suppose you're in another clone of this same repository, and you want +these gifs. + + # git-annex get foo.gif + get foo.gif (not available) + Maybe enable some of these special remotes (git annex enableremote ...): + 8332f7ad-d54e-435e-803b-138c1cfa7b71 -- imageconvert + failed + +With [[special_remotes/compute/git-annex-compute-imageconvert]] and +imagemagic installed, all you need to do is enable the special remote to +get the computed files from it: + + # git-annex enableremote imageconvert + # git-annex get foo.gif + get foo.gif (from imageconvert...) + (getting input foo.jpeg from origin...) + ok + +Notice that, when the input file is not present in the repository, getting +a file from a compute remote will first get the input file. + +That's the basics of using the compute special remote. + +## recomputation + +What happens if the input file `foo.gif` is changed to a new version? +Will getting `foo.jpeg` from the compute remote base it on the new version +too? No. `foo.gif` is stuck on the original version of the input file that +was used to compute it. + +But, it's easy to recompute the file with a new version of the input file. +Just `git-annex add` the new version of the input file, and then: + + # git-annex recompute foo.gif + recompute foo.gif (foo.jpeg changed) + ok + +You can use commands like `git diff` and `git status` to see the +change that made to `foo.gif`. + + # git status --short foo.gif + M foo.gif + +Now both the new and old versions of `foo.gif` are stored in the +imageconvert remote, and it can compute either as needed. + +## reproducibility + +You might be wondering, what happens if a computed file, such as `foo.gif` +isn't exactly the same identical file each time it's computed? For example, +what if there's a timestamp in there. + +The answer is that, by default, files computed by a compute special remote +are not required, or guaranteed to be bit-for-bit reproducible. One gif +converted from a jpeg is much like any other converted from the same jpeg. + +So git-annex usually treats all files computed in the same way from the +same input as interchangeable. (Unless the compute program indicates +that it produces reproducible files.) + +Sometimes though, it's important that a file be bit-for-bit reproducible. And +you can ask git-annex to enforce this for computed files. +There is a `--reproducible` option for this, which you can pass to +`git-annex addcomputed` or to `git-annex recompute`. + +Let's switch the computed `foo.gif` to a reproducible file: + + # git-annex recompute --original --reproducible foo.gif + recompute foo.gif + ok + +You can `git commit foo.gif` to store this change. + +But first, let's check if that computation actually *is* reproducible. +This is easy, just drop it and get it from the compute remote again: + + # git-annex drop foo.gif + drop foo.gif ok + # git-annex get foo.gif --from imageconvert + get foo.gif (from imageconvert...) + ok + +If it turned out that the computation was not reproducible, getting the +file would fail, like this: + + # git-annex get foo.gif --from imageconvert + get foo.gif (from imageconvert...) + Verification of content failed + +This is because a reproducible file uses a regular [[backend]], which +by default uses a hash to verify the content of the file. + +If it does turn out that a file that was expected to be reproducible isn't, +you can always convert it to an unreproducible file: + + # git-annex recompute --original --unreproducible foo.gif + recompute foo.gif + ok + +## writing your own compute programs + +There is a whole little protocol that compute programs use to +communicate with git-annex. It's all documented at +[[design/compute_special_remote_interface]]. + +But it's really easy to write simple ones, and you don't need to +dive into all the details to do it. Let's walk through the code +to [[special_remotes/compute/git-annex-compute-imageconvert]], +which at 14 lines, is about as simple as one can be. + + #!/bin/sh + +It's a shell script. + + set -e + +If it fails to read input from standard input, or if a command fails, it +will exit nonzero. + + if [ -z "$1" ] || [ -z "$2" ]; then + echo "Specify the input image file, followed by the output image file." >&2 + echo "Example: foo.jpg foo.gif" >&2 + exit 1 + fi + +It expects to be passed two parameters, which were "foo.jpeg" and "foo.gif" in +the examples above. And it outputs some usage to stderr otherwise. That is (Diff truncated)
recompute: stage new version of file in git
When writing doc/tips/computing_annexed_files.mdwn, I noticed
that a recompute --reproducible followed by a drop and a re-get did not
actually test if the file could be reproducible computed again.
Turns out that get and drop both operate on staged files. If there is an
unstaged modification in the work tree, that's ignored. Somewhat
surprisingly, other commands like info do operate on staged files. So
behavior is inconsistent, and fairly surprising really, when there are
unstaged modifications to files.
Probably this is rarely noticed because `git-annex add` is used to add a
new version of a file, and then it's staged. Or `git mv` is used to move
a file, rather than `mv` of a file over top of an existing file. So it's
uncommon to have an unstaged annexed file in a worktree.
It might be worth making things more consistent, but that's out of scope
for what I'm working on currently.
Also, I anticipate that supporting unlocked files with recompute will
require it to stage changes anyway.
So, make recompute stage the new version of the file.
I considered having recompute refuse to overwrite an existing staged
file. After all, whatever version was staged before will get lost when
the new version is staged over top of it. But, that's no different than
`git-annex addcomputed` being run with the name of an existing staged
file. Or `git-annex add` being run with a new file content when there is
an existing staged file. Or, for that matter, `git add` being ran with a
new content when there is an existing staged file.
When writing doc/tips/computing_annexed_files.mdwn, I noticed
that a recompute --reproducible followed by a drop and a re-get did not
actually test if the file could be reproducible computed again.
Turns out that get and drop both operate on staged files. If there is an
unstaged modification in the work tree, that's ignored. Somewhat
surprisingly, other commands like info do operate on staged files. So
behavior is inconsistent, and fairly surprising really, when there are
unstaged modifications to files.
Probably this is rarely noticed because `git-annex add` is used to add a
new version of a file, and then it's staged. Or `git mv` is used to move
a file, rather than `mv` of a file over top of an existing file. So it's
uncommon to have an unstaged annexed file in a worktree.
It might be worth making things more consistent, but that's out of scope
for what I'm working on currently.
Also, I anticipate that supporting unlocked files with recompute will
require it to stage changes anyway.
So, make recompute stage the new version of the file.
I considered having recompute refuse to overwrite an existing staged
file. After all, whatever version was staged before will get lost when
the new version is staged over top of it. But, that's no different than
`git-annex addcomputed` being run with the name of an existing staged
file. Or `git-annex add` being run with a new file content when there is
an existing staged file. Or, for that matter, `git add` being ran with a
new content when there is an existing staged file.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index dd6c310b06..02d8826683 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -102,12 +102,11 @@ perform o r = do (Remote.Compute.ImmutableState False) (getInputContent fast) Nothing - (addComputed (Just "adding") True r (reproducible o) chooseBackend Just fast) + (addComputed (Just "adding") r (reproducible o) chooseBackend Just fast) next $ return True addComputed :: Maybe StringContainingQuotedPath - -> Bool -> Remote -> Maybe Reproducible -> (OsPath -> Annex Backend) @@ -117,7 +116,7 @@ addComputed -> OsPath -> NominalDiffTime -> Annex () -addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fast result tmpdir ts = do +addComputed maddaction r reproducibleconfig choosebackend destfile fast result tmpdir ts = do when (M.null outputs) $ giveup "The computation succeeded, but it did not generate any files." oks <- forM (M.keys outputs) $ \outputfile -> do @@ -148,9 +147,7 @@ addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fa | fast = do case destfile outputfile of Nothing -> noop - Just f - | stagefiles -> addSymlink f stateurlk Nothing - | otherwise -> makelink f stateurlk + Just f -> addSymlink f stateurlk Nothing return stateurlk | isreproducible = do sz <- liftIO $ getFileSize outputfile' @@ -175,16 +172,10 @@ addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fa genkey f p = do backend <- choosebackend outputfile fst <$> genKey (ks f) p backend - makelink f k = void $ makeLink f k Nothing - ingesthelper f p mk - | stagefiles = ingestwith $ do + ingesthelper f p mk = + ingestwith $ do k <- maybe (genkey f p) return mk ingestAdd' p (Just (ld f)) (Just k) - | otherwise = ingestwith $ do - k <- maybe (genkey f p) return mk - mk' <- fst <$> ingest p (Just (ld f)) (Just k) - maybe noop (makelink f) mk' - return mk' ingestwith a = a >>= \case Nothing -> giveup "ingestion failed" Just k -> do diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 17246d10e4..82ed7ab37e 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -137,7 +137,7 @@ perform o r file origkey origstate = do go program reproducibleconfig result tmpdir ts = do checkbehaviorchange program (Remote.Compute.computeState result) - addComputed Nothing False r reproducibleconfig + addComputed Nothing r reproducibleconfig choosebackend destfile False result tmpdir ts checkbehaviorchange program state = do diff --git a/doc/git-annex-recompute.mdwn b/doc/git-annex-recompute.mdwn index 498c85e26c..f10125827c 100644 --- a/doc/git-annex-recompute.mdwn +++ b/doc/git-annex-recompute.mdwn @@ -15,8 +15,7 @@ By default, this only recomputes files whose input files have changed. The new contents of the input files are used to re-run the computation. When the output of the computation is different, the computed file is -updated with the new content. The updated file is written to the worktree, -but is not staged, in order to avoid overwriting any staged changes. +updated with the new content. The updated file is staged in git. # OPTIONS diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index 820b423199..c6e5a64de6 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,13 +1,6 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* recompute should stage files in git. Otherwise, - `git-annex drop` after recompute --reproducible drops the staged - file, and `git-annex get` gets the staged file, and if it wasn't - actually reproducible, this is not apparent. - - This is blocking adding the tip. - * Support parallel get of input files. The design allows for this, but how much parallelism makes sense? Would it be possible to use the usual worker pool?
todo
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index f478c5d966..820b423199 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,7 +1,12 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* write a tip showing how to use this +* recompute should stage files in git. Otherwise, + `git-annex drop` after recompute --reproducible drops the staged + file, and `git-annex get` gets the staged file, and if it wasn't + actually reproducible, this is not apparent. + + This is blocking adding the tip. * Support parallel get of input files. The design allows for this, but how much parallelism makes sense? Would it be possible to use the @@ -10,7 +15,12 @@ compute special remote. --[[Joey]] * compute on input files in submodules * annex.diskreserve can be violated if getting a file computes it but also - some other output files, which get added to the annex. + some other output files, which get added to the annex. This can't be + avoided at addcomputed time, but when getting later from the compute + remote, it could check (but not when using VURL without size information) + +* annex.diskreserve can also be violated if computing a file gets source + files that are larger than the disk reserve. This could be checked. * would be nice to have a way to see what computations are used by a compute remote for a file. Put it in `whereis` output? But it's not an
improve
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index f286a0b7cd..001b57a6d1 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -29,7 +29,7 @@ avoid exposing user input to the shell unprotected, or otherwise executing it. (Except when the program is explicitly running user input in some form of sandbox.) -## interface +## program parameters and environment Whatever values the user passes to `git-annex addcomputed` are passed to the program in `ARGV`, followed by any values that the user provided to
comment
diff --git a/doc/special_remotes/compute/comment_5_d3faa33c3876d6f4883cce19189b7928._comment b/doc/special_remotes/compute/comment_5_d3faa33c3876d6f4883cce19189b7928._comment new file mode 100644 index 0000000000..3a41e92045 --- /dev/null +++ b/doc/special_remotes/compute/comment_5_d3faa33c3876d6f4883cce19189b7928._comment @@ -0,0 +1,19 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: just thinking out loud""" + date="2025-03-11T16:42:46Z" + content=""" +> And there could be some generic "helper" (or a number of them) which would then provide desired CLI interfacing over arbitrary command + +Absolutely! + +You do need to use "--" before your own custom dashed options. + +And bear in mind that "field=value" parameters passed to initremote will +be passed on to the program. So you can have a generic helper +that is instantiated with a parameter like --command=, which then gets used +automatically when running addcompute: + + git-annex initremote foo type=compute program=git-annex-compute-generic-helper -- --command='convert {inputs} {outputs}' + git-annex addcomputed --to=foo -- -i foo.jpeg -o foo.gif +"""]]
buffer responses to compute programs in a TQueue
This avoids a potential problem where the program sends several INPUT
before reading responses, so flushing the respose to the pipe could
block. It's unlikely, but seemed worth making sure it can't happen.
This avoids a potential problem where the program sends several INPUT
before reading responses, so flushing the respose to the pipe could
block. It's unlikely, but seemed worth making sure it can't happen.
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 0b27d135ba..2ef7844808 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -435,10 +435,12 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) showOutput starttime <- liftIO currentMonotonicTimestamp let startresult = ComputeProgramResult state False False False - result <- withmeterfile $ \meterfile -> bracket - (liftIO $ createProcess pr) - (liftIO . cleanupProcess) - (getinput tmpdir subdir startresult meterfile) + result <- withmeterfile $ \meterfile -> + bracket + (liftIO $ createProcess pr) + (liftIO . cleanupProcess) $ \p -> + withoutputv p $ + getinput tmpdir subdir startresult meterfile p endtime <- liftIO currentMonotonicTimestamp liftIO $ checkoutputs result subdir cont result subdir (calcduration starttime endtime) @@ -453,14 +455,14 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) , return tmpdir ) - getinput tmpdir subdir result meterfile p = + getinput tmpdir subdir result meterfile p outputv = liftIO (hGetLineUntilExitOrEOF (processHandle p) (stdoutHandle p)) >>= \case Just l - | null l -> getinput tmpdir subdir result meterfile p + | null l -> getinput tmpdir subdir result meterfile p outputv | otherwise -> do fastDebug "Compute" ("< " ++ l) - result' <- parseoutput p tmpdir subdir result meterfile l - getinput tmpdir subdir result' meterfile p + result' <- parseoutput outputv tmpdir subdir result meterfile l + getinput tmpdir subdir result' meterfile p outputv Nothing -> do liftIO $ hClose (stdoutHandle p) liftIO $ hClose (stdinHandle p) @@ -468,19 +470,14 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) giveup $ program ++ " exited unsuccessfully" return result - sendresponse p s = do - fastDebug "Compute" ("> " ++ s) - liftIO $ hPutStrLn (stdinHandle p) s - liftIO $ hFlush (stdinHandle p) - - parseoutput p tmpdir subdir result meterfile l = case Proto.parseMessage l of - Just (ProcessInput f) -> handleinput f False p tmpdir subdir result - Just (ProcessInputRequired f) -> handleinput f True p tmpdir subdir result + parseoutput outputv tmpdir subdir result meterfile l = case Proto.parseMessage l of + Just (ProcessInput f) -> handleinput f False outputv tmpdir subdir result + Just (ProcessInputRequired f) -> handleinput f True outputv tmpdir subdir result Just (ProcessOutput f) -> do let f' = toOsPath f checksafefile tmpdir subdir f' "output" -- Modify filename so eg "-foo" becomes "./-foo" - sendresponse p $ toCommand' (File f) + sendresponse outputv $ toCommand' (File f) -- If the output file is in a subdirectory, make -- the directories so the compute program doesn't -- need to. @@ -508,7 +505,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Just ProcessSandbox -> do sandboxpath <- liftIO $ fromOsPath <$> relPathDirToFile subdir tmpdir - sendresponse p $ + sendresponse outputv $ if null sandboxpath then "." else sandboxpath @@ -516,7 +513,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Nothing -> giveup $ program ++ " output an unparseable line: \"" ++ l ++ "\"" - handleinput f required p tmpdir subdir result = do + handleinput f required outputv tmpdir subdir result = do let f' = toOsPath f let knowninput = M.member f' (computeInputs (computeState result)) @@ -534,7 +531,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) mkrel $ pure obj Just (Left gitsha) -> mkrel $ populategitsha gitsha tmpdir - sendresponse p $ + sendresponse outputv $ maybe "" fromOsPath mp let result' = result { computeInputsUnavailable = @@ -630,6 +627,28 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Just sz -> progress $ BytesProcessed $ floor $ fromIntegral sz * percent / 100 + + withoutputv p a = do + outputv <- liftIO $ atomically newTQueue + let cleanup pid = do + atomically $ writeTQueue outputv Nothing + wait pid + bracket + (liftIO $ async $ sendoutput' p outputv) + (liftIO . cleanup) + (const $ a outputv) + + sendoutput' p outputv = + atomically (readTQueue outputv) >>= \case + Nothing -> return () + Just s -> do + liftIO $ hPutStrLn (stdinHandle p) s + liftIO $ hFlush (stdinHandle p) + sendoutput' p outputv + + sendresponse outputv s = do + fastDebug "Compute" ("> " ++ s) + liftIO $ atomically $ writeTQueue outputv (Just s) computationBehaviorChangeError :: ComputeProgram -> String -> OsPath -> Annex a computationBehaviorChangeError (ComputeProgram program) requestdesc p = diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index bba17b2300..f478c5d966 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,11 +1,6 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* git-annex responds to each INPUT immediately, and flushes stdout. - This could cause problems if the program is sending several INPUT - first, before reading responses, as is documented it should do to allow - for parallel get of the input files. - * write a tip showing how to use this * Support parallel get of input files. The design allows for this,
close off newline injection attacks against compute special remote protocol
diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 7ed6040ceb..0b27d135ba 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -56,6 +56,7 @@ import Utility.CopyFile import Types.Key import Backend import qualified Git +import qualified Utility.OsString as OS import qualified Utility.FileIO as F import qualified Utility.RawFilePath as R import qualified Utility.SimpleProtocol as Proto @@ -271,7 +272,9 @@ formatComputeState' mk st = renderQuery False $ concat parseComputeState :: Key -> B.ByteString -> Maybe ComputeState parseComputeState k b = let st = go emptycomputestate (parseQuery b) - in if st == emptycomputestate then Nothing else Just st + in if st == emptycomputestate || illegalComputeState st + then Nothing + else Just st where emptycomputestate = ComputeState { computeParams = mempty @@ -317,6 +320,20 @@ parseComputeState k b = _ -> Nothing in go c' rest +{- This is used to avoid ComputeStates that should never happen, + - but which could be injected into a repository by an attacker. -} +illegalComputeState :: ComputeState -> Bool +illegalComputeState st + -- The protocol is line-based, so filenames used in it cannot + -- contain newlines. + | any containsnewline (M.keys (computeInputs st)) = True + | any containsnewline (M.keys (computeOutputs st)) = True + -- Just in case. + | containsnewline (computeSubdir st) = True + | otherwise = False + where + containsnewline p = unsafeFromChar '\n' `OS.elem` p + {- A compute: url for a given output file of a computation. -} computeStateUrl :: Remote -> ComputeState -> OsPath -> URLString computeStateUrl r st p = diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index fab644f0e4..bba17b2300 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,10 +1,6 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] -* prohibit using compute states where an input or output filename contains - a newline. The protocol doesn't allow this to happen usually, but an - attacker might try it in order to scramble the protocol. - * git-annex responds to each INPUT immediately, and flushes stdout. This could cause problems if the program is sending several INPUT first, before reading responses, as is documented it should do to allow
update
diff --git a/doc/todo/compute_special_remote_remaining_todos.mdwn b/doc/todo/compute_special_remote_remaining_todos.mdwn index c13a4e6425..fab644f0e4 100644 --- a/doc/todo/compute_special_remote_remaining_todos.mdwn +++ b/doc/todo/compute_special_remote_remaining_todos.mdwn @@ -1,6 +1,10 @@ This is the remainder of my todo list while I was building the compute special remote. --[[Joey]] +* prohibit using compute states where an input or output filename contains + a newline. The protocol doesn't allow this to happen usually, but an + attacker might try it in order to scramble the protocol. + * git-annex responds to each INPUT immediately, and flushes stdout. This could cause problems if the program is sending several INPUT first, before reading responses, as is documented it should do to allow @@ -12,12 +16,6 @@ compute special remote. --[[Joey]] but how much parallelism makes sense? Would it be possible to use the usual worker pool? -* Write some simple compute programs so we have something to start with. - - - convert between images eg jpeg to png - - run a command in a singularity container (that is one of the inputs) - - run a wasm binary (that is one of the inputs) - * compute on input files in submodules * annex.diskreserve can be violated if getting a file computes it but also
add INPUT-REQUIRED
Used by git-annex-compute-singularity to make addcomputed --fast work.
Also, simplified git-annex-compute-singularity; there is no need to hard
link the container into place. singularity does not care about the
extension of the container, so can just pass it the annex object file.
Used by git-annex-compute-singularity to make addcomputed --fast work.
Also, simplified git-annex-compute-singularity; there is no need to hard
link the container into place. singularity does not care about the
extension of the container, so can just pass it the annex object file.
diff --git a/Command/AddComputed.hs b/Command/AddComputed.hs index 4774caae9b..50c0ee28f6 100644 --- a/Command/AddComputed.hs +++ b/Command/AddComputed.hs @@ -206,14 +206,14 @@ addComputed maddaction stagefiles r reproducibleconfig choosebackend destfile fa Just v -> isReproducible v Nothing -> Remote.Compute.computeReproducible result -getInputContent :: Bool -> OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath)) -getInputContent fast p = catKeyFile p >>= \case - Just inputkey -> getInputContent' fast inputkey filedesc +getInputContent :: Bool -> OsPath -> Bool -> Annex (Key, Maybe (Either Git.Sha OsPath)) +getInputContent fast p required = catKeyFile p >>= \case + Just inputkey -> getInputContent' fast inputkey required filedesc Nothing -> inRepo (Git.fileRef p) >>= \case Just fileref -> catObjectMetaData fileref >>= \case Just (sha, _, t) | t == Git.BlobObject -> - getInputContent' fast (gitShaKey sha) filedesc + getInputContent' fast (gitShaKey sha) required filedesc | otherwise -> badinput $ ", not a git " ++ decodeBS (Git.fmtObjectType t) Nothing -> notcheckedin @@ -223,9 +223,9 @@ getInputContent fast p = catKeyFile p >>= \case badinput s = giveup $ "The computation needs an input file " ++ s ++ ": " ++ fromOsPath p notcheckedin = badinput "that is not checked into the git repository" -getInputContent' :: Bool -> Key -> String -> Annex (Key, Maybe (Either Git.Sha OsPath)) -getInputContent' fast inputkey filedesc - | fast = return (inputkey, Nothing) +getInputContent' :: Bool -> Key -> Bool -> String -> Annex (Key, Maybe (Either Git.Sha OsPath)) +getInputContent' fast inputkey required filedesc + | fast && not required = return (inputkey, Nothing) | otherwise = case keyGitSha inputkey of Nothing -> ifM (inAnnex inputkey) ( do diff --git a/Command/Recompute.hs b/Command/Recompute.hs index 6b21ce8ee7..b85f5d449d 100644 --- a/Command/Recompute.hs +++ b/Command/Recompute.hs @@ -152,14 +152,14 @@ perform o r file origkey origstate = do check "not outputting" Remote.Compute.computeOutputs origstate state - getinputcontent program p + getinputcontent program p required | originalOption o = case M.lookup p (Remote.Compute.computeInputs origstate) of - Just inputkey -> getInputContent' False inputkey + Just inputkey -> getInputContent' False inputkey required (fromOsPath p ++ "(key " ++ serializeKey inputkey ++ ")") Nothing -> Remote.Compute.computationBehaviorChangeError program "requesting a new input file" p - | otherwise = getInputContent False p + | otherwise = getInputContent False p required destfile outputfile | Just outputfile == origfile = Just file diff --git a/Remote/Compute.hs b/Remote/Compute.hs index 7d21ddccdb..3adec4bc5b 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -201,17 +201,19 @@ programField = Accepted "program" data ProcessCommand = ProcessInput FilePath | ProcessOutput FilePath + | ProcessProgress PercentFloat | ProcessReproducible | ProcessSandbox - | ProcessProgress PercentFloat + | ProcessInputRequired FilePath deriving (Show, Eq) instance Proto.Receivable ProcessCommand where parseCommand "INPUT" = Proto.parse1 ProcessInput parseCommand "OUTPUT" = Proto.parse1 ProcessOutput + parseCommand "PROGRESS" = Proto.parse1 ProcessProgress parseCommand "REPRODUCIBLE" = Proto.parse0 ProcessReproducible parseCommand "SANDBOX" = Proto.parse0 ProcessSandbox - parseCommand "PROGRESS" = Proto.parse1 ProcessProgress + parseCommand "INPUT-REQUIRED" = Proto.parse1 ProcessInputRequired parseCommand _ = Proto.parseFail newtype PercentFloat = PercentFloat Float @@ -392,9 +394,10 @@ runComputeProgram :: ComputeProgram -> ComputeState -> ImmutableState - -> (OsPath -> Annex (Key, Maybe (Either Git.Sha OsPath))) - -- ^ get input file's content, or Nothing the input file's - -- content is not available + -> (OsPath -> Bool -> Annex (Key, Maybe (Either Git.Sha OsPath))) + -- ^ Get input file's content, or Nothing the input file's + -- content is not available. True is passed when the input content + -- is required even for addcomputed --fast. -> Maybe (Key, MeterUpdate) -- ^ update meter for this key -> (ComputeProgramResult -> OsPath -> NominalDiffTime -> Annex v) @@ -454,37 +457,8 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) liftIO $ hFlush (stdinHandle p) parseoutput p tmpdir subdir result meterfile l = case Proto.parseMessage l of - Just (ProcessInput f) -> do - let f' = toOsPath f - let knowninput = M.member f' - (computeInputs (computeState result)) - checksafefile tmpdir subdir f' "input" - checkimmutable knowninput "inputting" f' $ do - (k, inputcontent) <- getinputcontent f' - let mkrel a = Just <$> - (a >>= liftIO . relPathDirToFile subdir) - mp <- case inputcontent of - Nothing -> pure Nothing - Just (Right obj) - | computeSandbox result -> - mkrel $ populatesandbox obj tmpdir - | otherwise -> - mkrel $ pure obj - Just (Left gitsha) -> - mkrel $ populategitsha gitsha tmpdir - sendresponse p $ - maybe "" fromOsPath mp - let result' = result - { computeInputsUnavailable = - isNothing mp || computeInputsUnavailable result - } - return $ if immutablestate - then result' - else modresultstate result' $ \s -> s - { computeInputs = - M.insert f' k - (computeInputs s) - } + Just (ProcessInput f) -> handleinput f False p tmpdir subdir result + Just (ProcessInputRequired f) -> handleinput f True p tmpdir subdir result Just (ProcessOutput f) -> do let f' = toOsPath f checksafefile tmpdir subdir f' "output" @@ -525,6 +499,38 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) Nothing -> giveup $ program ++ " output an unparseable line: \"" ++ l ++ "\"" + handleinput f required p tmpdir subdir result = do + let f' = toOsPath f + let knowninput = M.member f' + (computeInputs (computeState result)) + checksafefile tmpdir subdir f' "input" + checkimmutable knowninput "inputting" f' $ do + (k, inputcontent) <- getinputcontent f' required + let mkrel a = Just <$> + (a >>= liftIO . relPathDirToFile subdir) + mp <- case inputcontent of + Nothing -> pure Nothing + Just (Right obj) + | computeSandbox result -> + mkrel $ populatesandbox obj tmpdir + | otherwise -> + mkrel $ pure obj + Just (Left gitsha) -> + mkrel $ populategitsha gitsha tmpdir + sendresponse p $ + maybe "" fromOsPath mp + let result' = result + { computeInputsUnavailable = + isNothing mp || computeInputsUnavailable result + } + return $ if immutablestate + then result' + else modresultstate result' $ \s -> s + { computeInputs = + M.insert f' k + (computeInputs s) + } + modresultstate result f = result { computeState = f (computeState result) } @@ -630,7 +636,7 @@ computeKey rs (ComputeProgram program) k _af dest meterupdate vc = (Just (k, p)) (postcompute keyfile) - getinputcontent state f = + getinputcontent state f _required = case M.lookup f (computeInputs state) of Just inputkey -> case keyGitSha inputkey of Nothing -> diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 56bb90f14f..f286a0b7cd 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -73,12 +73,13 @@ If an input file is not available, the program's stdin will be closed without a path being written to it. So when reading from stdin fails, the program should exit. -When `git-annex addcomputed --fast` is being used to add a computation -to the git-annex repository without actually performing it, the -response to eaach `INPUT` will be an empty line rather than the path to -an input file. In that case, the program should proceed with the rest of -its output to stdout (eg `OUTPUT` and `REPRODUCIBLE`), but should not -perform any computation. (Diff truncated)
Added a comment: just thinking out loud
diff --git a/doc/special_remotes/compute/comment_4_6e02f138330b13adcfa8fbbce494205e._comment b/doc/special_remotes/compute/comment_4_6e02f138330b13adcfa8fbbce494205e._comment new file mode 100644 index 0000000000..d2b1145a4d --- /dev/null +++ b/doc/special_remotes/compute/comment_4_6e02f138330b13adcfa8fbbce494205e._comment @@ -0,0 +1,26 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="just thinking out loud" + date="2025-03-11T15:15:15Z" + content=""" +> it was more flexible to have a more freeform command line, which the compute program parses + +agree. And there could be some generic \"helper\" (or a number of them) which would then provide desired CLI interfacing over arbitrary command, smth like (mimicing [datalad-run](https://docs.datalad.org/en/stable/generated/man/datalad-run.html) interface here): + +``` +git-annex addcomputed --to=runcmd -i foo.jpeg -o foo.gif +``` + +as long as we can pass options like that or after `--`, e.g. + +``` +git-annex addcomputed --to=runcmd -- -i foo.jpeg -o foo.gif -- convert {inputs} {outputs}` +``` + +which would then +- ensure no stdout from `convert` +- follow the *compute special remote interface* to let git-annex know what inputs/outputs were + + +"""]]
reorg and expand security section
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 52b676c04e..56bb90f14f 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -12,23 +12,50 @@ a command like one of these: git-annex addcomputed --to=myremote -- compress in out --level=9 git-annex addcomputed --to=myremote -- clip foo 2:01-3:00 combine with bar to baz +## security + +Security is very important here, because a user who enables a compute +special remote and runs `git pull` followed by `git-annex get` is running +the compute program with inputs under the control of anyone who has +commit access to the repository. + +The contents of input files should be assumed to be untrusted, and so +should the filenames of input and output files, as well as everything +else passed to the program in `ARGV` and the environment. + +The program should make sure that whatever user input is passed +to it can result in only safe and expected behavior. The program should +avoid exposing user input to the shell unprotected, or otherwise executing +it. (Except when the program is explicitly running user input in some form +of sandbox.) + +## interface + Whatever values the user passes to `git-annex addcomputed` are passed to the program in `ARGV`, followed by any values that the user provided to `git-annex initremote`. -For security, the program should avoid exposing user input to the shell -unprotected, or otherwise executing it. And when running a command, make -sure that whatever user input is passed to it can result in only safe and -expected behavior. - To simplify the program's option parsing, any value that the user provides that is in the form "foo=bar" will also result in an environment variable being set, eg `ANNEX_COMPUTE_passes=10` or `ANNEX_COMPUTE_--level=9`. The program is run in a temporary directory, which will be cleaned up after -it exits. Note that it may be run in a subdirectory of a temporary -directory. This is done when `git-annex addcomputed` was run in a subdirectory -of the git repository. +it exits. It may be run in a subdirectory of the temporary directory. This +is done when `git-annex addcomputed` was run in a subdirectory of the git +repository. + +Anything that the program outputs to stderr will be displayed to the user. +This stderr should be used for error messages, and possibly computation +output, but not for progress displays. + +If the program exits nonzero, nothing it computed will be stored in the +git-annex repository. + +## input files + +Before doing any computation, the program needs to communicate with +git-annex about what input files it needs, and what output files it will +generate. The content of any file in the repository can be an input to the computation. The program requests an input by writing a line to stdout: @@ -48,25 +75,26 @@ the program should exit. When `git-annex addcomputed --fast` is being used to add a computation to the git-annex repository without actually performing it, the -response to each "INPUT" will be an empty line rather than the path to +response to eaach `INPUT` will be an empty line rather than the path to an input file. In that case, the program should proceed with the rest of -its output to stdout (eg "OUTPUT" and "REPRODUCIBLE"), but should not +its output to stdout (eg `OUTPUT` and `REPRODUCIBLE`), but should not perform any computation. +## output files + For each output file that it will compute, the program should write a -line to stdout: +line to stdout, indicating the name of the file that will be added to the +git-annex repository by `git-annex compute`. OUTPUT file.jpeg -Then it can read a line from stdin. This will be a sanitized version of the -output filename. It's important to use that sanitized version to avoid path -traversal attacks, as well as problems like filenames that look like -dashed options. If there is a path traversal attack, the program's stdin will -be closed without a path being written to it. - -The filename of the output file is both the filename in the program's -temporary directory that it should write to, and also the filename that will -be added to the git-annex repository by `git-annex compute`. +Then it should read a line from stdin, which is the path, in the program's +temporary directory, where it should write the output file. Often this will +be the same filename, but it also may be a sanitized version. It's +important to use that sanitized version to avoid path traversal attacks, as +well as problems like filenames that look like dashed options. +If there is a path traversal attack, the program's stdin will be closed +without a path being written to it. The program must write a regular file to the output file. Symlinks or other special files will not be accepted as output files. @@ -78,30 +106,34 @@ to somewhere else and renaming it at the end. But, if the program seeks around and writes out of order, it should write to a file somewhere else and rename it at the end. -The program can also output lines to stdout to indicate its current -progress: +## other messages - PROGRESS 50% +As well as `INPUT` and `OUTPUT` described above, there are some other +messages that the program can output. All of these are optional. -The program can optionally also output a "REPRODUCIBLE" line. That -indicates that the results of its computations are expected to be -bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if -the `--reproducible` option is set. +* `PROGRESS 50%` + + To indicate its current progress while performing the computation, + the program can output lines like this. This is not needed if the program + streams output to an output file. -The program can also output a "SANDBOX" line, and then read a line from -stdin that will be the path to the directory it should sandbox to (which -corresponds to the top of the git repository, so may be above its working -directory). Any "INPUT" lines that come after "SANDBOX" will have input -files be provided via paths that are inside the sandbox directory. Usually -that is done by making hard links, but it will fall back to copying annexed -files if the filesystem does not support hard links. +* `REPRODUCIBLE` + + This indicates that the results of the computation are expected to be + bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if + the `--reproducible` option is set. -Anything that the program outputs to stderr will be displayed to the user. -This stderr should be used for error messages, and possibly computation -output, but not for progress displays. +* `SANDBOX` -If the program exits nonzero, nothing it computed will be stored in the -git-annex repository. + After outputting this line, the program can read a line from stdin + that will be the path to the directory it should sandbox to (which + corresponds to the top of the git repository, so may be above its working + directory). Any `INPUT` lines that come after `SANDBOX` will have input + files be provided via paths that are inside the sandbox directory. Usually + that is done by making hard links, but it will fall back to copying annexed + files if the filesystem does not support hard links. + +## example An example `git-annex-compute-foo` shell script follows:
Added a comment
diff --git a/doc/special_remotes/compute/comment_3_d563e79fa8cb539bdf26a281824ad2ea._comment b/doc/special_remotes/compute/comment_3_d563e79fa8cb539bdf26a281824ad2ea._comment new file mode 100644 index 0000000000..985d62ebc6 --- /dev/null +++ b/doc/special_remotes/compute/comment_3_d563e79fa8cb539bdf26a281824ad2ea._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="comment 3" + date="2025-03-11T15:09:20Z" + content=""" +Thank you for the clarification -- I have missed that there is an \"entire\" [compute special remote interface](https://git-annex.branchable.com/design/compute_special_remote_interface/). **Cool!** +"""]]
expand
diff --git a/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn b/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn index 7613667cdc..48c9f1e052 100644 --- a/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn +++ b/doc/special_remotes/compute/git-annex-compute-singularity-examples.mdwn @@ -68,3 +68,13 @@ documentation for details about these options. * `--no-compat` * `--fakeroot` + +For example, passing the --fakeroot option: + + git-annex addcomputed --to=singularity -- --fakeroot debian.sif foo bar -- baz -- sh -c 'cat foo bar > baz' + +Since singularity happens to also accept `--fakeroot=1` and +`--no-compat=1`, it's also possible to set these options by +default in initremote: + + git-annex initremote foo type=compute program=git-annex-compute-singularity passthrough=imageconvert.sif -- --fakeroot=1
response
diff --git a/doc/special_remotes/compute/comment_2_53493ef08c5cef81c6b6ae64afc47c07._comment b/doc/special_remotes/compute/comment_2_53493ef08c5cef81c6b6ae64afc47c07._comment new file mode 100644 index 0000000000..ef6def3f4e --- /dev/null +++ b/doc/special_remotes/compute/comment_2_53493ef08c5cef81c6b6ae64afc47c07._comment @@ -0,0 +1,13 @@ +[[!comment format=mdwn + username="joey" + subject="""Re: Any way to annotate what are input files?""" + date="2025-03-10T20:42:26Z" + content=""" +git-annex does know what both the input and the output files are. +It learns this by running the compute program and seeing what INPUT and OUTPUT +lines it emits. + +I considered having some `--input=` option, but decided that it was more +flexible to have a more freeform command line, which the compute program +parses. +"""]]
added git-annex-compute-singularity
And implemented SANDBOX, which it needs.
And implemented SANDBOX, which it needs.
diff --git a/COPYRIGHT b/COPYRIGHT index 54a250abae..3ca3debd09 100644 --- a/COPYRIGHT +++ b/COPYRIGHT @@ -14,7 +14,7 @@ Files: doc/special_remotes/external/* Copyright: © 2013 Joey Hess <id@joeyh.name> License: GPL-3+ -Files: doc/special_remotes/compute/git-annex-compute-imageconvert doc/special_remotes/compute/git-annex-compute-wasmedge +Files: doc/special_remotes/compute/git-annex-compute-imageconvert doc/special_remotes/compute/git-annex-compute-wasmedge doc/special_remotes/compute/git-annex-compute-singularity Copyright: © 2025 Joey Hess <id@joeyh.name> License: GPL-3+ diff --git a/Remote/Compute.hs b/Remote/Compute.hs index be8429435c..7d21ddccdb 100644 --- a/Remote/Compute.hs +++ b/Remote/Compute.hs @@ -52,6 +52,7 @@ import Utility.Env import Utility.Tmp.Dir import Utility.Url import Utility.MonotonicClock +import Utility.CopyFile import Types.Key import Backend import qualified Git @@ -201,6 +202,7 @@ data ProcessCommand = ProcessInput FilePath | ProcessOutput FilePath | ProcessReproducible + | ProcessSandbox | ProcessProgress PercentFloat deriving (Show, Eq) @@ -208,6 +210,7 @@ instance Proto.Receivable ProcessCommand where parseCommand "INPUT" = Proto.parse1 ProcessInput parseCommand "OUTPUT" = Proto.parse1 ProcessOutput parseCommand "REPRODUCIBLE" = Proto.parse0 ProcessReproducible + parseCommand "SANDBOX" = Proto.parse0 ProcessSandbox parseCommand "PROGRESS" = Proto.parse1 ProcessProgress parseCommand _ = Proto.parseFail @@ -382,6 +385,7 @@ data ComputeProgramResult = ComputeProgramResult { computeState :: ComputeState , computeInputsUnavailable :: Bool , computeReproducible :: Bool + , computeSandbox :: Bool } runComputeProgram @@ -410,7 +414,7 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) } showOutput starttime <- liftIO currentMonotonicTimestamp - let startresult = ComputeProgramResult state False False + let startresult = ComputeProgramResult state False False False result <- withmeterfile $ \meterfile -> bracket (liftIO $ createProcess pr) (liftIO . cleanupProcess) @@ -457,13 +461,17 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) checksafefile tmpdir subdir f' "input" checkimmutable knowninput "inputting" f' $ do (k, inputcontent) <- getinputcontent f' + let mkrel a = Just <$> + (a >>= liftIO . relPathDirToFile subdir) mp <- case inputcontent of Nothing -> pure Nothing - Just (Right f'') -> liftIO $ - Just <$> relPathDirToFile subdir f'' - Just (Left gitsha) -> - Just <$> (liftIO . relPathDirToFile subdir - =<< populategitsha gitsha tmpdir) + Just (Right obj) + | computeSandbox result -> + mkrel $ populatesandbox obj tmpdir + | otherwise -> + mkrel $ pure obj + Just (Left gitsha) -> + mkrel $ populategitsha gitsha tmpdir sendresponse p $ maybe "" fromOsPath mp let result' = result @@ -506,6 +514,14 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) return result Just ProcessReproducible -> return $ result { computeReproducible = True } + Just ProcessSandbox -> do + sandboxpath <- liftIO $ fromOsPath <$> + relPathDirToFile subdir tmpdir + sendresponse p $ + if null sandboxpath + then "." + else sandboxpath + return $ result { computeSandbox = True } Nothing -> giveup $ program ++ " output an unparseable line: \"" ++ l ++ "\"" @@ -546,12 +562,23 @@ runComputeProgram (ComputeProgram program) state (ImmutableState immutablestate) -- to the program as a parameter, which could parse it as a dashed -- option or other special parameter. populategitsha gitsha tmpdir = do - let f = tmpdir </> literalOsPath ".git" </> literalOsPath "objects" + let f = tmpdir </> literalOsPath ".git" + </> literalOsPath "objects" </> toOsPath (Git.fromRef' gitsha) liftIO $ createDirectoryIfMissing True $ takeDirectory f liftIO . F.writeFile f =<< catObject gitsha return f + populatesandbox annexobj tmpdir = do + let f = tmpdir </> literalOsPath ".git" + </> literalOsPath "annex" + </> literalOsPath "objects" + </> takeFileName annexobj + liftIO $ createDirectoryIfMissing True $ takeDirectory f + liftIO $ unlessM (createLinkOrCopy annexobj f) $ + giveup "Unable to populate compute sandbox directory" + return f + withmeterfile a = case meterkey of Nothing -> a (const noop) Just (_, progress) -> do diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index 0ab7c45df4..52b676c04e 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -88,6 +88,14 @@ indicates that the results of its computations are expected to be bit-for-bit reproducible. That makes `git-annex addcomputed` behave as if the `--reproducible` option is set. +The program can also output a "SANDBOX" line, and then read a line from +stdin that will be the path to the directory it should sandbox to (which +corresponds to the top of the git repository, so may be above its working +directory). Any "INPUT" lines that come after "SANDBOX" will have input +files be provided via paths that are inside the sandbox directory. Usually +that is done by making hard links, but it will fall back to copying annexed +files if the filesystem does not support hard links. + Anything that the program outputs to stderr will be displayed to the user. This stderr should be used for error messages, and possibly computation output, but not for progress displays. diff --git a/doc/special_remotes/compute.mdwn b/doc/special_remotes/compute.mdwn index 33b1253978..52d650068f 100644 --- a/doc/special_remotes/compute.mdwn +++ b/doc/special_remotes/compute.mdwn @@ -39,6 +39,13 @@ List it here with an example! `git-annex addcomputed --to=imageconvert foo.jpeg foo.gif` +* [[compute/git-annex-compute-singularity]] + Uses [Singularity](https://sylabs.io/) to run a container, which is + checked into the git-annex repository, to compute other files in the + repository. Amoung other things, this can run other compute programs + inside a singularity container. + [[Examples here|compute/git-annex-compute-singularity-examples]] + * [[compute/git-annex-compute-wasmedge]] Uses [WasmEdge](https://WasmEdge.org/) to run WASM programs that are checked into the git-annex repository, to compute other files in the diff --git a/doc/special_remotes/compute/git-annex-compute-singularity b/doc/special_remotes/compute/git-annex-compute-singularity new file mode 100755 index 0000000000..d296e0162d --- /dev/null +++ b/doc/special_remotes/compute/git-annex-compute-singularity @@ -0,0 +1,94 @@ +#!/bin/bash +# git-annex compute remote program that runs singularity containers +# from the git-annex repository. +# +# Copyright 2025 Joey Hess; licenced under the GNU GPL version 3 or higher. +set -e + +if [ -z "$1" ]; then + echo "Usage: container [singularity options] [inputs] -- [outputs] -- [command params]" >&2 + exit 1 +fi + +nocompat_opt="" +fakeroot_opt="" +container="" +binddir="`pwd`" +rundir="`pwd`" + +run_singularity () { + # Network access is disabled (with --net --network=none), to + # prevent an untrusted singularity image from phoning home and/or + # attacking the local network. + # + # --oci is used to get process namespacing + singularity run --net --network=none --oci \ + --bind="$binddir" --pwd="$rundir" \ + $nocompat_opt $fakeroot_opt \ + "$container" "$@" +} + +# Avoid any security problems with harmful terminal escape sequences. +strip_escape () { + sed 's/[\x1B]//g' +} + +if [ -z "$ANNEX_COMPUTE_passthrough" ]; then (Diff truncated)
document output files must be regular files
diff --git a/doc/design/compute_special_remote_interface.mdwn b/doc/design/compute_special_remote_interface.mdwn index e6fad0f2b1..0ab7c45df4 100644 --- a/doc/design/compute_special_remote_interface.mdwn +++ b/doc/design/compute_special_remote_interface.mdwn @@ -65,8 +65,11 @@ dashed options. If there is a path traversal attack, the program's stdin will be closed without a path being written to it. The filename of the output file is both the filename in the program's -temporary directory, and also the filename that will be added to the -git-annex repository by `git-annex compute`. +temporary directory that it should write to, and also the filename that will +be added to the git-annex repository by `git-annex compute`. + +The program must write a regular file to the output file. Symlinks +or other special files will not be accepted as output files. If git-annex sees that an output file is growing, it will use its file size when displaying progress to the user. So if possible, the program should
make usage an error
diff --git a/doc/special_remotes/compute/git-annex-compute-wasmedge b/doc/special_remotes/compute/git-annex-compute-wasmedge index b93adb9370..51d7f3d40d 100755 --- a/doc/special_remotes/compute/git-annex-compute-wasmedge +++ b/doc/special_remotes/compute/git-annex-compute-wasmedge @@ -6,8 +6,9 @@ set -e if [ -z "$1" ]; then - echo "Usage: file.wasm [inputs] -- [outputs] -- [options]" - echo "Example: concat.wasm foo bar -- baz --" + echo "Usage: file.wasm [inputs] -- [outputs] -- [options]" >&2 + echo "Example: concat.wasm foo bar -- baz --" >&2 + exit 1 fi stage=1