Recent changes to this wiki:

removed
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_35_a277d90983bebba10ed1ae3a51fbafd2._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_35_a277d90983bebba10ed1ae3a51fbafd2._comment
deleted file mode 100644
index e38214c62..000000000
--- a/doc/bugs/significant_performance_regression_impacting_datal/comment_35_a277d90983bebba10ed1ae3a51fbafd2._comment
+++ /dev/null
@@ -1,8 +0,0 @@
-[[!comment format=mdwn
- username="yarikoptic"
- avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
- subject="comment 35"
- date="2021-06-15T15:38:28Z"
- content="""
-FWIW: Yes! ;-)
-"""]]

drop, move, mirror: when two files have the same content, honor the max numcopies and requiredcopies
Eg, before with a .gitattributes like:
*.2 annex.numcopies=2
*.1 annex.numcopies=1
And foo.1 and foo.2 having the same content and key, git-annex drop foo.1 foo.2
would succeed, leaving just 1 copy, despite foo.2 needing 2 copies.
It dropped foo.1 first and then skipped foo.2 since its content was gone.
Now that the keys database includes locked files, this longstanding wart
can be fixed.
Sponsored-by: Noam Kremen on Patreon
diff --git a/Annex/Drop.hs b/Annex/Drop.hs
index dc6b7b64e..79d6d63ff 100644
--- a/Annex/Drop.hs
+++ b/Annex/Drop.hs
@@ -60,7 +60,7 @@ handleDropsFrom locs rs reason fromhere key afile si preverified runner = do
   where
 	getcopies fs = do
 		(untrusted, have) <- trustPartition UnTrusted locs
-		(numcopies, mincopies) <- getSafestNumMinCopies' key fs
+		(numcopies, mincopies) <- getSafestNumMinCopies' afile key fs
 		return (length have, numcopies, mincopies, S.fromList untrusted)
 
 	{- Check that we have enough copies still to drop the content.
diff --git a/Annex/NumCopies.hs b/Annex/NumCopies.hs
index bbdd826e8..a91202893 100644
--- a/Annex/NumCopies.hs
+++ b/Annex/NumCopies.hs
@@ -11,7 +11,6 @@ module Annex.NumCopies (
 	module Types.NumCopies,
 	module Logs.NumCopies,
 	getFileNumMinCopies,
-	getAssociatedFileNumMinCopies,
 	getSafestNumMinCopies,
 	getSafestNumMinCopies',
 	getGlobalFileNumCopies,
@@ -123,33 +122,21 @@ getFileNumMinCopies f = do
 					<$> fallbacknum
 					<*> fallbackmin
 
-{- NumCopies and MinCopies value for an associated file, or the default
- - when there is no associated file.
- -
- - This does not include other associated files using the same key.
- -}
-getAssociatedFileNumMinCopies :: AssociatedFile -> Annex (NumCopies, MinCopies)
-getAssociatedFileNumMinCopies (AssociatedFile (Just file)) =
-	getFileNumMinCopies file
-getAssociatedFileNumMinCopies (AssociatedFile Nothing) = (,)
-	<$> getNumCopies
-	<*> getMinCopies
-
 {- Gets the highest NumCopies and MinCopies value for all files
  - associated with a key. Provide any known associated file;
  - the rest are looked up from the database.
  -
- - Using this when dropping avoids dropping one file that
- - has a smaller value violating the value set for another file
- - that uses the same content.
+ - Using this when dropping, rather than getFileNumMinCopies
+ - avoids dropping one file that has a smaller value violating
+ - the value set for another file that uses the same content.
  -}
 getSafestNumMinCopies :: AssociatedFile -> Key -> Annex (NumCopies, MinCopies)
 getSafestNumMinCopies afile k =
 	Database.Keys.getAssociatedFilesIncluding afile k
-		>>= getSafestNumMinCopies' k
+		>>= getSafestNumMinCopies' afile k
 
-getSafestNumMinCopies' :: Key -> [RawFilePath] -> Annex (NumCopies, MinCopies)
-getSafestNumMinCopies' k fs = do
+getSafestNumMinCopies' :: AssociatedFile -> Key -> [RawFilePath] -> Annex (NumCopies, MinCopies)
+getSafestNumMinCopies' afile k fs = do
 	l <- mapM getFileNumMinCopies fs
 	let l' = zip l fs
 	(,)
@@ -158,9 +145,14 @@ getSafestNumMinCopies' k fs = do
   where
 	-- Some associated files in the keys database may no longer
 	-- correspond to files in the repository.
-	stillassociated f = catKeyFile f >>= \case
-		Just k' | k' == k -> return True
-		_ -> return False
+	-- (But the AssociatedFile passed to this is known to be
+	-- an associated file, which may not be in the keys database
+	-- yet, so checking it is skipped.)
+	stillassociated f
+		| AssociatedFile (Just f) == afile = return True
+		| otherwise = catKeyFile f >>= \case
+			Just k' | k' == k -> return True
+			_ -> return False
 	
 	-- Avoid calling stillassociated on every file; just make sure
 	-- that the one with the highest value is still associated.
diff --git a/CHANGELOG b/CHANGELOG
index 3cf6b4251..9e6b2358a 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -4,7 +4,7 @@ git-annex (8.20210429) UNRELEASED; urgency=medium
   * When two files have the same content, and a required content expression
     matches one but not the other, dropping the latter file will fail as it
     would also remove the content of the required file.
-  * drop, move, import: When two files have the same content, and
+  * drop, move, mirror: When two files have the same content, and
     different numcopies or requiredcopies values, use the higher value.
   * drop --auto: When two files have the same content, and a preferred content
     expression matches one but not the other, do not drop the content.
diff --git a/Command/Drop.hs b/Command/Drop.hs
index f30b6f4c0..890b9e004 100644
--- a/Command/Drop.hs
+++ b/Command/Drop.hs
@@ -227,7 +227,7 @@ checkRequiredContent (PreferredContentChecked False) u k afile =
  - copies on other semitrusted repositories. -}
 checkDropAuto :: Bool -> Maybe Remote -> AssociatedFile -> Key -> (NumCopies -> MinCopies -> CommandStart) -> CommandStart
 checkDropAuto automode mremote afile key a =
-	go =<< getAssociatedFileNumMinCopies afile
+	go =<< getSafestNumMinCopies afile key
   where
 	go (numcopies, mincopies)
 		| automode = do
diff --git a/Command/Mirror.hs b/Command/Mirror.hs
index 4fe8c31f4..2e31efa65 100644
--- a/Command/Mirror.hs
+++ b/Command/Mirror.hs
@@ -68,7 +68,7 @@ startKey o afile (si, key, ai) = case fromToOptions o of
 	ToRemote r -> checkFailedTransferDirection ai Upload $ ifM (inAnnex key)
 		( Command.Move.toStart Command.Move.RemoveNever afile key ai si =<< getParsed r
 		, do
-			(numcopies, mincopies) <- getAssociatedFileNumMinCopies afile
+			(numcopies, mincopies) <- getSafestNumMinCopies afile key
 			Command.Drop.startRemote pcc afile ai si numcopies mincopies key =<< getParsed r
 		)
 	FromRemote r -> checkFailedTransferDirection ai Download $ do
@@ -81,7 +81,7 @@ startKey o afile (si, key, ai) = case fromToOptions o of
 				)
 			Right False -> ifM (inAnnex key)
 				( do
-					(numcopies, mincopies) <- getAssociatedFileNumMinCopies afile
+					(numcopies, mincopies) <- getSafestNumMinCopies afile key
 					Command.Drop.startLocal pcc afile ai si numcopies mincopies key []
 				, stop
 				)
diff --git a/Command/Move.hs b/Command/Move.hs
index 6d2cc50c3..ab6cd0aeb 100644
--- a/Command/Move.hs
+++ b/Command/Move.hs
@@ -166,7 +166,7 @@ toPerform dest removewhen key afile fastcheck isthere = do
 			willDropMakeItWorse srcuuid destuuid deststartedwithcopy key afile >>= \case
 				DropAllowed -> drophere setpresentremote contentlock "moved"
 				DropCheckNumCopies -> do
-					(numcopies, mincopies) <- getAssociatedFileNumMinCopies afile
+					(numcopies, mincopies) <- getSafestNumMinCopies afile key
 					(tocheck, verified) <- verifiableCopies key [srcuuid]
 					verifyEnoughCopiesToDrop "" key (Just contentlock)
 						 numcopies mincopies [srcuuid] verified
@@ -245,7 +245,7 @@ fromPerform src removewhen key afile = do
 		willDropMakeItWorse srcuuid destuuid deststartedwithcopy key afile >>= \case
 			DropAllowed -> dropremote "moved"
 			DropCheckNumCopies -> do
-				(numcopies, mincopies) <- getAssociatedFileNumMinCopies afile
+				(numcopies, mincopies) <- getSafestNumMinCopies afile key
 				(tocheck, verified) <- verifiableCopies key [Remote.uuid src]
 				verifyEnoughCopiesToDrop "" key Nothing numcopies mincopies [Remote.uuid src] verified
 					tocheck (dropremote . showproof) faileddropremote
diff --git a/doc/todo/numcopies_check_other_files_using_same_key.mdwn b/doc/todo/numcopies_check_other_files_using_same_key.mdwn
index 2d8782a0b..64d8070a4 100644
--- a/doc/todo/numcopies_check_other_files_using_same_key.mdwn
+++ b/doc/todo/numcopies_check_other_files_using_same_key.mdwn
@@ -21,4 +21,6 @@ do say that it bypasses checking .gitattributes numcopies.
 > files. With the recent change to also track
 > associated files for locked files, they also handle it for those.
 > 
-> But, git-annex drop/move/import don't yet.
+> But, git-annex drop/move/mirror don't yet.
+> 
+> > [[fixed|done]] (did not change --all behavior) --[[Joey]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_35_a277d90983bebba10ed1ae3a51fbafd2._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_35_a277d90983bebba10ed1ae3a51fbafd2._comment
new file mode 100644
index 000000000..e38214c62
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_35_a277d90983bebba10ed1ae3a51fbafd2._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 35"
+ date="2021-06-15T15:38:28Z"
+ content="""
+FWIW: Yes! ;-)
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_34_86d07025980e0b50d1a52f73c049944a._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_34_86d07025980e0b50d1a52f73c049944a._comment
new file mode 100644
index 000000000..8752b4c5c
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_34_86d07025980e0b50d1a52f73c049944a._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 34"
+ date="2021-06-15T15:38:16Z"
+ content="""
+FWIW: Yes! ;-)
+"""]]

verify associated files when checking numcopies
Most of this is just refactoring. But, handleDropsFrom
did not verify that associated files from the keys db were still
accurate, and has now been fixed to.
A minor improvement to this would be to avoid calling catKeyFile
twice on the same file, when getting the numcopies and mincopies value,
in the common case where the same file has the highest value for both.
But, it avoids checking every associated file, so it will scale well to
lots of dups already.
Sponsored-by: Kevin Mueller on Patreon
diff --git a/Annex/Drop.hs b/Annex/Drop.hs
index da7dad2d1..dc6b7b64e 100644
--- a/Annex/Drop.hs
+++ b/Annex/Drop.hs
@@ -52,11 +52,7 @@ type Reason = String
 handleDropsFrom :: [UUID] -> [Remote] -> Reason -> Bool -> Key -> AssociatedFile -> SeekInput -> [VerifiedCopy] -> (CommandStart -> CommandCleanup) -> Annex ()
 handleDropsFrom locs rs reason fromhere key afile si preverified runner = do
 	g <- Annex.gitRepo
-	l <- map (`fromTopFilePath` g)
-		<$> Database.Keys.getAssociatedFiles key
-	let fs = case afile of
-		AssociatedFile (Just f) -> f : filter (/= f) l
-		AssociatedFile Nothing -> l
+	fs <- Database.Keys.getAssociatedFilesIncluding afile key
 	n <- getcopies fs
 	void $ if fromhere && checkcopies n Nothing
 		then go fs rs n >>= dropl fs
@@ -64,11 +60,7 @@ handleDropsFrom locs rs reason fromhere key afile si preverified runner = do
   where
 	getcopies fs = do
 		(untrusted, have) <- trustPartition UnTrusted locs
-		(numcopies, mincopies) <- if null fs
-			then (,) <$> getNumCopies <*> getMinCopies
-			else do
-				l <- mapM getFileNumMinCopies fs
-				return (maximum $ map fst l, maximum $ map snd l)
+		(numcopies, mincopies) <- getSafestNumMinCopies' key fs
 		return (length have, numcopies, mincopies, S.fromList untrusted)
 
 	{- Check that we have enough copies still to drop the content.
diff --git a/Annex/NumCopies.hs b/Annex/NumCopies.hs
index 0fc2d191a..bbdd826e8 100644
--- a/Annex/NumCopies.hs
+++ b/Annex/NumCopies.hs
@@ -12,6 +12,8 @@ module Annex.NumCopies (
 	module Logs.NumCopies,
 	getFileNumMinCopies,
 	getAssociatedFileNumMinCopies,
+	getSafestNumMinCopies,
+	getSafestNumMinCopies',
 	getGlobalFileNumCopies,
 	getNumCopies,
 	getMinCopies,
@@ -34,6 +36,8 @@ import qualified Remote
 import qualified Types.Remote as Remote
 import Annex.Content
 import Annex.UUID
+import Annex.CatFile
+import qualified Database.Keys
 
 import Control.Exception
 import qualified Control.Monad.Catch as M
@@ -119,6 +123,11 @@ getFileNumMinCopies f = do
 					<$> fallbacknum
 					<*> fallbackmin
 
+{- NumCopies and MinCopies value for an associated file, or the default
+ - when there is no associated file.
+ -
+ - This does not include other associated files using the same key.
+ -}
 getAssociatedFileNumMinCopies :: AssociatedFile -> Annex (NumCopies, MinCopies)
 getAssociatedFileNumMinCopies (AssociatedFile (Just file)) =
 	getFileNumMinCopies file
@@ -126,6 +135,44 @@ getAssociatedFileNumMinCopies (AssociatedFile Nothing) = (,)
 	<$> getNumCopies
 	<*> getMinCopies
 
+{- Gets the highest NumCopies and MinCopies value for all files
+ - associated with a key. Provide any known associated file;
+ - the rest are looked up from the database.
+ -
+ - Using this when dropping avoids dropping one file that
+ - has a smaller value violating the value set for another file
+ - that uses the same content.
+ -}
+getSafestNumMinCopies :: AssociatedFile -> Key -> Annex (NumCopies, MinCopies)
+getSafestNumMinCopies afile k =
+	Database.Keys.getAssociatedFilesIncluding afile k
+		>>= getSafestNumMinCopies' k
+
+getSafestNumMinCopies' :: Key -> [RawFilePath] -> Annex (NumCopies, MinCopies)
+getSafestNumMinCopies' k fs = do
+	l <- mapM getFileNumMinCopies fs
+	let l' = zip l fs
+	(,)
+		<$> findmax fst l' getNumCopies
+		<*> findmax snd l' getMinCopies
+  where
+	-- Some associated files in the keys database may no longer
+	-- correspond to files in the repository.
+	stillassociated f = catKeyFile f >>= \case
+		Just k' | k' == k -> return True
+		_ -> return False
+	
+	-- Avoid calling stillassociated on every file; just make sure
+	-- that the one with the highest value is still associated.
+	findmax _ [] fallback = fallback
+	findmax getv l fallback = do
+		let n = maximum (map (getv . fst) l)
+		let (maxls, l') = partition (\(x, _) -> getv x == n) l
+		ifM (anyM stillassociated (map snd maxls))
+			( return n
+			, findmax getv l' fallback
+			)
+
 {- This is the globally visible numcopies value for a file. So it does
  - not include local configuration in the git config or command line
  - options. -}
diff --git a/CHANGELOG b/CHANGELOG
index 849503469..3cf6b4251 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -4,6 +4,8 @@ git-annex (8.20210429) UNRELEASED; urgency=medium
   * When two files have the same content, and a required content expression
     matches one but not the other, dropping the latter file will fail as it
     would also remove the content of the required file.
+  * drop, move, import: When two files have the same content, and
+    different numcopies or requiredcopies values, use the higher value.
   * drop --auto: When two files have the same content, and a preferred content
     expression matches one but not the other, do not drop the content.
   * sync --content, assistant: When two unlocked files have the same
diff --git a/Database/Keys.hs b/Database/Keys.hs
index 6cbfecda1..f79716c4e 100644
--- a/Database/Keys.hs
+++ b/Database/Keys.hs
@@ -14,6 +14,7 @@ module Database.Keys (
 	closeDb,
 	addAssociatedFile,
 	getAssociatedFiles,
+	getAssociatedFilesIncluding,
 	getAssociatedKey,
 	removeAssociatedFile,
 	storeInodeCaches,
@@ -155,6 +156,15 @@ addAssociatedFile k f = runWriterIO $ SQL.addAssociatedFile k f
 getAssociatedFiles :: Key -> Annex [TopFilePath]
 getAssociatedFiles = runReaderIO . SQL.getAssociatedFiles
 
+{- Include a known associated file along with any recorded in the database. -}
+getAssociatedFilesIncluding :: AssociatedFile -> Key -> Annex [RawFilePath]
+getAssociatedFilesIncluding afile k = do
+	g <- Annex.gitRepo
+	l <- map (`fromTopFilePath` g) <$> getAssociatedFiles k
+	return $ case afile of
+		AssociatedFile (Just f) -> f : filter (/= f) l
+		AssociatedFile Nothing -> l
+
 {- Gets any keys that are on record as having a particular associated file.
  - (Should be one or none but the database doesn't enforce that.) -}
 getAssociatedKey :: TopFilePath -> Annex [Key]
diff --git a/doc/todo/numcopies_check_other_files_using_same_key.mdwn b/doc/todo/numcopies_check_other_files_using_same_key.mdwn
index 19b365b96..2d8782a0b 100644
--- a/doc/todo/numcopies_check_other_files_using_same_key.mdwn
+++ b/doc/todo/numcopies_check_other_files_using_same_key.mdwn
@@ -15,3 +15,10 @@ differently than in a non-bare repo. (Also if this is done, the preferred
 content checking should also behave the same way.) The docs for --all 
 do say that it bypasses checking .gitattributes numcopies.
 --[[Joey]]
+
+> Note that the assistant and git-annex sync already check numcopies
+> for all known associated files, so already handled this for unlocked
+> files. With the recent change to also track
+> associated files for locked files, they also handle it for those.
+> 
+> But, git-annex drop/move/import don't yet.

close
diff --git a/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch.mdwn b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch.mdwn
index b5584f13a..f9aac29eb 100644
--- a/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch.mdwn
+++ b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch.mdwn
@@ -29,3 +29,6 @@ Actually -- it made me think: is that `scanning` branch specific? then what woul
 
 [[!meta author=yoh]]
 [[!tag projects/datalad]]
+
+> I feel this is unncessary complexity and I've optimised the scans quite a
+> lot in the meantime, so [[wontfix|done]] --[[Joey]]

respinse
diff --git a/doc/forum/Data_loss_if_symlink_target_edited/comment_1_599b96a2299f3c998f0317f4321eeff4._comment b/doc/forum/Data_loss_if_symlink_target_edited/comment_1_599b96a2299f3c998f0317f4321eeff4._comment
new file mode 100644
index 000000000..836d50ec8
--- /dev/null
+++ b/doc/forum/Data_loss_if_symlink_target_edited/comment_1_599b96a2299f3c998f0317f4321eeff4._comment
@@ -0,0 +1,26 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-15T13:50:17Z"
+ content="""
+`git-annex fsck` will detect this problem, but the real problem here is not
+that some edit got lost, but that you corrupted a version control object
+file. Similar to editing a file in `.git/objects/`. Fsck will, when it
+notices the problem, move the corrupted object file to `.git/annex/bad/'.
+So your edits are not lost, but are in danger of being forgotten.
+
+Note that, once the modified version of the file from repo B replaces
+the worktree file, `git annex fsck` of that file won't check the old
+version, so will not detect the problem. `git annex fsck --all` still will
+detect it.
+
+git-annex mostly prevents this kind of problem by making the file not have a
+write bit be set, and putting it in a directory that also has its write bit
+not set.
+
+You have to either be running as root, or using a program that goes
+out of its way to change multiple permissions, to get into that situation.
+
+(One example of a program that does is vim. `:w!` will
+temporarily overwrite both write bits.)
+"""]]

fix exponential blowup when adding lots of identical files
This was an old problem when the files were being added unlocked,
so the changelog mentions that being fixed. However, recently it's also
affected locked files.
The fix for locked files is kind of stupidly simple. moveAnnex already
handles populating unlocked files, and only does it when the object file
was not already present. So remove the redundant populateUnlockedFiles
call. (That call was added all the way back in
cfaac52b88e157dd4e71626fe68af37015b9c9bd, and has always been
unncessary.)
Sponsored-by: Dartmouth College's Datalad project
diff --git a/Annex/Ingest.hs b/Annex/Ingest.hs
index b8d98f108..ed2c6d5ac 100644
--- a/Annex/Ingest.hs
+++ b/Annex/Ingest.hs
@@ -178,9 +178,7 @@ ingest' preferredbackend meterupdate (Just (LockedDown cfg source)) mk restage =
 
 	golocked key mcache s =
 		tryNonAsync (moveAnnex key naf (contentLocation source)) >>= \case
-			Right True -> do
-				populateUnlockedFiles key source restage
-				success key mcache s		
+			Right True -> success key mcache s		
 			Right False -> giveup "failed to add content to annex"
 			Left e -> restoreFile (keyFilename source) key e
 
@@ -198,8 +196,8 @@ ingest' preferredbackend meterupdate (Just (LockedDown cfg source)) mk restage =
 		cleanOldKeys (keyFilename source) key
 		linkToAnnex key (keyFilename source) (Just cache) >>= \case
 			LinkAnnexFailed -> failure "failed to link to annex"
-			_ -> do
-				finishIngestUnlocked' key source restage
+			lar -> do
+				finishIngestUnlocked' key source restage (Just lar)
 				success key (Just cache) s
 	gounlocked _ _ _ = failure "failed statting file"
 
@@ -215,25 +213,30 @@ ingest' preferredbackend meterupdate (Just (LockedDown cfg source)) mk restage =
 finishIngestUnlocked :: Key -> KeySource -> Annex ()
 finishIngestUnlocked key source = do
 	cleanCruft source
-	finishIngestUnlocked' key source (Restage True)
+	finishIngestUnlocked' key source (Restage True) Nothing
 
-finishIngestUnlocked' :: Key -> KeySource -> Restage -> Annex ()
-finishIngestUnlocked' key source restage = do
+finishIngestUnlocked' :: Key -> KeySource -> Restage -> Maybe LinkAnnexResult -> Annex ()
+finishIngestUnlocked' key source restage lar = do
 	Database.Keys.addAssociatedFile key
 		=<< inRepo (toTopFilePath (keyFilename source))
-	populateUnlockedFiles key source restage
-
-{- Copy to any unlocked files using the same key. -}
-populateUnlockedFiles :: Key -> KeySource -> Restage -> Annex ()
-populateUnlockedFiles key source restage = 
-	whenM (annexSupportUnlocked <$> Annex.getGitConfig) $ do
-		obj <- calcRepo (gitAnnexLocation key)
-		g <- Annex.gitRepo
-		ingestedf <- flip fromTopFilePath g
-			<$> inRepo (toTopFilePath (keyFilename source))
-		afs <- map (`fromTopFilePath` g) <$> Database.Keys.getAssociatedFiles key
-		forM_ (filter (/= ingestedf) afs) $
-			populatePointerFile restage key obj
+	populateUnlockedFiles key source restage lar
+
+{- Copy to any other unlocked files using the same key.
+ -
+ - When linkToAnnex did not have to do anything, the object file
+ - was already present, and so other unlocked files are already populated,
+ - and nothing needs to be done here.
+ -}
+populateUnlockedFiles :: Key -> KeySource -> Restage -> Maybe LinkAnnexResult -> Annex ()
+populateUnlockedFiles _ _ _ (Just LinkAnnexNoop) = return ()
+populateUnlockedFiles key source restage lar = do
+	obj <- calcRepo (gitAnnexLocation key)
+	g <- Annex.gitRepo
+	ingestedf <- flip fromTopFilePath g
+		<$> inRepo (toTopFilePath (keyFilename source))
+	afs <- map (`fromTopFilePath` g) <$> Database.Keys.getAssociatedFiles key
+	forM_ (filter (/= ingestedf) afs) $
+		populatePointerFile restage key obj
 
 cleanCruft :: KeySource -> Annex ()
 cleanCruft source = when (contentLocation source /= keyFilename source) $
diff --git a/CHANGELOG b/CHANGELOG
index 0363d22b7..849503469 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -30,6 +30,8 @@ git-annex (8.20210429) UNRELEASED; urgency=medium
   * Added annex.adviceNoSshCaching config.
   * Added --size-limit option.
   * Future proof activity log parsing.
+  * Fix an exponential slowdown when large numbers of duplicate files are
+    being added in unlocked form.
 
  -- Joey Hess <id@joeyh.name>  Mon, 03 May 2021 10:33:10 -0400
 
diff --git a/doc/bugs/significant_performance_regression_impacting_datal.mdwn b/doc/bugs/significant_performance_regression_impacting_datal.mdwn
index aed3e4644..a0ea3823a 100644
--- a/doc/bugs/significant_performance_regression_impacting_datal.mdwn
+++ b/doc/bugs/significant_performance_regression_impacting_datal.mdwn
@@ -15,3 +15,5 @@ Currently 8.20210428+git282-gd39dfed2a and first got slow with
 8.20210428+git228-g13a6bfff4 and was ok with 8.20210428+git202-g9a5981a15
 
 [[!meta title="performance edge case when adding large numbers of identical files"]]
+
+> [[fixed|done]] --[[Joey]]
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_33_6e5121e066998a303cf68ebc53e9fc15._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_33_6e5121e066998a303cf68ebc53e9fc15._comment
index deeeec00b..32eb08e11 100644
--- a/doc/bugs/significant_performance_regression_impacting_datal/comment_33_6e5121e066998a303cf68ebc53e9fc15._comment
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_33_6e5121e066998a303cf68ebc53e9fc15._comment
@@ -7,4 +7,17 @@ Oh, there's a much better solution: If the annex object file already exists
 when ingesting a new file, skip populating other associated files. They
 will have already been populated. moveAnnex has to check if the annex object
 file already exists anyway, so this will have zero overhead.
+
+(Maybe that's what yarik was getting at in comment #30)
+
+Implemented that, and here's the results, re-running my prior benchmark:
+
+run 1: 0:03.14
+run 2: 0:03.24
+run 3: 0.03.35
+run 4: 0.03.45
+run 9: 0:03.65
+
+That also shows the actual overhead of the diffing of the index,
+as its size grows, is quite small.
 """]]

remove supportUnlocked check that is not worth its overhead
moveAnnex only gets to that check if the object file was not present
before. So in the case where dup files are being added repeatedly,
it will only run the first time, and so there's no significant speedup
from doing it; all it avoids is a single sqlite lookup. Since MVar
accesses do have overhead, it's better to optimise for the common case,
where unlocked files are supported.
removeAnnex is less clear cut, but I think mostly is skipped running on
keys when the object has already been dropped, so similar reasoning
applies.
diff --git a/Annex/Content.hs b/Annex/Content.hs
index 07daa14cc..12af39618 100644
--- a/Annex/Content.hs
+++ b/Annex/Content.hs
@@ -340,13 +340,12 @@ moveAnnex key af src = ifM (checkSecureHashes' key)
 			liftIO $ moveFile
 				(fromRawFilePath src)
 				(fromRawFilePath dest)
-			whenM (annexSupportUnlocked <$> Annex.getGitConfig) $ do
-				g <- Annex.gitRepo 
-				fs <- map (`fromTopFilePath` g)
-					<$> Database.Keys.getAssociatedFiles key
-				unless (null fs) $ do
-					ics <- mapM (populatePointerFile (Restage True) key dest) fs
-					Database.Keys.storeInodeCaches' key [dest] (catMaybes ics)
+			g <- Annex.gitRepo 
+			fs <- map (`fromTopFilePath` g)
+				<$> Database.Keys.getAssociatedFiles key
+			unless (null fs) $ do
+				ics <- mapM (populatePointerFile (Restage True) key dest) fs
+				Database.Keys.storeInodeCaches' key [dest] (catMaybes ics)
 		)
 	alreadyhave = liftIO $ R.removeLink src
 
@@ -503,11 +502,10 @@ removeAnnex (ContentRemovalLock key) = withObjectLoc key $ \file ->
 	cleanObjectLoc key $ do
 		secureErase file
 		liftIO $ removeWhenExistsWith R.removeLink file
-		whenM (annexSupportUnlocked <$> Annex.getGitConfig) $ do
-			g <- Annex.gitRepo 
-			mapM_ (\f -> void $ tryIO $ resetpointer $ fromTopFilePath f g)
-				=<< Database.Keys.getAssociatedFiles key
-			Database.Keys.removeInodeCaches key
+		g <- Annex.gitRepo 
+		mapM_ (\f -> void $ tryIO $ resetpointer $ fromTopFilePath f g)
+			=<< Database.Keys.getAssociatedFiles key
+		Database.Keys.removeInodeCaches key
   where
 	-- Check associated pointer file for modifications, and reset if
 	-- it's unmodified.
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_33_6e5121e066998a303cf68ebc53e9fc15._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_33_6e5121e066998a303cf68ebc53e9fc15._comment
new file mode 100644
index 000000000..deeeec00b
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_33_6e5121e066998a303cf68ebc53e9fc15._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 33"""
+ date="2021-06-15T13:01:04Z"
+ content="""
+Oh, there's a much better solution: If the annex object file already exists
+when ingesting a new file, skip populating other associated files. They
+will have already been populated. moveAnnex has to check if the annex object
+file already exists anyway, so this will have zero overhead.
+"""]]

diff --git a/doc/forum/Data_loss_if_symlink_target_edited.mdwn b/doc/forum/Data_loss_if_symlink_target_edited.mdwn
new file mode 100644
index 000000000..9ea1ad3d6
--- /dev/null
+++ b/doc/forum/Data_loss_if_symlink_target_edited.mdwn
@@ -0,0 +1,7 @@
+I am syncing two repos A and B. Will the following operations cause data loss:
+1. File 1 is locked in repo A, and its symlink target is edited without unlocking file 1.
+2. File 1 is unlocked and editted in repo B, followed by a git-annex add operation.
+3. Repo A and B are synced with git-annex sync --content
+Will the editing in step 1 be lost? If so, can you please do a git fsck on the file to be overwritten to at lease give a warning.
+
+I am using emacs su-mode, until recently have I found that su-mode let me edit the symlink target without unlocking.

bloom doesn't work, but this should I hope
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment
index 8b0e6b5f3..61f0dd11c 100644
--- a/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment
@@ -10,19 +10,23 @@ returning a long list of files. So it could detect say 10 files in the list
 and start doing something other than the usual, without bothering the usual
 case with any extra work.
 
-A bloom filter could be used to keep track of keys that have already had
-their associated files populated, and be used to skip the work the next
-time that same key is added. In the false positive case, it would check the
-associated files as it does now, so no harm done.
+Git starts to get slow anyway in the 1 million to 10 million file range. So
+we can assume less than that many files are being added. And there need to
+be a fairly large number of duplicates of a key for speed to become a problem
+when adding that key. Around 1000 based on above benchmarks, but 100 would
+be safer.
 
-Putting these together, a bloom filter with a large enough capacity could
-be set up when it detects the problem, and used to skip the redundant work.
-This would change the checking overhead from `O(N^2)` to O(N^F)` where F is
-the false positive rate of the bloom filter. And the false positive rate of
-the usual git-annex bloom filter is small: 1/1000000 when half a million
-files are in it. Since 1-10 million files is where git gets too slow to be
-usable, the false positive rate should remain low up until the point other
-performance becomes a problem.
+If it's adding 10 million files, there can be at most 10000 keys
+that have `>=` 1000 duplicates (10 million / 1000).
+No problem to remember 10000 keys; a key is less than 128 bytes long, so
+that would take 1250 kb, plus the overhead of the Map. Might as well
+remember 12 mb worth of keys, to catch 100 duplicates.
+
+It would be even better to use a bloom filter, which could remember many
+more, and I thought I had a way, but the false positive case seems the
+wrong way around. If the bloom filter remembers keys that have already had
+their associated files populated, then a false positive would prevent doing
+that for a key that it's not been done for.
 
 It would make sense to do this not only in populateUnlockedFiles but in
 Annex.Content.moveAnnex and Annex.Content.removeAnnex. Although removeAnnex

plan
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment
new file mode 100644
index 000000000..8b0e6b5f3
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_32_0ab5e774dad1e6e74a5c72b95e40d659._comment
@@ -0,0 +1,31 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 32"""
+ date="2021-06-14T20:26:52Z"
+ content="""
+Some thoughts leading to a workable plan:
+
+It's easy to detect this edge case because getAssociatedFiles will be
+returning a long list of files. So it could detect say 10 files in the list
+and start doing something other than the usual, without bothering the usual
+case with any extra work.
+
+A bloom filter could be used to keep track of keys that have already had
+their associated files populated, and be used to skip the work the next
+time that same key is added. In the false positive case, it would check the
+associated files as it does now, so no harm done.
+
+Putting these together, a bloom filter with a large enough capacity could
+be set up when it detects the problem, and used to skip the redundant work.
+This would change the checking overhead from `O(N^2)` to O(N^F)` where F is
+the false positive rate of the bloom filter. And the false positive rate of
+the usual git-annex bloom filter is small: 1/1000000 when half a million
+files are in it. Since 1-10 million files is where git gets too slow to be
+usable, the false positive rate should remain low up until the point other
+performance becomes a problem.
+
+It would make sense to do this not only in populateUnlockedFiles but in
+Annex.Content.moveAnnex and Annex.Content.removeAnnex. Although removeAnnex
+would need a different bloom filter, since a file might have been populated
+and then somehow get removed in the same git-annex call.
+"""]]

comment
diff --git a/doc/forum/List_if_files_are_stored_in_annex_or_in_git/comment_1_9206087f7320982ef26ef401d39a394f._comment b/doc/forum/List_if_files_are_stored_in_annex_or_in_git/comment_1_9206087f7320982ef26ef401d39a394f._comment
new file mode 100644
index 000000000..e38ee78ef
--- /dev/null
+++ b/doc/forum/List_if_files_are_stored_in_annex_or_in_git/comment_1_9206087f7320982ef26ef401d39a394f._comment
@@ -0,0 +1,20 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-14T18:38:13Z"
+ content="""
+`git annex find`, like all git-annex commands except for `add`, 
+skips over non-annexed files.
+
+What you can do is get a list of all annexed files:
+
+	git annex find --include '*' | sort > annexed-files
+
+And get a list of all files git knows:
+
+	git -c core.quotepath=off ls-files | sort > all-files
+
+And then find files that are in the second list but not the first:
+
+	comm -1 -3 annexed-files all-files
+"""]]

going round and round, boredly
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_31_97fa9b7729805704a6a22cf18082e203._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_31_97fa9b7729805704a6a22cf18082e203._comment
new file mode 100644
index 000000000..6065817da
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_31_97fa9b7729805704a6a22cf18082e203._comment
@@ -0,0 +1,7 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""Re: comment 30"""
+ date="2021-06-14T18:36:11Z"
+ content="""
+I discussed that approach in comment #24.
+"""]]

comment
diff --git a/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_3_fe771be800a6b26157c3075e632ee1b7._comment b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_3_fe771be800a6b26157c3075e632ee1b7._comment
new file mode 100644
index 000000000..e92ccb75f
--- /dev/null
+++ b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_3_fe771be800a6b26157c3075e632ee1b7._comment
@@ -0,0 +1,9 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2021-06-14T18:33:43Z"
+ content="""
+> It would be more beneficial to speed up that scanning (reconcileStaged), which should be doable by using the git cat-key --batch trick.
+
+That got implemented.
+"""]]

comment
diff --git a/doc/todo/trust_based_on_time_since_last_fsck/comment_4_29729772f7600dfb459e3be9cf2c43ea._comment b/doc/todo/trust_based_on_time_since_last_fsck/comment_4_29729772f7600dfb459e3be9cf2c43ea._comment
new file mode 100644
index 000000000..b2795a163
--- /dev/null
+++ b/doc/todo/trust_based_on_time_since_last_fsck/comment_4_29729772f7600dfb459e3be9cf2c43ea._comment
@@ -0,0 +1,26 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 4"""
+ date="2021-06-14T18:20:06Z"
+ content="""
+Maybe it's better to not tie this directly in to fsck. Another way
+would be:
+
+	git annex untrust foo --after=100days
+
+The first time this is run, it would record that the trust level will
+change to untrust after 100 days. The next time it's run, it would advance
+the timeout.
+
+So, you could do whatever fsck or other checks make you still trust the
+repo, and then run this again.
+
+Implementation would I guess need a separate future-trust.log in addition
+to trust.log, and when loading trust levels, if there was a value in
+future-trust.log that has a newer timestamp than the value in trust.log,
+and enough time has passed, use it instead of the value from trust.log.
+That way it avoids breaking older git-annex with changes to trust.log.
+
+No need to change what's in trust.log, although it could, which would also
+let older git-annex versions learn about the change to trust.
+"""]]

Future proof activity log parsing
When the log has an activity that is not known, eg added by a future
version of git-annex, it used to be treated as no activity at all,
which would make git-annex expire think it should expire the repository,
despite it having some kind of recent activity.
Hopefully there will be no reason to add a new activity until enough
time has passed that this commit is in use everywhere.
Sponsored-by: Jake Vosloo on Patreon
diff --git a/CHANGELOG b/CHANGELOG
index 47d1185ef..0363d22b7 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -29,6 +29,7 @@ git-annex (8.20210429) UNRELEASED; urgency=medium
     that creates the git-annex branch.
   * Added annex.adviceNoSshCaching config.
   * Added --size-limit option.
+  * Future proof activity log parsing.
 
  -- Joey Hess <id@joeyh.name>  Mon, 03 May 2021 10:33:10 -0400
 
diff --git a/Command/Expire.hs b/Command/Expire.hs
index da454ef3d..f7ef0a205 100644
--- a/Command/Expire.hs
+++ b/Command/Expire.hs
@@ -111,6 +111,6 @@ parseExpire ps = do
 parseActivity :: MonadFail m => String -> m Activity
 parseActivity s = case readish s of
 	Nothing -> Fail.fail $ "Unknown activity. Choose from: " ++ 
-		unwords (map show [minBound..maxBound :: Activity])
+		unwords (map show allActivities)
 	Just v -> return v
 
diff --git a/Logs/Activity.hs b/Logs/Activity.hs
index 62598ca50..d527d2669 100644
--- a/Logs/Activity.hs
+++ b/Logs/Activity.hs
@@ -1,6 +1,6 @@
 {- git-annex activity log
  -
- - Copyright 2015-2019 Joey Hess <id@joeyh.name>
+ - Copyright 2015-2021 Joey Hess <id@joeyh.name>
  -
  - Licensed under the GNU AGPL version 3 or higher.
  -}
@@ -8,6 +8,7 @@
 module Logs.Activity (
 	Log,
 	Activity(..),
+	allActivities,
 	recordActivity,
 	lastActivities,
 ) where
@@ -23,30 +24,38 @@ import Data.ByteString.Builder
 
 data Activity 
 	= Fsck
-	deriving (Eq, Read, Show, Enum, Bounded)
+	-- Allow for unknown activities to be added later.
+	| UnknownActivity S.ByteString
+	deriving (Eq, Read, Show)
 
+allActivities :: [Activity]
+allActivities = [Fsck]
+
+-- Record an activity. This takes the place of previously recorded activity
+-- for the UUID.
 recordActivity :: Activity -> UUID -> Annex ()
 recordActivity act uuid = do
 	c <- currentVectorClock
 	Annex.Branch.change (Annex.Branch.RegardingUUID [uuid]) activityLog $
 		buildLogOld buildActivity
-			. changeLog c uuid (Right act)
+			. changeLog c uuid act
 			. parseLogOld parseActivity
 
+-- Most recent activity for each UUID.
 lastActivities :: Maybe Activity -> Annex (Log Activity)
 lastActivities wantact = parseLogOld (onlywanted =<< parseActivity)
 	<$> Annex.Branch.get activityLog
   where
-	onlywanted (Right a) | wanted a = pure a
-	onlywanted _ = fail "unwanted activity"
+	onlywanted a 
+		| wanted a = pure a
+		| otherwise = fail "unwanted activity"
 	wanted a = maybe True (a ==) wantact
 
-buildActivity :: Either S.ByteString Activity -> Builder
-buildActivity (Right a) = byteString $ encodeBS $ show a
-buildActivity (Left b) = byteString b
+buildActivity :: Activity -> Builder
+buildActivity (UnknownActivity b) = byteString b
+buildActivity a = byteString $ encodeBS $ show a
 
--- Allow for unknown activities to be added later by preserving them.
-parseActivity :: A.Parser (Either S.ByteString Activity)
+parseActivity :: A.Parser Activity
 parseActivity = go <$> A.takeByteString
   where
-	go b = maybe (Left b) Right $ readish $ decodeBS b
+	go b = fromMaybe (UnknownActivity b) (readish $ decodeBS b)
diff --git a/doc/todo/trust_based_on_time_since_last_fsck/comment_1_3805e8dd9e6dd986c097c6f1b78ab244._comment b/doc/todo/trust_based_on_time_since_last_fsck/comment_1_3805e8dd9e6dd986c097c6f1b78ab244._comment
new file mode 100644
index 000000000..84e1bbf7e
--- /dev/null
+++ b/doc/todo/trust_based_on_time_since_last_fsck/comment_1_3805e8dd9e6dd986c097c6f1b78ab244._comment
@@ -0,0 +1,32 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-14T17:14:44Z"
+ content="""
+You can query for repositories that have not been fscked
+for some amount of time:
+
+	git annex expire 10d --no-act --activity=Fsck
+
+From there, it's a simple script to set the unfscked ones to untrusted, or
+whatever.
+
+	| grep '^expire' | awk '{print $2}' | xargs git-annex untrust
+
+I suppose `git-annex expire` could have an option added, like `--untrust`
+to specify *how* to expire, rather than the default of marking the repo
+dead.
+
+I suppose you'd want a way to also go the other way, to stop untrusting a
+repo once it's been fscked.. There is not currently a way to do that.
+
+Note that a fsck that is interrupted does not count as a fsck activity,
+and it's not keeping track of what files were fscked. That would bloat the
+git-annex branch. On the other hand, if you `git annex fsck onefile`
+that counts as a fsck activity, even though other files in the repo didn't get
+fscked. So you would have to limit the ways you use fsck to ones that
+generate the activity you want, perhaps to `git annex fsck --all`. 
+
+Perhaps fsck should also have a way to control whether it records an
+activity or not..
+"""]]
diff --git a/doc/todo/trust_based_on_time_since_last_fsck/comment_2_ec1b87b389dc06440df04c9a719e0cbc._comment b/doc/todo/trust_based_on_time_since_last_fsck/comment_2_ec1b87b389dc06440df04c9a719e0cbc._comment
new file mode 100644
index 000000000..d410d2f45
--- /dev/null
+++ b/doc/todo/trust_based_on_time_since_last_fsck/comment_2_ec1b87b389dc06440df04c9a719e0cbc._comment
@@ -0,0 +1,13 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2021-06-14T17:29:29Z"
+ content="""
+What if `git annex fsck --all` recorded an additional activity, eg FsckAll.
+Then there could be a command, or a config that untrusts repos that do not
+have a FsckAll activity that happened recently enough.
+
+A git config would be simplest, eg:
+
+	git config annex.untrustLastFscked 10d
+"""]]
diff --git a/doc/todo/trust_based_on_time_since_last_fsck/comment_3_23f37b9d8b877b829e34e6c8ea6b40c4._comment b/doc/todo/trust_based_on_time_since_last_fsck/comment_3_23f37b9d8b877b829e34e6c8ea6b40c4._comment
new file mode 100644
index 000000000..6f55fbda2
--- /dev/null
+++ b/doc/todo/trust_based_on_time_since_last_fsck/comment_3_23f37b9d8b877b829e34e6c8ea6b40c4._comment
@@ -0,0 +1,22 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2021-06-14T17:56:23Z"
+ content="""
+Tried to implement this, but ran into a problem adding FsckAll:
+If it only logs FsckAll and not also Fsck, then old git-annex expire
+will see the FsckAll and not understand it, and treats it as no activity,
+so expires. (I did fix git-annex now so an unknown activity is not treated
+as no activity.)
+
+And, the way recordActivity is implemented, it
+removes previous activities, and adds the current activity. So a FsckAll
+followed by a Fsck would remove the FsckAll activity.
+
+That could be fixed, and both be logged, but old git-annex would probably
+not be able to parse the result. And if old git-annex is then used to do a
+fsck, it would log Fsck and remove the previously added FsckAll.
+
+So, it seems this will need to use some log other than activity.log
+to keep track of fsck --all.
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_30_386c0cfe688effb1543ffd01a54717e0._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_30_386c0cfe688effb1543ffd01a54717e0._comment
new file mode 100644
index 000000000..6ab1c624a
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_30_386c0cfe688effb1543ffd01a54717e0._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 30"
+ date="2021-06-14T17:36:16Z"
+ content="""
+> The same needs to also hold true for unlocked files, and so it has to check if foo is an unlocked pointer to K and populate the file with the content.
+
+but that should only needs to be done iff K became present/known whenever it was not present before. If K is already known (e.g. was just added for another file, or may be was added in previous \"commit\") no such checks are needed since those files could be expected to already be populated. Right?
+"""]]

diff --git a/doc/forum/List_if_files_are_stored_in_annex_or_in_git.mdwn b/doc/forum/List_if_files_are_stored_in_annex_or_in_git.mdwn
new file mode 100644
index 000000000..68eca8631
--- /dev/null
+++ b/doc/forum/List_if_files_are_stored_in_annex_or_in_git.mdwn
@@ -0,0 +1,7 @@
+I‘ve decided to get my head wrapped around setting up `annex.largefiles` and stop manually deciding to `git add` or `git annex add` files as I go. I fumbled a bit, unsure if I had configured things correctly as it appeared that `git add` was still adding my large files into git history. I forgot that `git add` is configured to add annexed files unlocked and so the symlink I was expecting to see wasn't there. `git annex list` and `git annex find` helped me to see which files where staged to be committed into annex storage.
+
+What I would like to be able to do is to more easily list files which are not present in annex storage and are tracked into git storage. I have had a play with `git annex find` and the matching options however I have been unable to display a list of files that I have added as small files. Is there a way to achieve this?
+
+What I think would be ideal is for `git annex list` to show this information or another command which can print a tree with files in either of the two storage modes.
+
+Thanks for any help!

comment
diff --git a/doc/bugs/__34__failed_to_send_content_to_remote__34__/comment_1_c36b1968c69e9794c723ff29d1738759._comment b/doc/bugs/__34__failed_to_send_content_to_remote__34__/comment_1_c36b1968c69e9794c723ff29d1738759._comment
new file mode 100644
index 000000000..678278552
--- /dev/null
+++ b/doc/bugs/__34__failed_to_send_content_to_remote__34__/comment_1_c36b1968c69e9794c723ff29d1738759._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-14T17:06:19Z"
+ content="""
+From the --deubg you provided, the important part is that rsync is
+succeeding to copy the file, but then git-annex for some reason thinks it
+failed, or has some other problem. Eg, if the annexed object was corrupt,
+it could copy it with rsync and then fail to verify the copy.
+
+Your git-annex version is a year out of date, so you should upgrade. The
+code that is failing has definitely been changed a lot in the meantime,
+though not that I can remember to fix a bug like this.
+"""]]

improve docs based on forum feedback
diff --git a/doc/tips/using_borg_for_efficient_storage_of_old_annexed_files.mdwn b/doc/tips/using_borg_for_efficient_storage_of_old_annexed_files.mdwn
index c50238087..27be72e2f 100644
--- a/doc/tips/using_borg_for_efficient_storage_of_old_annexed_files.mdwn
+++ b/doc/tips/using_borg_for_efficient_storage_of_old_annexed_files.mdwn
@@ -52,3 +52,9 @@ repository, freeing up disk space.
 
 You can continue running `borg create` and `git-annex sync` to store
 changed files in borg and let git-annex know what's stored there.
+
+It's possible to access the same borg repository from another clone of the
+git-annex repository too. Just run `git annex enableremote borg` in that
+clone to set it up. This uses the same `borgrepo` value that was passed
+to initremote, but you can override it, if, for example, you want to access
+the borg repository over ssh from this new clone.

comment
diff --git a/doc/forum/import_and_export_treeish_for_rsync_and_webdav/comment_1_6a8ee189ff2fee0697b67a405f22c5a4._comment b/doc/forum/import_and_export_treeish_for_rsync_and_webdav/comment_1_6a8ee189ff2fee0697b67a405f22c5a4._comment
new file mode 100644
index 000000000..8d7692a13
--- /dev/null
+++ b/doc/forum/import_and_export_treeish_for_rsync_and_webdav/comment_1_6a8ee189ff2fee0697b67a405f22c5a4._comment
@@ -0,0 +1,28 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-14T16:45:21Z"
+ content="""
+exporttree is already supported by both. importtree is unlikely to be
+supported. It's difficult to support this without opening up a very real
+possibility of data loss. If you're importing and exporting to the same
+remote, what happens when there's a conflict? Eg, whatever else is writing
+files to the remote writes to a file, but you locally modify the same file,
+and export a tree, without importing the new version first. That
+would overwrite the modified file on the remote, losing data.
+
+All the special remotes that support exporttree+importtree are able to
+avoid losing data in this scenario. (At least as well as git is able to,
+there are similar races when git updates a working tree at the same time
+you modify a file in the working tree, but the timing of it makes it very
+unlikely, so much that noone seems to worry about it).
+See also [[todo/import_tree_from_rsync_special_remote]]
+
+I have not felt comfortable giving users a loaded gun in this case. So only
+exporttree is supported, because the user doesn't expect something
+else to be writing to files on that remote, or if they do, they don't
+have any reason to expect git-annex to deal with it well.
+
+Anyway, it seems to me it should be possible to install git-annex on
+your pinephone and connect the repos using ssh.
+"""]]

fix windows build
diff --git a/Annex/Init.hs b/Annex/Init.hs
index a552046a3..0da934889 100644
--- a/Annex/Init.hs
+++ b/Annex/Init.hs
@@ -50,8 +50,8 @@ import Upgrade
 import Annex.Tmp
 import Utility.UserInfo
 import qualified Utility.RawFilePath as R
-#ifndef mingw32_HOST_OS
 import Utility.ThreadScheduler
+#ifndef mingw32_HOST_OS
 import Annex.Perms
 import Utility.FileMode
 import System.Posix.User
diff --git a/doc/bugs/Build_fails_on_Windows_as_of_commit_a706708d1.mdwn b/doc/bugs/Build_fails_on_Windows_as_of_commit_a706708d1.mdwn
index 8c0e36293..afd21ee70 100644
--- a/doc/bugs/Build_fails_on_Windows_as_of_commit_a706708d1.mdwn
+++ b/doc/bugs/Build_fails_on_Windows_as_of_commit_a706708d1.mdwn
@@ -2,3 +2,5 @@ As of commit a706708d1, trying to build git-annex on Windows fails because the i
 
 [[!meta author=jwodder]]
 [[!tag projects/datalad]]
+
+> Thank you for reporting and for the patch. [[applied|done]] --[[Joey]]

comment
diff --git a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_21_8f5dd9ba761e636ce5413da4e596296d._comment b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_21_8f5dd9ba761e636ce5413da4e596296d._comment
new file mode 100644
index 000000000..ee83200fc
--- /dev/null
+++ b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_21_8f5dd9ba761e636ce5413da4e596296d._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 21"""
+ date="2021-06-14T16:41:15Z"
+ content="""
+Note that the bug report the previous comment links to is not actually
+about the overhead of this scan.
+"""]]

avoid sometimes expensive operations when annex.supportunlocked = false
This will mostly just avoid a DB lookup, so things get marginally
faster. But in cases where there are many files using the same key, it
can be a more significant speedup.
Added overhead is one MVar lookup per call, which should be small
enough, since this happens after transferring or ingesting a file,
which is always a lot more work than that. It would be nice, though,
to move getGitConfig to AnnexRead, which there is an open todo about.
diff --git a/Annex/Content.hs b/Annex/Content.hs
index 12af39618..07daa14cc 100644
--- a/Annex/Content.hs
+++ b/Annex/Content.hs
@@ -340,12 +340,13 @@ moveAnnex key af src = ifM (checkSecureHashes' key)
 			liftIO $ moveFile
 				(fromRawFilePath src)
 				(fromRawFilePath dest)
-			g <- Annex.gitRepo 
-			fs <- map (`fromTopFilePath` g)
-				<$> Database.Keys.getAssociatedFiles key
-			unless (null fs) $ do
-				ics <- mapM (populatePointerFile (Restage True) key dest) fs
-				Database.Keys.storeInodeCaches' key [dest] (catMaybes ics)
+			whenM (annexSupportUnlocked <$> Annex.getGitConfig) $ do
+				g <- Annex.gitRepo 
+				fs <- map (`fromTopFilePath` g)
+					<$> Database.Keys.getAssociatedFiles key
+				unless (null fs) $ do
+					ics <- mapM (populatePointerFile (Restage True) key dest) fs
+					Database.Keys.storeInodeCaches' key [dest] (catMaybes ics)
 		)
 	alreadyhave = liftIO $ R.removeLink src
 
@@ -502,10 +503,11 @@ removeAnnex (ContentRemovalLock key) = withObjectLoc key $ \file ->
 	cleanObjectLoc key $ do
 		secureErase file
 		liftIO $ removeWhenExistsWith R.removeLink file
-		g <- Annex.gitRepo 
-		mapM_ (\f -> void $ tryIO $ resetpointer $ fromTopFilePath f g)
-			=<< Database.Keys.getAssociatedFiles key
-		Database.Keys.removeInodeCaches key
+		whenM (annexSupportUnlocked <$> Annex.getGitConfig) $ do
+			g <- Annex.gitRepo 
+			mapM_ (\f -> void $ tryIO $ resetpointer $ fromTopFilePath f g)
+				=<< Database.Keys.getAssociatedFiles key
+			Database.Keys.removeInodeCaches key
   where
 	-- Check associated pointer file for modifications, and reset if
 	-- it's unmodified.
diff --git a/Annex/Ingest.hs b/Annex/Ingest.hs
index 7c0d6f449..b8d98f108 100644
--- a/Annex/Ingest.hs
+++ b/Annex/Ingest.hs
@@ -225,14 +225,15 @@ finishIngestUnlocked' key source restage = do
 
 {- Copy to any unlocked files using the same key. -}
 populateUnlockedFiles :: Key -> KeySource -> Restage -> Annex ()
-populateUnlockedFiles key source restage = do
-	obj <- calcRepo (gitAnnexLocation key)
-	g <- Annex.gitRepo
-	ingestedf <- flip fromTopFilePath g
-		<$> inRepo (toTopFilePath (keyFilename source))
-	afs <- map (`fromTopFilePath` g) <$> Database.Keys.getAssociatedFiles key
-	forM_ (filter (/= ingestedf) afs) $
-		populatePointerFile restage key obj
+populateUnlockedFiles key source restage = 
+	whenM (annexSupportUnlocked <$> Annex.getGitConfig) $ do
+		obj <- calcRepo (gitAnnexLocation key)
+		g <- Annex.gitRepo
+		ingestedf <- flip fromTopFilePath g
+			<$> inRepo (toTopFilePath (keyFilename source))
+		afs <- map (`fromTopFilePath` g) <$> Database.Keys.getAssociatedFiles key
+		forM_ (filter (/= ingestedf) afs) $
+			populatePointerFile restage key obj
 
 cleanCruft :: KeySource -> Annex ()
 cleanCruft source = when (contentLocation source /= keyFilename source) $
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_28_0d6a37f823cd9cb3ed1e6e90066ebd2c._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_28_0d6a37f823cd9cb3ed1e6e90066ebd2c._comment
new file mode 100644
index 000000000..f430ed767
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_28_0d6a37f823cd9cb3ed1e6e90066ebd2c._comment
@@ -0,0 +1,9 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 28"""
+ date="2021-06-14T16:31:26Z"
+ content="""
+@Ilya sure can be skipped when annex.supportunlocked=false.
+I've implemented that. (And also for several other cases that have similar
+behavior, like dropping a key.)
+"""]]
diff --git a/doc/todo/move_readonly_values_to_AnnexRead.mdwn b/doc/todo/move_readonly_values_to_AnnexRead.mdwn
index b14627df7..1a2906aab 100644
--- a/doc/todo/move_readonly_values_to_AnnexRead.mdwn
+++ b/doc/todo/move_readonly_values_to_AnnexRead.mdwn
@@ -5,8 +5,8 @@ moved to AnnexRead for a performance win and also to make clean how it's
 used. --[[Joey]]
 
 The easy things have been moved now, but some things like Annex.force and
-Annex.fast would be good to move. Moving those would involve running
-argument processing outside the Annex monad. The main reason argument
-processing runs in the Annex monad is to set those values, but there may be
-other reasons too, so this will be a large set of changes that need to all
-happen together. --[[Joey]]
+Annex.fast and Annex.getGitConfig would be good to move. Moving those would
+involve running argument processing outside the Annex monad. The main
+reason argument processing runs in the Annex monad is to set those values,
+but there may be other reasons too, so this will be a large set of changes
+that need to all happen together. --[[Joey]]

response
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_29_0f3c8949ae362b43ac7db02eb4e11890._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_29_0f3c8949ae362b43ac7db02eb4e11890._comment
new file mode 100644
index 000000000..3ab4eb9e6
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_29_0f3c8949ae362b43ac7db02eb4e11890._comment
@@ -0,0 +1,13 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 29"""
+ date="2021-06-14T16:32:38Z"
+ content="""
+@yarik no, it needs to check all files. Consider what happens when `foo` is
+an annexed link to key K, which is not present, and you copy 
+`.git/annex/objects/../K` from some other repo and run `git annex add K`
+(or reinject, or get, it's all the same) -- `foo` now points to the content
+of K. The same needs to also hold true for unlocked files, and so it has to
+check if `foo` is an unlocked pointer to K and populate the file with the
+content. Repeat for all the other 9999 files..
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_27_db91e10d0bea246686ba2241b348e67d._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_27_db91e10d0bea246686ba2241b348e67d._comment
new file mode 100644
index 000000000..e1c5cfe39
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_27_db91e10d0bea246686ba2241b348e67d._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 27"
+ date="2021-06-14T16:23:41Z"
+ content="""
+FWIW, without having a clear idea on code logic/data structures, it smells a bit like somewhere may be it could be `any` vs current `all` operation/check (in case of `add` at least) if needed just to test if there is already a path with a given key, and assuming that DB doesn't require a full rewrite of all `path` entries if a new path added for the key.
+"""]]

Added a comment: git-annex-add slowdown
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_26_cb64b1798af6ad507ce4a4196e9c5b1d._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_26_cb64b1798af6ad507ce4a4196e9c5b1d._comment
new file mode 100644
index 000000000..c209d0628
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_26_cb64b1798af6ad507ce4a4196e9c5b1d._comment
@@ -0,0 +1,16 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="git-annex-add slowdown"
+ date="2021-06-14T16:00:44Z"
+ content="""
+> git-annex add looking to see what other annexed files use the same content, so that it can populate any unlocked files that didn't have the content present before
+
+Could this be skipped if `annex.supportunlocked=false`?
+
+>The file contents all being the same is the crucial thing
+
+One not-quite-edge case where that happens is when empty files are used as placedholders for outputs of failed steps throughout a large workflow.
+
+
+"""]]

check symlink before reading file
This is faster because when multiple files are in a directory, it gets
cached.
diff --git a/Annex/Link.hs b/Annex/Link.hs
index 50f2717fc..c305b39bd 100644
--- a/Annex/Link.hs
+++ b/Annex/Link.hs
@@ -301,7 +301,7 @@ unpaddedMaxPointerSz = 8192
  - symlink does. Avoids a false positive in those cases.
  - -}
 isPointerFile :: RawFilePath -> IO (Maybe Key)
-isPointerFile f = catchDefaultIO Nothing $ do
+isPointerFile f = catchDefaultIO Nothing $
 #if defined(mingw32_HOST_OS)
 	checkcontentfollowssymlinks -- no symlinks supported on windows
 #else
@@ -311,13 +311,10 @@ isPointerFile f = catchDefaultIO Nothing $ do
 		closeFd
 		(\fd -> readhandle =<< fdToHandle fd)
 #else
-	pointercontent <- checkcontentfollowssymlinks
-	if isJust pointercontent
-		then ifM (isSymbolicLink <$> R.getSymbolicLinkStatus f)
-			( return Nothing
-			, return pointercontent
-			)
-		else return Nothing
+	ifM (isSymbolicLink <$> R.getSymbolicLinkStatus f)
+		( return Nothing
+		, checkcontentfollowssymlinks
+		)
 #endif
 #endif
   where
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_25_5525b58ec94084a926c71f749d1f4233._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_25_5525b58ec94084a926c71f749d1f4233._comment
new file mode 100644
index 000000000..ca226f11a
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_25_5525b58ec94084a926c71f749d1f4233._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 25"""
+ date="2021-06-14T15:52:37Z"
+ content="""
+Found an optimisation that sped it up 50%, but this edge case is still 
+`O(N^2)`, so *shrug*.
+"""]]

retitle
diff --git a/doc/bugs/significant_performance_regression_impacting_datal.mdwn b/doc/bugs/significant_performance_regression_impacting_datal.mdwn
index 87cd4b1e8..aed3e4644 100644
--- a/doc/bugs/significant_performance_regression_impacting_datal.mdwn
+++ b/doc/bugs/significant_performance_regression_impacting_datal.mdwn
@@ -14,3 +14,4 @@ The first red is ok, just a fluke but then they all fail due to change in output
 Currently 8.20210428+git282-gd39dfed2a and first got slow with 
 8.20210428+git228-g13a6bfff4 and was ok with 8.20210428+git202-g9a5981a15
 
+[[!meta title="performance edge case when adding large numbers of identical files"]]

reproduced
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_23_8693e7e9c800f25cbd274b6781d834d6._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_23_8693e7e9c800f25cbd274b6781d834d6._comment
new file mode 100644
index 000000000..af4a33a6e
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_23_8693e7e9c800f25cbd274b6781d834d6._comment
@@ -0,0 +1,31 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 23"""
+ date="2021-06-14T14:09:07Z"
+ content="""
+The file contents all being the same is the crucial thing. On linux,
+adding 1000 dup files at a time (all in same directory), I get:
+
+run 1: 0:08  
+run 2: 0:42  
+run 3: 1:14  
+run 4: 1:46
+
+After run 4, adding 1000 files with all different content takes 
+0:11, so not appreciably slowed down; it only affects adding dups,
+and only when there are a *lot* of them.
+
+This feels like quite an edge case, and also not
+really a new problem, since unlocked files would have already
+had the same problem before recent changes.
+
+I thought this might be an innefficiency in sqlite's index, similar to how
+hash tables can scale poorly when a lot of things end up in the same
+bucket. But disabling the index did not improve performance.
+
+Aha -- the slowdown is caused by `git-annex add` looking to see what other
+annexed files use the same content, so that it can populate any unlocked
+files that didn't have the content present before. With all these locked
+files now recorded in the db, it has to check each file in turn, and
+there's the `O(N^2)`
+"""]]
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_24_6d11f6aa4b1a435bdf6d165eb8e6db8a._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_24_6d11f6aa4b1a435bdf6d165eb8e6db8a._comment
new file mode 100644
index 000000000..ff018e693
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_24_6d11f6aa4b1a435bdf6d165eb8e6db8a._comment
@@ -0,0 +1,17 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 24"""
+ date="2021-06-14T15:36:30Z"
+ content="""
+If the database recorded when files were unlocked or not, that could be
+avoided, but tracking that would add a lot of complexity for what is just
+an edge case. And probably slow things down generally by some amount due to
+the db being larger.
+
+It seems almost cheating, but it could remember the last few keys it's added,
+and avoid trying to populate unlocked files when adding those keys again.
+This would slow down the usual case by some tiny amount (eg an IORef access) 
+but avoid `O(N^2)` in this edge case. Though it wouldn't fix all edge cases,
+eg when the files it's adding rotate through X different contents, and X is
+larger than the number of keys it remembers.
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_22_192f2055345f1156a06d93b4c25e6722._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_22_192f2055345f1156a06d93b4c25e6722._comment
new file mode 100644
index 000000000..2fc5dc4b9
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_22_192f2055345f1156a06d93b4c25e6722._comment
@@ -0,0 +1,16 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 22"
+ date="2021-06-09T22:28:06Z"
+ content="""
+FWIW the complete timing was
+```
+       21.65 real         8.47 user        10.89 sys
+      139.96 real        61.51 user        78.18 sys
+      253.47 real       112.01 user       142.78 sys
+      370.43 real       161.94 user       211.15 sys
+      481.03 real       210.78 user       274.31 sys
+```
+so goes pretty close with what I observed with 4 splits (`30 284 528 720`).
+"""]]

Added a comment: more "mystery resolved" -- identical (empty) keys
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_21_0119751108d8d9fe6848269594bec273._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_21_0119751108d8d9fe6848269594bec273._comment
new file mode 100644
index 000000000..d3173a56e
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_21_0119751108d8d9fe6848269594bec273._comment
@@ -0,0 +1,104 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="more &quot;mystery resolved&quot; -- identical (empty) keys"
+ date="2021-06-09T21:00:34Z"
+ content="""
+thank you Joey for timing it up! May be relevant: in our [test scenario](https://github.com/datalad/datalad/blob/master/datalad/support/tests/test_annexrepo.py#L2354) we have 100 directories with 100 files in each one of those and each directory or file name is 100 chars long (not sure if this is relevant either). So doing 5 subsequent adds (on OSX) should result in ~20k paths specified on the command line for each invocation.
+
+<details>
+<summary>so I did this little replication script which would do similar drill (just not long filenames for this one)</summary> 
+
+```shell
+#!/bin/bash
+
+export PS4='> '
+set -eu
+cd \"$(mktemp -d ${TMPDIR:-/tmp}/ann-XXXXXXX)\"
+
+echo \"Populating the tree\"
+for d in {0..99}; do
+    mkdir $d
+    for f in {0..99}; do
+        echo \"$d$f\" >> $d/$f
+    done
+done
+
+git init
+git annex init
+/usr/bin/time git annex add --json {,1}?/* >/dev/null
+/usr/bin/time git annex add --json {2,3}?/* >/dev/null
+/usr/bin/time git annex add --json {4,5}?/* >/dev/null
+/usr/bin/time git annex add --json {6,7}?/* >/dev/null
+/usr/bin/time git annex add --json {8,9}?/* >/dev/null
+
+```
+</details>
+
+<details>
+<summary>with \"older\" 8.20210429-g9a5981a15 on OSX - stable 30 sec per batch (matches what I observed from running our tests)</summary> 
+
+```shell
+       29.83 real         9.52 user        17.43 sys
+       30.49 real        10.02 user        17.34 sys
+       30.67 real        10.37 user        17.36 sys
+       31.00 real        10.57 user        17.39 sys
+       30.78 real        10.77 user        17.23 sys
+```
+</details>
+
+<details>
+<summary>and the newer 8.20210429-g57b567ac8 -- I got the same-ish nice timing without significant growth! -- damn it</summary> 
+
+```shell
+       31.26 real        10.08 user        18.14 sys
+       31.97 real        10.99 user        18.69 sys
+       31.77 real        11.23 user        18.24 sys
+       32.08 real        11.26 user        18.06 sys
+       32.53 real        11.45 user        18.27 sys
+```
+</details>
+
+so I looked into our test generation again and realized -- we are not populating unique files.  They are all empty!
+
+<details>
+<summary>and now confirming with this slightly adjusted script which just touches them:</summary> 
+
+```shell
+#!/bin/bash
+
+export PS4='> '
+set -eu
+cd \"$(mktemp -d ${TMPDIR:-/tmp}/ann-XXXXXXX)\"
+
+echo \"Populating the tree\"
+for d in {0..99}; do
+    mkdir $d
+    for f in {0..99}; do
+        touch $d/$f
+    done
+done
+
+git init
+git annex init
+/usr/bin/time git annex add --json {,1}?/* >/dev/null
+/usr/bin/time git annex add --json {2,3}?/* >/dev/null
+/usr/bin/time git annex add --json {4,5}?/* >/dev/null
+/usr/bin/time git annex add --json {6,7}?/* >/dev/null
+/usr/bin/time git annex add --json {8,9}?/* >/dev/null
+
+```
+</details>
+
+the case I had observed
+
+```
+(base) yoh@dataladmac2 ~ % bash macos-slow-annex-add-empty.sh
+...
+       21.65 real         8.47 user        10.89 sys
+      139.96 real        61.51 user        78.18 sys
+... waiting for the rest -- too eager to post as is for now
+```
+
+so -- sorry I have not spotted that peculiarity right from the beginning!
+"""]]

comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_17_045fd8ebe1f9441e42c9c6fbf5c06563._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_17_045fd8ebe1f9441e42c9c6fbf5c06563._comment
new file mode 100644
index 000000000..d8b1baa3e
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_17_045fd8ebe1f9441e42c9c6fbf5c06563._comment
@@ -0,0 +1,93 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 17"""
+ date="2021-06-09T18:53:14Z"
+ content="""
+Trying to repro on linux, batches of 1000 files each time:
+
+	joey@darkstar:~/tmp/bench6>/usr/bin/time git-annex add 1??? --quiet
+	7.17user 5.02system 0:11.33elapsed 107%CPU (0avgtext+0avgdata 69480maxresident)k
+	6880inputs+40344outputs (4major+404989minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 2??? --quiet
+	7.75user 5.63system 0:12.62elapsed 106%CPU (0avgtext+0avgdata 70296maxresident)k
+	3640inputs+42656outputs (9major+414106minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 3??? --quiet
+	8.04user 5.74system 0:12.92elapsed 106%CPU (0avgtext+0avgdata 70396maxresident)k
+	1456inputs+44200outputs (8major+414767minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 4??? --quiet
+	8.66user 5.57system 0:13.45elapsed 105%CPU (0avgtext+0avgdata 69620maxresident)k
+	768inputs+45768outputs (11major+416364minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 5??? --quiet
+	8.83user 5.58system 0:13.92elapsed 103%CPU (0avgtext+0avgdata 69956maxresident)k
+	5448inputs+47128outputs (45major+415820minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 6??? --quiet
+	9.00user 5.49system 0:13.60elapsed 106%CPU (0avgtext+0avgdata 70340maxresident)k
+	20864inputs+48768outputs (270major+418220minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 7??? --quiet
+	9.23user 5.56system 0:13.90elapsed 106%CPU (0avgtext+0avgdata 70352maxresident)k
+	12736inputs+50024outputs (492major+417882minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 8??? --quiet
+	9.27user 5.62system 0:13.89elapsed 107%CPU (0avgtext+0avgdata 70236maxresident)k
+	11320inputs+51816outputs (411major+419672minor)pagefaults 0swaps
+	joey@darkstar:~/tmp/bench>/usr/bin/time git-annex add 9??? --quiet
+	9.33user 5.80system 0:14.04elapsed 107%CPU (0avgtext+0avgdata 70380maxresident)k
+	10952inputs+53128outputs (281major+419771minor)pagefaults 0swaps
+
+There's some growth here, but it seems linear to the number of new files in the git index.
+
+Doing the same with the last git-annex release:
+
+0:10.97elapsed, 0:11.24elapsed, 0:11.65elapsed, 0:11.90elapsed,
+0:12.25elapsed, 0:12.86elapsed, 0:12.74elapsed, 0:12.84elapsed,
+0:13.26elapsed
+
+So close to the same, the slowdown feels minimal.
+
+Then on OSX:
+
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 1??? --quiet
+	       18.26 real         6.03 user         8.35 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 2??? --quiet
+	       26.68 real         6.63 user         9.08 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 3??? --quiet
+	       32.13 real         6.75 user         9.09 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 4??? --quiet
+	       34.49 real         7.11 user         9.40 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 5??? --quiet
+	       34.61 real         7.08 user         9.18 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 6??? --quiet
+	       36.66 real         7.20 user         9.31 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 7??? --quiet
+	       38.29 real         7.35 user         9.42 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 8??? --quiet
+	       35.85 real         7.39 user         9.57 sys
+	datalads-imac:bench joey$ PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add 9??? --quiet
+	       36.20 real         7.56 user         9.59 sys
+
+Well, this does not seem exponential but it's certainly growing
+faster than linux.
+
+Then tried on OSX with 100 chunks of 100 files each:
+
+	datalads-imac:bench joey$ for x in $(seq 10 99); do echo $x; PATH=$HOME:$PATH /usr/bin/time ~/git-annex  add $x?? --quiet; done
+	...
+	23
+	        3.91 real         0.83 user         1.17 sys
+	24
+	        3.31 real         0.83 user         1.25 sys
+	25
+	        3.32 real         0.83 user         1.26 sys
+	...
+	80
+	        5.68 real         1.05 user         1.60 sys
+	81
+	        6.18 real         1.06 user         1.60 sys
+	82
+	        7.06 real         1.06 user         1.59 sys
+
+There's more overhead because git-annex has to start up and diff
+the index trees, which takes longer than processing the 100 files
+in the chunk. But this is not taking hours to run, so either the
+test case involves a *lot* more than 10k files, or there is
+something else involved.
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_19_80372cd15a6a8812072a446c41fed4e7._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_19_80372cd15a6a8812072a446c41fed4e7._comment
new file mode 100644
index 000000000..96aec33cd
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_19_80372cd15a6a8812072a446c41fed4e7._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 19"
+ date="2021-06-08T22:02:33Z"
+ content="""
+and we seems to not run that test on windows (marked as `@known_failure_windows` ;)) so we do not see that \"regression\" on windows runs (was yet another part of mystery to me why OSX only ;))
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_18_3e57d7d4640de95a059b2ab3d1e9d54e._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_18_3e57d7d4640de95a059b2ab3d1e9d54e._comment
new file mode 100644
index 000000000..029c41078
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_18_3e57d7d4640de95a059b2ab3d1e9d54e._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 18"
+ date="2021-06-08T21:58:20Z"
+ content="""
+clarification: the \"(which takes over hour with splitting)\" was intended for the \"bleeding edge\" one, not the older one. That one is ok even with splitting.
+"""]]

Added a comment: OSX mystery resolved. add --batch is effective mitigation
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_17_8df0e346ed85e52ec94763d11c174ab3._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_17_8df0e346ed85e52ec94763d11c174ab3._comment
new file mode 100644
index 000000000..1c3242e79
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_17_8df0e346ed85e52ec94763d11c174ab3._comment
@@ -0,0 +1,12 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="OSX mystery resolved. add --batch is effective mitigation"
+ date="2021-06-08T21:56:53Z"
+ content="""
+> Perhaps on OSX something is making the write-tree significantly slower. Or something is making it run the command more with fewer files per run.
+
+The latter would be my guess... We seems to get 2 vs 5 splits on Linux vs OSX... our `datalad.utils.CMD_MAX_ARG` (logic is [here](https://github.com/datalad/datalad/blob/HEAD/datalad/utils.py#L103)) gets set to 1048576 on linux and 524288 on OSX.  So \"matches\" and \"OSX mystery resolved\"! ;)
+
+Meanwhile confirming that using `add --batch` mitigates it. With [ad-hoc patch to add add --batched to datalad](https://github.com/datalad/datalad/pull/5722) I get 187.8053s run for our test using `annex add --batch` and bleeding edge annex 57b567ac8 and 183.1654s (so about the same) using annex 9a5981a15 from 20210525 (which takes over hour with splitting); and then with our old splitting and that old git-annex 9a5981a15 I get 188.3987s run.
+"""]]

comments
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_16_65aaf8e15cd15187cd63863634f25091._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_16_65aaf8e15cd15187cd63863634f25091._comment
new file mode 100644
index 000000000..39e0f92db
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_16_65aaf8e15cd15187cd63863634f25091._comment
@@ -0,0 +1,33 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 16"""
+ date="2021-06-08T20:56:50Z"
+ content="""
+This is starting to make some sense. If you're running git-annex add
+N times adding M files each time, then each run will now diff the
+changes in the index made by the previous run.
+
+And the first part of diffing the index is generating a tree from all the
+files in it, which is to some extent `O(N*M)` (though IDK, git may have
+optimisations involving subtrees or such). So the combined N git-annex add
+runs come to `O(N*N*M)`
+
+On linux, `git write-tree` with 100,000 files in the index runs in under 1
+second, so athe `N*M` is not too bad. And then there's the overhead of
+git-annex processing the resulting diff, which takes more time but is what
+I've been optimising.
+
+Perhaps on OSX something is making the write-tree significantly slower.
+Or something is making it run the command more with fewer files per run.
+Although IIRC OSX has close to the same maximum command line length as
+linux.
+
+Or maybe the index diffing is diffing from the wrong start point.
+One way I can think of where this would happen is if
+it somehow misdetects the index as being locked.
+
+A --debug trace of the one of the later git-annex add runs in that
+test case would probably shed some useful light.
+
+Yes, --batch should avoid the problem...
+"""]]
diff --git a/doc/todo/display_when_reconcileStaged_is_taking_a_long_time/comment_2_b713931f34be1f06b52a75789c3b58ee._comment b/doc/todo/display_when_reconcileStaged_is_taking_a_long_time/comment_2_b713931f34be1f06b52a75789c3b58ee._comment
new file mode 100644
index 000000000..bc731d0c7
--- /dev/null
+++ b/doc/todo/display_when_reconcileStaged_is_taking_a_long_time/comment_2_b713931f34be1f06b52a75789c3b58ee._comment
@@ -0,0 +1,22 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2021-06-08T16:03:13Z"
+ content="""
+I tried making reconcileStaged display the message itself, this is the
+result:
+
+	add foo
+	100%  30 B             73 KiB/s 0s(scanning for annexed files...)
+	ok
+
+So for that to be done, showSideAction would need to clear the progress
+bar display first. Note that the display is ok when concurrent output is
+enabled:
+
+	add c (scanning for annexed files...)
+	ok
+
+Ok.. Fixed that display glitch, and made reconcileStaged display
+the message itself when it's taking a while to run.
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_15_c9488d6180e741dfec0793f546c9eb29._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_15_c9488d6180e741dfec0793f546c9eb29._comment
new file mode 100644
index 000000000..a8051f172
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_15_c9488d6180e741dfec0793f546c9eb29._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 15"
+ date="2021-06-08T20:23:09Z"
+ content="""
+ok -- I think it (or at least a part of it, datalad test is still running) boils down to `git-annex add` now doing some `O(n-staged * n-cmdline-paths)` lookup/operation (instead of before just `O(n-cmdline-paths)`) whenever we have a series of `annex add --json cmdline-paths ...`. This is reflected by the fact that if before we had about `~30 sec` per each invocation of `annex add`, now we have `30 284 528 720`.
+
+In any case in datalad we should finally switch to use `annex add --batch`, filed [an issue](https://github.com/datalad/datalad/issues/5721), but I guess may be it could also be addressed on git-annex side since sounds like some suboptimal data structure is used for some paths' matching.
+"""]]

Added a comment: getting closer...
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_14_4d2998ea843dd8adee8b7b066d97d942._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_14_4d2998ea843dd8adee8b7b066d97d942._comment
new file mode 100644
index 000000000..8204fee07
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_14_4d2998ea843dd8adee8b7b066d97d942._comment
@@ -0,0 +1,28 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="getting closer..."
+ date="2021-06-08T19:21:59Z"
+ content="""
+I think I have localized the slowdown to a single particular test in datalad which operates on a very heavy tree with tiny files.
+Good and bad runs:
+
+```
+*$> grep -h -A3 'datalad.support.tests.test_annexrepo.test_files_split(' builds/2021/05/*/cron-*/44196064/Build\ git-annex\ on\ macOS-29{2,3}-failed/1_test-datalad\ \(master\).txt
+2021-05-25T04:34:39.7723910Z datalad.support.tests.test_annexrepo.test_files_split(<class 'datalad.support.gitrepo.GitRepo'>,) ... ok
+2021-05-25T04:39:31.3031220Z datalad.support.tests.test_annexrepo.test_files_split(<class 'datalad.support.annexrepo.AnnexRepo'>,) ... ok
+2021-05-25T04:39:31.3032670Z datalad.support.tests.test_annexrepo.test_get_size_from_key ... ok
+2021-05-25T04:39:31.3043440Z datalad.support.tests.test_annexrepo.test_done_deprecation ... ok
+2021-05-25T04:39:31.3104830Z datalad.support.tests.test_ansi_colors.test_color_enabled ... ok
+--
+2021-05-26T05:01:12.6881120Z datalad.support.tests.test_annexrepo.test_files_split(<class 'datalad.support.gitrepo.GitRepo'>,) ... ok
+2021-05-26T06:47:04.8547640Z datalad.support.tests.test_annexrepo.test_files_split(<class 'datalad.support.annexrepo.AnnexRepo'>,) ... ok
+2021-05-26T06:47:04.8549600Z datalad.support.tests.test_annexrepo.test_get_size_from_key ... ok
+2021-05-26T06:47:04.8559760Z datalad.support.tests.test_annexrepo.test_done_deprecation ... ok
+2021-05-26T06:47:04.8636720Z datalad.support.tests.test_ansi_colors.test_color_enabled ... ok
+
+```
+you can see from timestamps (a guess github prepends time stamp AFTER getting full line) that there is over 1h30m spent there on `test_files_split(<class 'datalad.support.annexrepo.AnnexRepo'>,)`.  [Here is the actual test etc for posterity](https://github.com/datalad/datalad/blob/master/datalad/support/tests/test_annexrepo.py#L2354).  Yet to pin point more specifically on what is going on but most likely some interplay with command line length invocation limits (specific to OSX) etc.
+
+So good news is that it is not some widely spread drastic slow-down effect as far as I see it.
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_13_f6bec279f43603719694e44d99309fc7._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_13_f6bec279f43603719694e44d99309fc7._comment
new file mode 100644
index 000000000..e8a5dbb31
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_13_f6bec279f43603719694e44d99309fc7._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="comment 13"
+ date="2021-06-08T19:08:01Z"
+ content="""
+One other test for “some coincidence with OSX VM getting slower“ is to try the current github CI test with an older git-annex.
+"""]]

diff --git a/doc/forum/import_and_export_treeish_for_rsync_and_webdav.mdwn b/doc/forum/import_and_export_treeish_for_rsync_and_webdav.mdwn
new file mode 100644
index 000000000..c4e03503f
--- /dev/null
+++ b/doc/forum/import_and_export_treeish_for_rsync_and_webdav.mdwn
@@ -0,0 +1,3 @@
+is it possible for having import/export (pull/push) funciontality for the rsync special remote? similar to how the adb special remote handles importing and exporting treeishes?  an example use case is that i have a pinePhone which i push files (music, podcasts, pdfs) to using the rsync special remote so it can be done over my wifi network, but it would be cool to also be able to import either any picutres i take, or to import any documents that I pull from the internet and then store on the pinePhone back into my annex.
+
+some reasons that I do not just make a another full annex repo on the pinePhone really comes down computation resource limitations of the pinephone. and I have a very large git annex (150,000+ keys) which can make the git repo fairly large.

removed
diff --git a/doc/special_remotes/webdav/comment_23_d1b44f0cf171fb8cab85add778f2949b._comment b/doc/special_remotes/webdav/comment_23_d1b44f0cf171fb8cab85add778f2949b._comment
deleted file mode 100644
index c375606c3..000000000
--- a/doc/special_remotes/webdav/comment_23_d1b44f0cf171fb8cab85add778f2949b._comment
+++ /dev/null
@@ -1,9 +0,0 @@
-[[!comment format=mdwn
- username="jenkin.schibel@286264d9ceb79998aecff0d5d1a4ffe34f8b8421"
- nickname="jenkin.schibel"
- avatar="http://cdn.libravatar.org/avatar/692d82fb5c42fc86d97cc44ae0fb61ca"
- subject="using import tree and export tree"
- date="2021-06-06T14:43:39Z"
- content="""
-hey will being able to import a treeish from a webdav remote ever be supported? my use case is that i have a nextcloud instance where i store photo backups for all the smart devices in my family which all get backed up to a single shared directory.  since this tree would be ever changing due to the many smart phones connecting to it, and storing data in it, i figured a push and pull method similar to what can be done with the adb special remote could be useful to keep all the files tracked in my annex.
-"""]]

Added a comment: all recent builds/logs are fetched to smaug
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_12_c841c9db2000c65c56ce8e26e58a3f62._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_12_c841c9db2000c65c56ce8e26e58a3f62._comment
new file mode 100644
index 000000000..4d8479053
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_12_c841c9db2000c65c56ce8e26e58a3f62._comment
@@ -0,0 +1,153 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="all recent builds/logs are fetched to smaug"
+ date="2021-06-08T16:50:11Z"
+ content="""
+Finished fetching all the builds (logs + artifacts, i.e. built installers and packages) to smaug.  FWIW `joey` user on smaug should be able to access them at /mnt/datasets/datalad/ci/git-annex as well (it is all git/git-annex with submodules happen you want to get a \"clone\" ;))
+
+<details>
+<summary>grep confirms timing for datalad testson OSX</summary> 
+
+```shell
+(tinuous-dev) datalad@smaug:/mnt/datasets/datalad/ci/git-annex$ grep -h -i 'Ran .*tests in' builds/2021/0[56]/*/cron-*/*/*macOS*/*_test-datalad\ \(master\).txt
+2021-05-01T05:47:43.6623300Z Ran 1269 tests in 10381.041s
+2021-05-02T04:27:47.4468150Z Ran 1269 tests in 5835.199s
+2021-05-03T04:25:06.4741780Z Ran 1269 tests in 5817.953s
+2021-05-04T04:26:25.2725400Z Ran 1269 tests in 5925.938s
+2021-05-05T04:15:45.7241110Z Ran 1270 tests in 6003.093s
+2021-05-06T04:14:18.9203140Z Ran 1270 tests in 5821.084s
+2021-05-07T04:19:12.5257430Z Ran 1270 tests in 5887.912s
+2021-05-08T04:17:21.4790320Z Ran 1270 tests in 5775.691s
+2021-05-09T04:20:53.2085730Z Ran 1270 tests in 5961.437s
+2021-05-10T04:21:52.2492300Z Ran 1270 tests in 5698.905s
+2021-05-11T04:28:39.1451570Z Ran 1270 tests in 5899.918s
+2021-05-12T04:28:26.6181110Z Ran 1270 tests in 5964.851s
+2021-05-13T04:30:47.2978230Z Ran 1277 tests in 5938.297s
+2021-05-14T04:37:51.3172970Z Ran 1277 tests in 5921.571s
+2021-05-15T04:31:29.6466840Z Ran 1277 tests in 6060.762s
+2021-05-16T04:33:26.5552670Z Ran 1277 tests in 5789.405s
+2021-05-17T04:36:52.4460760Z Ran 1277 tests in 5855.382s
+2021-05-18T04:37:24.5613580Z Ran 1278 tests in 6007.685s
+2021-05-19T04:34:37.3253830Z Ran 1279 tests in 5940.204s
+2021-05-20T04:36:10.3184010Z Ran 1280 tests in 5941.388s
+2021-05-21T04:41:10.2664330Z Ran 1280 tests in 6218.775s
+2021-05-22T04:40:39.2754540Z Ran 1289 tests in 5884.267s
+2021-05-23T04:37:51.5005200Z Ran 1289 tests in 5750.672s
+2021-05-24T04:38:24.5894000Z Ran 1289 tests in 5911.655s
+2021-05-25T04:51:35.2189100Z Ran 1289 tests in 6266.836s
+2021-05-26T06:56:33.6660300Z Ran 1293 tests in 12584.992s
+2021-05-27T07:08:27.4015580Z Ran 1293 tests in 11901.552s
+2021-05-28T07:39:17.3481450Z Ran 1292 tests in 12094.343s
+2021-05-29T08:05:14.3586440Z Ran 1294 tests in 12205.434s
+2021-05-30T08:09:33.0780990Z Ran 1294 tests in 12028.089s
+2021-05-31T08:16:36.3910830Z Ran 1294 tests in 12329.455s
+2021-06-01T08:42:24.3167200Z Ran 1294 tests in 12115.378s
+2021-06-02T08:42:23.1985610Z Ran 1294 tests in 12432.309s
+2021-06-03T08:00:08.2576030Z Ran 1294 tests in 11556.974s
+2021-06-04T07:25:39.1674160Z Ran 1294 tests in 11946.195s
+2021-06-05T07:14:00.5262620Z Ran 1294 tests in 12456.432s
+2021-06-06T06:53:14.7001370Z Ran 1294 tests in 11677.647s
+2021-06-07T07:12:57.5076610Z Ran 1294 tests in 12042.332s
+2021-06-08T06:48:01.3977250Z Ran 1294 tests in 12126.463
+```
+
+Interestingly the top one is in 10k,  but prior ones were ok
+
+```shell
+(tinuous-dev) datalad@smaug:/mnt/datasets/datalad/ci/git-annex$ grep -h -i 'Ran .*tests in' builds/2021/04/*/cron-*/*/*macOS*/*_test-datalad\ \(master\).txt
+2021-04-27T03:58:30.4829510Z Ran 1265 tests in 6312.819s
+2021-04-28T03:44:23.1655040Z Ran 1265 tests in 5622.562s
+2021-04-29T03:50:14.0670840Z Ran 1269 tests in 6196.774s
+2021-04-30T04:21:45.1727310Z Ran 1269 tests in 5879.034s
+```
+</details>
+
+<details>
+<summary>and absence of the drastic effect confirmed for linux and windows</summary> 
+
+```shell
+(tinuous-dev) datalad@smaug:/mnt/datasets/datalad/ci/git-annex$ grep -h -i 'Ran .*tests in' builds/2021/0[56]/*/cron-*/*/*buntu*/*_test-datalad\ \(master\).txt
+2021-05-01T04:16:34.1539917Z Ran 1269 tests in 3999.905s
+2021-05-02T04:17:04.5604428Z Ran 1269 tests in 3745.951s
+2021-05-03T04:22:46.1508135Z Ran 1269 tests in 4133.703s
+2021-05-04T04:22:19.3612212Z Ran 1269 tests in 4341.995s
+2021-05-05T04:23:04.3327570Z Ran 1270 tests in 4383.268s
+2021-05-06T04:07:33.2967674Z Ran 1270 tests in 3457.688s
+2021-05-07T04:18:26.7140893Z Ran 1270 tests in 3778.162s
+2021-05-08T04:13:13.9041377Z Ran 1270 tests in 3728.503s
+2021-05-09T04:16:59.6844666Z Ran 1270 tests in 3898.714s
+2021-05-11T04:21:45.5447035Z Ran 1270 tests in 4032.457s
+2021-05-12T04:20:48.6349049Z Ran 1270 tests in 3940.342s
+2021-05-13T04:26:53.8376010Z Ran 1277 tests in 4130.863s
+2021-05-14T04:37:41.5689917Z Ran 1277 tests in 4476.880s
+2021-05-15T04:20:12.0126613Z Ran 1277 tests in 4007.927s
+2021-05-16T04:30:50.0329536Z Ran 1277 tests in 4077.010s
+2021-05-17T04:25:03.3302428Z Ran 1277 tests in 4034.791s
+2021-05-18T04:33:23.5513354Z Ran 1278 tests in 4422.760s
+2021-05-19T04:19:47.5664435Z Ran 1279 tests in 3910.031s
+2021-05-20T04:13:45.4617209Z Ran 1280 tests in 3508.342s
+2021-05-21T04:23:56.1103373Z Ran 1280 tests in 3946.563s
+2021-05-22T04:30:52.1649178Z Ran 1289 tests in 4406.863s
+2021-05-23T04:35:53.5526207Z Ran 1289 tests in 3818.054s
+2021-05-24T04:33:44.9770098Z Ran 1289 tests in 4069.751s
+2021-05-25T04:38:28.6369074Z Ran 1289 tests in 3809.039s
+2021-05-26T05:16:56.6913239Z Ran 1293 tests in 4340.973s
+2021-05-27T05:21:33.5230620Z Ran 1293 tests in 4303.062s
+2021-05-28T05:44:02.6763227Z Ran 1292 tests in 4013.921s
+2021-05-29T06:05:40.8639499Z Ran 1294 tests in 3581.055s
+2021-05-30T06:17:23.4265104Z Ran 1294 tests in 4257.736s
+2021-05-31T06:36:09.6446865Z Ran 1294 tests in 4782.606s
+2021-06-01T06:57:26.6049829Z Ran 1294 tests in 4391.030s
+2021-06-02T06:59:17.8345547Z Ran 1294 tests in 4737.580s
+2021-06-03T06:10:51.5533496Z Ran 1294 tests in 3453.557s
+2021-06-04T05:44:15.7873867Z Ran 1294 tests in 4113.770s
+2021-06-05T05:25:15.5789949Z Ran 1294 tests in 4383.014s
+2021-06-06T05:20:48.7371175Z Ran 1294 tests in 4386.210s
+2021-06-07T05:33:24.8643085Z Ran 1294 tests in 4303.855s
+2021-06-08T05:03:38.3420188Z Ran 1294 tests in 4371.387s
+```
+
+```shell
+(tinuous-dev) datalad@smaug:/mnt/datasets/datalad/ci/git-annex$ grep -h -i 'Ran .*tests in' builds/2021/0[56]/*/cron-*/*/*indows*/*_test-datalad\ \(master\).txt
+2021-05-01T05:29:21.7469495Z Ran 1250 tests in 4904.324s
+2021-05-02T05:33:35.7321469Z Ran 1250 tests in 4835.792s
+2021-05-03T05:32:36.9342468Z Ran 1250 tests in 5040.210s
+2021-05-04T05:18:53.5090364Z Ran 1250 tests in 4508.693s
+2021-05-05T05:29:53.0946195Z Ran 1251 tests in 4989.032s
+2021-05-06T05:26:05.6627667Z Ran 1251 tests in 4766.005s
+2021-05-07T05:19:54.3788683Z Ran 1251 tests in 4305.963s
+2021-05-08T05:32:49.9787816Z Ran 1251 tests in 4999.028s
+2021-05-09T05:38:03.4560444Z Ran 1251 tests in 5482.509s
+2021-05-10T05:34:58.3502335Z Ran 1251 tests in 5067.805s
+2021-05-11T05:29:07.7916431Z Ran 1251 tests in 4797.435s
+2021-05-12T05:23:10.5488731Z Ran 1251 tests in 4276.426s
+2021-05-13T05:25:09.1884039Z Ran 1258 tests in 4275.409s
+2021-05-14T05:39:20.0295201Z Ran 1258 tests in 4999.023s
+2021-05-15T05:31:48.8021685Z Ran 1258 tests in 4736.174s
+2021-05-16T05:33:12.0136713Z Ran 1258 tests in 4310.326s
+2021-05-17T05:30:27.7704400Z Ran 1258 tests in 4561.060s
+2021-05-18T05:31:34.3746293Z Ran 1259 tests in 4426.723s
+2021-05-19T05:31:11.8604863Z Ran 1260 tests in 4903.506s
+2021-05-20T05:22:56.6087010Z Ran 1261 tests in 4299.536s
+2021-05-21T05:28:25.7511300Z Ran 1261 tests in 4498.286s
+2021-05-22T05:44:54.2975226Z Ran 1270 tests in 5376.292s
+2021-05-23T05:35:29.5504579Z Ran 1270 tests in 4289.031s
+2021-05-24T05:33:40.4688984Z Ran 1270 tests in 4202.429s
+2021-05-25T05:45:30.0310577Z Ran 1270 tests in 4882.292s
+2021-05-26T06:03:08.4980635Z Ran 1274 tests in 4441.783s
+2021-05-27T06:05:25.0347805Z Ran 1274 tests in 4381.865s
+2021-05-28T06:35:10.3614775Z Ran 1273 tests in 4529.720s
+2021-05-29T07:10:28.5980900Z Ran 1275 tests in 4639.335s
+2021-05-30T07:01:37.6908114Z Ran 1275 tests in 5123.743s
+2021-05-31T07:01:41.6742606Z Ran 1275 tests in 4672.112s
+2021-06-01T07:36:38.0359361Z Ran 1275 tests in 5183.674s
+2021-06-02T07:21:46.3022239Z Ran 1275 tests in 4610.975s
+2021-06-03T07:20:14.4589015Z Ran 1275 tests in 5795.353s
+2021-06-04T06:44:13.4142157Z Ran 1275 tests in 4574.473s
+2021-06-08T06:10:00.4654735Z Ran 1275 tests in 5406.636s
+```
+</details>
+
+now will try to do timing on a local OSX to see if it is indeed just some coincidence with OSX VM getting slower or it is \"real\".
+"""]]

display scanning message whenever reconcileStaged has enough files to chew on
Clear visible progress bar first.
Removed showSideActionAfter because it can't be used in reconcileStaged
(import loop). Instead, it counts the number of files it
processes and displays it after it's seen a sufficient to know it's
taking a while.
Sponsored-by: Dartmouth College's Datalad project
diff --git a/Annex/Concurrent.hs b/Annex/Concurrent.hs
index cb22b6a46..1701120da 100644
--- a/Annex/Concurrent.hs
+++ b/Annex/Concurrent.hs
@@ -100,23 +100,3 @@ mergeState st = do
 		uncurry addCleanupAction
 	Annex.Queue.mergeFrom st'
 	changeState $ \s -> s { errcounter = errcounter s + errcounter st' }
-
-{- Display a message, only when the action runs for a long enough
- - amount of time.
- - 
- - The action should not display any other messages, progress, etc;
- - if it did there could be some scrambling of the display since the
- - message display could happen at the same time as other output,
- - or after it.
- -}
-showSideActionAfter :: Microseconds -> String -> Annex a -> Annex a
-showSideActionAfter t m a = do
-	waiter <- liftIO $ async $ unboundDelay t
-	let display = liftIO (waitCatch waiter) >>= \case
-		Left _ -> return ()
-		Right _ -> showSideAction m
-	displayer <- liftIO . async =<< forkState display
-	let cleanup = do
-		liftIO $ cancel waiter
-		join (liftIO (wait displayer))
-	a `finally` cleanup
diff --git a/Annex/WorkTree.hs b/Annex/WorkTree.hs
index 33d909487..9a3dd5077 100644
--- a/Annex/WorkTree.hs
+++ b/Annex/WorkTree.hs
@@ -15,8 +15,6 @@ import Annex.Content
 import Annex.ReplaceFile
 import Annex.CurrentBranch
 import Annex.InodeSentinal
-import Annex.Concurrent
-import Utility.ThreadScheduler
 import Utility.InodeCache
 import Git.FilePath
 import Git.CatFile
@@ -81,7 +79,7 @@ ifAnnexed file yes no = maybe no yes =<< lookupKey file
  - as-is.
  -}
 scanAnnexedFiles :: Bool -> Annex ()
-scanAnnexedFiles initscan = showSideActionAfter oneSecond "scanning for annexed files" $ do
+scanAnnexedFiles initscan = do
 	-- This gets the keys database populated with all annexed files,
 	-- by running Database.Keys.reconcileStaged.
 	Database.Keys.runWriter (const noop)
diff --git a/Database/Keys.hs b/Database/Keys.hs
index 3c0e3bede..6cbfecda1 100644
--- a/Database/Keys.hs
+++ b/Database/Keys.hs
@@ -7,6 +7,7 @@
 
 {-# LANGUAGE ScopedTypeVariables #-}
 {-# LANGUAGE OverloadedStrings #-}
+{-# LANGUAGE BangPatterns #-}
 
 module Database.Keys (
 	DbHandle,
@@ -403,7 +404,7 @@ reconcileStaged qh = do
 				proct <- liftIO $ async $
 					procthread mdreader catfeeder
 						`finally` void catcloser
-				dbchanged <- dbwriter False catreader
+				dbchanged <- dbwriter False largediff catreader
 				-- Flush database changes now
 				-- so other processes can see them.
 				when dbchanged $
@@ -420,8 +421,27 @@ reconcileStaged qh = do
 			Just _ -> procthread mdreader catfeeder
 			Nothing -> return ()
 
-		dbwriter dbchanged catreader = liftIO catreader >>= \case
+		dbwriter dbchanged n catreader = liftIO catreader >>= \case
 			Just (ka, content) -> do
 				changed <- ka (parseLinkTargetOrPointerLazy =<< content)
-				dbwriter (dbchanged || changed) catreader
+				!n' <- countdownToMessage n
+				dbwriter (dbchanged || changed) n' catreader
 			Nothing -> return dbchanged
+
+	-- When the diff is large, the scan can take a while,
+	-- so let the user know what's going on.
+	countdownToMessage n
+		| n < 1 = return 0
+		| n == 1 = do
+			showSideAction "scanning for annexed files"
+			return 0
+		| otherwise = return (pred n)
+
+	-- How large is large? Too large and there will be a long
+	-- delay before the message is shown; too short and the message
+	-- will clutter things up unncessarily. It's uncommon for 1000
+	-- files to change in the index, and processing that many files
+	-- takes less than half a second, so that seems about right.
+	largediff :: Int
+	largediff = 1000
+
diff --git a/Messages.hs b/Messages.hs
index 81c96a764..563bd7e81 100644
--- a/Messages.hs
+++ b/Messages.hs
@@ -124,12 +124,14 @@ showSideAction m = Annex.getState Annex.output >>= go
   where
 	go st
 		| sideActionBlock st == StartBlock = do
-			p
+			go' st
 			let st' = st { sideActionBlock = InBlock }
 			Annex.changeState $ \s -> s { Annex.output = st' }
 		| sideActionBlock st == InBlock = return ()
-		| otherwise = p
-	p = outputMessage JSON.none $ encodeBS' $ "(" ++ m ++ "...)\n"
+		| otherwise = go' st
+	go' st = do
+		liftIO $ clearProgressMeter st
+		outputMessage JSON.none $ encodeBS' $ "(" ++ m ++ "...)\n"
 			
 showStoringStateAction :: Annex ()
 showStoringStateAction = showSideAction "recording state in git"
diff --git a/Messages/Progress.hs b/Messages/Progress.hs
index da60716e6..397aebbb3 100644
--- a/Messages/Progress.hs
+++ b/Messages/Progress.hs
@@ -11,6 +11,7 @@
 module Messages.Progress where
 
 import Common
+import qualified Annex
 import Messages
 import Utility.Metered
 import Types
@@ -66,7 +67,7 @@ instance MeterSize KeySizer where
 
 {- Shows a progress meter while performing an action.
  - The action is passed the meter and a callback to use to update the meter.
- --}
+ -}
 metered
 	:: MeterSize sizer
 	=> Maybe MeterUpdate
@@ -75,28 +76,38 @@ metered
 	-> Annex a
 metered othermeter sizer a = withMessageState $ \st -> do
 	sz <- getMeterSize sizer
-	metered' st othermeter sz showOutput a
+	metered' st setclear othermeter sz showOutput a
+  where
+	setclear c = Annex.changeState $ \st -> st
+		{ Annex.output = (Annex.output st) { clearProgressMeter = c } }
 
 metered'
 	:: (Monad m, MonadIO m, MonadMask m)
 	=> MessageState
+	-> (IO () -> m ())
+	-- ^ This should set clearProgressMeter when progress meters
+	-- are being displayed; not needed when outputType is not
+	-- NormalOutput.
 	-> Maybe MeterUpdate
 	-> Maybe TotalSize
 	-> m ()
 	-- ^ this should run showOutput
 	-> (Meter -> MeterUpdate -> m a)
 	-> m a
-metered' st othermeter msize showoutput a = go st
+metered' st setclear othermeter msize showoutput a = go st
   where
 	go (MessageState { outputType = QuietOutput }) = nometer
 	go (MessageState { outputType = NormalOutput, concurrentOutputEnabled = False }) = do
 		showoutput
 		meter <- liftIO $ mkMeter msize $ 
 			displayMeterHandle stdout bandwidthMeter
+		let clear = clearMeterHandle meter stdout
+		setclear clear
 		m <- liftIO $ rateLimitMeterUpdate consoleratelimit meter $
 			updateMeter meter
 		r <- a meter (combinemeter m)
-		liftIO $ clearMeterHandle meter stdout
+		setclear noop
+		liftIO clear
 		return r
 	go (MessageState { outputType = NormalOutput, concurrentOutputEnabled = True }) =
 		withProgressRegion st $ \r -> do
@@ -149,7 +160,7 @@ metered' st othermeter msize showoutput a = go st
 	jsonratelimit = 0.1
 
 	minratelimit = min consoleratelimit jsonratelimit
-
+		
 {- Poll file size to display meter. -}
 meteredFile :: FilePath -> Maybe MeterUpdate -> Key -> Annex a -> Annex a
 meteredFile file combinemeterupdate key a = 
diff --git a/Messages/Serialized.hs b/Messages/Serialized.hs
index ff2517a90..bf1d09578 100644
--- a/Messages/Serialized.hs
+++ b/Messages/Serialized.hs
@@ -65,9 +65,10 @@ relaySerializedOutput getso sendsor meterreport runannex = go Nothing

(Diff truncated)
clarification
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_11_51a3a86a290b0e3994507f3f64e7c72a._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_11_51a3a86a290b0e3994507f3f64e7c72a._comment
index 8cc8b8623..b9d18337d 100644
--- a/doc/bugs/significant_performance_regression_impacting_datal/comment_11_51a3a86a290b0e3994507f3f64e7c72a._comment
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_11_51a3a86a290b0e3994507f3f64e7c72a._comment
@@ -17,9 +17,12 @@ wastes some time, eg:
 	[2021-06-08 11:52:44.092620256] (Database.Keys) reconcileStaged end
 	ok
 
-All the new work happens in between those two debugs, so you could check if
+All the new work happens in between those two debugs[1], so you could check if
 the time sink is there or elsewhere.
 
 (Note that the last release takes 2 seconds longer for that init than
 it does now..)
+
+[1] With the exception of a single call to `git write-tree`, but that
+should be very fast.
 """]]

add debugging for reconcileStaged calls for benchmarking
diff --git a/Database/Keys.hs b/Database/Keys.hs
index aca13f94e..3c0e3bede 100644
--- a/Database/Keys.hs
+++ b/Database/Keys.hs
@@ -240,6 +240,7 @@ reconcileStaged qh = do
 	go cur indexcache (Just newtree) = do
 		oldtree <- getoldtree
 		when (oldtree /= newtree) $ do
+			fastDebug "Database.Keys" "reconcileStaged start"
 			g <- Annex.gitRepo
 			void $ catstream $ \mdfeeder -> 
 				void $ updatetodiff g
@@ -251,6 +252,7 @@ reconcileStaged qh = do
 			-- get garbage collected, and is available to diff
 			-- against next time.
 			inRepo $ update' lastindexref newtree
+			fastDebug "Database.Keys" "reconcileStaged end"
 	-- git write-tree will fail if the index is locked or when there is
 	-- a merge conflict. To get up-to-date with the current index, 
 	-- diff --staged with the old index tree. The current index tree
@@ -262,6 +264,7 @@ reconcileStaged qh = do
 	-- version of the files that are conflicted. So a second diff
 	-- is done, with --staged but no old tree.
 	go _ _ Nothing = do
+		fastDebug "Database.Keys" "reconcileStaged start (in conflict)"
 		oldtree <- getoldtree
 		g <- Annex.gitRepo
 		catstream $ \mdfeeder -> do
@@ -270,6 +273,7 @@ reconcileStaged qh = do
 			when conflicted $
 				void $ updatetodiff g Nothing "--staged"
 					(procmergeconflictdiff mdfeeder)
+		fastDebug "Database.Keys" "reconcileStaged end"
 	
 	updatetodiff g old new processor = do
 		(l, cleanup) <- pipeNullSplit' (diff old new) g
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_11_51a3a86a290b0e3994507f3f64e7c72a._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_11_51a3a86a290b0e3994507f3f64e7c72a._comment
new file mode 100644
index 000000000..8cc8b8623
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_11_51a3a86a290b0e3994507f3f64e7c72a._comment
@@ -0,0 +1,25 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 11"""
+ date="2021-06-08T15:41:23Z"
+ content="""
+I can't think of anything OSX specific in the recent changes.
+
+I have added debugging of when reconcileStaged wakes up and possibly
+wastes some time, eg:
+
+	joey@darkstar:~/tmp/big> git config annex.debug true
+	joey@darkstar:~/tmp/big> git config annex.debugfilter Database.Keys
+	joey@darkstar:~/tmp/big> git-annex init
+	init  
+	[2021-06-08 11:52:11.854202926] (Database.Keys) reconcileStaged start
+	(scanning for annexed files...)
+	[2021-06-08 11:52:44.092620256] (Database.Keys) reconcileStaged end
+	ok
+
+All the new work happens in between those two debugs, so you could check if
+the time sink is there or elsewhere.
+
+(Note that the last release takes 2 seconds longer for that init than
+it does now..)
+"""]]

scanAnnexedFiles in smudge --update
This makes git checkout and git merge hooks do the work to catch up with
changes that they made to the tree. Rather than doing it at some later
point when the user is not thinking about that past operation.
Sponsored-by: Dartmouth College's Datalad project
diff --git a/Annex/Init.hs b/Annex/Init.hs
index 4bd0955ea..a552046a3 100644
--- a/Annex/Init.hs
+++ b/Annex/Init.hs
@@ -37,7 +37,6 @@ import Annex.UUID
 import Annex.WorkTree
 import Annex.Fixup
 import Annex.Path
-import Annex.Concurrent
 import Config
 import Config.Files
 import Config.Smudge
@@ -134,8 +133,7 @@ initialize' mversion = checkInitializeAllowed $ do
 		then configureSmudgeFilter
 		else deconfigureSmudgeFilter
 	unlessM isBareRepo $ do
-		showSideActionAfter oneSecond "scanning for annexed files" $
-			scanAnnexedFiles
+		scanAnnexedFiles True
 		hookWrite postCheckoutHook
 		hookWrite postMergeHook
 	AdjustedBranch.checkAdjustedClone >>= \case
diff --git a/Annex/WorkTree.hs b/Annex/WorkTree.hs
index ac9c49b27..33d909487 100644
--- a/Annex/WorkTree.hs
+++ b/Annex/WorkTree.hs
@@ -15,6 +15,8 @@ import Annex.Content
 import Annex.ReplaceFile
 import Annex.CurrentBranch
 import Annex.InodeSentinal
+import Annex.Concurrent
+import Utility.ThreadScheduler
 import Utility.InodeCache
 import Git.FilePath
 import Git.CatFile
@@ -78,8 +80,8 @@ ifAnnexed file yes no = maybe no yes =<< lookupKey file
  - But if worktree file does not have a pointer file's content, it is left
  - as-is.
  -}
-scanAnnexedFiles :: Annex ()
-scanAnnexedFiles = whenM (inRepo Git.Ref.headExists <&&> not <$> isBareRepo) $ do
+scanAnnexedFiles :: Bool -> Annex ()
+scanAnnexedFiles initscan = showSideActionAfter oneSecond "scanning for annexed files" $ do
 	-- This gets the keys database populated with all annexed files,
 	-- by running Database.Keys.reconcileStaged.
 	Database.Keys.runWriter (const noop)
@@ -88,14 +90,19 @@ scanAnnexedFiles = whenM (inRepo Git.Ref.headExists <&&> not <$> isBareRepo) $ d
 	-- annex object file already exists, but its inode is not yet
 	-- cached and annex.thin is set. So, the rest of this makes
 	-- another pass over the tree to do that.
-	whenM (annexThin <$> Annex.getGitConfig) $ do
-		g <- Annex.gitRepo
-		(l, cleanup) <- inRepo $ Git.LsTree.lsTree
-			Git.LsTree.LsTreeRecursive
-			(Git.LsTree.LsTreeLong True)
-			Git.Ref.headRef
-		catObjectStreamLsTree l want g go
-		liftIO $ void cleanup
+	whenM
+		( pure initscan
+		<&&> annexThin <$> Annex.getGitConfig
+		<&&> inRepo Git.Ref.headExists
+		<&&> not <$> isBareRepo
+		) $ do
+			g <- Annex.gitRepo
+			(l, cleanup) <- inRepo $ Git.LsTree.lsTree
+				Git.LsTree.LsTreeRecursive
+				(Git.LsTree.LsTreeLong True)
+				Git.Ref.headRef
+			catObjectStreamLsTree l want g go
+			liftIO $ void cleanup
   where
 	-- Want to process symlinks, and regular files.
 	want i = case Git.Types.toTreeItemType (Git.LsTree.mode i) of
diff --git a/Command/Smudge.hs b/Command/Smudge.hs
index cbecd055f..70fc9235a 100644
--- a/Command/Smudge.hs
+++ b/Command/Smudge.hs
@@ -13,6 +13,7 @@ import Annex.Link
 import Annex.FileMatcher
 import Annex.Ingest
 import Annex.CatFile
+import Annex.WorkTree
 import Logs.Smudge
 import Logs.Location
 import qualified Database.Keys
@@ -262,6 +263,11 @@ getMoveRaceRecovery k file = void $ tryNonAsync $
 
 update :: CommandStart
 update = do
+	-- This gets run after a git checkout or merge, so it's a good
+	-- point to refresh the keys database for changes to annexed files.
+	-- Doing it explicitly here avoids a later pause in the middle of
+	-- some other action.
+	scanAnnexedFiles False
 	updateSmudged (Restage True)
 	stop
 
diff --git a/Upgrade/V5.hs b/Upgrade/V5.hs
index 2db92d57f..22200b0c8 100644
--- a/Upgrade/V5.hs
+++ b/Upgrade/V5.hs
@@ -47,7 +47,7 @@ upgrade automatic = flip catchNonAsync onexception $ do
 		, do
 			checkGitVersionForIndirectUpgrade
 		)
-	scanAnnexedFiles
+	scanAnnexedFiles True
 	configureSmudgeFilter
 	-- Inode sentinal file was only used in direct mode and when
 	-- locking down files as they were added. In v6, it's used more
diff --git a/doc/todo/display_when_reconcileStaged_is_taking_a_long_time/comment_1_7efa1d29b475b445cea6fe44d402b275._comment b/doc/todo/display_when_reconcileStaged_is_taking_a_long_time/comment_1_7efa1d29b475b445cea6fe44d402b275._comment
new file mode 100644
index 000000000..517e507df
--- /dev/null
+++ b/doc/todo/display_when_reconcileStaged_is_taking_a_long_time/comment_1_7efa1d29b475b445cea6fe44d402b275._comment
@@ -0,0 +1,12 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-08T15:21:02Z"
+ content="""
+Made `git-annex smudge --update` run the scan, and so the post-checkout or
+post-merge hook will call it. 
+
+That avoids the scenario shown above. But adding a lot of files to the
+index can still cause a later pause for reconcileStaged without indication
+what it's doing.
+"""]]

Added a comment: slow down is OSX specific
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_10_37d5186dfa2da31526c4eafbbbfdbc33._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_10_37d5186dfa2da31526c4eafbbbfdbc33._comment
new file mode 100644
index 000000000..dddce7ad1
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_10_37d5186dfa2da31526c4eafbbbfdbc33._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="slow down is OSX specific"
+ date="2021-06-08T14:28:18Z"
+ content="""
+Thank you Joey for all the OPTs. 
+Looking at it again: the significant impact seems to be OSX builds specific: Linux and Windows builds did get just a bit slower (if at all).  So may be situation is not that dire ;-)  Two possibilities:
+
+- Even though coincidences like this are unlikely, it is possible that OSX VMs on Github actions got slower? 
+- may be the RFing is somehow effecting OSX specifically?
+
+I will try to do some timings locally.  Also I should finally get those logs (and built packages) into some more convenient form using [tinuous](https://github.com/con/tinuous/) John has been developing -- would allow for easy greps instead of all these jumping through CIs etc.
+"""]]

comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_9_431582f1c218f27e50849df1444dc845._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_9_431582f1c218f27e50849df1444dc845._comment
new file mode 100644
index 000000000..2676d6509
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_9_431582f1c218f27e50849df1444dc845._comment
@@ -0,0 +1,33 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 9"""
+ date="2021-06-08T13:45:18Z"
+ content="""
+I do not see 1h40min anywhere on the page you linked. It says
+2h 6m at the top. Oh I see, you are clicking around to get to
+https://github.com/datalad/git-annex/runs/2661363385?check_suite_focus=true,
+and that number is in there next to "Run datalad tests".
+
+My improvements yesterday did not improve your test suite time any. But
+they certainly sped it up a *lot* in my benchmarks. So I think what you
+think is taking more time, eg the scanning at init, does not really have
+much to do with whatever is really taking more time. And if your test suite
+is mostly not cloning repos but is initializing new empty repos, then the
+scanning at init was and still is effectively a noop, it will not have
+gotten more expensive.
+
+It might be that the incremental updating git-annex now does when it sees
+changes to the index is making your test suite run a bit longer. But twice
+as long? That is not a very expensive process.
+
+Also notice that the git-annex test part of the CI job used to take
+15m54s and is now taking 17m29s. That's doing some cloning and some adding
+of files etc, and it didn't double in run time.
+
+I did find another nice optimisation this morning
+[[!commit c831a562f550e6ff93e081f555eced3a8a0d524f]], so we can see if that
+improves things in the next run.
+
+I suspect you're going to have to look at specific parts of your test suite
+and/or the CI system to identify what's slower though.
+"""]]

Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_8_29bde900ef625702b49cf020118f940f._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_8_29bde900ef625702b49cf020118f940f._comment
new file mode 100644
index 000000000..e773865a7
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_8_29bde900ef625702b49cf020118f940f._comment
@@ -0,0 +1,16 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 8"
+ date="2021-06-07T21:39:05Z"
+ content="""
+FWIW:
+
+> I assume these tests are creating lots of clones of repositories.
+
+lots of `git annex init` in new git repos.  Also a good number of clones AFAIK.  FWIW: we used to do more clones instead of just recreating new repos per test, but then we switched to just creating a new repo.  We still have a good number though of cloning.
+
+> Are they also doing lots of merges or switching between branches?
+
+Not that many AFAIK.
+"""]]

Added a comment: clarification
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_7_35217e61ccf6936e105df8de7e1775d0._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_7_35217e61ccf6936e105df8de7e1775d0._comment
new file mode 100644
index 000000000..8a32e5a42
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_7_35217e61ccf6936e105df8de7e1775d0._comment
@@ -0,0 +1,16 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="clarification"
+ date="2021-06-07T21:20:35Z"
+ content="""
+For completeness 
+
+> Looking at the CI log, I don't see any past runs that took 1h 46min. A month ago they were taking 2h 6min. 
+
+I was looking at OSX ones. And the timing was for the one right before slow down: It was [this one](https://github.com/datalad/git-annex/actions/runs/873390021), but even [previous one](https://github.com/datalad/git-annex/actions/runs/869970007) was 1h40min
+
+> Let's see if the changes I'm pushing now drop it back to that.
+
+🤞
+"""]]

todo
diff --git a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_20_e9a36e9600561201969c4d21499833af._comment b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_20_e9a36e9600561201969c4d21499833af._comment
index 8b8ffef1f..ea193d0c2 100644
--- a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_20_e9a36e9600561201969c4d21499833af._comment
+++ b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_20_e9a36e9600561201969c4d21499833af._comment
@@ -13,4 +13,16 @@ Fixed not to use reconcileStaged it took 37 seconds.
 (Keeping reconcileStaged and removing scanAnnexedFiles it took 47 seconds.
 That makes sense; reconcileStaged is an incremental updater and is not
 able to use SQL as efficiently as scanAnnexedFiles.)
+
+---
+
+Also the git clone of that 100,000 file repo itself, from another repo on
+the same SSD, takes 9 seconds. git-annex init taking 4x as long as
+a fast local git clone to do a scan is not bad.
+
+This is EOT for me, but I will accept pathes if someone wants to make
+git-annex faster. 
+
+(Also see
+[[todo/display_when_reconcileStaged_is_taking_a_long_time]])
 """]]
diff --git a/doc/todo/display_when_reconcileStaged_is_taking_a_long_time.mdwn b/doc/todo/display_when_reconcileStaged_is_taking_a_long_time.mdwn
new file mode 100644
index 000000000..20b645d6e
--- /dev/null
+++ b/doc/todo/display_when_reconcileStaged_is_taking_a_long_time.mdwn
@@ -0,0 +1,36 @@
+Consider this, where branch foo has ten to a hundred thousand files
+not in the master branch:
+	
+	git checkout foo
+	touch newfile
+	git annex add newfile
+
+After recent changes to reconcileStaged, the result can be:
+
+	add newfile 0b 100% # cursor sits here for several seconds
+
+This is because it has to look in the keys db to see if there's an
+associated file that's unlocked and needs populating with the content of
+this newly available key, so it does reconcileStaged, which can take some
+time.
+
+One fix would be, if reconcileStaged is taking a long time, make it display
+a note about what it's doing:
+
+	add newfile 0b 100% (scanning annexed files...)
+
+It would also be possible to do the scan before starting to add files,
+which would look more consitent and would avoid it getting stuck
+with the progress display in view:
+
+	(scanning annexed files...)
+	add newfile ok
+
+It might also be possible to make reconcileStaged run a less expensive
+scan in this case, eg the scan it did before
+[[!commit 428c91606b434512d1986622e751c795edf4df44]]. In this case, it
+only really cares about associated files that are unlocked, and so
+diffing from HEAD to the index is sufficient, because the git checkout
+will have run the smudge filter on all the unlocked ones in HEAD and so it
+will already know about those associated files. However, I can't say I like
+this idea much because it complicates using the keys db significantly.

avoid double work in git-annex init
reconcileStaged was doing a redundant scan to scannAnnexedFiles.
It would probably make sense to move the body of scannAnnexedFiles
into reconcileStaged, the separation does not really serve any purpose.
Sponsored-by: Dartmouth College's Datalad project
diff --git a/Annex/WorkTree.hs b/Annex/WorkTree.hs
index b9d45b483..8d09adf7a 100644
--- a/Annex/WorkTree.hs
+++ b/Annex/WorkTree.hs
@@ -70,27 +70,29 @@ ifAnnexed file yes no = maybe no yes =<< lookupKey file
 
 {- Find all annexed files and update the keys database for them.
  - 
- - This is expensive, and so normally the associated files are updated
- - incrementally when changes are noticed. So, this only needs to be done
- - when initializing/upgrading a repository.
- -
  - Also, the content for an unlocked file may already be present as
  - an annex object. If so, populate the pointer file with it.
  - But if worktree file does not have a pointer file's content, it is left
  - as-is.
+ -
+ - Normally the keys database is updated incrementally when changes are
+ - noticed. For an initial scan, this is faster than that incremental
+ - update.
  -}
 scanAnnexedFiles :: Annex ()
-scanAnnexedFiles = whenM (inRepo Git.Ref.headExists <&&> not <$> isBareRepo) $ do
-	g <- Annex.gitRepo
-	Database.Keys.runWriter $
-		liftIO . Database.Keys.SQL.dropAllAssociatedFiles
-	(l, cleanup) <- inRepo $ Git.LsTree.lsTree
-		Git.LsTree.LsTreeRecursive
-		(Git.LsTree.LsTreeLong True)
-		Git.Ref.headRef
-	catObjectStreamLsTree l want g go
-	liftIO $ void cleanup
+scanAnnexedFiles = whenM (inRepo Git.Ref.headExists <&&> not <$> isBareRepo) $
+	Database.Keys.runWriter' (Just reconciler) (const noop)
   where
+	reconciler dbh = do
+		g <- Annex.gitRepo
+		liftIO $ Database.Keys.SQL.dropAllAssociatedFiles dbh
+		(l, cleanup) <- inRepo $ Git.LsTree.lsTree
+			Git.LsTree.LsTreeRecursive
+			(Git.LsTree.LsTreeLong True)
+			Git.Ref.headRef
+		catObjectStreamLsTree l want g (go dbh)
+		liftIO $ void cleanup
+	
 	-- Want to process symlinks, and regular files.
 	want i = case Git.Types.toTreeItemType (Git.LsTree.mode i) of
 		Just Git.Types.TreeSymlink -> Just (i, False)
@@ -103,17 +105,16 @@ scanAnnexedFiles = whenM (inRepo Git.Ref.headExists <&&> not <$> isBareRepo) $ d
 		Just n | n < maxPointerSz -> Just (i, True)
 		_ -> Nothing
 	
-	go getnext = liftIO getnext >>= \case
+	go dbh getnext = liftIO getnext >>= \case
 		Just ((i, isregfile), Just c) -> do
-			maybe noop (add i isregfile)
+			maybe noop (add i isregfile dbh)
 				(parseLinkTargetOrPointer (L.toStrict c))
-			go getnext
+			go dbh getnext
 		_ -> return ()
 	
-	add i isregfile k = do
+	add i isregfile dbh k = do
 		let tf = Git.LsTree.file i
-		Database.Keys.runWriter $
-			liftIO . Database.Keys.SQL.addAssociatedFileFast k tf
+		liftIO $ Database.Keys.SQL.addAssociatedFileFast k tf dbh
 		whenM (pure isregfile <&&> inAnnex k) $ do
 			f <- fromRepo $ fromTopFilePath tf
 			liftIO (isPointerFile f) >>= \case
diff --git a/Database/Keys.hs b/Database/Keys.hs
index 8b6edd80f..778a0af1a 100644
--- a/Database/Keys.hs
+++ b/Database/Keys.hs
@@ -23,6 +23,7 @@ module Database.Keys (
 	removeInodeCache,
 	isInodeKnown,
 	runWriter,
+	runWriter',
 ) where
 
 import qualified Database.Keys.SQL as SQL
@@ -73,7 +74,7 @@ runReader a = do
 		v <- a (SQL.ReadHandle qh)
 		return (v, st)
 	go DbClosed = do
-		st' <- openDb True DbClosed
+		st' <- openDb True Nothing DbClosed
 		v <- case st' of
 			(DbOpen qh) -> a (SQL.ReadHandle qh)
 			_ -> return mempty
@@ -87,7 +88,15 @@ runReaderIO a = runReader (liftIO . a)
  -
  - The database is created if it doesn't exist yet. -}
 runWriter :: (SQL.WriteHandle -> Annex ()) -> Annex ()
-runWriter a = do
+runWriter = runWriter' Nothing
+
+runWriterIO :: (SQL.WriteHandle -> IO ()) -> Annex ()
+runWriterIO a = runWriter (liftIO . a)
+
+{- When a reconcile action is passed, it is run by reconcileStaged instead
+ - of its usual scan, and must update the database in the same way. -}
+runWriter' :: Maybe (SQL.WriteHandle -> Annex ()) -> (SQL.WriteHandle -> Annex ()) -> Annex ()
+runWriter' reconciler a = do
 	h <- Annex.getRead Annex.keysdbhandle
 	withDbState h go
   where
@@ -95,15 +104,12 @@ runWriter a = do
 		v <- a (SQL.WriteHandle qh)
 		return (v, st)
 	go st = do
-		st' <- openDb False st
+		st' <- openDb False reconciler st
 		v <- case st' of
 			DbOpen qh -> a (SQL.WriteHandle qh)
 			_ -> error "internal"
 		return (v, st')
 
-runWriterIO :: (SQL.WriteHandle -> IO ()) -> Annex ()
-runWriterIO a = runWriter (liftIO . a)
-
 {- Opens the database, creating it if it doesn't exist yet.
  -
  - Multiple readers and writers can have the database open at the same
@@ -112,10 +118,10 @@ runWriterIO a = runWriter (liftIO . a)
  - the database doesn't exist yet, one caller wins the lock and
  - can create it undisturbed.
  -}
-openDb :: Bool -> DbState -> Annex DbState
-openDb _ st@(DbOpen _) = return st
-openDb False DbUnavailable = return DbUnavailable
-openDb forwrite _ = catchPermissionDenied permerr $ withExclusiveLock gitAnnexKeysDbLock $ do
+openDb :: Bool -> (Maybe (SQL.WriteHandle -> Annex ())) -> DbState -> Annex DbState
+openDb _ _ st@(DbOpen _) = return st
+openDb False _ DbUnavailable = return DbUnavailable
+openDb forwrite reconciler _ = catchPermissionDenied permerr $ withExclusiveLock gitAnnexKeysDbLock $ do
 	dbdir <- fromRepo gitAnnexKeysDb
 	let db = dbdir P.</> "db"
 	dbexists <- liftIO $ R.doesPathExist db
@@ -133,7 +139,7 @@ openDb forwrite _ = catchPermissionDenied permerr $ withExclusiveLock gitAnnexKe
 	
 	open db = do
 		qh <- liftIO $ H.openDbQueue H.MultiWriter db SQL.containedTable
-		reconcileStaged qh
+		reconcileStaged qh reconciler
 		return $ DbOpen qh
 
 {- Closes the database if it was open. Any writes will be flushed to it.
@@ -223,8 +229,8 @@ isInodeKnown i s = or <$> runReaderIO ((:[]) <$$> SQL.isInodeKnown i s)
  - So when using getAssociatedFiles, have to make sure the file still
  - is an associated file.
  -}
-reconcileStaged :: H.DbQueue -> Annex ()
-reconcileStaged qh = do
+reconcileStaged :: H.DbQueue -> Maybe (SQL.WriteHandle -> Annex ()) -> Annex ()
+reconcileStaged qh mreconciler = do
 	gitindex <- inRepo currentIndexFile
 	indexcache <- fromRawFilePath <$> fromRepo gitAnnexKeysDbIndexCache
 	withTSDelta (liftIO . genInodeCache gitindex) >>= \case
@@ -246,12 +252,21 @@ reconcileStaged qh = do
 	go cur indexcache (Just newtree) = do
 		oldtree <- getoldtree
 		when (oldtree /= newtree) $ do
+			case mreconciler of
+				Just reconciler ->
+					reconciler (SQL.WriteHandle qh)
+				Nothing -> noop
 			g <- Annex.gitRepo
-			void $ catstream $ \mdfeeder -> 
-				void $ updatetodiff g
-					(Just (fromRef oldtree)) 
-					(fromRef newtree)
-					(procdiff mdfeeder)
+			void $ catstream $ \mdfeeder -> do
+				case mreconciler of
+					Nothing -> void $ updatetodiff g
+						(Just (fromRef oldtree)) 
+						(fromRef newtree)
+						(procdiff mdfeeder)
+					Just _ -> void $ updatetodiff g
+						Nothing "--staged"
+						(procdiff mdfeeder)
+
 			liftIO $ writeFile indexcache $ showInodeCache cur
 			-- Storing the tree in a ref makes sure it does not
 			-- get garbage collected, and is available to diff
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_4_85d1031d2b51c0fc1271c283d8ee7888._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_4_85d1031d2b51c0fc1271c283d8ee7888._comment
new file mode 100644
index 000000000..27c306ba7
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_4_85d1031d2b51c0fc1271c283d8ee7888._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 4"""
+ date="2021-06-07T20:22:24Z"
+ content="""
+Turns out `git-annex init` got a lot slower than it had to, it was doing

(Diff truncated)
optimise reconcileStaged with git cat-file streaming
Commit 428c91606b434512d1986622e751c795edf4df44 made it need to do more
work in situations like switching between very different branches.
Compare with seekFilteredKeys which has a similar optimisation. Might be
possible to factor out the common part from these?
Sponsored-by: Dartmouth College's Datalad project
diff --git a/Database/Keys.hs b/Database/Keys.hs
index 633b654c1..8b6edd80f 100644
--- a/Database/Keys.hs
+++ b/Database/Keys.hs
@@ -34,7 +34,6 @@ import Annex.Locations
 import Annex.Common hiding (delete)
 import qualified Annex
 import Annex.LockFile
-import Annex.CatFile
 import Annex.Content.PointerFile
 import Annex.Link
 import Utility.InodeCache
@@ -45,6 +44,7 @@ import Git.Command
 import Git.Types
 import Git.Index
 import Git.Sha
+import Git.CatFile
 import Git.Branch (writeTreeQuiet, update')
 import qualified Git.Ref
 import Config.Smudge
@@ -53,6 +53,7 @@ import qualified Utility.RawFilePath as R
 import qualified Data.ByteString as S
 import qualified Data.ByteString.Char8 as S8
 import qualified System.FilePath.ByteString as P
+import Control.Concurrent.Async
 
 {- Runs an action that reads from the database.
  -
@@ -241,12 +242,16 @@ reconcileStaged qh = do
 		<$> catchMaybeIO (readFile indexcache)
 
 	getoldtree = fromMaybe emptyTree <$> inRepo (Git.Ref.sha lastindexref)
-
+	
 	go cur indexcache (Just newtree) = do
 		oldtree <- getoldtree
 		when (oldtree /= newtree) $ do
-			updatetodiff (Just (fromRef oldtree)) (fromRef newtree) procdiff
-				>>= flushdb . fst
+			g <- Annex.gitRepo
+			void $ catstream $ \mdfeeder -> 
+				void $ updatetodiff g
+					(Just (fromRef oldtree)) 
+					(fromRef newtree)
+					(procdiff mdfeeder)
 			liftIO $ writeFile indexcache $ showInodeCache cur
 			-- Storing the tree in a ref makes sure it does not
 			-- get garbage collected, and is available to diff
@@ -264,24 +269,18 @@ reconcileStaged qh = do
 	-- is done, with --staged but no old tree.
 	go _ _ Nothing = do
 		oldtree <- getoldtree
-		(changed, conflicted) <- updatetodiff
-			(Just (fromRef oldtree)) "--staged" procdiff
-		changed' <- if conflicted
-			then fst <$> updatetodiff Nothing "--staged"
-				procmergeconflictdiff
-			else pure False
-		flushdb (changed || changed')
-		
-	updatetodiff old new processor = do
-		(l, cleanup) <- inRepo $ pipeNullSplit' $ diff old new
-		processor l False False
-			`finally` void (liftIO cleanup)
+		g <- Annex.gitRepo
+		catstream $ \mdfeeder -> do
+			conflicted <- updatetodiff g
+				(Just (fromRef oldtree)) "--staged" (procdiff mdfeeder)
+			when conflicted $
+				void $ updatetodiff g Nothing "--staged"
+					(procmergeconflictdiff mdfeeder)
 	
-	-- Flush database changes immediately
-	-- so other processes can see them.
-	flushdb changed
-		| changed = liftIO $ H.flushDbQueue qh
-		| otherwise = noop
+	updatetodiff g old new processor = do
+		(l, cleanup) <- pipeNullSplit' (diff old new) g
+		processor l False
+			`finally` void cleanup
 	
 	-- Avoid running smudge clean filter, which would block trying to
 	-- access the locked database. git write-tree sometimes calls it,
@@ -320,22 +319,21 @@ reconcileStaged qh = do
 		, Param "--no-ext-diff"
 		]
 	
-	procdiff (info:file:rest) changed conflicted
+	procdiff mdfeeder (info:file:rest) conflicted
 		| ":" `S.isPrefixOf` info = case S8.words info of
 			(_colonsrcmode:dstmode:srcsha:dstsha:status:[]) -> do
 				let conflicted' = status == "U"
 				-- avoid removing associated file when
 				-- there is a merge conflict
-				removed <- if not conflicted'
-					then catKey (Ref srcsha) >>= \case
+				unless conflicted' $
+					send mdfeeder (Ref srcsha) $ \case
 						Just oldkey -> do
 							liftIO $ SQL.removeAssociatedFile oldkey
 								(asTopFilePath file)
 								(SQL.WriteHandle qh)
 							return True
 						Nothing -> return False
-					else return False
-				added <- catKey (Ref dstsha) >>= \case
+				send mdfeeder (Ref dstsha) $ \case
 					Just key -> do
 						liftIO $ SQL.addAssociatedFile key
 							(asTopFilePath file)
@@ -344,32 +342,30 @@ reconcileStaged qh = do
 							reconcilerace (asTopFilePath file) key
 						return True
 					Nothing -> return False
-				procdiff rest
-					(changed || removed || added)
+				procdiff mdfeeder rest
 					(conflicted || conflicted')
-			_ -> return (changed, conflicted) -- parse failed
-	procdiff _ changed conflicted = return (changed, conflicted)
-	
+			_ -> return conflicted -- parse failed
+	procdiff _ _ conflicted = return conflicted
+
 	-- Processing a diff --index when there is a merge conflict.
 	-- This diff will have the new local version of a file as the
 	-- first sha, and a null sha as the second sha, and we only
 	-- care about files that are in conflict.
-	procmergeconflictdiff (info:file:rest) changed conflicted
+	procmergeconflictdiff mdfeeder (info:file:rest) conflicted
 		| ":" `S.isPrefixOf` info = case S8.words info of
 			(_colonmode:_mode:sha:_sha:status:[]) -> do
-				let conflicted' = status == "U"
-				added <- catKey (Ref sha) >>= \case
+				send mdfeeder (Ref sha) $ \case
 					Just key -> do
 						liftIO $ SQL.addAssociatedFile key
 							(asTopFilePath file)
 							(SQL.WriteHandle qh)
 						return True
 					Nothing -> return False
-				procmergeconflictdiff rest
-					(changed || added)
+				let conflicted' = status == "U"
+				procmergeconflictdiff mdfeeder rest
 					(conflicted || conflicted')
-			_ -> return (changed, conflicted) -- parse failed
-	procmergeconflictdiff _ changed conflicted = return (changed, conflicted)
+			_ -> return conflicted -- parse failed
+	procmergeconflictdiff _ _ conflicted = return conflicted
 
 	reconcilerace file key = do
 		caches <- liftIO $ SQL.getInodeCaches key (SQL.ReadHandle qh)
@@ -385,3 +381,41 @@ reconcileStaged qh = do
 						SQL.addInodeCaches key [ic] (SQL.WriteHandle qh)
 			(False, True) -> depopulatePointerFile key p
 			_ -> return ()
+	
+	send :: ((Maybe Key -> Annex a, Ref) -> IO ()) -> Ref -> (Maybe Key -> Annex a) -> IO ()
+	send feeder r withk = feeder (withk, r)
+
+	-- Streaming through git cat-file like this is significantly
+	-- faster than using catKey.
+	catstream a = do
+		g <- Annex.gitRepo
+		catObjectMetaDataStream g $ \mdfeeder mdcloser mdreader ->
+			catObjectStream g $ \catfeeder catcloser catreader -> do
+				feedt <- liftIO $ async $
+					a mdfeeder
+						`finally` void mdcloser
+				proct <- liftIO $ async $
+					procthread mdreader catfeeder
+						`finally` void catcloser
+				dbchanged <- dbwriter False catreader
+				-- Flush database changes now
+				-- so other processes can see them.
+				when dbchanged $
+					liftIO $ H.flushDbQueue qh
+				() <- liftIO $ wait feedt
+				liftIO $ wait proct
+				return ()
+	  where
+		procthread mdreader catfeeder = mdreader >>= \case
+			Just (ka, Just (sha, size, _type))
+				| size < maxPointerSz -> do
+					() <- catfeeder (ka, sha)
+					procthread mdreader catfeeder
+			Just _ -> procthread mdreader catfeeder
+			Nothing -> return ()
+
+		dbwriter dbchanged catreader = liftIO catreader >>= \case
+			Just (ka, content) -> do
+				changed <- ka (parseLinkTargetOrPointerLazy =<< content)
+				dbwriter (dbchanged || changed) catreader
+			Nothing -> return dbchanged
diff --git a/doc/todo/speed_up_keys_db_update_with_git_streaming.mdwn b/doc/todo/speed_up_keys_db_update_with_git_streaming.mdwn
index 61df4706a..ce67b6a99 100644
--- a/doc/todo/speed_up_keys_db_update_with_git_streaming.mdwn
+++ b/doc/todo/speed_up_keys_db_update_with_git_streaming.mdwn

(Diff truncated)
Added a comment: deferring the scan
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_4_4011786784a140442dd7ecd1cefe559b._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_4_4011786784a140442dd7ecd1cefe559b._comment
new file mode 100644
index 000000000..40b04f2f7
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_4_4011786784a140442dd7ecd1cefe559b._comment
@@ -0,0 +1,11 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="deferring the scan"
+ date="2021-06-07T17:41:44Z"
+ content="""
+>The only way to fix it that would not have an on performance would be to remove include= and exclude= from the preferred content expression syntax
+
+What about deferring the scan until the first command that uses `include=/exclude=` gets run, if `annex.supportunlocked=false`?
+Also, could the scan be limited to the files that match one of the include/exclude globs?
+"""]]

comment
diff --git a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_19_b34a9cf4114ed943fe4ba2de78eb0bbc._comment b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_19_b34a9cf4114ed943fe4ba2de78eb0bbc._comment
new file mode 100644
index 000000000..2f68e4129
--- /dev/null
+++ b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_19_b34a9cf4114ed943fe4ba2de78eb0bbc._comment
@@ -0,0 +1,19 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 19"""
+ date="2021-06-07T17:07:26Z"
+ content="""
+It can't know if there are unlocked files without doing this scan.
+
+Except for when annex.supportunlocked=false, but then that config option
+would have the side effect of making git-annex *slower* at some point after
+init, with the situations where it does hard to enumerate and probably
+growing. This would be a hard behavior to explain to the user.
+
+And there are numerous other points than the ones you listed where
+git-annex accesses the keys db and would trigger a deferred scan. Eg, anytime
+it might need to update a pointer file. Eg, when `git annex get` is run. 
+Avoiding using the keys db when annex.supportunlocked=false in all such
+cases in order to avoid the scan would be effectively the same complexity
+as continuing to support v5 repos, which I've NAKed before.
+""]]

fix link
diff --git a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_18_4e1e8fd89ea9be43d89e72562236c979._comment b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_18_4e1e8fd89ea9be43d89e72562236c979._comment
index 6927ab008..2d9c0444a 100644
--- a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_18_4e1e8fd89ea9be43d89e72562236c979._comment
+++ b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_18_4e1e8fd89ea9be43d89e72562236c979._comment
@@ -7,7 +7,7 @@
 >The scan could be done lazily, but there are situations that use the database where unexpectedly taking a much longer time than usual would be a real problem
 
 For unlocked files, certainly.  When `annex.supportunlocked=false`, it sounded like the only situation that uses the database is `drop --auto`, or a [[matching expression|git-annex-matching-options]] with `--includesamecontent/--excludesamecontent`?  (And maybe [[todo/git-annex_whereused]]).
-Personally I would prefer an unexpected delay in these rare cases, to a [delay](https://git-annex.branchable.com/bugs/significant_performance_regression_impacting_data!) in the more common case of checking out or switching branches.
+Personally I would prefer an unexpected delay in these rare cases, to a [delay](https://git-annex.branchable.com/bugs/significant_performance_regression_impacting_datal) in the more common case of checking out or switching branches.
  
 
 

correctly update keys db in merge conflict
This is quite a subtle edge case, see the bug report for full details.
The second git diff is needed only when there's a merge conflict.
It would be possible to speed it up marginally by using
--diff-filter=Unmerged, but probably not enough to bother with.
Sponsored-by: Graham Spencer on Patreon
diff --git a/Database/Keys.hs b/Database/Keys.hs
index 3154118fd..633b654c1 100644
--- a/Database/Keys.hs
+++ b/Database/Keys.hs
@@ -245,7 +245,8 @@ reconcileStaged qh = do
 	go cur indexcache (Just newtree) = do
 		oldtree <- getoldtree
 		when (oldtree /= newtree) $ do
-			updatetodiff (fromRef oldtree) (fromRef newtree)
+			updatetodiff (Just (fromRef oldtree)) (fromRef newtree) procdiff
+				>>= flushdb . fst
 			liftIO $ writeFile indexcache $ showInodeCache cur
 			-- Storing the tree in a ref makes sure it does not
 			-- get garbage collected, and is available to diff
@@ -253,22 +254,34 @@ reconcileStaged qh = do
 			inRepo $ update' lastindexref newtree
 	-- git write-tree will fail if the index is locked or when there is
 	-- a merge conflict. To get up-to-date with the current index, 
-	-- diff --cached with the old index tree. The current index tree
+	-- diff --staged with the old index tree. The current index tree
 	-- is not known, so not recorded, and the inode cache is not updated,
 	-- so the next time git-annex runs, it will diff again, even
 	-- if the index is unchanged.
+	--
+	-- When there is a merge conflict, that will not see the new local
+	-- version of the files that are conflicted. So a second diff
+	-- is done, with --staged but no old tree.
 	go _ _ Nothing = do
 		oldtree <- getoldtree
-		updatetodiff (fromRef oldtree) "--cached"
+		(changed, conflicted) <- updatetodiff
+			(Just (fromRef oldtree)) "--staged" procdiff
+		changed' <- if conflicted
+			then fst <$> updatetodiff Nothing "--staged"
+				procmergeconflictdiff
+			else pure False
+		flushdb (changed || changed')
 		
-	updatetodiff old new = do
+	updatetodiff old new processor = do
 		(l, cleanup) <- inRepo $ pipeNullSplit' $ diff old new
-		changed <- procdiff l False
-		void $ liftIO cleanup
-		-- Flush database changes immediately
-		-- so other processes can see them.
-		when changed $
-			liftIO $ H.flushDbQueue qh
+		processor l False False
+			`finally` void (liftIO cleanup)
+	
+	-- Flush database changes immediately
+	-- so other processes can see them.
+	flushdb changed
+		| changed = liftIO $ H.flushDbQueue qh
+		| otherwise = noop
 	
 	-- Avoid running smudge clean filter, which would block trying to
 	-- access the locked database. git write-tree sometimes calls it,
@@ -288,8 +301,8 @@ reconcileStaged qh = do
 		-- (The -G option may make it be used otherwise.)
 		[ Param "-c", Param "diff.external="
 		, Param "diff"
-		, Param old
-		, Param new
+		] ++ maybeToList (Param <$> old) ++
+		[ Param new
 		, Param "--raw"
 		, Param "-z"
 		, Param "--no-abbrev"
@@ -307,12 +320,13 @@ reconcileStaged qh = do
 		, Param "--no-ext-diff"
 		]
 	
-	procdiff (info:file:rest) changed
+	procdiff (info:file:rest) changed conflicted
 		| ":" `S.isPrefixOf` info = case S8.words info of
 			(_colonsrcmode:dstmode:srcsha:dstsha:status:[]) -> do
+				let conflicted' = status == "U"
 				-- avoid removing associated file when
 				-- there is a merge conflict
-				removed <- if status /= "U" 
+				removed <- if not conflicted'
 					then catKey (Ref srcsha) >>= \case
 						Just oldkey -> do
 							liftIO $ SQL.removeAssociatedFile oldkey
@@ -330,9 +344,32 @@ reconcileStaged qh = do
 							reconcilerace (asTopFilePath file) key
 						return True
 					Nothing -> return False
-				procdiff rest (changed || removed || added)
-			_ -> return changed -- parse failed
-	procdiff _ changed = return changed
+				procdiff rest
+					(changed || removed || added)
+					(conflicted || conflicted')
+			_ -> return (changed, conflicted) -- parse failed
+	procdiff _ changed conflicted = return (changed, conflicted)
+	
+	-- Processing a diff --index when there is a merge conflict.
+	-- This diff will have the new local version of a file as the
+	-- first sha, and a null sha as the second sha, and we only
+	-- care about files that are in conflict.
+	procmergeconflictdiff (info:file:rest) changed conflicted
+		| ":" `S.isPrefixOf` info = case S8.words info of
+			(_colonmode:_mode:sha:_sha:status:[]) -> do
+				let conflicted' = status == "U"
+				added <- catKey (Ref sha) >>= \case
+					Just key -> do
+						liftIO $ SQL.addAssociatedFile key
+							(asTopFilePath file)
+							(SQL.WriteHandle qh)
+						return True
+					Nothing -> return False
+				procmergeconflictdiff rest
+					(changed || added)
+					(conflicted || conflicted')
+			_ -> return (changed, conflicted) -- parse failed
+	procmergeconflictdiff _ changed conflicted = return (changed, conflicted)
 
 	reconcilerace file key = do
 		caches <- liftIO $ SQL.getInodeCaches key (SQL.ReadHandle qh)
diff --git a/doc/bugs/case_where_keys_db_lags_reality.mdwn b/doc/bugs/case_where_keys_db_lags_reality.mdwn
index 17a799d2f..87ea7fa67 100644
--- a/doc/bugs/case_where_keys_db_lags_reality.mdwn
+++ b/doc/bugs/case_where_keys_db_lags_reality.mdwn
@@ -1,10 +1,10 @@
 Found a case where the associated files in the keys db end up out-of-date.
-Make a repo with an unlocked file, clone it to a second repo, and set up a
+Make a repo with an locked file, clone it to a second repo, and set up a
 conflict involving that file in both repos, using git-annex add to add the
-conflicting version, and not running other git-annex commands after that,
-before pulling the conflicting branch. When the associated files db
-gets updated in the conflict situation, only 1 key has the conflicting file
-associated with it, rather than 2 or 3.
+conflicting version, committing, and not running other git-annex commands
+after that, before pulling the conflicting branch. When the associated
+files db gets updated in the conflict situation, only 1 key has the
+conflicting file associated with it, rather than 2 or 3.
 
 The original key before the conflict has the file associated with it, but
 the new local key and new remote key do not.
@@ -21,8 +21,35 @@ git-annex updates the keys db. So, one solution to this bug will be for
 git-annex to also update the keys db when staging locked files.
 (Unfortunately this would make mass adds somewhat slower.)
 
-Or, possibly, for reconcileStaged to not use git diff --index in this case,
+Or, possibly, for reconcileStaged to not use git diff --cached in this case,
 but git diff with -1 and -3. That lets both sides of the merge conflict be
 accessed, and it could then add the file to both keys. As well as not
 slowing down git-annex add, this would let it honor the preferred content
 of the conflicting file for all 3 keys. --[[Joey]]
+
+> On second thought, it's not really necessary that all 3 keys have the
+> conflicted file associated with them. The original key doesn't because
+> the user has already changed the file to use the new key. The new remote
+> key does not really need to, and there might not even be any effect if it
+> did. The new local key is the one that this bug is really about.
+> 
+> Consider that checkDrop uses catKeyFile to double-check the associated
+> files. And that will see the file pointing to the new local key. So
+> if the original key or new remote key are also associated with the file,
+> it will ignore them and drop anyway. And that's ok, from the user's
+> perspective the one it needs to retain is the one that the file in the
+> working tree uses, which is the new local key.
+> 
+> > Hmm, -1 and -3 are not what's needed to get the new local key.
+> > It's using `git diff oldtree --cached`, and the code preserves the old
+> > key when it sees a merge conflict. Using instead
+> > `git diff HEAD --cached` has the new key as the src sha, and nullsha as
+> > the dst sha.
+> >
+> > However, the diff with the old tree is needed to incrementally
+> > update when it's not in the middle of a merge conflict.
+> > So what can be done is do the diff as now; when it sees a merge
+> > conflict, run diff a second time with `HEAD --cached` to get the new
+> > key. 
+> > 
+> > > [[done]] --[[Joey]]

Added a comment: deferring the keys-to-files scan
diff --git a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_18_4e1e8fd89ea9be43d89e72562236c979._comment b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_18_4e1e8fd89ea9be43d89e72562236c979._comment
new file mode 100644
index 000000000..6927ab008
--- /dev/null
+++ b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_18_4e1e8fd89ea9be43d89e72562236c979._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="deferring the keys-to-files scan"
+ date="2021-06-07T16:11:00Z"
+ content="""
+>The scan could be done lazily, but there are situations that use the database where unexpectedly taking a much longer time than usual would be a real problem
+
+For unlocked files, certainly.  When `annex.supportunlocked=false`, it sounded like the only situation that uses the database is `drop --auto`, or a [[matching expression|git-annex-matching-options]] with `--includesamecontent/--excludesamecontent`?  (And maybe [[todo/git-annex_whereused]]).
+Personally I would prefer an unexpected delay in these rare cases, to a [delay](https://git-annex.branchable.com/bugs/significant_performance_regression_impacting_data!) in the more common case of checking out or switching branches.
+ 
+
+
+"""]]

comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_3_d30a515103d04fd966aadee7f141aeee._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_3_d30a515103d04fd966aadee7f141aeee._comment
new file mode 100644
index 000000000..e59b347f7
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_3_d30a515103d04fd966aadee7f141aeee._comment
@@ -0,0 +1,13 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2021-06-07T15:49:51Z"
+ content="""
+Also worth saying: When I have a choice between a bug fix and a performance
+change, I kind of have to pick the bug fix.
+[[bugs/indeterminite_preferred_content_state_for_duplicated_file]] was a
+longstanding bug that could cause *very* expensive misbehavior. The only
+way to fix it that would not have an on performance would be to
+remove include= and exclude= from the preferred content expression syntax,
+which would prevent a lot of current uses of preferred content.
+"""]]

comment
diff --git a/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_6_d40eb166d8d0e5e61fe18d42ad794e75._comment b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_6_d40eb166d8d0e5e61fe18d42ad794e75._comment
new file mode 100644
index 000000000..01f7f8328
--- /dev/null
+++ b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_6_d40eb166d8d0e5e61fe18d42ad794e75._comment
@@ -0,0 +1,9 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 6"""
+ date="2021-06-07T15:48:07Z"
+ content="""
+annex.supportunlocked=false still prevents the smudge/clean filter from
+being used, which can significantly speed up git if the repository has a
+lot of files stored in git.
+"""]]

comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_2_521f784d686fcaeb5e081599fbdf9903._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_2_521f784d686fcaeb5e081599fbdf9903._comment
new file mode 100644
index 000000000..7d3c87606
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_2_521f784d686fcaeb5e081599fbdf9903._comment
@@ -0,0 +1,12 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2021-06-07T15:41:43Z"
+ content="""
+I assume these tests are creating lots of clones of repositories. Are they
+also doing lots of merges or switching between branches?
+
+The init scan time has already been optimised as much as seems feasible.
+The update scan time can still be optimised, see
+[[speed_up_keys_db_update_with_git_streaming]].
+"""]]

added suggestion to match keys by file extension in the key
diff --git a/doc/todo/find__47__prefer_keys_by_file_extension_in_key.mdwn b/doc/todo/find__47__prefer_keys_by_file_extension_in_key.mdwn
new file mode 100644
index 000000000..635f6a7ba
--- /dev/null
+++ b/doc/todo/find__47__prefer_keys_by_file_extension_in_key.mdwn
@@ -0,0 +1 @@
+Add [[preferred content expression|git-annex-preferred-content]] and [[matching option|git-annex-matching-options]] to match the file extension incorporated into a `*E` [[key|backends]], e.g. `keyext=.mp3` .  This would help address the limitation that `include=*.mp3` does not work with `--all` or `--unused`.

Added a comment: keeping connected files together
diff --git a/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_a33437b28ea7fc74eb221d90efaec487._comment b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_a33437b28ea7fc74eb221d90efaec487._comment
new file mode 100644
index 000000000..73139ec72
--- /dev/null
+++ b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_a33437b28ea7fc74eb221d90efaec487._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="keeping connected files together"
+ date="2021-06-07T14:45:35Z"
+ content="""
+One other option is to `tar` up each movie and all associated files into one archive, and annex that.
+
+There's a [special remote in DataLad](https://github.com/datalad/datalad/blob/master/datalad/customremotes/archives.py) for accessing individual files inside annexed archives, though I guess in your case you'd normally want all files anyway.
+"""]]

diff --git a/doc/bugs/Build_fails_on_Windows_as_of_commit_a706708d1.mdwn b/doc/bugs/Build_fails_on_Windows_as_of_commit_a706708d1.mdwn
new file mode 100644
index 000000000..8c0e36293
--- /dev/null
+++ b/doc/bugs/Build_fails_on_Windows_as_of_commit_a706708d1.mdwn
@@ -0,0 +1,4 @@
+As of commit a706708d1, trying to build git-annex on Windows fails because the import of `oneSecond` from `Utility.ThreadScheduler` is not available.  [This patch](https://raw.githubusercontent.com/datalad/git-annex/master/patches/20210607-a706708d1-fix-oneSecond.patch) fixes that.
+
+[[!meta author=jwodder]]
+[[!tag projects/datalad]]

removed
diff --git a/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_1d73e3bcbda6c28556e2d5a6473de245._comment b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_1d73e3bcbda6c28556e2d5a6473de245._comment
deleted file mode 100644
index dfcb6f8de..000000000
--- a/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_1d73e3bcbda6c28556e2d5a6473de245._comment
+++ /dev/null
@@ -1,10 +0,0 @@
-[[!comment format=mdwn
- username="Ilya_Shlyakhter"
- avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
- subject="specifying preferred content by metadata"
- date="2021-06-07T14:26:57Z"
- content="""
->I'd prefer if git-annex could be made to operate on full series (as defined by subdirectories or metadata
-
-Maybe I'm missing something, but doesn't [[git-annex-preferred-content]] support `metadata=field=glob`?
-"""]]

Added a comment: specifying preferred content by metadata
diff --git a/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_1d73e3bcbda6c28556e2d5a6473de245._comment b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_1d73e3bcbda6c28556e2d5a6473de245._comment
new file mode 100644
index 000000000..dfcb6f8de
--- /dev/null
+++ b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_4_1d73e3bcbda6c28556e2d5a6473de245._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="specifying preferred content by metadata"
+ date="2021-06-07T14:26:57Z"
+ content="""
+>I'd prefer if git-annex could be made to operate on full series (as defined by subdirectories or metadata
+
+Maybe I'm missing something, but doesn't [[git-annex-preferred-content]] support `metadata=field=glob`?
+"""]]

Added a comment: specifying preferred content by metadata
diff --git a/doc/forum/How_to_keep_connected_files_with_another__63__/comment_3_7b14aa71a166cf14e852ae896b05ee74._comment b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_3_7b14aa71a166cf14e852ae896b05ee74._comment
new file mode 100644
index 000000000..2a7a65aeb
--- /dev/null
+++ b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_3_7b14aa71a166cf14e852ae896b05ee74._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="specifying preferred content by metadata"
+ date="2021-06-07T14:26:27Z"
+ content="""
+>I'd prefer if git-annex could be made to operate on full series (as defined by subdirectories or metadata
+
+Maybe I'm missing something, but doesn't [[git-annex-preferred-content]] support `metadata=field=glob`?
+"""]]

Added a comment
diff --git a/doc/forum/How_to_keep_connected_files_with_another__63__/comment_2_20e8dba904eff7cc8f72ee12b8119632._comment b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_2_20e8dba904eff7cc8f72ee12b8119632._comment
new file mode 100644
index 000000000..f20e1db04
--- /dev/null
+++ b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_2_20e8dba904eff7cc8f72ee12b8119632._comment
@@ -0,0 +1,17 @@
+[[!comment format=mdwn
+ username="Atemu"
+ avatar="http://cdn.libravatar.org/avatar/d1f0f4275931c552403f4c6707bead7a"
+ subject="comment 2"
+ date="2021-06-06T20:47:31Z"
+ content="""
+> You could make the subtitles wanted in every repo so that all subtitles are present in every repo. Since they are small, the overhead shouldn't be large.
+
+This is probably what I'll end up doing.
+
+> Or you could directly add them to git (\"small files\") so they are also present everywhere.
+
+I do that for one type of metadata file that isn't important for consumption but I want everything else to be annex files so that I can assign metadata etc. to them.
+
+
+
+"""]]

Added a comment
diff --git a/doc/forum/How_to_keep_connected_files_with_another__63__/comment_1_89e04a7dc25a3df849aa2ada4dbc8439._comment b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_1_89e04a7dc25a3df849aa2ada4dbc8439._comment
new file mode 100644
index 000000000..e5c379bf6
--- /dev/null
+++ b/doc/forum/How_to_keep_connected_files_with_another__63__/comment_1_89e04a7dc25a3df849aa2ada4dbc8439._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="Lukey"
+ avatar="http://cdn.libravatar.org/avatar/c7c08e2efd29c692cc017c4a4ca3406b"
+ subject="comment 1"
+ date="2021-06-06T18:01:08Z"
+ content="""
+You could make the subtitles wanted in every repo so that all subtitles are present in every repo. Since they are small, the overhead shouldn't be large.
+
+Or you could directly add them to git (\"small files\") so they are also present everywhere. On a fresh repo, this would help a bit with speed too since git-annex then doesn't need to keep track of the location of these small files.
+
+Or (depending on how you configured your preferred content) you could increase numcopies just for the small files. See [[walkthrough/backups/]].
+
+Unfortunately, preferred-content can't directly relate multiple files with each other. git-annex iterates over each file in the tree and checks if preferred-content matches for that particular file.
+"""]]

diff --git a/doc/forum/How_to_keep_connected_files_with_another__63__.mdwn b/doc/forum/How_to_keep_connected_files_with_another__63__.mdwn
new file mode 100644
index 000000000..c1af975d9
--- /dev/null
+++ b/doc/forum/How_to_keep_connected_files_with_another__63__.mdwn
@@ -0,0 +1,5 @@
+I've got a repo full of (legally aquired) Movies and series, most of which have corresponding metadata JSON and ASS subtitle files. When distributing them over many cold storage drives, I've noticed that git-annex would try to fill them up with many of the (much smaller) text files when there isn't enough space for another video file, leaving video and subtitles files on separate drives.
+
+This isn't a critical issue since there are still enough copies and everything but it'd be annoying to have to search for and connect two or more drives to get videos + subtitles for a single series. 
+
+I was wondering if there was perhaps a clever solution to prevent this from happening. Everything is organised into subfolders, so *ideally* I'd prefer if git-annex could be made to operate on full series (as defined by subdirectories or metadata perhaps?) instead of context-less files somehow.

rename bugs/delayadd_doesn__39__t_work.mdwn to bugs/delayadd_doesn__39__t_work_with_smallfiles.mdwn
diff --git a/doc/bugs/delayadd_doesn__39__t_work.mdwn b/doc/bugs/delayadd_doesn__39__t_work_with_smallfiles.mdwn
similarity index 100%
rename from doc/bugs/delayadd_doesn__39__t_work.mdwn
rename to doc/bugs/delayadd_doesn__39__t_work_with_smallfiles.mdwn
diff --git a/doc/bugs/delayadd_doesn__39__t_work/comment_1_1501fc7de682c0f2920c6c592204268c._comment b/doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_1_1501fc7de682c0f2920c6c592204268c._comment
similarity index 100%
rename from doc/bugs/delayadd_doesn__39__t_work/comment_1_1501fc7de682c0f2920c6c592204268c._comment
rename to doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_1_1501fc7de682c0f2920c6c592204268c._comment
diff --git a/doc/bugs/delayadd_doesn__39__t_work/comment_2_3aa7b34ff3d0606f97fb9e80ece34255._comment b/doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_2_3aa7b34ff3d0606f97fb9e80ece34255._comment
similarity index 100%
rename from doc/bugs/delayadd_doesn__39__t_work/comment_2_3aa7b34ff3d0606f97fb9e80ece34255._comment
rename to doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_2_3aa7b34ff3d0606f97fb9e80ece34255._comment
diff --git a/doc/bugs/delayadd_doesn__39__t_work/comment_3_f0b5e6f0554eb43f55bfc99d178c506d._comment b/doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_3_f0b5e6f0554eb43f55bfc99d178c506d._comment
similarity index 100%
rename from doc/bugs/delayadd_doesn__39__t_work/comment_3_f0b5e6f0554eb43f55bfc99d178c506d._comment
rename to doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_3_f0b5e6f0554eb43f55bfc99d178c506d._comment
diff --git a/doc/bugs/delayadd_doesn__39__t_work/comment_4_dbe41188bc6650418b68f52ec479fc11._comment b/doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_4_dbe41188bc6650418b68f52ec479fc11._comment
similarity index 100%
rename from doc/bugs/delayadd_doesn__39__t_work/comment_4_dbe41188bc6650418b68f52ec479fc11._comment
rename to doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_4_dbe41188bc6650418b68f52ec479fc11._comment
diff --git a/doc/bugs/delayadd_doesn__39__t_work/comment_5_31a194407e433b17450725170552b8f7._comment b/doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_5_31a194407e433b17450725170552b8f7._comment
similarity index 100%
rename from doc/bugs/delayadd_doesn__39__t_work/comment_5_31a194407e433b17450725170552b8f7._comment
rename to doc/bugs/delayadd_doesn__39__t_work_with_smallfiles/comment_5_31a194407e433b17450725170552b8f7._comment

Added a comment: using import tree and export tree
diff --git a/doc/special_remotes/webdav/comment_23_d1b44f0cf171fb8cab85add778f2949b._comment b/doc/special_remotes/webdav/comment_23_d1b44f0cf171fb8cab85add778f2949b._comment
new file mode 100644
index 000000000..c375606c3
--- /dev/null
+++ b/doc/special_remotes/webdav/comment_23_d1b44f0cf171fb8cab85add778f2949b._comment
@@ -0,0 +1,9 @@
+[[!comment format=mdwn
+ username="jenkin.schibel@286264d9ceb79998aecff0d5d1a4ffe34f8b8421"
+ nickname="jenkin.schibel"
+ avatar="http://cdn.libravatar.org/avatar/692d82fb5c42fc86d97cc44ae0fb61ca"
+ subject="using import tree and export tree"
+ date="2021-06-06T14:43:39Z"
+ content="""
+hey will being able to import a treeish from a webdav remote ever be supported? my use case is that i have a nextcloud instance where i store photo backups for all the smart devices in my family which all get backed up to a single shared directory.  since this tree would be ever changing due to the many smart phones connecting to it, and storing data in it, i figured a push and pull method similar to what can be done with the adb special remote could be useful to keep all the files tracked in my annex.
+"""]]

Add bug report
diff --git a/doc/bugs/__34__failed_to_send_content_to_remote__34__.mdwn b/doc/bugs/__34__failed_to_send_content_to_remote__34__.mdwn
new file mode 100644
index 000000000..a03582fdc
--- /dev/null
+++ b/doc/bugs/__34__failed_to_send_content_to_remote__34__.mdwn
@@ -0,0 +1,218 @@
+### Please describe the problem.
+
+I am unable to copy files from one git-annex repo to another, either using `copy --from=...' from the destination repo or `--to=...' from the source repo.
+
+Even with the added output from `--debug` I have no idea what is going wrong. The example below focusses on one file, but I think no files work.
+
+`git annex unused` shows `partially transferred data` after the failures. As far as I can tell, the rsync commands that copy the files to `.git/annex/objects/tmp` are working, so I guess the failure happens after that.
+
+From the source repo, located at `/home/james/w/annex-neu`:
+
+	james copter annex-neu $ git remote show -n nature
+	* remote nature
+	  Fetch URL: /d/nature-ext2/falsifian/w/annex-neu
+	  Push  URL: /d/nature-ext2/falsifian/w/annex-neu
+	  HEAD branch: (not queried)
+	  Remote branches: (status not queried)
+	    git-annex
+	    master
+	    synced/git-annex
+	    synced/master
+	  Local ref configured for 'git push' (status not queried):
+	    (matching) pushes to (matching)
+
+	james copter annex-neu $ git annex copy --to=nature datasets/reddit/files.pushshift.io/reddit/daily/RS_2019_10_01.gz
+	copy datasets/reddit/files.pushshift.io/reddit/daily/RS_2019_10_01.gz (to nature...)
+
+
+	  failed to send content to remote
+
+
+
+	  failed to send content to remote
+	failed
+	git-annex: copy: 1 failed
+
+And with `--debug`:
+
+	james copter annex-neu $ git annex copy --debug --to=nature datasets/reddit/files.pushshift.io/reddit/daily/RS_2019_10_01.gz                                    [2021-06-05 20:24:58.016674809] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","git-annex"]
+	[2021-06-05 20:24:58.021106113] process done ExitSuccess                        
+	[2021-06-05 20:24:58.021381484] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","show-ref","--hash","refs/heads/git-annex"]
+	[2021-06-05 20:24:58.025170816] process done ExitSuccess                        [2021-06-05 20:24:58.025705324] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..803d71e5e3f15ee4a2a30e63c5e0e9faccf0defe","--pretty=%H","-n1"]                                                   [2021-06-05 20:24:58.029655191] process done ExitSuccess                        [2021-06-05 20:24:58.030257913] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","log","refs/heads/git-annex..4a7de0d2b9a7f4a2b9cf88c57636a8388
+	6f6b340","--pretty=%H","-n1"]
+	[2021-06-05 20:24:58.034210664] process done ExitSuccess
+	[2021-06-05 20:24:58.034470983] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","log","refs/heads/git-annex..bbfde8eb1c05e46c1a4e25719c8eb0418
+	9fa7cf2","--pretty=%H","-n1"]                                                   
+	[2021-06-05 20:24:58.038954155] process done ExitSuccess
+	[2021-06-05 20:24:58.039404429] chat: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","cat-file","--batch"]
+	[2021-06-05 20:24:58.039881197] chat: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(object
+	size)"]
+	[2021-06-05 20:24:58.050584672] read: git ["config","--null","--list"]
+	[2021-06-05 20:24:58.053085725] process done ExitSuccess
+	[2021-06-05 20:24:58.053918467] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","symbolic-ref","-q","HEAD"]
+	[2021-06-05 20:24:58.056598571] process done ExitSuccess
+	[2021-06-05 20:24:58.056750274] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","show-ref","refs/heads/master"]
+	[2021-06-05 20:24:58.059953296] process done ExitSuccess
+	[2021-06-05 20:24:58.060270357] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","ls-files","--stage","-z","--","datasets/reddit/files.pushshif
+	t.io/reddit/daily/RS_2019_10_01.gz"]
+	[2021-06-05 20:24:58.060602372] chat: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(object
+	size)","--buffer"]
+	[2021-06-05 20:24:58.061149131] chat: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)"
+	,"--buffer"]
+	[2021-06-05 20:24:58.061754169] chat: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)"
+	,"--buffer"]
+	copy datasets/reddit/files.pushshift.io/reddit/daily/RS_2019_10_01.gz (to nature
+	...) 
+	[2021-06-05 20:24:58.124257351] read: cp ["--reflink=always","--preserve=timesta
+	mps",".git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47b
+	c1bf3a46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be
+	3400a47bc1bf3a46253646b710579b9162560cea087.gz","../../../../nature-ext2/falsifi
+	an/w/annex-neu/.git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc
+	1bf3a46253646b710579b9162560cea087.gz"]
+	[2021-06-05 20:24:58.125931854] process done ExitFailure 1
+	[2021-06-05 20:24:58.12621772] read: rsync ["--progress","--inplace","--perms","
+	.git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a
+	46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be3400a4
+	7bc1bf3a46253646b710579b9162560cea087.gz","../../../../nature-ext2/falsifian/w/a
+	nnex-neu/.git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a4
+	6253646b710579b9162560cea087.gz"]
+	100%  288.37 MiB      330 MiB/s 0s  [2021-06-05 20:24:59.003840882] process done
+	 ExitSuccess
+
+	[2021-06-05 20:24:59.004436111] read: rsync ["--progress","--inplace","--perms",
+	".git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3
+	a46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be3400a
+	47bc1bf3a46253646b710579b9162560cea087.gz","../../../../nature-ext2/falsifian/w/
+	annex-neu/.git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a
+	46253646b710579b9162560cea087.gz"]
+	100%  288.37 MiB      463 MiB/s 0s [2021-06-05 20:24:59.629830876] process done
+	ExitSuccess
+
+	  failed to send content to remote
+
+	[2021-06-05 20:24:59.631304163] read: rsync ["--progress","--inplace","--perms",
+	".git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3
+	a46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be3400a
+	47bc1bf3a46253646b710579b9162560cea087.gz","../../../../nature-ext2/falsifian/w/
+	annex-neu/.git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a
+	46253646b710579b9162560cea087.gz"]
+	100%  288.37 MiB      462 MiB/s 0s [2021-06-05 20:25:00.257185993] process done
+	ExitSuccess
+
+	[2021-06-05 20:25:00.257891486] read: rsync ["--progress","--inplace","--perms",
+	".git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3
+	a46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be3400a
+	47bc1bf3a46253646b710579b9162560cea087.gz","../../../../nature-ext2/falsifian/w/
+	annex-neu/.git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a
+	46253646b710579b9162560cea087.gz"]
+	100%  288.37 MiB      437 MiB/s 0s [2021-06-05 20:25:00.92075523] process done E
+	xitSuccess
+
+	  failed to send content to remote
+	failed
+	[2021-06-05 20:25:00.921692771] process done ExitSuccess
+	[2021-06-05 20:25:00.921771903] process done ExitSuccess
+	[2021-06-05 20:25:00.921837622] process done ExitSuccess
+	git-annex: copy: 1 failed
+
+
+Trying in the other direction (note /home/falsifian is a symlink to james):
+
+	james copter annex-neu $ git remote show -n local                               * remote local
+	  Fetch URL: /home/falsifian/w/annex-neu
+	  Push  URL: /home/falsifian/w/annex-neu
+	  HEAD branch: (not queried)
+	  Remote branches: (status not queried)
+	    git-annex
+	    master                                                                          synced/git-annex                                                                synced/master                                                                 Local ref configured for 'git push' (status not queried):                         (matching) pushes to (matching)                                             james copter annex-neu $ git annex get --debug --from=local datasets/reddit/file
+	s.pushshift.io/reddit/daily/RS_2019_10_01.gz
+	[2021-06-05 20:26:43.459693059] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","show-ref","git-annex"]
+	[2021-06-05 20:26:43.4637773] process done ExitSuccess
+	[2021-06-05 20:26:43.464083545] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","show-ref","--hash","refs/heads/git-annex"]
+	[2021-06-05 20:26:43.467747946] process done ExitSuccess
+	[2021-06-05 20:26:43.468371444] read: git ["--git-dir=.git","--work-tree=.","--l
+	iteral-pathspecs","log","refs/heads/git-annex..2585bc984868658a30bac238f343e14ee
+	0ab4a40","--pretty=%H","-n1"]
+	[2021-06-05 20:26:43.472230994] process done ExitSuccess
+	[2021-06-05 20:26:43.472821833] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..4026e21e431b23f3ed8bab4a59d076a07eb38aba","--pretty=%H","-n1"]
+	[2021-06-05 20:26:43.476524704] process done ExitSuccess
+	[2021-06-05 20:26:43.476704328] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","log","refs/heads/git-annex..4a7de0d2b9a7f4a2b9cf88c57636a83886f6b340","--pretty=%H","-n1"]
+	[2021-06-05 20:26:43.480419871] process done ExitSuccess
+	[2021-06-05 20:26:43.480903262] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch"]
+	[2021-06-05 20:26:43.481438522] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
+	[2021-06-05 20:26:43.484464329] read: git ["config","--null","--list"]
+	[2021-06-05 20:26:43.484899963] read: git ["config","--null","--list"]
+	[2021-06-05 20:26:43.494876192] process done ExitSuccess
+	[2021-06-05 20:26:43.495606623] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","symbolic-ref","-q","HEAD"]
+	[2021-06-05 20:26:43.498087081] process done ExitSuccess
+	[2021-06-05 20:26:43.498254949] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","show-ref","refs/heads/master"]
+	[2021-06-05 20:26:43.50112263] process done ExitSuccess
+	[2021-06-05 20:26:43.501446299] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","ls-files","--stage","-z","--","datasets/reddit/files.pushshift.io/reddit/daily/RS_2019_10_01.gz"]
+	[2021-06-05 20:26:43.501858278] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)","--buffer"]
+	[2021-06-05 20:26:43.503867361] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)","--buffer"]
+	[2021-06-05 20:26:43.504653623] chat: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","cat-file","--batch=%(objectname) %(objecttype) %(objectsize)","--buffer"]
+	[2021-06-05 20:26:43.50750836] chat: git ["--git-dir=.git","--work-tree=.","--li
+	teral-pathspecs","cat-file","--batch"]
+	[2021-06-05 20:26:43.50796496] chat: git ["--git-dir=.git","--work-tree=.","--li
+	teral-pathspecs","cat-file","--batch-check=%(objectname) %(objecttype) %(objectsize)"]
+	get datasets/reddit/files.pushshift.io/reddit/daily/RS_2019_10_01.gz (from local...)
+	[2021-06-05 20:26:43.568982372] read: cp ["--reflink=always","--preserve=timesta
+	mps","../../../../../home/falsifian/w/annex-neu/.git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz",".git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz"]
+	[2021-06-05 20:26:43.571638865] process done ExitFailure 1
+	[2021-06-05 20:26:43.572002434] read: rsync ["--progress","--inplace","--perms","../../../../../home/falsifian/w/annex-neu/.git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz",".git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz"]
+	100%  288.37 MiB      351 MiB/s 0s   [2021-06-05 20:26:44.396269512] process don
+	e ExitSuccess
+
+	  failed to retrieve content from remote
+
+	[2021-06-05 20:26:44.397675637] read: rsync ["--progress","--inplace","--perms","../../../../../home/falsifian/w/annex-neu/.git/annex/objects/vv/Km/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz",".git/annex/tmp/SHA256E-s302380112--fb63e37bf5af350af35be3400a47bc1bf3a46253646b710579b9162560cea087.gz"]
+	0%    32 KiB           23 MiB/s 12s[2021-06-05 20:26:45.0364163] process done Ex
+	itSuccess
+
+	  failed to retrieve content from remote
+	failed
+	[2021-06-05 20:26:45.037805309] process done ExitSuccess
+	[2021-06-05 20:26:45.03799324] process done ExitSuccess
+	[2021-06-05 20:26:45.038069266] process done ExitSuccess
+	[2021-06-05 20:26:45.038244936] process done ExitSuccess
+	[2021-06-05 20:26:45.03831897] process done ExitSuccess
+	git-annex: get: 1 failed
+
+

(Diff truncated)
Added a comment
diff --git a/doc/bugs/significant_performance_regression_impacting_datal/comment_1_3eb1b092f41dc5c04d74f7a03249aa0f._comment b/doc/bugs/significant_performance_regression_impacting_datal/comment_1_3eb1b092f41dc5c04d74f7a03249aa0f._comment
new file mode 100644
index 000000000..d761dae81
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal/comment_1_3eb1b092f41dc5c04d74f7a03249aa0f._comment
@@ -0,0 +1,10 @@
+[[!comment format=mdwn
+ username="yarikoptic"
+ avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4"
+ subject="comment 1"
+ date="2021-06-05T13:50:42Z"
+ content="""
+Oh, fresh runs datalad tests fail because I guess you implemented that \"don't display message unless takes awhile\" I spotted in one of the comments but didn't have a chance to follow up - I guess from now on we weren't be able to test us passing through messages from git annex (unless there is some setting where were could disable that feature during tests).
+
+That is unrelated to this issue of slow down though
+"""]]

Initial report on performance regression
diff --git a/doc/bugs/significant_performance_regression_impacting_datal.mdwn b/doc/bugs/significant_performance_regression_impacting_datal.mdwn
new file mode 100644
index 000000000..87cd4b1e8
--- /dev/null
+++ b/doc/bugs/significant_performance_regression_impacting_datal.mdwn
@@ -0,0 +1,16 @@
+### Please describe the problem.
+
+With recent RFing of scanning for unlocked/annexed files (I guess), a sweep of datalad tests on OSX started to take about 3h 30min instead of prior 1h 46min. So pretty much twice. Besides possibly affecting user experience, I am afraid that would cause too much ripples though our CI setup which might not run out of time
+
+Logs etc are at https://github.com/datalad/git-annex/actions/workflows/build-macos.yaml 
+
+The first red is ok, just a fluke but then they all fail due to change in output log string (for which there is a fix but somehow behavior on osx seems different, yet to check).
+ 
+
+
+
+### What version of git-annex are you using? On what operating system?
+
+Currently 8.20210428+git282-gd39dfed2a and first got slow with 
+8.20210428+git228-g13a6bfff4 and was ok with 8.20210428+git202-g9a5981a15
+

Added a comment
diff --git a/doc/forum/distributed_borg/comment_4_cf173923509ec826a1962f9a934b5a64._comment b/doc/forum/distributed_borg/comment_4_cf173923509ec826a1962f9a934b5a64._comment
new file mode 100644
index 000000000..77599215d
--- /dev/null
+++ b/doc/forum/distributed_borg/comment_4_cf173923509ec826a1962f9a934b5a64._comment
@@ -0,0 +1,11 @@
+[[!comment format=mdwn
+ username="alt"
+ subject="comment 4"
+ date="2021-06-05T13:07:47Z"
+ content="""
+Indeed, we had been following that page. Considering that a git-annex repo is stored in its entirety within a Borg archive, the explanation that “`git-annex` sync scans the borg repository to find out what annexed files are stored in it” likely led to the mistaken assumption that simply adding the special remote would be enough for git-annex to know how to handle it. (We had also tried specifying `subdir` to tell it exactly where to look but clearly were cargo culting by that point.)
+
+In retrospect, the outline of our intended implementation reveals a predisposition to make such an assumption: that the git-annex repo on each workstation would be “accessible only to the workstation user” suggests that there would be no provision for cloning in the first place. (Instead, there simply would be a bunch of repos that happened to be privately initialized more or less the same that all share a bunch of special remotes that also happened to be initialized more or less the same. Highly technical, I know!)
+
+Also, I had read the man pages, but due to Borg being an “unusual kind of remote”—a _special_ special remote, if you will—I was unsure how much of the information applied. Thus, the difference between `initremote` and `enableremote` in this case was not immediately clear.
+"""]]

Added a comment
diff --git a/doc/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_4_4560eff896578ac2779b03ec3484d0b2._comment b/doc/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_4_4560eff896578ac2779b03ec3484d0b2._comment
new file mode 100644
index 000000000..1c89c746e
--- /dev/null
+++ b/doc/todo/option_for___40__fast__41___compression_on_special_remotes_like___34__directory__34__/comment_4_4560eff896578ac2779b03ec3484d0b2._comment
@@ -0,0 +1,12 @@
+[[!comment format=mdwn
+ username="lucas.gautheron@09f1983993dfb0907d02ba268b3ca672f1dc3eea"
+ nickname="lucas.gautheron"
+ avatar="http://cdn.libravatar.org/avatar/ae142c662ed23018ca47390ab00f7374"
+ subject="comment 4"
+ date="2021-06-05T10:10:57Z"
+ content="""
+This would be _extremely_ useful!
+
+Especially if it was possible to turn compression on/off based on each file name or metadata.
+
+"""]]

Added a comment: "why all these wild ideas are being thrown out there"
diff --git a/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_5_c6071f417bebc13aeba79832802f9fc4._comment b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_5_c6071f417bebc13aeba79832802f9fc4._comment
new file mode 100644
index 000000000..34d418686
--- /dev/null
+++ b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_5_c6071f417bebc13aeba79832802f9fc4._comment
@@ -0,0 +1,8 @@
+[[!comment format=mdwn
+ username="Ilya_Shlyakhter"
+ avatar="http://cdn.libravatar.org/avatar/1647044369aa7747829c38b9dcc84df0"
+ subject="&quot;why all these wild ideas are being thrown out there&quot;"
+ date="2021-06-04T22:15:32Z"
+ content="""
+It just seemed like all the speedup possibilities from `annex.supportunlocked=false` are getting undone to optimize a not-too-common scenario?
+"""]]

--size-limit exit 101
Sponsored-by: Mark Reidenbach on Patreon
diff --git a/Annex.hs b/Annex.hs
index c9feb7540..f659a4f88 100644
--- a/Annex.hs
+++ b/Annex.hs
@@ -189,6 +189,7 @@ data AnnexState = AnnexState
 	, sentinalstatus :: Maybe SentinalStatus
 	, useragent :: Maybe String
 	, errcounter :: Integer
+	, skippedfiles :: Bool
 	, adjustedbranchrefreshcounter :: Integer
 	, unusedkeys :: Maybe (S.Set Key)
 	, tempurls :: M.Map Key URLString
@@ -248,6 +249,7 @@ newAnnexState c r = do
 		, sentinalstatus = Nothing
 		, useragent = Nothing
 		, errcounter = 0
+		, skippedfiles = False
 		, adjustedbranchrefreshcounter = 0
 		, unusedkeys = Nothing
 		, tempurls = M.empty
diff --git a/Benchmark.hs b/Benchmark.hs
index c48cbbd09..434d9764d 100644
--- a/Benchmark.hs
+++ b/Benchmark.hs
@@ -32,9 +32,9 @@ mkGenerator cmds userinput = do
 		forM_ l $ \(cmd, seek, st) ->
 			-- The cmd is run for benchmarking without startup or
 			-- shutdown actions.
-			Annex.eval st $ performCommandAction cmd seek noop
+			Annex.eval st $ performCommandAction False cmd seek noop
   where
-	-- Simplified versio of CmdLine.dispatch, without support for fuzzy
+	-- Simplified version of CmdLine.dispatch, without support for fuzzy
 	-- matching or out-of-repo commands.
 	parsesubcommand ps = do
 		(cmd, seek, globalconfig) <- liftIO $ O.handleParseResult $
diff --git a/CmdLine.hs b/CmdLine.hs
index eb75e66e1..1cb795366 100644
--- a/CmdLine.hs
+++ b/CmdLine.hs
@@ -62,7 +62,7 @@ dispatch' subcommandname args fuzzy cmds allargs allcmds fields getgitrepo progn
 			forM_ fields $ uncurry Annex.setField
 			prepRunCommand cmd globalsetter
 			startup
-			performCommandAction cmd seek $
+			performCommandAction True cmd seek $
 				shutdown $ cmdnocommit cmd
 	go (Left norepo) = do
 		let ingitrepo = \a -> a =<< Git.Config.global
diff --git a/CmdLine/Action.hs b/CmdLine/Action.hs
index 29baea29e..6d7932bb0 100644
--- a/CmdLine/Action.hs
+++ b/CmdLine/Action.hs
@@ -30,19 +30,32 @@ import qualified Data.Map.Strict as M
 import qualified System.Console.Regions as Regions
 
 {- Runs a command, starting with the check stage, and then
- - the seek stage. Finishes by running the continutation, and 
- - then showing a count of any failures. -}
-performCommandAction :: Command -> CommandSeek -> Annex () -> Annex ()
-performCommandAction Command { cmdcheck = c, cmdname = name } seek cont = do
+ - the seek stage. Finishes by running the continuation.
+ -
+ - Can exit when there was a problem or when files were skipped.
+ - Also shows a count of any failures when that is enabled.
+ -}
+performCommandAction :: Bool -> Command -> CommandSeek -> Annex () -> Annex ()
+performCommandAction canexit (Command { cmdcheck = c, cmdname = name }) seek cont = do
 	mapM_ runCheck c
 	Annex.changeState $ \s -> s { Annex.errcounter = 0 }
 	seek
 	finishCommandActions
 	cont
-	showerrcount =<< Annex.getState Annex.errcounter
+	st <- Annex.getState id
+	when canexit $ liftIO $ case (Annex.errcounter st, Annex.skippedfiles st) of
+		(0, False) -> noop
+		(errcnt, False) -> do
+			showerrcount errcnt
+			exitWith $ ExitFailure 1
+		(0, True) -> exitskipped
+		(errcnt, True) -> do
+			showerrcount errcnt
+			exitskipped
   where
-	showerrcount 0 = noop
-	showerrcount cnt = giveup $ name ++ ": " ++ show cnt ++ " failed"
+	showerrcount cnt = hPutStrLn stderr $
+		name ++ ": " ++ show cnt ++ " failed"
+	exitskipped = exitWith $ ExitFailure 101
 
 commandActions :: [CommandStart] -> Annex ()
 commandActions = mapM_ commandAction
@@ -315,7 +328,7 @@ checkSizeLimit (Just sizelimitvar) startmsg a =
 			Nothing -> do
 				fsz <- catchMaybeIO $ withObjectLoc k $
 					liftIO . getFileSize
-				maybe noop go fsz
+				maybe skipped go fsz
 		Nothing -> a
   where
 	go sz = do
@@ -327,4 +340,8 @@ checkSizeLimit (Just sizelimitvar) startmsg a =
 					writeTVar sizelimitvar n'
 					return True
 				else return False
-		when fits a
+		if fits 
+			then a
+			else skipped
+	
+	skipped = Annex.changeState $ \s -> s { Annex.skippedfiles = True }
diff --git a/doc/git-annex-common-options.mdwn b/doc/git-annex-common-options.mdwn
index cf44cf40e..7b8efed2b 100644
--- a/doc/git-annex-common-options.mdwn
+++ b/doc/git-annex-common-options.mdwn
@@ -89,6 +89,9 @@ Most of these options are accepted by all git-annex commands.
   In some cases, an annexed file's size is not known. This option will
   prevent git-annex from processing such files.
 
+  When the size limit prevents git-annex from acting on any files,
+  it will exit with a special code, 101.
+
 * `--semitrust=repository`
 * `--untrust=repository`
 
diff --git a/doc/todo/size_limits_for_drop__47__move__47__copy__47__get.mdwn b/doc/todo/size_limits_for_drop__47__move__47__copy__47__get.mdwn
index a93f700df..9742d3c02 100644
--- a/doc/todo/size_limits_for_drop__47__move__47__copy__47__get.mdwn
+++ b/doc/todo/size_limits_for_drop__47__move__47__copy__47__get.mdwn
@@ -5,3 +5,5 @@ This way you could quickly "garbage collect" a few dozen GiB from your annex rep
 Another issue this could be used to mitigates is that, for some reason, git-annex doesn't auto-stop the transfer when the repos on my external drives are full properly.
 
 I imagine there are many more use-cases where quickly being able to set a limit for the amount of data a command should act on could come in handy.
+
+> [[done]] --[[Joey]]
diff --git a/doc/todo/size_limits_for_drop__47__move__47__copy__47__get/comment_2_8687c2bb69c83d3d8ebfa877a0d7b32f._comment b/doc/todo/size_limits_for_drop__47__move__47__copy__47__get/comment_2_8687c2bb69c83d3d8ebfa877a0d7b32f._comment
new file mode 100644
index 000000000..341f1f384
--- /dev/null
+++ b/doc/todo/size_limits_for_drop__47__move__47__copy__47__get/comment_2_8687c2bb69c83d3d8ebfa877a0d7b32f._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2021-06-04T20:35:26Z"
+ content="""
+--size-limit is implemented, for most git-annex commands.
+
+Ones like `git-annex add` that don't operate on annexed files don't support
+it, at least yet.
+
+Ones like git-annex export/import/sync I'm not sure it makes sense to
+support it, since they kind of operate at a higher level than individual
+files.
+"""]]

diff --git a/doc/todo/trust_based_on_time_since_last_fsck.mdwn b/doc/todo/trust_based_on_time_since_last_fsck.mdwn
new file mode 100644
index 000000000..2edee94a5
--- /dev/null
+++ b/doc/todo/trust_based_on_time_since_last_fsck.mdwn
@@ -0,0 +1,5 @@
+It'd be really useful if I could specify my level of trust in a remote holding a file as a function of the time since the file has last been fsck'd in that remote.
+
+This way, if I haven't fsck'd say my off-site cold storage in x amount of time, git-annex would automatically try to create additional copies of its files in other remotes for example.
+
+Expiry can be used in a similar way but declaring the remote as dead is overkill and has unwanted side-effects.

add --size-limit option
When this option is not used, there should be effectively no added
overhead, thanks to the optimisation in
b3cd0cc6ba4e5b9e2ae0abd9c8b2ec32475e09d2.
When an action fails on a file, the size of the file still counts toward
the size limit. This was necessary to support concurrency, but also
generally seems like the right choice.
Most commands that operate on annexed files support the option.
export and import do not, and I don't know if it would make sense for
export to.. Why would you want an incomplete export? sync doesn't, and
while it would be easy to make it support it for transferring files,
it's not clear if dropping files should also take the size limit into
account. Commands like add that don't operate on annexed files don't
support the option either.
Exiting 101 not yet implemented.
Sponsored-by: Denis Dzyubenko on Patreon
diff --git a/Annex.hs b/Annex.hs
index 5168b6411..c9feb7540 100644
--- a/Annex.hs
+++ b/Annex.hs
@@ -174,6 +174,7 @@ data AnnexState = AnnexState
 	, forcemincopies :: Maybe MinCopies
 	, limit :: ExpandableMatcher Annex
 	, timelimit :: Maybe (Duration, POSIXTime)
+	, sizelimit :: Maybe (TVar Integer)
 	, uuiddescmap :: Maybe UUIDDescMap
 	, preferredcontentmap :: Maybe (FileMatcherMap Annex)
 	, requiredcontentmap :: Maybe (FileMatcherMap Annex)
@@ -232,6 +233,7 @@ newAnnexState c r = do
 		, forcemincopies = Nothing
 		, limit = BuildingMatcher []
 		, timelimit = Nothing
+		, sizelimit = Nothing
 		, uuiddescmap = Nothing
 		, preferredcontentmap = Nothing
 		, requiredcontentmap = Nothing
diff --git a/CHANGELOG b/CHANGELOG
index 05b190b13..47d1185ef 100644
--- a/CHANGELOG
+++ b/CHANGELOG
@@ -28,6 +28,7 @@ git-annex (8.20210429) UNRELEASED; urgency=medium
   * init: When annex.commitmessage is set, use that message for the commit
     that creates the git-annex branch.
   * Added annex.adviceNoSshCaching config.
+  * Added --size-limit option.
 
  -- Joey Hess <id@joeyh.name>  Mon, 03 May 2021 10:33:10 -0400
 
diff --git a/CmdLine/Action.hs b/CmdLine/Action.hs
index 008e8fc99..29baea29e 100644
--- a/CmdLine/Action.hs
+++ b/CmdLine/Action.hs
@@ -1,6 +1,6 @@
 {- git-annex command-line actions and concurrency
  -
- - Copyright 2010-2020 Joey Hess <id@joeyh.name>
+ - Copyright 2010-2021 Joey Hess <id@joeyh.name>
  -
  - Licensed under the GNU AGPL version 3 or higher.
  -}
@@ -15,9 +15,11 @@ import Annex.Concurrent
 import Annex.WorkerPool
 import Types.Command
 import Types.Concurrency
+import Annex.Content
 import Messages.Concurrent
 import Types.Messages
 import Types.WorkerPool
+import Types.ActionItem
 import Remote.List
 
 import Control.Concurrent
@@ -58,21 +60,29 @@ commandAction :: CommandStart -> Annex ()
 commandAction start = do
 	st <- Annex.getState id
 	case getConcurrency' (Annex.concurrency st) of
-		NonConcurrent -> runnonconcurrent
+		NonConcurrent -> runnonconcurrent (Annex.sizelimit st)
 		Concurrent n
-			| n > 1 -> runconcurrent (Annex.workers st)
-			| otherwise -> runnonconcurrent
-		ConcurrentPerCpu -> runconcurrent (Annex.workers st)
+			| n > 1 -> runconcurrent (Annex.sizelimit st) (Annex.workers st)
+			| otherwise -> runnonconcurrent (Annex.sizelimit st)
+		ConcurrentPerCpu -> runconcurrent (Annex.sizelimit st) (Annex.workers st)
   where
-	runnonconcurrent = void $ includeCommandAction start
-	runconcurrent Nothing = runnonconcurrent
-	runconcurrent (Just tv) = 
-		liftIO (atomically (waitStartWorkerSlot tv)) >>=
-			maybe runnonconcurrent (runconcurrent' tv)
-	runconcurrent' tv (workerstrd, workerstage) = do
+	runnonconcurrent sizelimit = start >>= \case
+		Nothing -> noop
+		Just (startmsg, perform) -> 
+			checkSizeLimit sizelimit startmsg $ do
+				showStartMessage startmsg
+				void $ accountCommandAction startmsg $
+					performCommandAction' startmsg perform
+
+	runconcurrent sizelimit Nothing = runnonconcurrent sizelimit
+	runconcurrent sizelimit (Just tv) = 
+		liftIO (atomically (waitStartWorkerSlot tv)) >>= maybe
+			(runnonconcurrent sizelimit)
+			(runconcurrent' sizelimit tv)
+	runconcurrent' sizelimit tv (workerstrd, workerstage) = do
 		aid <- liftIO $ async $ snd 
 			<$> Annex.run workerstrd
-				(concurrentjob (fst workerstrd))
+				(concurrentjob sizelimit (fst workerstrd))
 		liftIO $ atomically $ do
 			pool <- takeTMVar tv
 			let !pool' = addWorkerPool (ActiveWorker aid workerstage) pool
@@ -88,10 +98,11 @@ commandAction start = do
 				let !pool' = deactivateWorker pool aid workerstrd'
 				putTMVar tv pool'
 	
-	concurrentjob workerst = start >>= \case
+	concurrentjob sizelimit workerst = start >>= \case
 		Nothing -> noop
 		Just (startmsg, perform) ->
-			concurrentjob' workerst startmsg perform
+			checkSizeLimit sizelimit startmsg $
+				concurrentjob' workerst startmsg perform
 	
 	concurrentjob' workerst startmsg perform = case mkActionItem startmsg of
 		OnlyActionOn k _ -> ensureOnlyActionOn k $
@@ -126,7 +137,7 @@ commandAction start = do
 			Nothing -> do
 				showEndMessage startmsg False
 				return False
-
+	
 {- Waits for all worker threads to finish and merges their AnnexStates
  - back into the current Annex's state.
  -}
@@ -294,3 +305,26 @@ ensureOnlyActionOn k a = debugLocks $
 					writeTVar tv $! M.insert k mytid m
 					return $ liftIO $ atomically $
 						modifyTVar tv $ M.delete k
+
+checkSizeLimit :: Maybe (TVar Integer) -> StartMessage -> Annex () -> Annex ()
+checkSizeLimit Nothing _ a = a
+checkSizeLimit (Just sizelimitvar) startmsg a =
+	case actionItemKey (mkActionItem startmsg) of
+		Just k -> case fromKey keySize k of
+			Just sz -> go sz
+			Nothing -> do
+				fsz <- catchMaybeIO $ withObjectLoc k $
+					liftIO . getFileSize
+				maybe noop go fsz
+		Nothing -> a
+  where
+	go sz = do
+		fits <- liftIO $ atomically $ do
+			n <- readTVar sizelimitvar
+			let !n' = n - sz
+			if n' >= 0
+				then do
+					writeTVar sizelimitvar n'
+					return True
+				else return False
+		when fits a
diff --git a/CmdLine/GitAnnex/Options.hs b/CmdLine/GitAnnex/Options.hs
index 1e2f15c32..ab7331b95 100644
--- a/CmdLine/GitAnnex/Options.hs
+++ b/CmdLine/GitAnnex/Options.hs
@@ -12,6 +12,7 @@ module CmdLine.GitAnnex.Options where
 import Control.Monad.Fail as Fail (MonadFail(..))
 import Options.Applicative
 import Data.Time.Clock.POSIX
+import Control.Concurrent.STM
 import qualified Data.Map as M
 
 import Annex.Common
@@ -37,6 +38,7 @@ import CmdLine.GlobalSetter
 import qualified Backend
 import qualified Types.Backend as Backend
 import Utility.HumanTime
+import Utility.DataUnits
 import Annex.Concurrent
 
 -- Global options that are accepted by all git-annex sub-commands,
@@ -233,11 +235,12 @@ annexedMatchingOptions = concat
 	, fileMatchingOptions' Limit.LimitAnnexFiles
 	, combiningOptions
 	, timeLimitOption
+	, sizeLimitOption
 	]
 
 -- Matching options that can operate on keys as well as files.
 keyMatchingOptions :: [GlobalOption]
-keyMatchingOptions = keyMatchingOptions' ++ combiningOptions ++ timeLimitOption
+keyMatchingOptions = keyMatchingOptions' ++ combiningOptions ++ timeLimitOption ++ sizeLimitOption
 
 keyMatchingOptions' :: [GlobalOption]
 keyMatchingOptions' = 
@@ -435,6 +438,19 @@ timeLimitOption =
 		let cutoff = start + durationToPOSIXTime duration
 		Annex.changeState $ \s -> s { Annex.timelimit = Just (duration, cutoff) }
 
+sizeLimitOption :: [GlobalOption]
+sizeLimitOption =
+	[ globalOption setsizelimit $ option (maybeReader (readSize dataUnits))
+		( long "size-limit" <> metavar paramSize
+		<> help "total size of annexed files to process"
+		<> hidden
+		)
+	]
+  where
+	setsizelimit n = setAnnexState $ do
+		v <- liftIO $ newTVarIO n
+		Annex.changeState $ \s -> s { Annex.sizelimit = Just v }
+
 data DaemonOptions = DaemonOptions
 	{ foregroundDaemonOption :: Bool

(Diff truncated)
comment
diff --git a/doc/forum/distributed_borg/comment_3_f56ac8a714facc8aafb60d3da815021b._comment b/doc/forum/distributed_borg/comment_3_f56ac8a714facc8aafb60d3da815021b._comment
new file mode 100644
index 000000000..c15084376
--- /dev/null
+++ b/doc/forum/distributed_borg/comment_3_f56ac8a714facc8aafb60d3da815021b._comment
@@ -0,0 +1,17 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2021-06-04T17:53:51Z"
+ content="""
+Yeah, that's the pattern for any special remote: Initialize once and
+enableremote everywhere else. Otherwise you have a bunch of different
+special remotes that happened to be initialized more or less the same
+but git-annex doesn't know you consider them all to be the same place.
+
+I'd be keen to improve whatever docs might have led to the multiple
+initremote mistake. The man page for initremote does
+say to use enableremote in other clones. But maybe you were
+following a page like 
+[[tips/using_borg_for_efficient_storage_of_old_annexed_files]] and just
+assumed you'd follow that same procedure in each clone?
+"""]]

comment
diff --git a/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_17_c41b669837be95b731bd68f1b3269fef._comment b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_17_c41b669837be95b731bd68f1b3269fef._comment
new file mode 100644
index 000000000..50ec39f1f
--- /dev/null
+++ b/doc/todo/Avoid_lengthy___34__Scanning_for_unlocked_files_...__34__/comment_17_c41b669837be95b731bd68f1b3269fef._comment
@@ -0,0 +1,11 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 17"""
+ date="2021-06-04T17:37:30Z"
+ content="""
+The scan could be done lazily, but there are situations that use the
+database where unexpectedly taking a much longer time than usual
+would be a real problem. For example "git add".
+
+The bloom filter idea does not work.
+"""]]
diff --git a/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_3_f4a816ecc43528913f6a97bf1c50dd38._comment b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_3_f4a816ecc43528913f6a97bf1c50dd38._comment
new file mode 100644
index 000000000..8b4bcb317
--- /dev/null
+++ b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_3_f4a816ecc43528913f6a97bf1c50dd38._comment
@@ -0,0 +1,12 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 3"""
+ date="2021-06-04T17:45:21Z"
+ content="""
+It is not very useful to detect if a key is used by more than one file if
+you don't know the files. In any case, yes, the keys db is used for a large
+number of things, when it comes to unlocked files.
+
+[[todo/numcopies_check_other_files_using_same_key]] has some thoughts on
+--all, but I doubt it will make sense to change --all.
+"""]]
diff --git a/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_4_2d39ba9ea61fd8006d1fd4fdde8fc055._comment b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_4_2d39ba9ea61fd8006d1fd4fdde8fc055._comment
new file mode 100644
index 000000000..b16ba840a
--- /dev/null
+++ b/doc/todo/speed_up_keys_db_update_with_git_streaming/comment_4_2d39ba9ea61fd8006d1fd4fdde8fc055._comment
@@ -0,0 +1,14 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 4"""
+ date="2021-06-04T17:49:43Z"
+ content="""
+Keys with extensions do not necessarily have the same extension as used in
+the worktree files that include/exclude match on.
+
+I'm not sure why all these wild ideas are being thrown out there when this
+todo is about a specific, simple improvement that will speed up the git
+part of the scanning by about 3x? It's like you somehow consider this an
+emergency where increasingly wild measures have to be taken to prevent me
+from making a terrible mistake?
+"""]]

comment
diff --git a/doc/todo/whishlist__58___kde-connect_as_a_transport/comment_1_14426ad1fa5dfc32d290da484c1e4c7b._comment b/doc/todo/whishlist__58___kde-connect_as_a_transport/comment_1_14426ad1fa5dfc32d290da484c1e4c7b._comment
new file mode 100644
index 000000000..a6a412057
--- /dev/null
+++ b/doc/todo/whishlist__58___kde-connect_as_a_transport/comment_1_14426ad1fa5dfc32d290da484c1e4c7b._comment
@@ -0,0 +1,21 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-04T17:23:06Z"
+ content="""
+From what I can see, kde-connect lets the phone be used to run specific,
+pre-selected commands on the linux computer it's paired with. Not the other
+way around. <https://userbase.kde.org/KDE_Connect/Tutorials/Adding_commands>
+Also being able to run a command on a phone is rather a long way from what
+git-annex actually needs.
+
+It seems like it would be possible to use the remote filesystem via sftp part
+of it. <https://rafaelc.org/tech/p/mounting-kde-connect-filesystem-via-cli/>
+That would be a useful alternative to adb. I don't think git-annex needs
+any changes to use it that way, just mount it with sshfs and
+initremote a directory special remote with exporttree=yes and
+importtree=yes.
+
+See also [[add_sftp_special_remote]] which if it were implemented would
+avoid needing to use sshfs.
+"""]]

avoid displaying the scanning annexed files message when repo is not large
Avoids users thinking this scan is a big deal, when it's not in the
majority of repos.
showSideActionAfter has some ugly caveats, since it has to display in
the background of another action. I could not see a better way to do it
and it works fine in this particular case. It also doesn't really belong
in Annex.Concurrent, but cannot go in Messages due to an import loop.
Sponsored-by: Dartmouth College's Datalad project
diff --git a/Annex/Concurrent.hs b/Annex/Concurrent.hs
index f341e21d0..cb22b6a46 100644
--- a/Annex/Concurrent.hs
+++ b/Annex/Concurrent.hs
@@ -1,6 +1,6 @@
 {- git-annex concurrent state
  -
- - Copyright 2015-2020 Joey Hess <id@joeyh.name>
+ - Copyright 2015-2021 Joey Hess <id@joeyh.name>
  -
  - Licensed under the GNU AGPL version 3 or higher.
  -}
@@ -19,8 +19,10 @@ import Types.Concurrency
 import Types.CatFileHandles
 import Annex.CheckAttr
 import Annex.CheckIgnore
+import Utility.ThreadScheduler
 
 import qualified Data.Map as M
+import Control.Concurrent.Async
 
 setConcurrency :: ConcurrencySetting -> Annex ()
 setConcurrency (ConcurrencyCmdLine s) = setConcurrency' s ConcurrencyCmdLine
@@ -98,3 +100,23 @@ mergeState st = do
 		uncurry addCleanupAction
 	Annex.Queue.mergeFrom st'
 	changeState $ \s -> s { errcounter = errcounter s + errcounter st' }
+
+{- Display a message, only when the action runs for a long enough
+ - amount of time.
+ - 
+ - The action should not display any other messages, progress, etc;
+ - if it did there could be some scrambling of the display since the
+ - message display could happen at the same time as other output,
+ - or after it.
+ -}
+showSideActionAfter :: Microseconds -> String -> Annex a -> Annex a
+showSideActionAfter t m a = do
+	waiter <- liftIO $ async $ unboundDelay t
+	let display = liftIO (waitCatch waiter) >>= \case
+		Left _ -> return ()
+		Right _ -> showSideAction m
+	displayer <- liftIO . async =<< forkState display
+	let cleanup = do
+		liftIO $ cancel waiter
+		join (liftIO (wait displayer))
+	a `finally` cleanup
diff --git a/Annex/Init.hs b/Annex/Init.hs
index 124e28865..4bd0955ea 100644
--- a/Annex/Init.hs
+++ b/Annex/Init.hs
@@ -37,6 +37,7 @@ import Annex.UUID
 import Annex.WorkTree
 import Annex.Fixup
 import Annex.Path
+import Annex.Concurrent
 import Config
 import Config.Files
 import Config.Smudge
@@ -133,8 +134,8 @@ initialize' mversion = checkInitializeAllowed $ do
 		then configureSmudgeFilter
 		else deconfigureSmudgeFilter
 	unlessM isBareRepo $ do
-		showSideAction "scanning for annexed files"
-		scanAnnexedFiles
+		showSideActionAfter oneSecond "scanning for annexed files" $
+			scanAnnexedFiles
 		hookWrite postCheckoutHook
 		hookWrite postMergeHook
 	AdjustedBranch.checkAdjustedClone >>= \case
diff --git a/Utility/ThreadScheduler.hs b/Utility/ThreadScheduler.hs
index ef69ead81..9ab94d911 100644
--- a/Utility/ThreadScheduler.hs
+++ b/Utility/ThreadScheduler.hs
@@ -15,6 +15,7 @@ module Utility.ThreadScheduler (
 	threadDelaySeconds,
 	waitForTermination,
 	oneSecond,
+	unboundDelay,
 ) where
 
 import Control.Monad
diff --git a/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_2_2cbfb22e8f1d89d4f78ade565554cb1c._comment b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_2_2cbfb22e8f1d89d4f78ade565554cb1c._comment
new file mode 100644
index 000000000..0142d7855
--- /dev/null
+++ b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_2_2cbfb22e8f1d89d4f78ade565554cb1c._comment
@@ -0,0 +1,9 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 2"""
+ date="2021-06-04T17:14:42Z"
+ content="""
+Made the scanning message not be displayed unless it takes at least 1
+second. Of course, if some test suite is still looking at that message,
+it will break..
+"""]]

comment
diff --git a/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_1_2ff094dd4ff1fb05db90782d381441d3._comment b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_1_2ff094dd4ff1fb05db90782d381441d3._comment
new file mode 100644
index 000000000..7e90b2c5f
--- /dev/null
+++ b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch/comment_1_2ff094dd4ff1fb05db90782d381441d3._comment
@@ -0,0 +1,22 @@
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 1"""
+ date="2021-06-04T16:03:34Z"
+ content="""
+Well, the init scanning is quite optimised by now, and since it will find no
+annexed objects, there are no database writes needed, which are the slower
+part of that.
+
+You will pay the price though when you later check out the master
+branch, since it then has to scan the delta. And that scanning is less
+optimised. It would be more beneficial to speed up that scanning
+(reconcileStaged), which should be doable by using the git cat-key --batch
+trick.
+
+I think the effect is somewhat psychological; if it says it's doing
+a scan then people are going to feel on guard that it's expensive. I know that
+sometimes git avoids displaying certain progress messages unless it's
+determined the operation is going to take a long time (eg git status will
+show a progress display in some circumstances but not commonly). That could
+be useful here, and probably in other parts of git-annex.
+"""]]

about "scanning for annexed" while in git-annex branch
diff --git a/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch.mdwn b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch.mdwn
new file mode 100644
index 000000000..b5584f13a
--- /dev/null
+++ b/doc/todo/init__58__do_not_bother_scanning_if_in_git-annex_branch.mdwn
@@ -0,0 +1,31 @@
+a very minor and probably should not happen in real-life since noone should check out git-annex branch (besides may be to fix up something manually) but I still do not see the point of scanning for unlocked/annexed files if `init` ran in that branch.  ATM ([8.20210428-270-g651fe3f39](https://git.kitenet.net/index.cgi/git-annex.git/commit/?id=8.20210429-g651fe3f39)) I see
+
+```
+$> bash annex-scanning-git-annex
+> set -eu
+>> mktemp -d /home/yoh/.tmp/dl-XXXXXXX
+> cd /home/yoh/.tmp/dl-yHMtPGl
+> git clone --branch git-annex http://datasets.datalad.org/.git out
+Cloning into 'out'...
+Fetching objects: 11612, done.
+> cd out
+> git branch
+* git-annex
+> git annex init
+init  (scanning for annexed files...)
+ok
+(recording state in git...)
+
+real	0m0.622s
+user	0m0.435s
+sys	0m0.163s
+
+```
+
+so it reports scanning while in git-annex, although goes quickly (has only about 1000 keys there), so may be doesn't even actually do any scanning and it is just a matter of reporting?
+
+
+Actually -- it made me think: is that `scanning` branch specific? then what would happen if e.g. master has no unlocked files in the tree and some other branch has unlocked files in the tree -- I could checkout/switch between branches without causing `git-annex` to redo its scanning and would require manual `git annex init`?
+
+[[!meta author=yoh]]
+[[!tag projects/datalad]]

Added a comment
diff --git a/doc/forum/distributed_borg/comment_2_1dcf100a28991c664681d78ed4c12e39._comment b/doc/forum/distributed_borg/comment_2_1dcf100a28991c664681d78ed4c12e39._comment
new file mode 100644
index 000000000..205e2f2b8
--- /dev/null
+++ b/doc/forum/distributed_borg/comment_2_1dcf100a28991c664681d78ed4c12e39._comment
@@ -0,0 +1,13 @@
+[[!comment format=mdwn
+ username="alt"
+ subject="comment 2"
+ date="2021-06-04T10:13:06Z"
+ content="""
+Leave it to the Grand Wizard himself :)
+
+With a few tweaks based on your explanation, this appears to be working smoothly. I think our issue was caused by attempting to connect repos that were individually initialized (i.e., with `git init`, `git annex init`, and `git annex initremote` on each workstation); by performing this initialization routine only on a single workstation, then following through with `git clone` and `git annex enableremote` on each additional workstation, the syncing works as expected.
+
+Thank you for your work and guidance! This is very exciting.
+
+The next thing to figure out is Borg repo mirroring to alleviate the overhead caused by Step 1(b) in the procedure above. Currently, the number of `borg create` operations each workstation must perform is multiplied by the number of Borg special remotes, which obviously doesn’t scale well. Ideally, a workstation could create an archive on a single server—say, the nearest available—offloading to the server the burden of creating archives on the remaining Borg repos. It sounds good in my head, but I struggle to find prior art for something like Borg-based swarms for eventual consistency.
+"""]]

diff --git a/doc/todo/whishlist__58___kde-connect_as_a_transport.mdwn b/doc/todo/whishlist__58___kde-connect_as_a_transport.mdwn
new file mode 100644
index 000000000..4961f3680
--- /dev/null
+++ b/doc/todo/whishlist__58___kde-connect_as_a_transport.mdwn
@@ -0,0 +1,7 @@
+Running an SSH server on a phone is quite some setup and could perhaps even be considered insecure.
+
+KDE-connect provides a an easy to set up method of creating a transport tunnel between phone and computer. Many users might actually have it set up already.
+
+Apparently, it can even be used to connect multiple desktops with another but I haven't tested that. This could be an alternative (or perhaps even replacement?) to git-annex' current pairing mechanism.
+
+KDE-connect offers remote command execution and file sharing, so it should be possible to use it for git-annex' purposes.