Recent changes to this wiki:
Added a comment
diff --git a/doc/special_remotes/rclone/comment_4_dc8884bf9f6bd6bdac73f26fad417355._comment b/doc/special_remotes/rclone/comment_4_dc8884bf9f6bd6bdac73f26fad417355._comment new file mode 100644 index 0000000000..66a9df0119 --- /dev/null +++ b/doc/special_remotes/rclone/comment_4_dc8884bf9f6bd6bdac73f26fad417355._comment @@ -0,0 +1,16 @@ +[[!comment format=mdwn + username="mike@2d6d71f56ce2a992244350475251df87c26fe351" + nickname="mike" + avatar="http://cdn.libravatar.org/avatar/183fa439752e2f0c6f39ede658d81050" + subject="comment 4" + date="2024-09-12T15:40:24Z" + content=""" +Here are a few pointers for switching from `git-annex-remote-rclone` (old helper program) to `rclone gitannex` (rclone's builtin support): + +0. Figure out `rcloneprefix` (directory relative to the rclone remote (rclone term here)) and `rclonelayout` (layout of the git-annex content therein). If you set it up just like in `git-annex-remote-rclone`'s README, those are `git-annex` and `lower`. +1. Update rclone and git-annex +2. Rename the old remote, `git remote rename my_rclone_remote my_rclone_remote.old; git annex renameremote my_rclone_remote my_rclone_remote.old` +3. Create a new remote, copying the encryption settings: `git annex initremote my_rclone_remote --sameas=my_rclone_remote.old type=rclone rcloneremotename=my_rclone_remote rcloneprefix=git-annex rclonelayout=lower` + +It might be possible to just change the type of the remote but at the time I'm writing this, that didn't work so I renamed the old remote and created a new one, with `--sameas` to not lose any encryption settings. +"""]]
Added a comment: 👍 +1 for encrypting the annex on regular git remotes
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_1_d8a2d51f7caec8cf3ce836e79897e8c1._comment b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_1_d8a2d51f7caec8cf3ce836e79897e8c1._comment new file mode 100644 index 0000000000..1726d34832 --- /dev/null +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site/comment_1_d8a2d51f7caec8cf3ce836e79897e8c1._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="nobodyinperson" + avatar="http://cdn.libravatar.org/avatar/736a41cd4988ede057bae805d000f4f5" + subject="👍 +1 for encrypting the annex on regular git remotes" + date="2024-09-12T14:51:20Z" + content=""" +Funny, playing around with my own forgejo-aneksajo instance, I thought about exactly that 😀 Being able to encrypt only the annex but keeping the repo open would be cool. +"""]]
diff --git a/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site.mdwn b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site.mdwn new file mode 100644 index 0000000000..7064048b83 --- /dev/null +++ b/doc/todo/encrypt_just_the_annex_on_git+annex_hosting_site.mdwn @@ -0,0 +1,14 @@ +Some git hosting sites (e.g forgejo instances tweaked to support git-annex) can store annexed contents. +The goal here would be to encrypt the annexed file contents, but not the git repo. +What would it take? + +Git-annex encryption can be enabled for special remotes, but in this case there is only a "regular" git remote so there is no way to set the config. + +My first intuition was to initialize a type=git special remote pointing to the same location, but it does not support encryption +(`initremote` fails with `git-annex: Unexpected parameters: encryption keyid`). + +There is the gcrypt special remote (and it worked with the forgejo instance I tried), but it encrypts / obfuscates everything (file names, commits etc.) and turns each push into a force push. + +The advantage of having the annexed files but not the git repo encrypted is that the file tree, commit history, readme and all the things typically displayed by the site would still be viewable (communicating repository layout, contents), but GPG keys would be used to control practical access (possibly on top of site's access premissions). + +Thanks in advance for considering! -- MSz
Added a comment
diff --git a/doc/forum/Change_a_special_remote__39__s_type/comment_4_8486b348a126f6889b648999d39f3631._comment b/doc/forum/Change_a_special_remote__39__s_type/comment_4_8486b348a126f6889b648999d39f3631._comment new file mode 100644 index 0000000000..8791123ff1 --- /dev/null +++ b/doc/forum/Change_a_special_remote__39__s_type/comment_4_8486b348a126f6889b648999d39f3631._comment @@ -0,0 +1,22 @@ +[[!comment format=mdwn + username="mike@2d6d71f56ce2a992244350475251df87c26fe351" + nickname="mike" + avatar="http://cdn.libravatar.org/avatar/183fa439752e2f0c6f39ede658d81050" + subject="comment 4" + date="2024-09-12T05:22:18Z" + content=""" +When trying to change a remote to the new rclone special remote (from `type=external externaltype=rclone`), I encountered this: + +``` +$ git annex enableremote halde-pcloud type=rclone +enableremote halde-pcloud +git-annex: getRemoteConfigValue externaltype found value of unexpected type PassedThrough. This is a bug in git-annex! +CallStack (from HasCallStack): + error, called at ./Annex/SpecialRemote/Config.hs:192:28 in main:Annex.SpecialRemote.Config + getRemoteConfigValue, called at ./Remote/External.hs:920:35 in main:Remote.External +failed +enableremote: 1 failed +``` + +(The reason I tried it this way is that I didn't want to lose the encrypted files (`encryption=shared`)) +"""]]
initial report on that addunlocked is not respected during import
diff --git a/doc/bugs/addunlocked_true_is_not_in_effect_for_import.mdwn b/doc/bugs/addunlocked_true_is_not_in_effect_for_import.mdwn new file mode 100644 index 0000000000..9814cca7f4 --- /dev/null +++ b/doc/bugs/addunlocked_true_is_not_in_effect_for_import.mdwn @@ -0,0 +1,56 @@ +### Please describe the problem. + +Here is a reproducer +``` +#!/bin/bash + +export PS4='> ' +set -x +set -eu +cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)" + +mkdir d-in d-repo +echo content >| d-in/file + +function dance() { + git annex import master --from d-in + # but we need to merge it + git merge d-in/master + ls -l + grep -e . * +} + +cd d-repo +git init +git annex init +git annex initremote d-in type=directory directory=../d-in exporttree=yes importtree=yes encryption=none +git config annex.addunlocked true + +ls -l ../d-in +dance + +echo "sample" > samplefile +git annex add samplefile +git commit -m 'Committing explicitly samplefile' +ls -l samplefile +git show + +dance + +``` + +which even if using super fresh annex 10.20240831+git21-gd717e9aca0-1~ndall+1 shows that files which were obtained via `annex import` and not added unlocked, whenever those which are `git annex add`ed directly, are: + +``` +> ls -l +total 8 +lrwxrwxrwx 1 yoh yoh 178 Sep 11 16:45 file -> .git/annex/objects/zm/2W/SHA256E-s8--434728a410a78f56fc1b5899c3593436e61ab0c731e9072d95e96db290205e53/SHA256E-s8--434728a410a78f56fc1b5899c3593436e61ab0c731e9072d95e96db290205e53 +-rw-rw-r-- 1 yoh yoh 7 Sep 11 16:45 samplefile +``` + +IMHO behavior of `import` should respect setting of `annex.addunlocked`. + +This was to consider using `import` for a folder with DANDI stats. For now I will just add them directly. + +[[!meta author=yoh]] +[[!tag projects/dandi]]
initial report on incorrect handling of empty files in adjusted branches mode
diff --git a/doc/bugs/git_diff_in_adj_unlock_reports_diff_for_empty_file.mdwn b/doc/bugs/git_diff_in_adj_unlock_reports_diff_for_empty_file.mdwn new file mode 100644 index 0000000000..fe5de013fb --- /dev/null +++ b/doc/bugs/git_diff_in_adj_unlock_reports_diff_for_empty_file.mdwn @@ -0,0 +1,93 @@ +### Please describe the problem. + +Came up in the course of +- [BF: allow for empty output directory to be specified to run](https://github.com/datalad/datalad/pull/7654#issuecomment-2334087030) + +### What steps will reproduce the problem? + +Here is a bash script + +``` +#!/bin/bash +# https://github.com/datalad/datalad/pull/7654#issuecomment-2334087030 + +export PS4='> ' +set -x +set -eu + +cd "$(mktemp -d ${TMPDIR:-/tmp}/dl-XXXXXXX)" + +function annexsync() { + # call we have in datalad + # git -c diff.ignoreSubmodules=none -c core.quotepath=false annex sync --no-push --no-pull --no-resolvemerge --no-content -c annex.dotfiles=true --no-commit + git annex sync + : +} + +git init +git annex init + +mkdir empty full +touch emptyfile full/emptyfile + +echo c1 > full/withcontent +echo c2 > withcontent + +git annex add * +git commit -m 'Initial commit' + +echo content >| empty/withcontent +touch empty/emptyfile + +git annex add empty/* +git commit -m 'Added empty/ files' + +annexsync + +pwd +ls -l + +git status +git diff + +``` + +which if ran on TMPDIR to be on a crippled FS, e.g. vfat, it would report at the end `git diff` for all empty files, **but not for files with content**, e.g. using our [eval_under_testloopfs helper](https://github.com/datalad/datalad/blob/maint/tools/eval_under_testloopfs) + +```shell +❯ DATALAD_TESTS_TEMP_FSSIZE=300 tools/eval_under_testloopfs ../trash/adjusted-git-diff.sh +... +> ls -l +total 24 +drwxr-xr-x 2 yoh root 8192 Sep 6 09:59 empty +-rwxr-xr-x 1 yoh root 0 Sep 6 09:59 emptyfile +drwxr-xr-x 2 yoh root 8192 Sep 6 09:59 full +-rwxr-xr-x 1 yoh root 3 Sep 6 09:59 withcontent +> git status +On branch adjusted/master(unlocked) +nothing to commit, working tree clean +> git diff +diff --git a/empty/emptyfile b/empty/emptyfile +--- a/empty/emptyfile ++++ b/empty/emptyfile +@@ -1 +0,0 @@ +-/annex/objects/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +diff --git a/emptyfile b/emptyfile +--- a/emptyfile ++++ b/emptyfile +@@ -1 +0,0 @@ +-/annex/objects/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +diff --git a/full/emptyfile b/full/emptyfile +--- a/full/emptyfile ++++ b/full/emptyfile +@@ -1 +0,0 @@ +-/annex/objects/SHA256E-s0--e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855 +I: done, unmounting +``` + +### What version of git-annex are you using? On what operating system? + +first locally with 10.20240430 and then current `10.20240831+git21-gd717e9aca0-1~ndall+1` + +[[!meta author=yoh]] +[[!tag projects/repronim]]
Added a comment
diff --git a/doc/bugs/assistant___40__webapp__41___commited_unlocked_link_to_annex/comment_10_ae87af8282fcc4a06c0b703b3c1f8710._comment b/doc/bugs/assistant___40__webapp__41___commited_unlocked_link_to_annex/comment_10_ae87af8282fcc4a06c0b703b3c1f8710._comment new file mode 100644 index 0000000000..2a90baa16b --- /dev/null +++ b/doc/bugs/assistant___40__webapp__41___commited_unlocked_link_to_annex/comment_10_ae87af8282fcc4a06c0b703b3c1f8710._comment @@ -0,0 +1,64 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="comment 10" + date="2024-09-05T14:52:51Z" + content=""" +but may be it is actually a separate issue of the unlocked mode since it does drop the file + +``` +reprostim@reproiner:/data/reprostim$ find .git/annex -iname *377.mkv +.git/annex/objects/Qp/XF/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv +.git/annex/objects/Qp/XF/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv +reprostim@reproiner:/data/reprostim$ git annex drop Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv +drop Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv (locking rolando...) ok +(recording state in git...) +reprostim@reproiner:/data/reprostim$ find .git/annex -iname *377.mkv +reprostim@reproiner:/data/reprostim$ cat Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv +/annex/objects/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv + +``` + +but then when I get it, it does not actually copy into the tree: + +``` +reprostim@reproiner:/data/reprostim$ git annex get --json Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv +{\"command\":\"get\",\"error-messages\":[],\"file\":\"Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv\",\"input\":[\"Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv\"],\"key\":\"MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv\",\"note\":\"from rolando...\",\"success\":true} +reprostim@reproiner:/data/reprostim$ cat Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv +/annex/objects/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv +reprostim@reproiner:/data/reprostim$ find .git/annex -iname *377.mkv +.git/annex/objects/Qp/XF/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv +.git/annex/objects/Qp/XF/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv + +``` + +``` +reprostim@reproiner:/data/reprostim$ cat .git/config +[core] + repositoryformatversion = 0 + filemode = true + bare = false + logallrefupdates = true +[annex] + uuid = 9806a90e-4cdd-48cb-b03d-7a113663fce7 + version = 10 + addunlocked = false +[filter \"annex\"] + smudge = git-annex smudge -- %f + clean = git-annex smudge --clean -- %f + process = git-annex filter-process +[remote \"rolando\"] + url = bids@rolando.cns.dartmouth.edu:VIDS/ + fetch = +refs/heads/*:refs/remotes/rolando/* + annex-uuid = 285d851e-77a8-4d31-b24c-fa72deb4d3cc +[branch \"master\"] + remote = rolando + merge = refs/heads/master + +reprostim@reproiner:/data/reprostim$ git annex version +git-annex version: 10.20240831-1~ndall+1 + +``` + + +"""]]
Added a comment: ping on this issue : how to recover?
diff --git a/doc/bugs/assistant___40__webapp__41___commited_unlocked_link_to_annex/comment_9_f3ac5f4aab89893b88fb30cced56827b._comment b/doc/bugs/assistant___40__webapp__41___commited_unlocked_link_to_annex/comment_9_f3ac5f4aab89893b88fb30cced56827b._comment new file mode 100644 index 0000000000..35d611dce0 --- /dev/null +++ b/doc/bugs/assistant___40__webapp__41___commited_unlocked_link_to_annex/comment_9_f3ac5f4aab89893b88fb30cced56827b._comment @@ -0,0 +1,23 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="ping on this issue : how to recover?" + date="2024-09-05T14:49:06Z" + content=""" +I got back to this issue, since even after upgrade of git-annex to `10.20240831-1~ndall+1` and trying on a sample file which I guess was screwed up + +``` +reprostim@reproiner:/data/reprostim$ git annex get --json Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv + +reprostim@reproiner:/data/reprostim$ git annex find --in here +Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv + +reprostim@reproiner:/data/reprostim$ ls -lL Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv +-rw-r--r-- 2 reprostim reprostim 72 Sep 5 10:42 Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv + +reprostim@reproiner:/data/reprostim$ cat Videos/2024/08/2024.08.30-11.31.56.000--2024.08.30-11.48.03.377.mkv +/annex/objects/MD5E-s20610854--4fa8311cf5fc0ea247dca2b0ae556bab.377.mkv +``` + +so, I need to figure out how to actually get that key/file here. +"""]]
comment
diff --git a/doc/todo/use_copy__95__file__95__range_for_get_and_copy/comment_1_c04b8a2a226f361bc77876875b6d17f8._comment b/doc/todo/use_copy__95__file__95__range_for_get_and_copy/comment_1_c04b8a2a226f361bc77876875b6d17f8._comment new file mode 100644 index 0000000000..6940730938 --- /dev/null +++ b/doc/todo/use_copy__95__file__95__range_for_get_and_copy/comment_1_c04b8a2a226f361bc77876875b6d17f8._comment @@ -0,0 +1,27 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 1""" + date="2024-09-05T12:56:27Z" + content=""" +The reason for reflink=always is that git-annex wants it to fail when +reflink is not supported and the copy is going to be slow. +Then it falls back to copying the file itself, which allows an interrupted +copy of a large file to be resumed, rather than restarted from the beginning +as cp would do when it's not making a reflink. + +So, at first it seemed to me that the solution will need to involve +git-annex using `copy_file_range` itself. + +But, git-annex would like to checksum the file as it's copying it (unless +annex.verify is not set), in order to avoid needing to re-read it to hash it +after the fact, which would double the disk IO in many cases. +Using `copy_file_range` by default would prevent git-annex from doing that. + +So it needs to either be probed, or be a config setting. And whichever way +git-annex determines it, it may as well use `cp reflink=auto` then +rather than using `copy_file_range` itself. + +I'd certainly rather avoid a config setting if I can. But if this is specific to +NFS on ZFS, I don't know what would be a good way to probe for that? Or is this +happening on NFS when not on ZFS as well? +"""]]
Added a comment: PS
diff --git a/doc/git-annex-fsck/comment_2_44ca99f975e75d0236b8ae6be4d23f72._comment b/doc/git-annex-fsck/comment_2_44ca99f975e75d0236b8ae6be4d23f72._comment new file mode 100644 index 0000000000..ecf8d62307 --- /dev/null +++ b/doc/git-annex-fsck/comment_2_44ca99f975e75d0236b8ae6be4d23f72._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="tapesafer" + avatar="http://cdn.libravatar.org/avatar/8a62b25ea58309a6e15cac10a5c33f1d" + subject="PS" + date="2024-09-04T15:48:01Z" + content=""" +If I am understanding the documentation of the borg special remote, then having something like `appendonly=yes` for the special directory remote would likely help in my scenario. +"""]]
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 17889fff43..a3833085e7 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -29,6 +29,7 @@ Planned schedule of work: ## work notes * Currently working in [[todo/proving_preferred_content_behavior]] + in the `sim` branch. ## items deferred until later for balanced preferred content and maxsize tracking
Added a comment: numcopies & force-trusting is ignored by fsck on readonly directory remotes?
diff --git a/doc/git-annex-fsck/comment_1_80ec8617d99d3f520c22b1e7fd741c16._comment b/doc/git-annex-fsck/comment_1_80ec8617d99d3f520c22b1e7fd741c16._comment new file mode 100644 index 0000000000..ca895f210f --- /dev/null +++ b/doc/git-annex-fsck/comment_1_80ec8617d99d3f520c22b1e7fd741c16._comment @@ -0,0 +1,56 @@ +[[!comment format=mdwn + username="tapesafer" + avatar="http://cdn.libravatar.org/avatar/8a62b25ea58309a6e15cac10a5c33f1d" + subject="numcopies & force-trusting is ignored by fsck on readonly directory remotes?" + date="2024-09-04T14:50:16Z" + content=""" +I have old readonly backup media, say something like + +- `tapeA1/apples.txt` +- `tapeA2/apples.txt` +- `tapeB1/earth.svg` +- `tapeB2/earth.svg` + +I use git-annex special directory remotes to be able to navigate the directory tree that lives on those media (e.g. to decide if and which media I need to find to copy a file from that I need). +I added the remotes like so (they are too big to import with content): + +``` +git annex initremote tapeA1 type=directory directory=/tapes/tapeA1 encryption=none importtree=yes +git annex import master:tapeA1 --from tapeA1 --no-content +git annex merge --allow-unrelated-histories tapeA1/main +``` + +At some point I may buy new hardware and recreate those backup media as proper git-annex remotes, but wouldn't it be great to keep the existing backups as long as they show no sign of bitrot and together hold enough copies? + +Though, git-annex fsck behaves unexpected: It seems I cannot force trust these remotes nor does `--numcopies=0 --mincopies=0` have the desired effect. + +Concretely, when calling `git annex fsck --from=tapeA1 --numcopies=0 --mincopies=0 --trust=tapeA1 --force`, +for every file that is still intact on tapeA1, git-annex fsck reports a failure as follows + +``` +fsck tapeA1/apples.txt + Only these untrusted locations may have copies of tapeA1/apples.txt + abc-def-ghi -- [tapeA1] + Back it up to trusted locations with git-annex copy. +failed +``` + +while I'd be happy to (semi)trust tapeA1 or to accept no copies whatsoever. So fsck ignores `--trust=tapeA1 --force` and/or `--numcopies=0 --mincopies=0` which are common git-annex options that should work for fsck? + +Ideally, I would be able to (semi)trust my readonly tape remotes (which likely should be behind a `--force` as it may lead to data loss in classical directory remote settings). Then I can use git-annex to index those tapes, but also to monitor their health via fsck (so I can over the years replace the tapes that are showing signs of corruption). + +As for the corruption, I emulated bitrot on a test directory remote, which then leads to a fsck failure as follows: + +``` +fsck tapeB2/earth.svg + verification of content failed +(checksum...) + tapeB2/earth.svg: Bad file content; failed to drop fromtapeB2: dropping content from this remote is not supported because it is configured with importtree=yes +``` + +This suffices to detect tapes that should be replaced, and it's kinda expected that files cannot be dropped. + +Somehow fsck does not work as I would expect -- am I misunderstanding the numcopies/mincopies arguments here? Is there really no way to force-trust a directory remote, which to me seems appropriate in this case? Is there another way to achieve what I have in mind with git-annex? + +Thanks for this great piece of software – also use the assistant in another day-to-day usecase and it's simply great! +"""]]
Added a comment: Similar Borg sync issue
diff --git a/doc/bugs/git_annex_sync_borg_fails/comment_5_7929e8cdc6008bb41ee38a7be42cddf6._comment b/doc/bugs/git_annex_sync_borg_fails/comment_5_7929e8cdc6008bb41ee38a7be42cddf6._comment new file mode 100644 index 0000000000..a57439ea6b --- /dev/null +++ b/doc/bugs/git_annex_sync_borg_fails/comment_5_7929e8cdc6008bb41ee38a7be42cddf6._comment @@ -0,0 +1,53 @@ +[[!comment format=mdwn + username="Rick" + avatar="http://cdn.libravatar.org/avatar/bbc227c89f7136fbb191127764e9d02c" + subject="Similar Borg sync issue" + date="2024-09-03T19:40:57Z" + content=""" +I'm also getting `list borg failed` when I run `git annex sync borg`. In my case, syncing succeeds after creating the first borg archive but fails when the borg repo contains a second archive. + +I'm running: + +- git-annex 10.20240731 +- borg 1.4.0 +- NixOS 24.11.20240821.c374d94 (Vicuna) + +To reproduce this problem: + +``` +borg init --encryption=keyfile /path/to/borgrepo +git annex initremote borg type=borg borgrepo=/path/to/borgrepo +borg create /path/to/borgrepo::archive1 `pwd` +git annex sync borg +git annex add newfile +borg create /path/to/borgrepo::archive2 `pwd` +git annex sync borg +``` + +From the debug output the first time running `git-annex sync`, the only ExitFailure line: + +``` +[2024-08-28 19:13:31.056388087] (Utility.Process) process [79595] done ExitFailure 1 +ok +```` + +And the first appearance of process 79595: + +``` +[2024-08-28 19:13:31.011783181] (utility.process) process [79595] call: git [\"--git-dir=.git\",\"--work-tree=.\",\"--literal-pathspecs\",\"commit\",\"-a\",\"-m\",\"git-annex in user@nixos:~/sandbox/gr\"] +``` + +Only once, after running the command a second time, I got the following additional lines: + +``` +[2024-08-28 19:48:41.942245332] (Utility.Process) process [122585] read: borg [\"list\",\"--format\",\"{size}{NUL}{path}{NUL}{extra}{NUL}\",\"/home/user/sandbox/br::archive2\",\"\"] +... +borg list: error: argument PATH: Empty strings are not accepted as paths. +[2024-08-28 19:48:42.296294751] (Utility.Process) process [122585] done ExitFailure 2 +``` + +I have set `LANG=C` and `git annex enableremote borg subdir=` as suggested in this thread to no avail. + +Thanks in advance for your help! I have used and loved git-annex for years and am very thankful for the work Joey and others have put into it. I'm planning to buy a git-annex backpack soon. + +"""]]
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index b3ffb08065..17889fff43 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -28,6 +28,8 @@ Planned schedule of work: ## work notes +* Currently working in [[todo/proving_preferred_content_behavior]] + ## items deferred until later for balanced preferred content and maxsize tracking * `git-annex assist --rebalance` of `balanced=foo:2`
sim design
diff --git a/doc/todo/proving_preferred_content_behavior.mdwn b/doc/todo/proving_preferred_content_behavior.mdwn index d352373ea0..534a6decf9 100644 --- a/doc/todo/proving_preferred_content_behavior.mdwn +++ b/doc/todo/proving_preferred_content_behavior.mdwn @@ -80,4 +80,36 @@ Be sure to enforce invariants like numcopies the same as git-annex does. Since users can write preferred content expressions, this should be targeted at being used by end users. +The sim could be run in a clone of a repository, and update location +logs as it runs. This would let the user query with `whereis` and +`maxsize` etc to see what happens. + +Such a repository's location tracking would no longer match reality, +so it would need to be clearly marked as a simulation result, and be +prevented from merging back into another repository. This can be done by +adding a new Difference to the repository. + +The sim would need a map of repositories with connections between them. +Perhaps `git-annex map` could be used? + +For each step of the sim, it would pick a repository from the map +(excluding special remotes), and simulate an operation being run in that +repository, affecting it and its remotes. + +Split brain needs to be simulated, so the operations in the sim should +include pushing and fetching the git-annex branch. The ref of each +git-annex branch of each repository would be stored, with +refs/heads/git-annex being set to the git-annex branch of the repository +it is currently simulating an operation in. + +The other operations would include get, drop, copy, move, sync, all +with preferred content respected. + +May want to also simulate adding files to a repository, which would be +generated (without any actual content) according to simulation parameters. +Also file moves and deletions. `git-annex fuzztest` has some prior art. + +The location log history could be examined at the end of the simulation +to find problems like instability. + [[!tag projects/openneuro]]
fix typo
diff --git a/doc/tuning.mdwn b/doc/tuning.mdwn index 1860295465..45d0edd20c 100644 --- a/doc/tuning.mdwn +++ b/doc/tuning.mdwn @@ -16,7 +16,7 @@ in `.git/annex/objects`: It's very important to keep in mind that this makes a nonstandard format git-annex repository. In general, this cannot safely be used with -git-annex older than version 5.20150128. Older version of git-annex will +git-annex older than version 5.20150128. Older versions of git-annex will not understand and will get confused and perhaps do bad things. Also, it's not safe to merge two separate git repositories that have been
treat "not present" in preferred content as invalid
Detect when a preferred content expression contains "not present", which
would lead to repeatedly getting and then dropping files, and make it never
match. This also applies to "not balanced" and "not sizebalanced".
--explain will tell the user when this happens
Note that getMatcher calls matchMrun' and does not check for unstable
negated limits. While there is no --present anyway, if there was,
it would not make sense for --not --present to complain about
instability and fail to match.
Detect when a preferred content expression contains "not present", which
would lead to repeatedly getting and then dropping files, and make it never
match. This also applies to "not balanced" and "not sizebalanced".
--explain will tell the user when this happens
Note that getMatcher calls matchMrun' and does not check for unstable
negated limits. While there is no --present anyway, if there was,
it would not make sense for --not --present to complain about
instability and fail to match.
diff --git a/Annex/FileMatcher.hs b/Annex/FileMatcher.hs index a2bfd23dce..474680e75c 100644 --- a/Annex/FileMatcher.hs +++ b/Annex/FileMatcher.hs @@ -95,10 +95,26 @@ checkMatcher' (matcher, (MatcherDesc matcherdesc)) mi lu notpresent = go = do (matches, desc) <- runWriterT $ matchMrun' matcher $ \op -> matchAction op lu notpresent mi - explain (mkActionItem mi) $ UnquotedString <$> - describeMatchResult matchDesc desc - ((if matches then "matches " else "does not match ") ++ matcherdesc ++ ": ") - return matches + let descmsg = UnquotedString <$> + describeMatchResult + (\o -> matchDesc o . Just) desc + ((if matches then "matches " else "does not match ") ++ matcherdesc ++ ": ") + let unstablenegated = filter matchNegationUnstable (findNegated matcher) + if null unstablenegated + then do + explain (mkActionItem mi) descmsg + return matches + else do + let s = concat + [ ", but that expression is not stable due to negated use of " + , unwords $ nub $ + map (fromMatchDesc . flip matchDesc Nothing) + unstablenegated + , ", so will not be used" + ] + explain (mkActionItem mi) $ Just $ + fromMaybe mempty descmsg <> UnquotedString s + return False fileMatchInfo :: RawFilePath -> Maybe Key -> Annex MatchInfo fileMatchInfo file mkey = do @@ -282,6 +298,7 @@ call desc (Right sub) = Right $ Operation $ MatchFiles , matchNeedsKey = any matchNeedsKey sub , matchNeedsLocationLog = any matchNeedsLocationLog sub , matchNeedsLiveRepoSize = any matchNeedsLiveRepoSize sub + , matchNegationUnstable = any matchNegationUnstable sub , matchDesc = matchDescSimple desc } call _ (Left err) = Left err diff --git a/CHANGELOG b/CHANGELOG index 8fc1eacbcf..9abab02c85 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -1,6 +1,10 @@ git-annex (10.20240832) UNRELEASED; urgency=medium - * Fix --debug display of onlyingroup preferred content expression. + * Detect when a preferred content expression contains "not present", + which would lead to repeatedly getting and then dropping files, + and make it never match. This also applies to + "not balanced" and "not sizebalanced". + * Fix --explain display of onlyingroup preferred content expression. -- Joey Hess <id@joeyh.name> Tue, 03 Sep 2024 12:38:42 -0400 diff --git a/Limit.hs b/Limit.hs index 40b571f8e2..134d70e826 100644 --- a/Limit.hs +++ b/Limit.hs @@ -69,7 +69,8 @@ getMatcher = run <$> getMatcher' Utility.Matcher.matchMrun' matcher $ \o -> matchAction o NoLiveUpdate S.empty i explain (mkActionItem i) $ UnquotedString <$> - Utility.Matcher.describeMatchResult matchDesc desc + Utility.Matcher.describeMatchResult + (\o -> matchDesc o . Just) desc (if match then "matches:" else "does not match:") return match @@ -115,6 +116,7 @@ limitInclude glob = Right $ MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = "include" =? glob } @@ -130,6 +132,7 @@ limitExclude glob = Right $ MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = "exclude" =? glob } @@ -156,6 +159,7 @@ limitIncludeSameContent glob = Right $ MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = "includesamecontent" =? glob } @@ -172,6 +176,7 @@ limitExcludeSameContent glob = Right $ MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = "excludesamecontent" =? glob } @@ -249,6 +254,7 @@ matchMagic limitname querymagic selectprovidedinfo selectuserprovidedinfo (Just , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = limitname =? glob } where @@ -277,6 +283,7 @@ addUnlocked = addLimit $ Right $ MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = matchDescSimple "unlocked" } @@ -288,6 +295,7 @@ addLocked = addLimit $ Right $ MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = matchDescSimple "locked" } @@ -324,6 +332,7 @@ addIn s = do , matchNeedsKey = True , matchNeedsLocationLog = not inhere , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = "in" =? s } checkinuuid u notpresent key @@ -355,6 +364,7 @@ addExpectedPresent = do , matchNeedsKey = True , matchNeedsLocationLog = True , matchNeedsLiveRepoSize = False + , matchNegationUnstable = True , matchDesc = matchDescSimple "expected-present" } @@ -373,6 +383,7 @@ limitPresent u = MatchFiles , matchNeedsKey = True , matchNeedsLocationLog = not (isNothing u) , matchNeedsLiveRepoSize = False + , matchNegationUnstable = True , matchDesc = matchDescSimple "present" } @@ -385,6 +396,7 @@ limitInDir dir desc = MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = matchDescSimple desc } where @@ -418,6 +430,7 @@ limitCopies want = case splitc ':' want of , matchNeedsKey = True , matchNeedsLocationLog = True , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = "copies" =? want } go' n good notpresent key = do @@ -444,6 +457,7 @@ limitLackingCopies desc approx want = case readish want of , matchNeedsKey = True , matchNeedsLocationLog = True , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = matchDescSimple desc } Nothing -> Left "bad value for number of lacking copies" @@ -475,6 +489,7 @@ limitUnused = MatchFiles , matchNeedsKey = True , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = matchDescSimple "unused" } where @@ -499,6 +514,7 @@ limitAnything = MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = matchDescSimple "anything" } @@ -515,6 +531,7 @@ limitNothing = MatchFiles , matchNeedsKey = False , matchNeedsLocationLog = False , matchNeedsLiveRepoSize = False + , matchNegationUnstable = False , matchDesc = matchDescSimple "nothing" } (Diff truncated)
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index e76b32fb5c..b3ffb08065 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -21,8 +21,8 @@ Planned schedule of work: * July: p2p protocol over http * August, part 1: git-annex proxy support for exporttree * August, part 2: balanced preferred content -* September: streaming through proxy to special remotes (especially S3) * October: proving behavior of balanced preferred content with proxies +* September: streaming through proxy to special remotes (especially S3) [[!tag projects/openneuro]] diff --git a/doc/todo/proving_preferred_content_behavior.mdwn b/doc/todo/proving_preferred_content_behavior.mdwn index b98166fade..0b1f1fee0b 100644 --- a/doc/todo/proving_preferred_content_behavior.mdwn +++ b/doc/todo/proving_preferred_content_behavior.mdwn @@ -35,18 +35,9 @@ matter the sizes of the underlying repositories, but balanced preferred content does take repository fullness into account, which further complicates fully understanding the behavior. -Notice that `balanced()` (in the current design) is not stable when used -on its own, and has to be used as part of a larger expression to make it -stable, eg: - - ((balanced(backup) and not (copies=backup:1)) or present - -So perhaps `balanced()` should include the other checks in it, -to avoid the user shooting themselves in the foot. On the other -hand, if `balanced()` implicitly contains `present`, then `not balanced()` -would include `not present`, which is bad! - -(For that matter, what does `not balanced()` even do currently?) +Notice that `fullbalanced()` is not stable when used +on its own, and so `balanced()` adds an "or present" to stabilize it. +And so `not balanced()` includes `not present`, which is bad! ## proof
2 level toc
diff --git a/doc/design/p2p_protocol_over_http.mdwn b/doc/design/p2p_protocol_over_http.mdwn index 90d68fa371..ec20de471e 100644 --- a/doc/design/p2p_protocol_over_http.mdwn +++ b/doc/design/p2p_protocol_over_http.mdwn @@ -1,4 +1,4 @@ -[[!toc ]] +[[!toc levels=2]] ## introduction
fix number of headers
diff --git a/doc/design/p2p_protocol_over_http.mdwn b/doc/design/p2p_protocol_over_http.mdwn index 372c21998c..90d68fa371 100644 --- a/doc/design/p2p_protocol_over_http.mdwn +++ b/doc/design/p2p_protocol_over_http.mdwn @@ -295,7 +295,7 @@ Same as v3, except the JSON will not include "plusuuids". Identical to v1. -## POST /git-annex/$uuid/v3/remove-before +### POST /git-annex/$uuid/v3/remove-before Remove a key's content from the server, but only before a specified time. @@ -313,7 +313,7 @@ removal will fail and the server will respond with: `{"removed": false}` This is used to avoid removing content after a point in time where it is no longer locked in other repostitories. -## POST /git-annex/$uuid/v3/gettimestamp +### POST /git-annex/$uuid/v3/gettimestamp Gets the current timestamp from the server.
diff --git a/doc/forum/Copy_portion_of_file_from_remote.mdwn b/doc/forum/Copy_portion_of_file_from_remote.mdwn index 4701835e6b..60470a3f90 100644 --- a/doc/forum/Copy_portion_of_file_from_remote.mdwn +++ b/doc/forum/Copy_portion_of_file_from_remote.mdwn @@ -6,7 +6,7 @@ Unfortunately, to do so, we must download entire recordings of ~1GB each, even i This can take hours. My question is: how hard would it be to download specific ranges of bytes from a remote repository (given start/end cursors)? -Given that git annex can resume interrupted downloads, I assume there is already some code for readings bytes from a remote, starting a specific position in the file. +Given that git annex can resume interrupted downloads, I assume there is already some code for readings bytes from a remote, starting from specific positions in a file. What would be the easiest way of doing this? Is it achievable via git annex' interface? Or would it require a change in git annex itself? (My colleagues and I aren't proficient in Haskell and we can't really maintain binaries for the multiple platforms we work with).
diff --git a/doc/forum/Copy_portion_of_file_from_remote.mdwn b/doc/forum/Copy_portion_of_file_from_remote.mdwn new file mode 100644 index 0000000000..4701835e6b --- /dev/null +++ b/doc/forum/Copy_portion_of_file_from_remote.mdwn @@ -0,0 +1,15 @@ +Hi, + +My peers and I work with longform audio recordings (10-20 hours each). +We often need to sub-sample small portions (typically 1 percent) of many of these recordings in unpredictable ways, for various reasons. +Unfortunately, to do so, we must download entire recordings of ~1GB each, even if we end up using only 1% of each of them. +This can take hours. + +My question is: how hard would it be to download specific ranges of bytes from a remote repository (given start/end cursors)? +Given that git annex can resume interrupted downloads, I assume there is already some code for readings bytes from a remote, starting a specific position in the file. +What would be the easiest way of doing this? Is it achievable via git annex' interface? Or would it require a change in git annex itself? +(My colleagues and I aren't proficient in Haskell and we can't really maintain binaries for the multiple platforms we work with). + +If this requires a patch to git-annex, maybe this would be a feature of interest to a broader share of people? + +Best,
add news item for git-annex 10.20240831
diff --git a/doc/news/version_10.20240430.mdwn b/doc/news/version_10.20240430.mdwn deleted file mode 100644 index 03577d2c9e..0000000000 --- a/doc/news/version_10.20240430.mdwn +++ /dev/null @@ -1,24 +0,0 @@ -git-annex 10.20240430 released with [[!toggle text="these changes"]] -[[!toggleable text=""" * Bug fix: While redundant concurrent transfers were already - prevented in most cases, it failed to prevent the case where - two different repositories were sending the same content to - the same repository. - * addurl, importfeed: Added --verifiable option, which improves - the safety of --fast or --relaxed by letting the content of - annexed files be verified with a checksum that is calculated - on a later download from the web. This will become the default later. - * Added rclone special remote, which can be used without needing - to install the git-annex-remote-rclone program. This needs - a forthcoming version of rclone (1.67.0), which supports - "rclone gitannex". - * sync, assist, import: Allow -m option to be specified multiple - times, to provide additional paragraphs for the commit message. - * reregisterurl: New command that can change an url from being - used by a special remote to being used by the web remote. - * annex.maxextensions configuration controls how many filename - extensions to preserve. - * find: Fix --help for --copies. - Thanks, Gergely Risko - * Windows: Fix escaping output to terminal when using old - versions of MinTTY. - * Added dependency on unbounded-delays."""]] \ No newline at end of file diff --git a/doc/news/version_10.20240831.mdwn b/doc/news/version_10.20240831.mdwn new file mode 100644 index 0000000000..39b752e198 --- /dev/null +++ b/doc/news/version_10.20240831.mdwn @@ -0,0 +1,28 @@ +git-annex 10.20240831 released with [[!toggle text="these changes"]] +[[!toggleable text=""" * Special remotes configured with exporttree=yes annexobjects=yes + can store objects in .git/annex/objects, as well as an exported tree. + * Support proxying to special remotes configured with + exporttree=yes annexobjects=yes, and allow such remotes to be used as + cluster nodes. + * post-retrieve: When proxying is enabled for an exporttree=yes + special remote (or it is a cluster node) and the configured + remote.name.annex-tracking-branch is received, the tree is + exported to the special remote. + * Support "balanced=", "fullybalanced=", "sizebalanced=" and + "fullysizebalanced=" in preferred content expressions. + * Added --rebalance option. + * Added the annex.fullybalancedthreshhold git config. + * maxsize: New command to tell git-annex how large the expected maximum + size of a repository is, and to display repository sizes. + * vicfg: Include maxsize configuration. + * info: Improved speed by using new repository size tracking. + * lookupkey: Allow using --ref in a bare repository. + * export: Added --from option. + * git-remote-annex: Store objects in exportree=yes special remotes + in the same paths used by annexobjects=yes. This is a backwards + compatible change. + * updateproxy, updatecluster: Prevent using an exporttree=yes special + remote that does not have annexobjects=yes, since it will not work. + * The config versioning=true is now reserved for use by versioned special + remotes. External special remotes should not use that config for their + own purposes."""]] \ No newline at end of file
mention sizebalanced as well as balanced
diff --git a/doc/git-annex-common-options.mdwn b/doc/git-annex-common-options.mdwn index c978c96bbc..37e2e0aaf7 100644 --- a/doc/git-annex-common-options.mdwn +++ b/doc/git-annex-common-options.mdwn @@ -79,8 +79,10 @@ Most of these options are accepted by all git-annex commands. * `--rebalance` Changes the behavior of the "balanced" preferred content expression - to be the same as "fullbalanced". When that expression is used, - this can cause a lot of work to be done to rebalance repositories. + to be the same as "fullbalanced" and the "sizebalanced" expression + to be the same as "fullsizebalanced". When those expressions are + used, this can cause a lot of work to be done to rebalance + repositories. * `--time-limit=time`
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 73e9ded242..e76b32fb5c 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -20,22 +20,21 @@ Planned schedule of work: * June: git-annex proxies and clusters * July: p2p protocol over http * August, part 1: git-annex proxy support for exporttree -* August, part 2: [[track_free_space_in_repos_via_git-annex_branch]] -* September, part 1: balanced preferred content -* September, part 2: streaming through proxy to special remotes (especially S3) -* October, part 1: streaming through proxy continued -* October, part 2: proving behavior of balanced preferred content with proxies +* August, part 2: balanced preferred content +* September: streaming through proxy to special remotes (especially S3) +* October: proving behavior of balanced preferred content with proxies [[!tag projects/openneuro]] ## work notes +## items deferred until later for balanced preferred content and maxsize tracking + * `git-annex assist --rebalance` of `balanced=foo:2` sometimes needs several runs to stabalize. May not be a bug, needs reproducing and analysis. - -* Test that live repo size data is correct and really works. + Deferred for proving behavior of balanced preferred content stage. * The assistant is using NoLiveUpdate, but it should be posssible to plumb a LiveUpdate through it from preferred content checking to location log @@ -49,7 +48,8 @@ Planned schedule of work: * getLiveRepoSizes has a filterM getRecentChange over the live updates. This could be optimised to a single sql join. There are usually not many live updates, but sometimes there will be a great many recent changes, - so it might be worth doing this optimisation. + so it might be worth doing this optimisation. Persistent is not capable + of this, would need dependency added on esquelito. ## completed items for August's work on balanced preferred content diff --git a/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn b/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn index bace415615..5c0dcd5cbf 100644 --- a/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn +++ b/doc/todo/track_free_space_in_repos_via_git-annex_branch.mdwn @@ -110,5 +110,4 @@ sizeOfDownloadsInProgress. It would be possible to make a [[!tag projects/openneuro]] -> Current status: This is implemented, but concurrency issues remain. -> --[[Joey]] +> [[done]]! > --[[Joey]]
document using balanced preferred content in a cluster
diff --git a/doc/tips/clusters.mdwn b/doc/tips/clusters.mdwn index f166558596..d0eaa139ad 100644 --- a/doc/tips/clusters.mdwn +++ b/doc/tips/clusters.mdwn @@ -63,31 +63,6 @@ clusters. A cluster is not a git repository, and so `git pull bigserver-mycluster` will not work. -## preferred content of clusters - -The preferred content of the cluster can be configured. This tells -users what files the cluster as a whole should contain. - -To configure the preferred content of a cluster, as well as other related -things like [[groups|git-annex-group]] and [[required_content]], it's easiest -to do the configuration in a repository that has the cluster as a remote. - -For example: - - $ git-annex wanted bigserver-mycluster standard - $ git-annex group bigserver-mycluster archive - -By default, when a file is uploaded to a cluster, it is stored on every node of -the cluster. To control which nodes to store to, the [[preferred_content]] of -each individual node can be configured. - -It's also a good idea to configure the preferred content of the cluster's -gateway. To avoid files redundantly being stored on the gateway -(which remember, is not a node of the cluster), you might make it not want -any files: - - $ git-annex wanted bigserver nothing - ## setting up a cluster A new cluster first needs to be initialized. Run [[git-annex-initcluster]] in @@ -131,6 +106,41 @@ on more than one at a time will likely be faster. $ git config annex.jobs cpus +## preferred content of clusters + +The preferred content of the cluster can be configured. This tells +users what files the cluster as a whole should contain. + +To configure the preferred content of a cluster, as well as other related +things like [[groups|git-annex-group]] and [[required_content]], it's easiest +to do the configuration in a repository that has the cluster as a remote. + +For example: + + $ git-annex wanted bigserver-mycluster standard + $ git-annex group bigserver-mycluster archive + +By default, when a file is uploaded to a cluster, it is stored on every node +of the cluster. To control which nodes to store to, the [[preferred_content]] +of each individual node can be configured. + +For example, to balance content evenly across nodes: + + $ git-annex groupwanted bigserver-node balanced=bigserver-node + $ git-annex group bigserver-node1 bigserver-node + $ git-annex group bigserver-node2 bigserver-node + $ git-annex group bigserver-node3 bigserver-node + $ git-annex wanted bigserver-node1 groupwanted + $ git-annex wanted bigserver-node2 groupwanted + $ git-annex wanted bigserver-node3 groupwanted + +It's also a good idea to configure the preferred content of the cluster's +gateway. To avoid files redundantly being stored on the gateway +(which remember, is not a node of the cluster), you might make it not want +any files: + + $ git-annex wanted bigserver nothing + ## special remotes as cluster nodes Cluster nodes don't have to be regular git remotes. They can @@ -138,7 +148,7 @@ also be special remotes. Even special remotes with `exporttree=yes` can be used as cluster nodes. Those also need to be configured with -`annexobjects=yes` though. And, will also need to configure +`annexobjects=yes` though. And, you will also need to configure `remote.name.annex-tracking-branch` to the branch that will trigger an update of the exported tree when it is pushed to the cluster gateway.
lookupkey: Allow using --ref in a bare repository.
diff --git a/CHANGELOG b/CHANGELOG index 37b3688b86..db03e5af79 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -18,6 +18,7 @@ git-annex (10.20240831) UNRELEASED; urgency=medium * The config versioning=true is now reserved for use by versioned special remotes. External special remotes should not use that config for their own purposes. + * lookupkey: Allow using --ref in a bare repository. -- Joey Hess <id@joeyh.name> Wed, 31 Jul 2024 15:52:03 -0400 diff --git a/Command/LookupKey.hs b/Command/LookupKey.hs index f191aa1e8b..32df886532 100644 --- a/Command/LookupKey.hs +++ b/Command/LookupKey.hs @@ -15,7 +15,7 @@ import Utility.Terminal import Utility.SafeOutput cmd :: Command -cmd = notBareRepo $ noCommit $ noMessages $ +cmd = noCommit $ noMessages $ command "lookupkey" SectionPlumbing "looks up key used for file" (paramRepeating paramFile) @@ -35,9 +35,11 @@ optParser = LookupKeyOptions run :: LookupKeyOptions -> SeekInput -> String -> Annex Bool run o _ file | refOption o = catKey (Ref (toRawFilePath file)) >>= display - | otherwise = seekSingleGitFile file >>= \case - Nothing -> return False - Just file' -> catKeyFile file' >>= display + | otherwise = do + checkNotBareRepo + seekSingleGitFile file >>= \case + Nothing -> return False + Just file' -> catKeyFile file' >>= display display :: Maybe Key -> Annex Bool display (Just k) = do diff --git a/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository.mdwn b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository.mdwn index 7c11340b57..b30487ec90 100644 --- a/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository.mdwn +++ b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository.mdwn @@ -37,3 +37,4 @@ git-annex: You cannot run this command in a bare repository. ### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) +> [[fixed|done]] --[[Joey]] diff --git a/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_3_d88061641439dd029939584cc5679a40._comment b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_3_d88061641439dd029939584cc5679a40._comment new file mode 100644 index 0000000000..e191890c70 --- /dev/null +++ b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_3_d88061641439dd029939584cc5679a40._comment @@ -0,0 +1,14 @@ +[[!comment format=mdwn + username="joey" + subject="""comment 3""" + date="2024-08-30T14:47:41Z" + content=""" +Fixed that. + +It kind of seems like metadata could have an option to get the metadata for +a specific ref as well, but since it already has --branch which takes a +branch ref, adding a --ref which takes a file ref seems confusing. Maybe +--fileref? There are a decent number of other commands that also use +parseKeyOptions to support --branch/--key/--all that would also get the new +option if it were implemented. +"""]]
Added a comment
diff --git a/doc/todo/copy__47__move_support_for_pushinsteadOf_/comment_4_34d2d12c16a829498360fbb2515a685f._comment b/doc/todo/copy__47__move_support_for_pushinsteadOf_/comment_4_34d2d12c16a829498360fbb2515a685f._comment new file mode 100644 index 0000000000..23d1d1ceb4 --- /dev/null +++ b/doc/todo/copy__47__move_support_for_pushinsteadOf_/comment_4_34d2d12c16a829498360fbb2515a685f._comment @@ -0,0 +1,16 @@ +[[!comment format=mdwn + username="yarikoptic" + avatar="http://cdn.libravatar.org/avatar/f11e9c84cb18d26a1748c33b48c924b4" + subject="comment 4" + date="2024-08-29T18:35:47Z" + content=""" +just ran into this again with `datalad push` which surprised me (since I do not get into it with regular `git push`), and took me a bit to figure out/find this issue. + +> Unless there's some reason why you need git to pull from the http url rather than from the ssh url? + +It is my pattern of working with git -- clone via public URL whenever possible (so I do not have to load/use any ssh key without necessity; could use the same URLs on public and private hosts alike) and only when needed to push, automagically push via ssh. FWIW I really love such workflow and use it not only for github but other hosting providers too! + +And IMHO indeed it would make total sense for a similar separation of \"use public public/read access route regardless of having or not credentials for private/write, and use secure/authenticated route only if write/push is necessary\" for git-annex too. The utility of `insteadOf` is not allowing for such separation, but at least indeed would allow \"location-wide\" overload of using secure/authenticated even when simpler public access route possible. + +Indeed adding such a feature parity with `git` might break existing setups, but I would say it should only fix a possible divergence and remove the surprise that annex is behaving differently from how git does it. IMHO it is unlikely someone had `pushInsteadOf` configured to have `git` push somewhere else (thus git-annex branch going there too) while still somehow interested to use original URL for git-annex. +"""]]
remove stale live changes from reposize database
Reorganized the reposize database directory, and split up a column.
checkStaleSizeChanges needs to run before needLiveUpdate,
otherwise the process won't be holding a lock on its pid file, and
another process could go in and expire the live update it records. It
just so happens that they do get called in the correct order, since
checking balanced preferred content calls getLiveRepoSizes before
needLiveUpdate.
The 1 minute delay between checks is arbitrary, but will avoid excess
work. The downside of it is that, if a process is dropping a file and
gets interrupted, for 1 minute another process can expect a repository
will soon be smaller than it is. And so a process might send data to a
repository when a file is not really going to be dropped from it. But
note that can already happen if a drop takes some time in eg locking and
then fails. So it seems possible that live updates should only be
allowed to increase, rather than decrease the size of a repository.
Reorganized the reposize database directory, and split up a column.
checkStaleSizeChanges needs to run before needLiveUpdate,
otherwise the process won't be holding a lock on its pid file, and
another process could go in and expire the live update it records. It
just so happens that they do get called in the correct order, since
checking balanced preferred content calls getLiveRepoSizes before
needLiveUpdate.
The 1 minute delay between checks is arbitrary, but will avoid excess
work. The downside of it is that, if a process is dropping a file and
gets interrupted, for 1 minute another process can expect a repository
will soon be smaller than it is. And so a process might send data to a
repository when a file is not really going to be dropped from it. But
note that can already happen if a drop takes some time in eg locking and
then fails. So it seems possible that live updates should only be
allowed to increase, rather than decrease the size of a repository.
diff --git a/Annex/Locations.hs b/Annex/Locations.hs index 18b4dd2ed0..da0b4f800a 100644 --- a/Annex/Locations.hs +++ b/Annex/Locations.hs @@ -77,6 +77,7 @@ module Annex.Locations ( gitAnnexImportFeedDbLock, gitAnnexRepoSizeDbDir, gitAnnexRepoSizeDbLock, + gitAnnexRepoSizeLiveDir, gitAnnexScheduleState, gitAnnexTransferDir, gitAnnexCredsDir, @@ -157,7 +158,21 @@ objectDir :: RawFilePath objectDir = P.addTrailingPathSeparator $ annexDir P.</> "objects" {- Annexed file's possible locations relative to the .git directory - - in a non-bare repository. + - in a non-bare eepository. + +{- Checks for other git-annex processes that might have been interrupted + - and left the database populated with stale live size changes. Those + - are removed from the database. + - + - Also registers the current process so that other calls to this will not + - consider it stale while it's running. + - + - This checks the first time it is called, and again if it's been more + - than 1 minute since the last check. + -} +checkStaleSizeChanges :: Db.RepoSizeHandle -> Annex () +checkStaleSizeChanges h = do + undefined - - Normally it is hashDirMixed. However, it's always possible that a - bare repository was converted to non-bare, or that the cripped @@ -520,11 +535,17 @@ gitAnnexImportFeedDbLock r c = gitAnnexImportFeedDbDir r c <> ".lck" {- Directory containing reposize database. -} gitAnnexRepoSizeDbDir :: Git.Repo -> GitConfig -> RawFilePath gitAnnexRepoSizeDbDir r c = - fromMaybe (gitAnnexDir r) (annexDbDir c) P.</> "reposize" + fromMaybe (gitAnnexDir r) (annexDbDir c) P.</> "reposize" P.</> "db" {- Lock file for the reposize database. -} gitAnnexRepoSizeDbLock :: Git.Repo -> GitConfig -> RawFilePath -gitAnnexRepoSizeDbLock r c = gitAnnexRepoSizeDbDir r c <> ".lck" +gitAnnexRepoSizeDbLock r c = + fromMaybe (gitAnnexDir r) (annexDbDir c) P.</> "reposize" P.</> "lock" + +{- Directory containing liveness pid files. -} +gitAnnexRepoSizeLiveDir :: Git.Repo -> GitConfig -> RawFilePath +gitAnnexRepoSizeLiveDir r c = + fromMaybe (gitAnnexDir r) (annexDbDir c) P.</> "reposize" P.</> "live" {- .git/annex/schedulestate is used to store information about when - scheduled jobs were last run. -} diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index 084c2c3efd..8da005e2a8 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -17,6 +17,7 @@ import qualified Annex import Annex.Branch (UnmergedBranches(..), getBranch) import qualified Database.RepoSize as Db import Annex.Journal +import Annex.RepoSize.LiveUpdate import Logs import Logs.Location import Logs.UUID @@ -55,6 +56,7 @@ getLiveRepoSizes quiet = do where go sizemap = do h <- Db.getRepoSizeHandle + checkStaleSizeChanges h liveoffsets <- liftIO $ Db.liveRepoOffsets h let calc u (RepoSize size, SizeOffset startoffset) = case M.lookup u liveoffsets of @@ -86,12 +88,12 @@ calcRepoSizes quiet rsv = go `onException` failed calculatefromscratch h = do unless quiet $ showSideAction "calculating repository sizes" - (sizemap, branchsha) <- calcBranchRepoSizes - liftIO $ Db.setRepoSizes h sizemap branchsha - calcJournalledRepoSizes h sizemap branchsha + use h =<< calcBranchRepoSizes - incrementalupdate h oldsizemap oldbranchsha currbranchsha = do - (sizemap, branchsha) <- diffBranchRepoSizes quiet oldsizemap oldbranchsha currbranchsha + incrementalupdate h oldsizemap oldbranchsha currbranchsha = + use h =<< diffBranchRepoSizes quiet oldsizemap oldbranchsha currbranchsha + + use h (sizemap, branchsha) = do liftIO $ Db.setRepoSizes h sizemap branchsha calcJournalledRepoSizes h sizemap branchsha diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index 602e8f374a..49431d0173 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -11,13 +11,20 @@ module Annex.RepoSize.LiveUpdate where import Annex.Common import Logs.Presence.Pure -import qualified Database.RepoSize as Db +import Database.RepoSize.Handle import Annex.UUID import Types.FileMatcher +import Annex.LockFile +import Annex.LockPool +import qualified Database.RepoSize as Db import qualified Utility.Matcher as Matcher import Control.Concurrent import System.Process +import Text.Read +import Data.Time.Clock.POSIX +import qualified Utility.RawFilePath as R +import qualified System.FilePath.ByteString as P {- Called when a location log change is journalled, so the LiveUpdate - is done. This is called with the journal still locked, so no concurrent @@ -124,3 +131,59 @@ finishedLiveUpdate lu u k sc = tryNonAsync (putMVar (liveUpdateDone lu) (Just (u, k, sc, finishv))) >>= \case Right () -> void $ tryNonAsync $ takeMVar finishv Left _ -> noop + +{- Checks for other git-annex processes that might have been interrupted + - and left the database populated with stale live size changes. Those + - are removed from the database. + - + - Also registers the current process so that other calls to this will not + - consider it stale while it's running. + - + - This checks the first time it is called, and again if it's been more + - than 1 minute since the last check. + -} +checkStaleSizeChanges :: RepoSizeHandle -> Annex () +checkStaleSizeChanges h@(RepoSizeHandle (Just _) livev) = do + livedir <- calcRepo' gitAnnexRepoSizeLiveDir + pid <- liftIO getCurrentPid + let pidlockfile = show pid + now <- liftIO getPOSIXTime + liftIO (takeMVar livev) >>= \case + Nothing -> do + lck <- takeExclusiveLock $ + livedir P.</> toRawFilePath pidlockfile + go livedir lck pidlockfile now + Just v@(lck, lastcheck) + | now >= lastcheck + 60 -> + go livedir lck pidlockfile now + | otherwise -> + liftIO $ putMVar livev (Just v) + where + go livedir lck pidlockfile now = do + void $ tryNonAsync $ do + lockfiles <- liftIO $ filter (not . dirCruft) + <$> getDirectoryContents (fromRawFilePath livedir) + stale <- forM lockfiles $ \lockfile -> + if (lockfile /= pidlockfile) + then case readMaybe lockfile of + Nothing -> return Nothing + Just pid -> checkstale livedir lockfile pid + else return Nothing + let stale' = catMaybes stale + unless (null stale') $ liftIO $ do + Db.removeStaleLiveSizeChanges h (map fst stale') + mapM_ snd stale' + liftIO $ putMVar livev (Just (lck, now)) + + checkstale livedir lockfile pid = + let f = livedir P.</> toRawFilePath lockfile + in tryLockShared Nothing f >>= \case + Nothing -> return Nothing + Just lck -> do + return $ Just + ( StaleSizeChanger (SizeChangeProcessId pid) + , do + dropLock lck + removeWhenExistsWith R.removeLink f + ) +checkStaleSizeChanges (RepoSizeHandle Nothing _) = noop diff --git a/Database/RepoSize.hs b/Database/RepoSize.hs index 1ea8c702a2..1fbee8f660 100644 --- a/Database/RepoSize.hs +++ b/Database/RepoSize.hs @@ -29,6 +29,7 @@ module Database.RepoSize ( startingLiveSizeChange, successfullyFinishedLiveSizeChange, removeStaleLiveSizeChange, + removeStaleLiveSizeChanges, recordedRepoOffsets, liveRepoOffsets, ) where @@ -50,6 +51,7 @@ import qualified System.FilePath.ByteString as P import qualified Data.Map.Strict as M import qualified Data.Set as S import Control.Exception +import Control.Concurrent share [mkPersist sqlSettings, mkMigrate "migrateRepoSizes"] [persistLowerCase| -- Corresponds to location log information from the git-annex branch. @@ -67,9 +69,10 @@ AnnexBranch (Diff truncated)
combine 2 queries
diff --git a/Database/RepoSize.hs b/Database/RepoSize.hs index c1f1e98d51..1ea8c702a2 100644 --- a/Database/RepoSize.hs +++ b/Database/RepoSize.hs @@ -247,19 +247,18 @@ removeLiveSizeChange u k sc sid = , LiveSizeChangesChange ==. sc ] -getLiveSizeChanges :: SqlPersistM (M.Map UUID [(Key, (SizeChange, SizeChangeId))]) -getLiveSizeChanges = M.fromListWith (++) . map conv <$> selectList [] [] +getLiveSizeChangesMap :: SqlPersistM (M.Map UUID [(Key, (SizeChange, SizeChangeId))]) +getLiveSizeChangesMap = M.fromListWith (++) . map conv <$> getLiveSizeChanges where - conv entity = - let LiveSizeChanges u k sid sc = entityVal entity - in (u, [(k, (sc, sid))]) + conv (LiveSizeChanges u k sid sc) = (u, [(k, (sc, sid))]) -getLiveSizeChanges' :: SqlPersistM [(UUID, Key, SizeChange)] -getLiveSizeChanges' = map conv <$> selectList [] [] +getLiveSizeChangesList :: SqlPersistM [(UUID, Key, SizeChange)] +getLiveSizeChangesList = map conv <$> getLiveSizeChanges where - conv entity = - let LiveSizeChanges u k _sid sc = entityVal entity - in (u, k, sc) + conv (LiveSizeChanges u k _sid sc) = (u, k, sc) + +getLiveSizeChanges :: SqlPersistM [LiveSizeChanges] +getLiveSizeChanges = map entityVal <$> selectList [] [] getSizeChanges :: SqlPersistM (M.Map UUID FileSize) getSizeChanges = M.fromList . map conv <$> selectList [] [] @@ -310,7 +309,7 @@ getRecentChanges = map conv <$> selectList [] [] - redundant with a recent change. -} clearRecentChanges :: SqlPersistM () clearRecentChanges = do - live <- getLiveSizeChanges' + live <- getLiveSizeChangesList if null live then deleteWhere ([] :: [Filter RecentChanges]) else do @@ -354,7 +353,7 @@ recordedRepoOffsets (RepoSizeHandle Nothing) = pure mempty liveRepoOffsets :: RepoSizeHandle -> IO (M.Map UUID SizeOffset) liveRepoOffsets (RepoSizeHandle (Just h)) = H.queryDb h $ do sizechanges <- getSizeChanges - livechanges <- getLiveSizeChanges + livechanges <- getLiveSizeChangesMap let us = S.toList $ S.fromList $ M.keys sizechanges ++ M.keys livechanges M.fromList <$> forM us (go sizechanges livechanges) diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 7621f4f878..5db949cec6 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -58,18 +58,14 @@ Planned schedule of work: used. annex.pidlock might otherwise prevent running more than one git-annex at a time. - , or alternatively - when checking a preferred content expression that uses balanced preferred - content. - * The assistant is using NoLiveUpdate, but it should be posssible to plumb a LiveUpdate through it from preferred content checking to location log updating. * `git-annex info` in the limitedcalc path in cachedAllRepoData double-counts redundant information from the journal due to using - overLocationLogs. In the other path it does not, and this should be fixed - for consistency and correctness. + overLocationLogs. In the other path it does not (any more; it used to), + and this should be fixed for consistency and correctness. * getLiveRepoSizes has a filterM getRecentChange over the live updates. This could be optimised to a single sql join. There are usually not many
avoid reposize database locking overhead when not needed
Only when the preferred content expression being matched uses balanced
preferred content is this overhead needed.
It might be possible to eliminate the locking entirely. Eg, check the
live changes before and after the action and re-run if they are not
stable. For now, this is good enough, it avoids existing preferred
content getting slow. If balanced preferred content turns out to be too
slow to check, that could be tried later.
Only when the preferred content expression being matched uses balanced
preferred content is this overhead needed.
It might be possible to eliminate the locking entirely. Eg, check the
live changes before and after the action and re-run if they are not
stable. For now, this is good enough, it avoids existing preferred
content getting slow. If balanced preferred content turns out to be too
slow to check, that could be tried later.
diff --git a/Annex/FileMatcher.hs b/Annex/FileMatcher.hs index 3c2840c73d..a2bfd23dce 100644 --- a/Annex/FileMatcher.hs +++ b/Annex/FileMatcher.hs @@ -90,7 +90,7 @@ checkMatcher matcher mkey afile lu notpresent notconfigured d checkMatcher' :: FileMatcher Annex -> MatchInfo -> LiveUpdate -> AssumeNotPresent -> Annex Bool checkMatcher' (matcher, (MatcherDesc matcherdesc)) mi lu notpresent = - checkLiveUpdate lu go + checkLiveUpdate lu matcher go where go = do (matches, desc) <- runWriterT $ matchMrun' matcher $ \op -> @@ -281,6 +281,7 @@ call desc (Right sub) = Right $ Operation $ MatchFiles , matchNeedsFileContent = any matchNeedsFileContent sub , matchNeedsKey = any matchNeedsKey sub , matchNeedsLocationLog = any matchNeedsLocationLog sub + , matchNeedsLiveRepoSize = any matchNeedsLiveRepoSize sub , matchDesc = matchDescSimple desc } call _ (Left err) = Left err diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index aaacb31450..602e8f374a 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -13,6 +13,8 @@ import Annex.Common import Logs.Presence.Pure import qualified Database.RepoSize as Db import Annex.UUID +import Types.FileMatcher +import qualified Utility.Matcher as Matcher import Control.Concurrent import System.Process @@ -95,9 +97,16 @@ needLiveUpdate lu = liftIO $ void $ tryPutMVar (liveUpdateNeeded lu) () -- This serializes calls to the action, so that if the action -- queries getLiveRepoSizes it will not race with another such action -- that may also be starting a live update. -checkLiveUpdate :: LiveUpdate -> Annex Bool -> Annex Bool -checkLiveUpdate NoLiveUpdate a = a -checkLiveUpdate lu a = Db.lockDbWhile (const go) go +checkLiveUpdate + :: LiveUpdate + -> Matcher.Matcher (MatchFiles Annex) + -> Annex Bool + -> Annex Bool +checkLiveUpdate NoLiveUpdate _ a = a +checkLiveUpdate lu matcher a + | Matcher.introspect matchNeedsLiveRepoSize matcher = + Db.lockDbWhile (const go) go + | otherwise = a where go = do r <- a diff --git a/Limit.hs b/Limit.hs index b824bcc640..778a7e65d8 100644 --- a/Limit.hs +++ b/Limit.hs @@ -114,6 +114,7 @@ limitInclude glob = Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = "include" =? glob } @@ -128,6 +129,7 @@ limitExclude glob = Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = "exclude" =? glob } @@ -153,6 +155,7 @@ limitIncludeSameContent glob = Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = "includesamecontent" =? glob } @@ -168,6 +171,7 @@ limitExcludeSameContent glob = Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = "excludesamecontent" =? glob } @@ -244,6 +248,7 @@ matchMagic limitname querymagic selectprovidedinfo selectuserprovidedinfo (Just , matchNeedsFileContent = True , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = limitname =? glob } where @@ -271,6 +276,7 @@ addUnlocked = addLimit $ Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple "unlocked" } @@ -281,6 +287,7 @@ addLocked = addLimit $ Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple "locked" } @@ -316,6 +323,7 @@ addIn s = do , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = not inhere + , matchNeedsLiveRepoSize = False , matchDesc = "in" =? s } checkinuuid u notpresent key @@ -346,6 +354,7 @@ addExpectedPresent = do , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = True + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple "expected-present" } @@ -363,6 +372,7 @@ limitPresent u = MatchFiles , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = not (isNothing u) + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple "present" } @@ -374,6 +384,7 @@ limitInDir dir desc = MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple desc } where @@ -406,6 +417,7 @@ limitCopies want = case splitc ':' want of , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = True + , matchNeedsLiveRepoSize = False , matchDesc = "copies" =? want } go' n good notpresent key = do @@ -431,6 +443,7 @@ limitLackingCopies desc approx want = case readish want of , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = True + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple desc } Nothing -> Left "bad value for number of lacking copies" @@ -461,6 +474,7 @@ limitUnused = MatchFiles , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple "unused" } where @@ -484,6 +498,7 @@ limitAnything = MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple "anything" } @@ -499,6 +514,7 @@ limitNothing = MatchFiles , matchNeedsFileContent = False , matchNeedsKey = False , matchNeedsLocationLog = False + , matchNeedsLiveRepoSize = False , matchDesc = matchDescSimple "nothing" } @@ -522,6 +538,7 @@ limitInAllGroup getgroupmap groupname = Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = True + , matchNeedsLiveRepoSize = False , matchDesc = "inallgroup" =? groupname } where @@ -547,6 +564,7 @@ limitOnlyInGroup getgroupmap groupname = Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = True + , matchNeedsLiveRepoSize = False , matchDesc = "inallgroup" =? groupname (Diff truncated)
Added a comment
diff --git a/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_2_be3a369499c5a745536169c36eb9a321._comment b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_2_be3a369499c5a745536169c36eb9a321._comment new file mode 100644 index 0000000000..076052de6a --- /dev/null +++ b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_2_be3a369499c5a745536169c36eb9a321._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="matrss" + avatar="http://cdn.libravatar.org/avatar/59541f50d845e5f81aff06e88a38b9de" + subject="comment 2" + date="2024-08-28T14:11:36Z" + content=""" +@mih if you need a workaround now, you can parse the key from `git show <branch/commit/tag>:<path>` or even just `git show <blob-id-of-the-file>`. In the case of locked files, it will return something like `.git/annex/objects/.../.../<key>/<key>` (i.e. the symlink target), and in the case of unlocked files it is something like `/annex/objects/<key>`. This is what forgejo-aneksajo does here: <https://codeberg.org/matrss/forgejo-aneksajo/src/branch/forgejo/modules/annex/annex.go#L48-L105>. `lookupkey --ref` would massively simplify that code though. +"""]]
Added a comment: Needed to retrieve single file metadata from bare repo
diff --git a/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_1_19224e627bca195bfcb9a9ebbe45ff55._comment b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_1_19224e627bca195bfcb9a9ebbe45ff55._comment new file mode 100644 index 0000000000..6b66f545b5 --- /dev/null +++ b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository/comment_1_19224e627bca195bfcb9a9ebbe45ff55._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="mih" + avatar="http://cdn.libravatar.org/avatar/f881df265a423e4f24eff27c623148fd" + subject="Needed to retrieve single file metadata from bare repo" + date="2024-08-28T13:58:30Z" + content=""" +I ran into the same issue. My actual goal is to retrieve git-annex metadata for a specific file from a bare repo. I only know branch/commit and the path. `git-annex metadata` can only report for a tree or a key. For the former I need to implement path matching for a potentially voluminous output. For the latter I need to look up the key -- which currently is not supported for a bare repo. +"""]]
Added a comment
diff --git a/doc/todo/migration_to_VURL_by_default/comment_1_1c0e952c2d7da41e53aab3f334de87e0._comment b/doc/todo/migration_to_VURL_by_default/comment_1_1c0e952c2d7da41e53aab3f334de87e0._comment new file mode 100644 index 0000000000..f5c9e56909 --- /dev/null +++ b/doc/todo/migration_to_VURL_by_default/comment_1_1c0e952c2d7da41e53aab3f334de87e0._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="matrss" + avatar="http://cdn.libravatar.org/avatar/59541f50d845e5f81aff06e88a38b9de" + subject="comment 1" + date="2024-08-28T08:47:32Z" + content=""" +While a migration to VURL by default would be great, I think this issue when dealing with external special remotes should first be fixed: <https://git-annex.branchable.com/bugs/VURL_verification_failure_on_first_download/>. Right now, my datalad-cds extension does not explicitly set the URL backend, so it would break if the default was to change to VURL, but I would really like to use VURL with it if it was possible. +"""]]
thoughts
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 05cbf74c44..481df391df 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -41,6 +41,10 @@ Planned schedule of work: expression that does use balanced preferred content. No reason to pay its time penalty otherwise. + Alternatively, make it not use file locking. It could rely on a database + transaction, or it could check the live changes before and after and + re-run the Annex action if they are not stable. + * When loading the live update table, check if PIDs in it are still running (and are still git-annex), and if not, remove stale entries from it, which can accumulate when processes are interrupted. @@ -52,25 +56,19 @@ Planned schedule of work: But then, how to check if a PID is git-annex or not? /proc of course, but what about other OS's? Windows? - How? Possibly have a thread that - waits on an empty MVar. Thread MVar through somehow to location log - update. (Seems this would need checking preferred content to return - the MVar? Or alternatively, the MVar could be passed into it, which - seems better..) Fill MVar on location log update. If MVar gets - GCed without being filled, the thread will get an exception and can - remove from table and cache then. This does rely on GC behavior, but if - the GC takes some time, it will just cause a failed upload to take - longer to get removed from the table and cache, which will just prevent - another upload of a different key from running immediately. - (Need to check if MVar GC behavior operates like this. - See https://stackoverflow.com/questions/10871303/killing-a-thread-when-mvar-is-garbage-collected ) - Perhaps stale entries can be found in a different way. Require the live - update table to be updated with a timestamp every 5 minutes. The thread - that waits on the MVar can do that, as long as the transfer is running. If - interrupted, it will become stale in 5 minutes, which is probably good - enough? Could do it every minute, depending on overhead. This could - also be done by just repeatedly touching a file named with the processes's - pid in it, to avoid sqlite overhead. + A plan: Have git-annex lock a per-pid file at startup. Then before + loading the live updates table, check each other per-pid file, by + try to take a shared lock. If able to, that process is no longer running, + and its live updates should be considered stale, and can be removed + while loading the live updates table. + + Might be better to not lock at startup, but only once live updates are + used. annex.pidlock might otherwise prevent running more than one + git-annex at a time. + + , or alternatively + when checking a preferred content expression that uses balanced preferred + content. * The assistant is using NoLiveUpdate, but it should be posssible to plumb a LiveUpdate through it from preferred content checking to location log
locking in checkLiveUpdate
This makes sure that two threads don't check balanced preferred content at the
same time, so each thread always sees a consistent picture of what is
happening.
This does add a fairly expensive file level lock to every check of
preferred content, in commands that use prepareLiveUpdate. It would
be good to only do that when live updates are actually needed, eg when
the preferred content expression uses balanced preferred content.
This makes sure that two threads don't check balanced preferred content at the
same time, so each thread always sees a consistent picture of what is
happening.
This does add a fairly expensive file level lock to every check of
preferred content, in commands that use prepareLiveUpdate. It would
be good to only do that when live updates are actually needed, eg when
the preferred content expression uses balanced preferred content.
diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index 6f27617369..17dae1dec7 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -15,8 +15,6 @@ import qualified Database.RepoSize as Db import Annex.UUID import Control.Concurrent -import qualified Data.Map.Strict as M -import qualified Data.Set as S import System.Process {- Called when a location log change is journalled, so the LiveUpdate @@ -93,15 +91,21 @@ needLiveUpdate lu = liftIO $ void $ tryPutMVar (liveUpdateNeeded lu) () -- -- This can be called more than once on the same LiveUpdate. It will -- only start it once. +-- +-- This serializes calls to the action, so that if the action +-- queries getLiveRepoSizes it will not race with another such action +-- that may also be starting a live update. checkLiveUpdate :: LiveUpdate -> Annex Bool -> Annex Bool checkLiveUpdate NoLiveUpdate a = a -checkLiveUpdate lu a = do - r <- a - needed <- liftIO $ isJust <$> tryTakeMVar (liveUpdateNeeded lu) - when (r && needed) $ do - liftIO $ void $ tryPutMVar (liveUpdateStart lu) () - liftIO $ void $ readMVar (liveUpdateReady lu) - return r +checkLiveUpdate lu a = Db.lockDbWhile (const go) go + where + go = do + r <- a + needed <- liftIO $ isJust <$> tryTakeMVar (liveUpdateNeeded lu) + when (r && needed) $ do + liftIO $ void $ tryPutMVar (liveUpdateStart lu) () + liftIO $ void $ readMVar (liveUpdateReady lu) + return r finishedLiveUpdate :: LiveUpdate -> UUID -> Key -> SizeChange -> IO () finishedLiveUpdate NoLiveUpdate _ _ _ = noop diff --git a/Database/RepoSize.hs b/Database/RepoSize.hs index 0b3e15fe98..d25ca5374b 100644 --- a/Database/RepoSize.hs +++ b/Database/RepoSize.hs @@ -23,6 +23,7 @@ module Database.RepoSize ( getRepoSizeHandle, openDb, closeDb, + lockDbWhile, getRepoSizes, setRepoSizes, startingLiveSizeChange, @@ -48,6 +49,7 @@ import Database.Persist.TH import qualified System.FilePath.ByteString as P import qualified Data.Map.Strict as M import qualified Data.Set as S +import Control.Exception share [mkPersist sqlSettings, mkMigrate "migrateRepoSizes"] [persistLowerCase| -- Corresponds to location log information from the git-annex branch. @@ -101,16 +103,14 @@ getRepoSizeHandle = Annex.getState Annex.reposizehandle >>= \case - can create it undisturbed. -} openDb :: Annex RepoSizeHandle -openDb = do - lck <- calcRepo' gitAnnexRepoSizeDbLock - catchPermissionDenied permerr $ withExclusiveLock lck $ do - dbdir <- calcRepo' gitAnnexRepoSizeDbDir - let db = dbdir P.</> "db" - unlessM (liftIO $ R.doesPathExist db) $ do - initDb db $ void $ - runMigrationSilent migrateRepoSizes - h <- liftIO $ H.openDb db "repo_sizes" - return $ RepoSizeHandle (Just h) +openDb = lockDbWhile permerr $ do + dbdir <- calcRepo' gitAnnexRepoSizeDbDir + let db = dbdir P.</> "db" + unlessM (liftIO $ R.doesPathExist db) $ do + initDb db $ void $ + runMigrationSilent migrateRepoSizes + h <- liftIO $ H.openDb db "repo_sizes" + return $ RepoSizeHandle (Just h) where -- If permissions don't allow opening the database, -- just don't use it. Since this database is just a cache @@ -123,6 +123,13 @@ closeDb :: RepoSizeHandle -> Annex () closeDb (RepoSizeHandle (Just h)) = liftIO $ H.closeDb h closeDb (RepoSizeHandle Nothing) = noop +-- This does not prevent another process that has already +-- opened the db from changing it at the same time. +lockDbWhile :: (IOException -> Annex a) -> Annex a -> Annex a +lockDbWhile permerr a = do + lck <- calcRepo' gitAnnexRepoSizeDbLock + catchPermissionDenied permerr $ withExclusiveLock lck a + {- Gets the sizes of repositories as of a commit to the git-annex - branch. -} getRepoSizes :: RepoSizeHandle -> IO (M.Map UUID RepoSize, Maybe Sha) diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index bace31f5e7..05cbf74c44 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -35,9 +35,11 @@ Planned schedule of work: May not be a bug, needs reproducing and analysis. -* Make sure that two threads don't check balanced preferred content at the - same time, so each thread always sees a consistent picture of what is - happening. Use locking as necessary. +* Test that live repo size data is correct and really works. + +* Avoid using checkLiveUpdate except when checking a preferred content + expression that does use balanced preferred content. No reason to pay + its time penalty otherwise. * When loading the live update table, check if PIDs in it are still running (and are still git-annex), and if not, remove stale entries
closing in on finishing live reposizes
Fixed successfullyFinishedLiveSizeChange to not update the rolling total
when a redundant change is in RecentChanges.
Made setRepoSizes clear RecentChanges that are no longer needed.
It might be possible to clear those earlier, this is only a convenient
point to do it.
The reason it's safe to clear RecentChanges here is that, in order for a
live update to call successfullyFinishedLiveSizeChange, a change must be
made to a location log. If a RecentChange gets cleared, and just after
that a new live update is started, making the same change, the location
log has already been changed (since the RecentChange exists), and
so when the live update succeeds, it won't call
successfullyFinishedLiveSizeChange. The reason it doesn't
clear RecentChanges when there is a reduntant live update is because
I didn't want to think through whether or not all races are avoided in
that case.
The rolling total in SizeChanges is never cleared. Instead,
calcJournalledRepoSizes gets the initial value of it, and then
getLiveRepoSizes subtracts that initial value from the current value.
Since the rolling total can only be updated by updateRepoSize,
which is called with the journal locked, locking the journal in
calcJournalledRepoSizes ensures that the database does not change while
reading the journal.
Fixed successfullyFinishedLiveSizeChange to not update the rolling total
when a redundant change is in RecentChanges.
Made setRepoSizes clear RecentChanges that are no longer needed.
It might be possible to clear those earlier, this is only a convenient
point to do it.
The reason it's safe to clear RecentChanges here is that, in order for a
live update to call successfullyFinishedLiveSizeChange, a change must be
made to a location log. If a RecentChange gets cleared, and just after
that a new live update is started, making the same change, the location
log has already been changed (since the RecentChange exists), and
so when the live update succeeds, it won't call
successfullyFinishedLiveSizeChange. The reason it doesn't
clear RecentChanges when there is a reduntant live update is because
I didn't want to think through whether or not all races are avoided in
that case.
The rolling total in SizeChanges is never cleared. Instead,
calcJournalledRepoSizes gets the initial value of it, and then
getLiveRepoSizes subtracts that initial value from the current value.
Since the rolling total can only be updated by updateRepoSize,
which is called with the journal locked, locking the journal in
calcJournalledRepoSizes ensures that the database does not change while
reading the journal.
diff --git a/Annex.hs b/Annex.hs index 4208e5c741..9e4d0a45c3 100644 --- a/Annex.hs +++ b/Annex.hs @@ -133,7 +133,7 @@ data AnnexRead = AnnexRead , forcenumcopies :: Maybe NumCopies , forcemincopies :: Maybe MinCopies , forcebackend :: Maybe String - , reposizes :: MVar (Maybe (M.Map UUID RepoSize)) + , reposizes :: MVar (Maybe (M.Map UUID (RepoSize, SizeOffset))) , rebalance :: Bool , useragent :: Maybe String , desktopnotify :: DesktopNotify diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index b9a6ffe95c..084c2c3efd 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -13,10 +13,10 @@ module Annex.RepoSize ( ) where import Annex.Common -import Annex.RepoSize.LiveUpdate import qualified Annex import Annex.Branch (UnmergedBranches(..), getBranch) import qualified Database.RepoSize as Db +import Annex.Journal import Logs import Logs.Location import Logs.UUID @@ -36,7 +36,10 @@ import qualified Data.Set as S - was called. It does not update while git-annex is running. -} getRepoSizes :: Bool -> Annex (M.Map UUID RepoSize) -getRepoSizes quiet = do +getRepoSizes quiet = M.map fst <$> getRepoSizes' quiet + +getRepoSizes' :: Bool -> Annex (M.Map UUID (RepoSize, SizeOffset)) +getRepoSizes' quiet = do rsv <- Annex.getRead Annex.reposizes liftIO (takeMVar rsv) >>= \case Just sizemap -> do @@ -47,22 +50,24 @@ getRepoSizes quiet = do {- Like getRepoSizes, but with live updates. -} getLiveRepoSizes :: Bool -> Annex (M.Map UUID RepoSize) getLiveRepoSizes quiet = do - h <- Db.getRepoSizeHandle - liftIO (Db.estimateLiveRepoSizes h) >>= \case - Just (m, annexbranchsha) -> return m - Nothing -> do - -- Db.estimateLiveRepoSizes needs the - -- reposizes to be calculated first. - m <- getRepoSizes quiet - liftIO (Db.estimateLiveRepoSizes h) >>= \case - Just (m', annexbranchsha) -> return m' - Nothing -> return m + sizemap <- getRepoSizes' quiet + go sizemap `onException` return (M.map fst sizemap) + where + go sizemap = do + h <- Db.getRepoSizeHandle + liveoffsets <- liftIO $ Db.liveRepoOffsets h + let calc u (RepoSize size, SizeOffset startoffset) = + case M.lookup u liveoffsets of + Nothing -> RepoSize size + Just (SizeOffset offset) -> RepoSize $ + size + (offset - startoffset) + return $ M.mapWithKey calc sizemap {- Fills an empty Annex.reposizes MVar with current information - from the git-annex branch, supplimented with journalled but - not yet committed information. -} -calcRepoSizes :: Bool -> MVar (Maybe (M.Map UUID RepoSize)) -> Annex (M.Map UUID RepoSize) +calcRepoSizes :: Bool -> MVar (Maybe (M.Map UUID (RepoSize, SizeOffset))) -> Annex (M.Map UUID (RepoSize, SizeOffset)) calcRepoSizes quiet rsv = go `onException` failed where go = do @@ -73,7 +78,7 @@ calcRepoSizes quiet rsv = go `onException` failed Just oldbranchsha -> do currbranchsha <- getBranch if oldbranchsha == currbranchsha - then calcJournalledRepoSizes oldsizemap oldbranchsha + then calcJournalledRepoSizes h oldsizemap oldbranchsha else incrementalupdate h oldsizemap oldbranchsha currbranchsha liftIO $ putMVar rsv (Just sizemap) return sizemap @@ -83,12 +88,12 @@ calcRepoSizes quiet rsv = go `onException` failed showSideAction "calculating repository sizes" (sizemap, branchsha) <- calcBranchRepoSizes liftIO $ Db.setRepoSizes h sizemap branchsha - calcJournalledRepoSizes sizemap branchsha + calcJournalledRepoSizes h sizemap branchsha incrementalupdate h oldsizemap oldbranchsha currbranchsha = do (sizemap, branchsha) <- diffBranchRepoSizes quiet oldsizemap oldbranchsha currbranchsha liftIO $ Db.setRepoSizes h sizemap branchsha - calcJournalledRepoSizes sizemap branchsha + calcJournalledRepoSizes h sizemap branchsha failed = do liftIO $ putMVar rsv (Just M.empty) @@ -120,13 +125,21 @@ calcBranchRepoSizes = do - data from journalled location logs. -} calcJournalledRepoSizes - :: M.Map UUID RepoSize + :: Db.RepoSizeHandle + -> M.Map UUID RepoSize -> Sha - -> Annex (M.Map UUID RepoSize) -calcJournalledRepoSizes startmap branchsha = - overLocationLogsJournal startmap branchsha - (\k v m -> pure (accumRepoSizes k v m)) - Nothing + -> Annex (M.Map UUID (RepoSize, SizeOffset)) +calcJournalledRepoSizes h startmap branchsha = + -- Lock the journal to prevent updates to the size offsets + -- in the repository size database while this is processing + -- the journal files. + lockJournal $ \_jl -> do + sizemap <- overLocationLogsJournal startmap branchsha + (\k v m' -> pure (accumRepoSizes k v m')) + Nothing + offsets <- liftIO $ Db.recordedRepoOffsets h + let getoffset u = fromMaybe (SizeOffset 0) $ M.lookup u offsets + return $ M.mapWithKey (\u sz -> (sz, getoffset u)) sizemap {- Incremental update by diffing. -} diffBranchRepoSizes :: Bool -> M.Map UUID RepoSize -> Sha -> Sha -> Annex (M.Map UUID RepoSize, Sha) @@ -180,3 +193,22 @@ diffBranchRepoSizes quiet oldsizemap oldbranchsha newbranchsha = do (\m u -> M.insertWith (flip const) u (RepoSize 0) m) newsizemap knownuuids + +addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize +addKeyRepoSize k mrs = case mrs of + Just (RepoSize sz) -> Just $ RepoSize $ sz + ksz + Nothing -> Just $ RepoSize ksz + where + ksz = fromMaybe 0 $ fromKey keySize k + +removeKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize +removeKeyRepoSize k mrs = case mrs of + Just (RepoSize sz) -> Just $ RepoSize $ sz - ksz + Nothing -> Nothing + where + ksz = fromMaybe 0 $ fromKey keySize k + +accumRepoSizes :: Key -> (S.Set UUID, S.Set UUID) -> M.Map UUID RepoSize -> M.Map UUID RepoSize +accumRepoSizes k (newlocs, removedlocs) sizemap = + let !sizemap' = foldl' (flip $ M.alter $ addKeyRepoSize k) sizemap newlocs + in foldl' (flip $ M.alter $ removeKeyRepoSize k) sizemap' removedlocs diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index 5d015b2585..6f27617369 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -31,25 +31,6 @@ updateRepoSize lu u k s = liftIO $ finishedLiveUpdate lu u k sc InfoMissing -> RemovingKey InfoDead -> RemovingKey -addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize -addKeyRepoSize k mrs = case mrs of - Just (RepoSize sz) -> Just $ RepoSize $ sz + ksz - Nothing -> Just $ RepoSize ksz - where - ksz = fromMaybe 0 $ fromKey keySize k - -removeKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize -removeKeyRepoSize k mrs = case mrs of - Just (RepoSize sz) -> Just $ RepoSize $ sz - ksz - Nothing -> Nothing - where - ksz = fromMaybe 0 $ fromKey keySize k - -accumRepoSizes :: Key -> (S.Set UUID, S.Set UUID) -> M.Map UUID RepoSize -> M.Map UUID RepoSize -accumRepoSizes k (newlocs, removedlocs) sizemap = - let !sizemap' = foldl' (flip $ M.alter $ addKeyRepoSize k) sizemap newlocs - in foldl' (flip $ M.alter $ removeKeyRepoSize k) sizemap' removedlocs - -- When the UUID is Nothing, it's a live update of the local repository. prepareLiveUpdate :: Maybe UUID -> Key -> SizeChange -> Annex LiveUpdate prepareLiveUpdate mu k sc = do diff --git a/Database/RepoSize.hs b/Database/RepoSize.hs index 1247fa81ca..0b3e15fe98 100644 --- a/Database/RepoSize.hs +++ b/Database/RepoSize.hs @@ -25,10 +25,11 @@ module Database.RepoSize ( closeDb, getRepoSizes, setRepoSizes, - estimateLiveRepoSizes, startingLiveSizeChange, successfullyFinishedLiveSizeChange, removeStaleLiveSizeChange, + recordedRepoOffsets, + liveRepoOffsets, ) where import Annex.Common @@ -164,6 +165,7 @@ setRepoSizes (RepoSizeHandle (Just h)) sizemap branchcommitsha = (Diff truncated)
Added contributions section to track my bugs and inquiries
diff --git a/doc/users/Spencer.mdwn b/doc/users/Spencer.mdwn new file mode 100644 index 0000000000..b521ab0894 --- /dev/null +++ b/doc/users/Spencer.mdwn @@ -0,0 +1,15 @@ +--- + +## Contributions + +### Contributed Pages + +[[!map pages="author(Spencer) and !internal(recentchanges/*) and !comment(*)" + show="title" + sort="title"]] + +### Comments + +[[!map pages="author(Spencer) and comment(*)" + show="title" + sort="title"]]
started work on getLiveRepoSizes
Doesn't quite compile
Doesn't quite compile
diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index 3507ee8207..46e6e01ef2 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -94,7 +94,7 @@ prepareLiveUpdate mu k sc = do | otherwise -> waitdone donev finishv h u cid Right Nothing -> abandoned h u cid Left _ -> abandoned h u cid - abandoned h u cid = Db.staleLiveSizeChange h u k sc cid + abandoned h u cid = Db.removeStaleLiveSizeChange h u k sc cid -- Called when a preferred content check indicates that a live update is -- needed. Can be called more than once on the same LiveUpdate. diff --git a/Database/RepoSize.hs b/Database/RepoSize.hs index 76bb1dae64..8c6c1b66bc 100644 --- a/Database/RepoSize.hs +++ b/Database/RepoSize.hs @@ -25,11 +25,10 @@ module Database.RepoSize ( closeDb, getRepoSizes, setRepoSizes, - getLiveSizeChanges, + getLiveRepoSizes, startingLiveSizeChange, successfullyFinishedLiveSizeChange, - staleLiveSizeChange, - getSizeChanges, + removeStaleLiveSizeChange, ) where import Annex.Common @@ -46,7 +45,8 @@ import qualified Utility.RawFilePath as R import Database.Persist.Sql hiding (Key) import Database.Persist.TH import qualified System.FilePath.ByteString as P -import qualified Data.Map as M +import qualified Data.Map.Strict as M +import qualified Data.Set as S share [mkPersist sqlSettings, mkMigrate "migrateRepoSizes"] [persistLowerCase| -- Corresponds to location log information from the git-annex branch. @@ -73,6 +73,13 @@ SizeChanges repo UUID rollingtotal FileSize UniqueRepoRollingTotal repo +-- The most recent size changes that were removed from LiveSizeChanges +-- upon successful completion. +RecentChanges + repo UUID + key Key + change SizeChange + UniqueRecentChange repo key |] {- Gets a handle to the database. It's cached in Annex state. -} @@ -115,19 +122,21 @@ closeDb :: RepoSizeHandle -> Annex () closeDb (RepoSizeHandle (Just h)) = liftIO $ H.closeDb h closeDb (RepoSizeHandle Nothing) = noop +{- Gets the sizes of repositories as of a commit to the git-annex + - branch. -} getRepoSizes :: RepoSizeHandle -> IO (M.Map UUID RepoSize, Maybe Sha) getRepoSizes (RepoSizeHandle (Just h)) = H.queryDb h $ do - sizemap <- M.fromList . map conv <$> getRepoSizes' + sizemap <- M.fromList <$> getRepoSizes' annexbranchsha <- getAnnexBranchCommit return (sizemap, annexbranchsha) +getRepoSizes (RepoSizeHandle Nothing) = return (mempty, Nothing) + +getRepoSizes' :: SqlPersistM [(UUID, RepoSize)] +getRepoSizes' = map conv <$> selectList [] [] where conv entity = let RepoSizes u sz = entityVal entity in (u, RepoSize sz) -getRepoSizes (RepoSizeHandle Nothing) = return (mempty, Nothing) - -getRepoSizes' :: SqlPersistM [Entity RepoSizes] -getRepoSizes' = selectList [] [] getAnnexBranchCommit :: SqlPersistM (Maybe Sha) getAnnexBranchCommit = do @@ -149,8 +158,8 @@ getAnnexBranchCommit = do setRepoSizes :: RepoSizeHandle -> M.Map UUID RepoSize -> Sha -> IO () setRepoSizes (RepoSizeHandle (Just h)) sizemap branchcommitsha = H.commitDb h $ do - l <- getRepoSizes' - forM_ (map entityVal l) $ \(RepoSizes u _) -> + l <- getRepoSizes' + forM_ (map fst l) $ \u -> unless (M.member u sizemap) $ unsetRepoSize u forM_ (M.toList sizemap) $ @@ -173,8 +182,6 @@ recordAnnexBranchCommit branchcommitsha = do deleteWhere ([] :: [Filter AnnexBranch]) void $ insertUniqueFast $ AnnexBranch $ toSSha branchcommitsha -{- If there is already a size change for the same UUID, Key, - - and SizeChangeId, it is overwritten with the new size change. -} startingLiveSizeChange :: RepoSizeHandle -> UUID -> Key -> SizeChange -> SizeChangeId -> IO () startingLiveSizeChange (RepoSizeHandle (Just h)) u k sc sid = H.commitDb h $ void $ upsertBy @@ -188,10 +195,11 @@ startingLiveSizeChange (RepoSizeHandle Nothing) _ _ _ _ = noop successfullyFinishedLiveSizeChange :: RepoSizeHandle -> UUID -> Key -> SizeChange -> SizeChangeId -> IO () successfullyFinishedLiveSizeChange (RepoSizeHandle (Just h)) u k sc sid = H.commitDb h $ do - -- Update the rolling total and remove the live change in the - -- same transaction. + -- Update the rolling total, add as a recent change, + -- and remove the live change in the same transaction. rollingtotal <- getSizeChangeFor u - setSizeChangeFor u (updaterollingtotal rollingtotal) + setSizeChangeFor u (updateRollingTotal rollingtotal sc k) + addRecentChange u k sc removeLiveSizeChange u k sc sid where updaterollingtotal t = case sc of @@ -200,10 +208,17 @@ successfullyFinishedLiveSizeChange (RepoSizeHandle (Just h)) u k sc sid = ksz = fromMaybe 0 $ fromKey keySize k successfullyFinishedLiveSizeChange (RepoSizeHandle Nothing) _ _ _ _ = noop -staleLiveSizeChange :: RepoSizeHandle -> UUID -> Key -> SizeChange -> SizeChangeId -> IO () -staleLiveSizeChange (RepoSizeHandle (Just h)) u k sc sid = +updateRollingTotal :: FileSize -> SizeChange -> Key -> FileSize +updateRollingTotal t sc k = case sc of + AddingKey -> t + ksz + RemovingKey -> t - ksz + where + ksz = fromMaybe 0 $ fromKey keySize k + +removeStaleLiveSizeChange :: RepoSizeHandle -> UUID -> Key -> SizeChange -> SizeChangeId -> IO () +removeStaleLiveSizeChange (RepoSizeHandle (Just h)) u k sc sid = H.commitDb h $ removeLiveSizeChange u k sc sid -staleLiveSizeChange (RepoSizeHandle Nothing) _ _ _ _ = noop +removeStaleLiveSizeChange (RepoSizeHandle Nothing) _ _ _ _ = noop removeLiveSizeChange :: UUID -> Key -> SizeChange -> SizeChangeId -> SqlPersistM () removeLiveSizeChange u k sc sid = @@ -214,25 +229,15 @@ removeLiveSizeChange u k sc sid = , LiveSizeChangesChange ==. sc ] -getLiveSizeChanges :: RepoSizeHandle -> IO (M.Map UUID (Key, SizeChange, SizeChangeId)) -getLiveSizeChanges (RepoSizeHandle (Just h)) = H.queryDb h $ do - m <- M.fromList . map conv <$> getLiveSizeChanges' - return m +getLiveSizeChanges :: SqlPersistM (M.Map UUID [(Key, (SizeChange, SizeChangeId))]) +getLiveSizeChanges = M.fromListWith (++) . map conv <$> selectList [] [] where conv entity = let LiveSizeChanges u k sid sc = entityVal entity - in (u, (k, sc, sid)) -getLiveSizeChanges (RepoSizeHandle Nothing) = return mempty - -getLiveSizeChanges' :: SqlPersistM [Entity LiveSizeChanges] -getLiveSizeChanges' = selectList [] [] + in (u, [(k, (sc, sid))]) -getSizeChanges :: RepoSizeHandle -> IO (M.Map UUID FileSize) -getSizeChanges (RepoSizeHandle (Just h)) = H.queryDb h getSizeChanges' -getSizeChanges (RepoSizeHandle Nothing) = return mempty - -getSizeChanges' :: SqlPersistM (M.Map UUID FileSize) -getSizeChanges' = M.fromList . map conv <$> selectList [] [] +getSizeChanges :: SqlPersistM (M.Map UUID FileSize) +getSizeChanges = M.fromList . map conv <$> selectList [] [] where conv entity = let SizeChanges u n = entityVal entity @@ -251,3 +256,96 @@ setSizeChangeFor u sz = (UniqueRepoRollingTotal u) (SizeChanges u sz) [SizeChangesRollingtotal =. sz] + +addRecentChange :: UUID -> Key -> SizeChange -> SqlPersistM () +addRecentChange u k sc = + void $ upsertBy + (UniqueRecentChange u k) + (RecentChanges u k sc) + [RecentChangesChange =. sc] + +getRecentChange :: UUID -> Key -> SqlPersistM (Maybe SizeChange) +getRecentChange u k = do + l <- selectList + [ RecentChangesRepo ==. u + , RecentChangesKey ==. k + ] [] + return $ case l of + (s:_) -> Just $ recentChangesChange $ entityVal s + [] -> Nothing + +{- Gets the sizes of Repos as of a commit to the git-annex branch + - (which is not necessarily the current commit), adjusted with all + - live changes that have happened since then or are happening now. + - + - This does not necessarily include all changes that have been journalled, + - only ones that had startingLiveSizeChange called for them will be + - included. Also live changes or recent changes that were to a UUID not in + - the RepoSizes map are not included. + - (Diff truncated)
partially fix concurrency issue in updating the rollingtotal
It's possible for two processes or threads to both be doing the same
operation at the same time. Eg, both dropping the same key. If one
finishes and updates the rollingtotal, then the other one needs to be
prevented from later updating the rollingtotal as well. And they could
finish at the same time, or with some time in between.
Addressed this by making updateRepoSize be called with the journal
locked, and only once it's been determined that there is an actual
location change to record in the log. updateRepoSize waits for the
database to be updated.
When there is a redundant operation, updateRepoSize won't be called,
and the redundant LiveUpdate will be removed from the database on
garbage collection.
But: There will be a window where the redundant LiveUpdate is still
visible in the db, and processes can see it, combine it with the
rollingtotal, and arrive at the wrong size. This is a small window, but
it still ought to be addressed. Unsure if it would always be safe to
remove the redundant LiveUpdate? Consider the case where two drops and a
get are all running concurrently somehow, and the order they finish is
[drop, get, drop]. The second drop seems redundant to the first, but
it would not be safe to remove it. While this seems unlikely, it's hard
to rule out that a get and drop at different stages can both be running
at the same time.
It's possible for two processes or threads to both be doing the same
operation at the same time. Eg, both dropping the same key. If one
finishes and updates the rollingtotal, then the other one needs to be
prevented from later updating the rollingtotal as well. And they could
finish at the same time, or with some time in between.
Addressed this by making updateRepoSize be called with the journal
locked, and only once it's been determined that there is an actual
location change to record in the log. updateRepoSize waits for the
database to be updated.
When there is a redundant operation, updateRepoSize won't be called,
and the redundant LiveUpdate will be removed from the database on
garbage collection.
But: There will be a window where the redundant LiveUpdate is still
visible in the db, and processes can see it, combine it with the
rollingtotal, and arrive at the wrong size. This is a small window, but
it still ought to be addressed. Unsure if it would always be safe to
remove the redundant LiveUpdate? Consider the case where two drops and a
get are all running concurrently somehow, and the order they finish is
[drop, get, drop]. The second drop seems redundant to the first, but
it would not be safe to remove it. While this seems unlikely, it's hard
to rule out that a get and drop at different stages can both be running
at the same time.
diff --git a/Annex/Branch.hs b/Annex/Branch.hs index 8afe6f9912..945acf724b 100644 --- a/Annex/Branch.hs +++ b/Annex/Branch.hs @@ -412,9 +412,11 @@ change ru file f = lockJournal $ \jl -> f <$> getToChange ru file >>= set jl ru {- Applies a function which can modify the content of a file, or not. - - - Returns True when the file was modified. -} -maybeChange :: Journalable content => RegardingUUID -> RawFilePath -> (L.ByteString -> Maybe content) -> Annex Bool -maybeChange ru file f = lockJournal $ \jl -> do + - When the file was modified, runs the onchange action, and returns + - True. The action is run while the journal is still locked, + - so another concurrent call to this cannot happen while it is running. -} +maybeChange :: Journalable content => RegardingUUID -> RawFilePath -> (L.ByteString -> Maybe content) -> Annex () -> Annex Bool +maybeChange ru file f onchange = lockJournal $ \jl -> do v <- getToChange ru file case f v of Just jv -> @@ -422,6 +424,7 @@ maybeChange ru file f = lockJournal $ \jl -> do in if v /= b then do set jl ru file b + onchange return True else return False _ -> return False diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index a8cca6f97a..3507ee8207 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -10,7 +10,6 @@ module Annex.RepoSize.LiveUpdate where import Annex.Common -import qualified Annex import Logs.Presence.Pure import qualified Database.RepoSize as Db import Annex.UUID @@ -20,22 +19,17 @@ import qualified Data.Map.Strict as M import qualified Data.Set as S import System.Process +{- Called when a location log change is journalled, so the LiveUpdate + - is done. This is called with the journal still locked, so no concurrent + - changes can happen while it's running. Waits for the database + - to be updated. -} updateRepoSize :: LiveUpdate -> UUID -> Key -> LogStatus -> Annex () -updateRepoSize lu u k s = do - liftIO $ finishedLiveUpdate lu u k sc - rsv <- Annex.getRead Annex.reposizes - liftIO (takeMVar rsv) >>= \case - Nothing -> liftIO (putMVar rsv Nothing) - Just sizemap -> do - let !sizemap' = M.adjust - (fromMaybe (RepoSize 0) . f k . Just) - u sizemap - liftIO $ putMVar rsv (Just sizemap') +updateRepoSize lu u k s = liftIO $ finishedLiveUpdate lu u k sc where - (sc, f) = case s of - InfoPresent -> (AddingKey, addKeyRepoSize) - InfoMissing -> (RemovingKey, removeKeyRepoSize) - InfoDead -> (RemovingKey, removeKeyRepoSize) + sc = case s of + InfoPresent -> AddingKey + InfoMissing -> RemovingKey + InfoDead -> RemovingKey addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize addKeyRepoSize k mrs = case mrs of @@ -88,11 +82,8 @@ prepareLiveUpdate mu k sc = do {- Wait for finishedLiveUpdate to be called, or for the LiveUpdate - to get garbage collected in the case where the change didn't - - actually happen. -} + - actually happen. Updates the database. -} waitdone donev finishv h u cid = tryNonAsync (takeMVar donev) >>= \case - -- TODO need to update local state too, and it must be done - -- with locking around the state update and this database - -- update. Right (Just (u', k', sc')) | u' == u && k' == k && sc' == sc -> do Db.successfullyFinishedLiveSizeChange h u k sc cid diff --git a/Logs/ContentIdentifier.hs b/Logs/ContentIdentifier.hs index bf8fef5b2e..d370374839 100644 --- a/Logs/ContentIdentifier.hs +++ b/Logs/ContentIdentifier.hs @@ -36,6 +36,7 @@ recordContentIdentifier (RemoteStateHandle u) cid k = do (Annex.Branch.RegardingUUID [u]) (remoteContentIdentifierLogFile config k) (addcid c . parseLog) + noop where addcid c v | cid `elem` l = Nothing -- no change needed diff --git a/Logs/Location.hs b/Logs/Location.hs index 7e7b0e3ad9..608020899a 100644 --- a/Logs/Location.hs +++ b/Logs/Location.hs @@ -84,13 +84,12 @@ logChange lu key u@(UUID _) s | isClusterUUID u = noop | otherwise = do config <- Annex.getGitConfig - changed <- maybeAddLog + void $ maybeAddLog (Annex.Branch.RegardingUUID [u]) (locationLogFile config key) s (LogInfo (fromUUID u)) - when changed $ - updateRepoSize lu u key s + (updateRepoSize lu u key s) logChange _ _ NoUUID _ = noop {- Returns a list of repository UUIDs that, according to the log, have diff --git a/Logs/Presence.hs b/Logs/Presence.hs index 6763e4676a..810ce6462d 100644 --- a/Logs/Presence.hs +++ b/Logs/Presence.hs @@ -50,17 +50,19 @@ addLog' ru file logstatus loginfo c = - older timestamp, that LogLine is preserved, rather than updating the log - with a newer timestamp. - - - Returns True when the log was changed. + - When the log was changed, the onchange action is run (with the journal + - still locked to prevent any concurrent changes) and True is returned. -} -maybeAddLog :: Annex.Branch.RegardingUUID -> RawFilePath -> LogStatus -> LogInfo -> Annex Bool -maybeAddLog ru file logstatus loginfo = do +maybeAddLog :: Annex.Branch.RegardingUUID -> RawFilePath -> LogStatus -> LogInfo -> Annex () -> Annex Bool +maybeAddLog ru file logstatus loginfo onchange = do c <- currentVectorClock - Annex.Branch.maybeChange ru file $ \b -> + let f = \b -> let old = parseLog b line = genLine logstatus loginfo c old in do m <- insertNewStatus line $ logMap old return $ buildLog $ mapLog m + Annex.Branch.maybeChange ru file f onchange genLine :: LogStatus -> LogInfo -> CandidateVectorClock -> [LogLine] -> LogLine genLine logstatus loginfo c old = LogLine c' logstatus loginfo diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index c2e4af6b8a..4c0b74d97b 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -77,10 +77,16 @@ Planned schedule of work: Listing process ID, thread ID, UUID, key, addition or removal (done) + Add to reposizes db a table for sizechanges. This has for each UUID + a rolling total which is the total size changes that have accumulated + since the last update of the reposizes table. + So adding the reposizes table to sizechanges gives the current + size. + Make checking the balanced preferred content limit record a live update in the table (done) - ... and use other live updates in making its decision + ... and use other live updates and sizechanges in making its decision Note: This will only work when preferred content is being checked. If a git-annex copy without --auto is run, for example, it won't @@ -92,33 +98,19 @@ Planned schedule of work: same time, so each thread always sees a consistent picture of what is happening. Use locking as necessary. - In the unlikely event that one thread of a process is storing a key and - another thread is dropping the same key from the same uuid, at the same - time, reconcile somehow. How? Or is this perhaps something that cannot - happen? Could just record the liveupdate for one, and not for the - other. - - Also keep an in-memory cache of the live updates being performed by - the current process. For use in location log update as follows.. + When updating location log for a key, when there is actually a change, + update the db, remove the live update (done) and update the sizechanges + table in the same transaction. - Make updating location log for a key that is in the in-memory cache - of the live update table update the db, removing it from that table, - and updating the in-memory reposizes. (done) - - Make updading location log have locking to make sure redundant - information is never visible: - Take lock, journal update, remove from live update table. + Two concurrent processes might both start the same action, eg dropping + a key, and both succeed, and so both update the location log. One needs + to update the log and the sizechanges table. The other needs to see + that it has no actual change to report, and so avoid updating the + location log (already the case) and avoid updating the sizechanges + table. (done) Detect when an upload (or drop) fails, and remove from the live - update table and in-memory cache. (done) - - Have a counter in the reposizes table that is updated on write. This (Diff truncated)
todo
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 8b6584a7ee..c2e4af6b8a 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -153,6 +153,24 @@ Planned schedule of work: * Still implementing LiveUpdate. Check for TODO XXX markers +* Could two processes both doing the same operation end up both + calling successfullyFinishedLiveSizeChange with the same repo uuid and + key? If so, the rolling total would get out of wack. + + Logs.Location.logChange only calls updateRepoSize when the presence + actually changed. So if one process does something and then the other + process also does the same thing (eg both drop), the second process + will see what the first process recorded, and won't update the size + redundantly. + + But: What if they're running at the same time? It seems + likely that Annex.Branch.maybeChange does not handle that in a way + that will guarantee this doesn't happen. Does anything else guarantee + it? + + Can additional locking be added to avoid it? Probably, but it + will add overhead and so should be avoided in the NoLiveUpdate case. + * In the case where a copy to a remote fails (due eg to annex.diskreserve), the LiveUpdate thread can not get a chance to catch its exception when the LiveUpdate is gced, before git-annex exits. In this case, the
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index e147aa462c..8b6584a7ee 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -78,8 +78,9 @@ Planned schedule of work: (done) Make checking the balanced preferred content limit record a - live update in the table and use other live updates in making its - decision. (done) + live update in the table (done) + + ... and use other live updates in making its decision Note: This will only work when preferred content is being checked. If a git-annex copy without --auto is run, for example, it won't
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 6893faa532..e147aa462c 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -79,7 +79,7 @@ Planned schedule of work: Make checking the balanced preferred content limit record a live update in the table and use other live updates in making its - decision. With locking as necessary. + decision. (done) Note: This will only work when preferred content is being checked. If a git-annex copy without --auto is run, for example, it won't @@ -87,6 +87,10 @@ Planned schedule of work: That seems ok though, because if the user is running a command like that, they are ok with a remote filling up. + Make sure that two threads don't check balanced preferred content at the + same time, so each thread always sees a consistent picture of what is + happening. Use locking as necessary. + In the unlikely event that one thread of a process is storing a key and another thread is dropping the same key from the same uuid, at the same time, reconcile somehow. How? Or is this perhaps something that cannot @@ -98,23 +102,14 @@ Planned schedule of work: Make updating location log for a key that is in the in-memory cache of the live update table update the db, removing it from that table, - and updating the in-memory reposizes. This needs to have - locking to make sure redundant information is never visible: + and updating the in-memory reposizes. (done) + + Make updading location log have locking to make sure redundant + information is never visible: Take lock, journal update, remove from live update table. - Somehow detect when an upload (or drop) fails, and remove from the live - update table and in-memory cache. How? Possibly have a thread that - waits on an empty MVar. Thread MVar through somehow to location log - update. (Seems this would need checking preferred content to return - the MVar? Or alternatively, the MVar could be passed into it, which - seems better..) Fill MVar on location log update. If MVar gets - GCed without being filled, the thread will get an exception and can - remove from table and cache then. This does rely on GC behavior, but if - the GC takes some time, it will just cause a failed upload to take - longer to get removed from the table and cache, which will just prevent - another upload of a different key from running immediately. - (Need to check if MVar GC behavior operates like this. - See https://stackoverflow.com/questions/10871303/killing-a-thread-when-mvar-is-garbage-collected ) + Detect when an upload (or drop) fails, and remove from the live + update table and in-memory cache. (done) Have a counter in the reposizes table that is updated on write. This can be used to quickly determine if it has changed. On every check of @@ -135,6 +130,18 @@ Planned schedule of work: But then, how to check if a PID is git-annex or not? /proc of course, but what about other OS's? Windows? + How? Possibly have a thread that + waits on an empty MVar. Thread MVar through somehow to location log + update. (Seems this would need checking preferred content to return + the MVar? Or alternatively, the MVar could be passed into it, which + seems better..) Fill MVar on location log update. If MVar gets + GCed without being filled, the thread will get an exception and can + remove from table and cache then. This does rely on GC behavior, but if + the GC takes some time, it will just cause a failed upload to take + longer to get removed from the table and cache, which will just prevent + another upload of a different key from running immediately. + (Need to check if MVar GC behavior operates like this. + See https://stackoverflow.com/questions/10871303/killing-a-thread-when-mvar-is-garbage-collected ) Perhaps stale entries can be found in a different way. Require the live update table to be updated with a timestamp every 5 minutes. The thread that waits on the MVar can do that, as long as the transfer is running. If
improve live update starting
In an expression like "balanced=foo and exclude=bar", avoid it starting
a live update when the overall expression doesn't match.
In an expression like "balanced=foo and exclude=bar", avoid it starting
a live update when the overall expression doesn't match.
diff --git a/Annex/FileMatcher.hs b/Annex/FileMatcher.hs index fc84a9c02b..3c2840c73d 100644 --- a/Annex/FileMatcher.hs +++ b/Annex/FileMatcher.hs @@ -42,6 +42,7 @@ import Git.FilePath import Types.Remote (RemoteConfig) import Types.ProposedAccepted import Annex.CheckAttr +import Annex.RepoSize.LiveUpdate import qualified Git.Config #ifdef WITH_MAGICMIME import Annex.Magic @@ -88,13 +89,16 @@ checkMatcher matcher mkey afile lu notpresent notconfigured d go mi = checkMatcher' matcher mi lu notpresent checkMatcher' :: FileMatcher Annex -> MatchInfo -> LiveUpdate -> AssumeNotPresent -> Annex Bool -checkMatcher' (matcher, (MatcherDesc matcherdesc)) mi lu notpresent = do - (matches, desc) <- runWriterT $ matchMrun' matcher $ \op -> - matchAction op lu notpresent mi - explain (mkActionItem mi) $ UnquotedString <$> - describeMatchResult matchDesc desc - ((if matches then "matches " else "does not match ") ++ matcherdesc ++ ": ") - return matches +checkMatcher' (matcher, (MatcherDesc matcherdesc)) mi lu notpresent = + checkLiveUpdate lu go + where + go = do + (matches, desc) <- runWriterT $ matchMrun' matcher $ \op -> + matchAction op lu notpresent mi + explain (mkActionItem mi) $ UnquotedString <$> + describeMatchResult matchDesc desc + ((if matches then "matches " else "does not match ") ++ matcherdesc ++ ": ") + return matches fileMatchInfo :: RawFilePath -> Maybe Key -> Annex MatchInfo fileMatchInfo file mkey = do diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index 3a05796bb5..8bd92921db 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -63,30 +63,36 @@ prepareLiveUpdate :: Maybe UUID -> Key -> SizeChange -> Annex LiveUpdate prepareLiveUpdate mu k sc = do h <- Db.getRepoSizeHandle u <- maybe getUUID pure mu + needv <- liftIO newEmptyMVar startv <- liftIO newEmptyMVar + readyv <- liftIO newEmptyMVar donev <- liftIO newEmptyMVar finishv <- liftIO newEmptyMVar - void $ liftIO $ forkIO $ waitstart startv donev finishv h u - return (LiveUpdate startv donev finishv) + void $ liftIO $ forkIO $ waitstart startv readyv donev finishv h u + return (LiveUpdate needv startv readyv donev finishv) where - {- Wait for startLiveUpdate, or for the LiveUpdate to get garbage - - collected in the case where it is never going to start. -} - waitstart startv donev finishv h u = tryNonAsync (takeMVar startv) >>= \case - Right _ -> do - {- Deferring updating the database until here - - avoids overhead except in cases where preferred - - content expressions need live updates. -} - Db.startingLiveSizeChange h u k sc - waitdone donev finishv h u - Left _ -> noop + {- Wait for checkLiveUpdate to request a start, or for the + - LiveUpdate to get garbage collected in the case where + - it is not needed. -} + waitstart startv readyv donev finishv h u = + tryNonAsync (takeMVar startv) >>= \case + Right () -> do + {- Deferring updating the database until + - here avoids overhead except in cases + - where preferred content expressions + - need live updates. -} + Db.startingLiveSizeChange h u k sc + putMVar readyv () + waitdone donev finishv h u + Left _ -> noop - {- Wait for finishedLiveUpdate to be called, or for the LiveUpdate to - - get garbage collected in the case where the change didn't + {- Wait for finishedLiveUpdate to be called, or for the LiveUpdate + - to get garbage collected in the case where the change didn't - actually happen. -} waitdone donev finishv h u = tryNonAsync (takeMVar donev) >>= \case -- TODO need to update RepoSize db -- in same transaction as Db.finishedLiveSizeChange - Right (u', k', sc') + Right (Just (u', k', sc')) | u' == u && k' == k && sc' == sc -> do done h u putMVar finishv () @@ -94,19 +100,37 @@ prepareLiveUpdate mu k sc = do -- causes fanout and so this is called with -- other UUIDs. | otherwise -> waitdone donev finishv h u + Right Nothing -> done h u Left _ -> done h u done h u = Db.finishedLiveSizeChange h u k sc -- Called when a preferred content check indicates that a live update is --- needed. Can be called more than once. -startLiveUpdate :: LiveUpdate -> Annex () -startLiveUpdate (LiveUpdate startv _donev _finishv) = - liftIO $ void $ tryPutMVar startv () -startLiveUpdate NoLiveUpdate = noop +-- needed. Can be called more than once on the same LiveUpdate. +needLiveUpdate :: LiveUpdate -> Annex () +needLiveUpdate NoLiveUpdate = noop +needLiveUpdate lu = liftIO $ void $ tryPutMVar (liveUpdateNeeded lu) () + +-- needLiveUpdate has to be called inside this to take effect. If the +-- action calls needLiveUpdate and then returns True, the live update is +-- started. If the action calls needLiveUpdate and then returns False, +-- the live update is not started. +-- +-- This can be called more than once on the same LiveUpdate. It will +-- only start it once. +checkLiveUpdate :: LiveUpdate -> Annex Bool -> Annex Bool +checkLiveUpdate NoLiveUpdate a = a +checkLiveUpdate lu a = do + r <- a + needed <- liftIO $ isJust <$> tryTakeMVar (liveUpdateNeeded lu) + when (r && needed) $ do + liftIO $ void $ tryPutMVar (liveUpdateStart lu) () + liftIO $ void $ readMVar (liveUpdateReady lu) + return r finishedLiveUpdate :: LiveUpdate -> UUID -> Key -> SizeChange -> IO () -finishedLiveUpdate (LiveUpdate _startv donev finishv) u k sc = do - tryNonAsync (putMVar donev (u, k, sc)) >>= \case - Right () -> void $ tryNonAsync $ readMVar finishv - Left _ -> noop finishedLiveUpdate NoLiveUpdate _ _ _ = noop +finishedLiveUpdate lu u k sc = do + tryNonAsync (putMVar (liveUpdateDone lu) (Just (u, k, sc))) >>= \case + Right () -> void $ + tryNonAsync $ readMVar $ liveUpdateFinish lu + Left _ -> noop diff --git a/Limit.hs b/Limit.hs index 3f8d480881..f05a3856db 100644 --- a/Limit.hs +++ b/Limit.hs @@ -672,7 +672,7 @@ limitFullyBalanced''' filtercandidates termname mu getgroupmap g n want = Right u `elem` picker candidates key n _ -> False when wanted $ - startLiveUpdate lu + needLiveUpdate lu return wanted , matchNeedsFileName = False , matchNeedsFileContent = False diff --git a/Types/RepoSize.hs b/Types/RepoSize.hs index 78f5d06ea7..f09aeff27d 100644 --- a/Types/RepoSize.hs +++ b/Types/RepoSize.hs @@ -31,7 +31,13 @@ newtype MaxSize = MaxSize { fromMaxSize :: Integer } -- the changes to its size into account. If NoLiveUpdate is used, it -- prevents that. data LiveUpdate - = LiveUpdate (MVar ()) (MVar (UUID, Key, SizeChange)) (MVar ()) + = LiveUpdate + { liveUpdateNeeded :: MVar () + , liveUpdateStart :: MVar () + , liveUpdateReady :: MVar () + , liveUpdateDone :: MVar (Maybe (UUID, Key, SizeChange)) + , liveUpdateFinish :: MVar () + } | NoLiveUpdate data SizeChange = AddingKey | RemovingKey diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 9ed68b1e83..6893faa532 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -145,20 +145,6 @@ Planned schedule of work: * Still implementing LiveUpdate. Check for TODO XXX markers -* In an expression like "balanced=foo and exclude=bar", - it will start a live update even if the overall expression doesn't - match. That is suboptimal, but also this will probably be a rare case, - it doesn't really make sense to to that. What will happen in that case - is the repo will temporarily be treated as having that key going - into it, even when it is not. As soon as the LiveUpdate gets GCed, - that resolves. Until then, other keys may not match that usually would, - if the repo would have been filled up by that key. - - What could be done in this case is, after checking preferred content, - when it's not preferred content, call stopLiveUpdate immediately, - rather than relying on GC. - That would also help with the next problem... - * In the case where a copy to a remote fails (due eg to annex.diskreserve), the LiveUpdate thread can not get a chance to catch its exception when the LiveUpdate is gced, before git-annex exits. In this case, the
todo
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 6893faa532..9ed68b1e83 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -145,6 +145,20 @@ Planned schedule of work: * Still implementing LiveUpdate. Check for TODO XXX markers +* In an expression like "balanced=foo and exclude=bar", + it will start a live update even if the overall expression doesn't + match. That is suboptimal, but also this will probably be a rare case, + it doesn't really make sense to to that. What will happen in that case + is the repo will temporarily be treated as having that key going + into it, even when it is not. As soon as the LiveUpdate gets GCed, + that resolves. Until then, other keys may not match that usually would, + if the repo would have been filled up by that key. + + What could be done in this case is, after checking preferred content, + when it's not preferred content, call stopLiveUpdate immediately, + rather than relying on GC. + That would also help with the next problem... + * In the case where a copy to a remote fails (due eg to annex.diskreserve), the LiveUpdate thread can not get a chance to catch its exception when the LiveUpdate is gced, before git-annex exits. In this case, the
LiveUpdate db updates working
I've tested the behavior of the thread that waits for the LiveUpdate to
be finished, and it does get signaled and exit cleanly when the
LiveUpdate is GCed instead.
Made finishedLiveUpdate wait for the thread to finish updating the
database.
There is a case where GC doesn't happen in time and the database is left
with a live update recorded in it. This should not be a problem as such
stale data can also happen when interrupted and will need to be detected
when loading the database.
Balanced preferred content expressions now call startLiveUpdate.
I've tested the behavior of the thread that waits for the LiveUpdate to
be finished, and it does get signaled and exit cleanly when the
LiveUpdate is GCed instead.
Made finishedLiveUpdate wait for the thread to finish updating the
database.
There is a case where GC doesn't happen in time and the database is left
with a live update recorded in it. This should not be a problem as such
stale data can also happen when interrupted and will need to be detected
when loading the database.
Balanced preferred content expressions now call startLiveUpdate.
diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index 0b519538ba..3a05796bb5 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -21,7 +21,10 @@ import qualified Data.Set as S updateRepoSize :: LiveUpdate -> UUID -> Key -> LogStatus -> Annex () updateRepoSize lu u k s = do - -- XXX call finishedLiveUpdate + -- TODO update reposizes db + -- FIXME locking so the liveupdate is remove in the same + -- transaction that updates reposizes and the db too. + liftIO $ finishedLiveUpdate lu u k sc rsv <- Annex.getRead Annex.reposizes liftIO (takeMVar rsv) >>= \case Nothing -> liftIO (putMVar rsv Nothing) @@ -31,10 +34,10 @@ updateRepoSize lu u k s = do u sizemap liftIO $ putMVar rsv (Just sizemap') where - f = case s of - InfoPresent -> addKeyRepoSize - InfoMissing -> removeKeyRepoSize - InfoDead -> removeKeyRepoSize + (sc, f) = case s of + InfoPresent -> (AddingKey, addKeyRepoSize) + InfoMissing -> (RemovingKey, removeKeyRepoSize) + InfoDead -> (RemovingKey, removeKeyRepoSize) addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize addKeyRepoSize k mrs = case mrs of @@ -62,40 +65,48 @@ prepareLiveUpdate mu k sc = do u <- maybe getUUID pure mu startv <- liftIO newEmptyMVar donev <- liftIO newEmptyMVar - void $ liftIO $ forkIO $ waitstart startv donev h u - return (LiveUpdate startv donev) + finishv <- liftIO newEmptyMVar + void $ liftIO $ forkIO $ waitstart startv donev finishv h u + return (LiveUpdate startv donev finishv) where {- Wait for startLiveUpdate, or for the LiveUpdate to get garbage - collected in the case where it is never going to start. -} - waitstart startv donev h u = tryNonAsync (takeMVar startv) >>= \case + waitstart startv donev finishv h u = tryNonAsync (takeMVar startv) >>= \case Right _ -> do + {- Deferring updating the database until here + - avoids overhead except in cases where preferred + - content expressions need live updates. -} Db.startingLiveSizeChange h u k sc - waitdone donev h u + waitdone donev finishv h u Left _ -> noop {- Wait for finishedLiveUpdate to be called, or for the LiveUpdate to - get garbage collected in the case where the change didn't - actually happen. -} - waitdone donev h u = tryNonAsync (takeMVar donev) >>= \case - -- TODO if succeeded == True, need to update RepoSize db + waitdone donev finishv h u = tryNonAsync (takeMVar donev) >>= \case + -- TODO need to update RepoSize db -- in same transaction as Db.finishedLiveSizeChange - Right (succeeded, u', k', sc') - | u' == u && k' == k && sc' == sc -> done h u + Right (u', k', sc') + | u' == u && k' == k && sc' == sc -> do + done h u + putMVar finishv () -- This can happen when eg, storing to a cluster -- causes fanout and so this is called with -- other UUIDs. - | otherwise -> waitdone donev h u + | otherwise -> waitdone donev finishv h u Left _ -> done h u done h u = Db.finishedLiveSizeChange h u k sc -- Called when a preferred content check indicates that a live update is -- needed. Can be called more than once. startLiveUpdate :: LiveUpdate -> Annex () -startLiveUpdate (LiveUpdate startv _donev) = +startLiveUpdate (LiveUpdate startv _donev _finishv) = liftIO $ void $ tryPutMVar startv () startLiveUpdate NoLiveUpdate = noop -finishedLiveUpdate :: LiveUpdate -> Bool -> UUID -> Key -> SizeChange -> IO () -finishedLiveUpdate (LiveUpdate _startv donev) succeeded u k sc = - putMVar donev (succeeded, u, k, sc) -finishedLiveUpdate NoLiveUpdate _ _ _ _ = noop +finishedLiveUpdate :: LiveUpdate -> UUID -> Key -> SizeChange -> IO () +finishedLiveUpdate (LiveUpdate _startv donev finishv) u k sc = do + tryNonAsync (putMVar donev (u, k, sc)) >>= \case + Right () -> void $ tryNonAsync $ readMVar finishv + Left _ -> noop +finishedLiveUpdate NoLiveUpdate _ _ _ = noop diff --git a/CmdLine.hs b/CmdLine.hs index c90d92a886..417b0f2819 100644 --- a/CmdLine.hs +++ b/CmdLine.hs @@ -31,7 +31,7 @@ import Types.Messages {- Parses input arguments, finds a matching Command, and runs it. -} dispatch :: Bool -> Bool -> CmdParams -> [Command] -> [(String, String)] -> IO Git.Repo -> String -> String -> IO () -dispatch addonok fuzzyok allargs allcmds fields getgitrepo progname progdesc = +dispatch addonok fuzzyok allargs allcmds fields getgitrepo progname progdesc = do go addonok allcmds $ findAddonCommand subcommandname >>= \case Just c -> go addonok (c:allcmds) noop diff --git a/Limit.hs b/Limit.hs index 5493517390..3f8d480881 100644 --- a/Limit.hs +++ b/Limit.hs @@ -18,6 +18,7 @@ import Annex.WorkTree import Annex.UUID import Annex.Magic import Annex.RepoSize +import Annex.RepoSize.LiveUpdate import Logs.MaxSize import Annex.Link import Types.Link @@ -598,7 +599,7 @@ limitFullyBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex limitFullyBalanced = limitFullyBalanced' "fullybalanced" limitFullyBalanced' :: String -> Maybe UUID -> Annex GroupMap -> MkLimit Annex -limitFullyBalanced' = limitFullyBalanced'' $ \n key candidates -> do +limitFullyBalanced' = limitFullyBalanced'' $ \lu n key candidates -> do maxsizes <- getMaxSizes sizemap <- getRepoSizes False threshhold <- annexFullyBalancedThreshhold <$> Annex.getGitConfig @@ -632,7 +633,7 @@ repoHasSpace keysize inrepo (RepoSize reposize) (MaxSize maxsize) reposize + keysize <= maxsize limitFullyBalanced'' - :: (Int -> Key -> S.Set UUID -> Annex (S.Set UUID)) + :: (LiveUpdate -> Int -> Key -> S.Set UUID -> Annex (S.Set UUID)) -> String -> Maybe UUID -> Annex GroupMap @@ -650,7 +651,7 @@ limitFullyBalanced'' filtercandidates termname mu getgroupmap want = getgroupmap (toGroup s) n want limitFullyBalanced''' - :: (Int -> Key -> S.Set UUID -> Annex (S.Set UUID)) + :: (LiveUpdate -> Int -> Key -> S.Set UUID -> Annex (S.Set UUID)) -> String -> Maybe UUID -> Annex GroupMap @@ -662,13 +663,17 @@ limitFullyBalanced''' filtercandidates termname mu getgroupmap g n want = Right gm <- getgroupmap let groupmembers = fromMaybe S.empty $ M.lookup g (uuidsByGroup gm) - candidates <- filtercandidates n key groupmembers - return $ if S.null candidates + -- TODO locking for liveupdate + candidates <- filtercandidates lu n key groupmembers + let wanted = if S.null candidates then False else case (mu, M.lookup g (balancedPickerByGroup gm)) of (Just u, Just picker) -> u `elem` picker candidates key n _ -> False + when wanted $ + startLiveUpdate lu + return wanted , matchNeedsFileName = False , matchNeedsFileContent = False , matchNeedsKey = True @@ -685,7 +690,7 @@ limitFullySizeBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex limitFullySizeBalanced = limitFullySizeBalanced' "fullysizebalanced" limitFullySizeBalanced' :: String -> Maybe UUID -> Annex GroupMap -> MkLimit Annex -limitFullySizeBalanced' = limitFullyBalanced'' $ \n key candidates -> do +limitFullySizeBalanced' = limitFullyBalanced'' $ \lu n key candidates -> do maxsizes <- getMaxSizes sizemap <- getRepoSizes False filterCandidatesFullySizeBalanced maxsizes sizemap n key candidates diff --git a/Types/RepoSize.hs b/Types/RepoSize.hs index bf28a8eac0..78f5d06ea7 100644 --- a/Types/RepoSize.hs +++ b/Types/RepoSize.hs @@ -27,19 +27,11 @@ newtype MaxSize = MaxSize { fromMaxSize :: Integer } -- Used when an action is in progress that will change the current size of -- a repository. -- --- The live update has been recorded as starting, and filling the MVar with --- the correct UUID, Key, and SizeChange will record the live update --- as complete. The Bool should be True when the action successfully --- added/removed the key from the repository. --- --- If the MVar gets garbage collected before it is filled, the live update --- will be removed. --- -- This allows other concurrent changes to the same repository take -- the changes to its size into account. If NoLiveUpdate is used, it -- prevents that. data LiveUpdate - = LiveUpdate (MVar ()) (MVar (Bool, UUID, Key, SizeChange)) + = LiveUpdate (MVar ()) (MVar (UUID, Key, SizeChange)) (MVar ()) | NoLiveUpdate data SizeChange = AddingKey | RemovingKey (Diff truncated)
LiveUpdate for clusters
diff --git a/Annex/Cluster.hs b/Annex/Cluster.hs index 0b61790431..9f8fd7deae 100644 --- a/Annex/Cluster.hs +++ b/Annex/Cluster.hs @@ -5,7 +5,7 @@ - Licensed under the GNU AGPL version 3 or higher. -} -{-# LANGUAGE RankNTypes, OverloadedStrings #-} +{-# LANGUAGE RankNTypes, OverloadedStrings, TupleSections #-} module Annex.Cluster where @@ -19,6 +19,7 @@ import P2P.IO import Annex.Proxy import Annex.UUID import Annex.BranchState +import Annex.RepoSize.LiveUpdate import Logs.Location import Logs.PreferredContent import Types.Command @@ -108,10 +109,15 @@ clusterProxySelector clusteruuid protocolversion (Bypass bypass) = do , proxyPUT = \af k -> do locs <- S.fromList <$> loggedLocations k let l = filter (flip S.notMember locs . Remote.uuid . remote) nodes - --- XXX FIXME TODO NoLiveUpdate should not be used - -- here. Doing a live update here is exactly why - -- live update is needed. - l' <- filterM (\n -> isPreferredContent NoLiveUpdate (Just (Remote.uuid (remote n))) mempty (Just k) af True) l + let checkpreferred n = do + let u = Just (Remote.uuid (remote n)) + lu <- prepareLiveUpdate u k AddingKey + ifM (isPreferredContent lu u mempty (Just k) af True) + ( return $ Just $ n + { remoteLiveUpdate = lu } + , return Nothing + ) + l' <- catMaybes <$> mapM checkpreferred l -- PUT to no nodes doesn't work, so fall -- back to all nodes. return $ nonempty [l', l] nodes diff --git a/Annex/Proxy.hs b/Annex/Proxy.hs index c73855d7b3..48222872c1 100644 --- a/Annex/Proxy.hs +++ b/Annex/Proxy.hs @@ -365,6 +365,6 @@ canProxyForRemote rs myproxies myclusters remoteuuid = mkProxyMethods :: ProxyMethods mkProxyMethods = ProxyMethods - { removedContent = \u k -> logChange NoLiveUpdate k u InfoMissing - , addedContent = \u k -> logChange NoLiveUpdate k u InfoPresent + { removedContent = \lu u k -> logChange lu k u InfoMissing + , addedContent = \lu u k -> logChange lu k u InfoPresent } diff --git a/P2P/Proxy.hs b/P2P/Proxy.hs index cbd28902a1..fc3a5ad094 100644 --- a/P2P/Proxy.hs +++ b/P2P/Proxy.hs @@ -43,6 +43,7 @@ data RemoteSide = RemoteSide , remoteConnect :: Annex (Maybe (RunState, P2PConnection, ProtoCloser)) , remoteTMVar :: TMVar (RunState, P2PConnection, ProtoCloser) , remoteSideId :: RemoteSideId + , remoteLiveUpdate :: LiveUpdate } instance Show RemoteSide where @@ -54,6 +55,7 @@ mkRemoteSide r remoteconnect = RemoteSide <*> pure remoteconnect <*> liftIO (atomically newEmptyTMVar) <*> liftIO (RemoteSideId <$> newUnique) + <*> pure NoLiveUpdate runRemoteSide :: RemoteSide -> Proto a -> Annex (Either ProtoFailure a) runRemoteSide remoteside a = @@ -103,9 +105,9 @@ singleProxySelector r = ProxySelector - all other actions that a proxy needs to do are provided - here. -} data ProxyMethods = ProxyMethods - { removedContent :: UUID -> Key -> Annex () + { removedContent :: LiveUpdate -> UUID -> Key -> Annex () -- ^ called when content is removed from a repository - , addedContent :: UUID -> Key -> Annex () + , addedContent :: LiveUpdate -> UUID -> Key -> Annex () -- ^ called when content is added to a repository } @@ -443,7 +445,7 @@ proxyRequest proxydone proxyparams requestcomplete requestmessage protoerrhandle _ -> Nothing let v' = map join v let us = concatMap snd $ catMaybes v' - mapM_ (\u -> removedContent (proxyMethods proxyparams) u k) us + mapM_ (\u -> removedContent (proxyMethods proxyparams) NoLiveUpdate u k) us protoerrhandler requestcomplete $ client $ net $ sendMessage $ let nonplussed = all (== proxyUUID proxyparams) us @@ -511,13 +513,19 @@ proxyRequest proxydone proxyparams requestcomplete requestmessage protoerrhandle requestcomplete () relayPUTRecord k remoteside SUCCESS = do - addedContent (proxyMethods proxyparams) (Remote.uuid (remote remoteside)) k + addedContent (proxyMethods proxyparams) + (remoteLiveUpdate remoteside) + (Remote.uuid (remote remoteside)) + k return $ Just [Remote.uuid (remote remoteside)] relayPUTRecord k remoteside (SUCCESS_PLUS us) = do - let us' = (Remote.uuid (remote remoteside)) : us - forM_ us' $ \u -> - addedContent (proxyMethods proxyparams) u k - return $ Just us' + addedContent (proxyMethods proxyparams) + (remoteLiveUpdate remoteside) + (Remote.uuid (remote remoteside)) + k + forM_ us $ \u -> + addedContent (proxyMethods proxyparams) NoLiveUpdate u k + return $ Just (Remote.uuid (remote remoteside) : us) relayPUTRecord _ _ _ = return Nothing diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index f7b9fc53e7..c4b13ce4f2 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -142,11 +142,10 @@ Planned schedule of work: also be done by just repeatedly touching a file named with the processes's pid in it, to avoid sqlite overhead. -* Check for TODO XXX markers +* Still implementing LiveUpdate. Check for TODO XXX markers * Check all uses of NoLiveUpdate to see if a live update can be started and - performed there. There is one in Annex.Cluster in particular that needs a - live update. + performed there. * The assistant is using NoLiveUpdate, but it should be posssible to plumb a LiveUpdate through it from preferred content checking to location log
punt on LiveUpdate plumbing through assistant for now
diff --git a/Assistant/Threads/Committer.hs b/Assistant/Threads/Committer.hs index 229ad17d1a..2f7e03c43c 100644 --- a/Assistant/Threads/Committer.hs +++ b/Assistant/Threads/Committer.hs @@ -322,7 +322,7 @@ handleAdds lockdowndir havelsof largefilematcher annexdotfiles delayadd cs = ret | not annexdotfiles && dotfile f = return (Right change) | otherwise = - ifM (liftAnnex $ checkFileMatcher largefilematcher f) + ifM (liftAnnex $ checkFileMatcher NoLiveUpdate largefilematcher f) ( return (Left change) , return (Right change) ) @@ -395,7 +395,7 @@ handleAdds lockdowndir havelsof largefilematcher annexdotfiles delayadd cs = ret return Nothing done change file key = liftAnnex $ do - logStatus key InfoPresent + logStatus NoLiveUpdate key InfoPresent mode <- liftIO $ catchMaybeIO $ fileMode <$> R.getFileStatus (toRawFilePath file) stagePointerFile (toRawFilePath file) mode =<< hashPointerFile key showEndOk diff --git a/Assistant/Threads/TransferScanner.hs b/Assistant/Threads/TransferScanner.hs index 970516a380..230194cb2c 100644 --- a/Assistant/Threads/TransferScanner.hs +++ b/Assistant/Threads/TransferScanner.hs @@ -171,9 +171,9 @@ expensiveScan urlrenderer rs = batch <~> do "expensive scan found too many copies of object" present key af (SeekInput []) [] callCommandAction ts <- if present - then liftAnnex . filterM (wantGetBy True (Just key) af . Remote.uuid . fst) + then liftAnnex . filterM (wantGetBy NoLiveUpdate True (Just key) af . Remote.uuid . fst) =<< use syncDataRemotes (genTransfer Upload False) - else ifM (liftAnnex $ wantGet True (Just key) af) + else ifM (liftAnnex $ wantGet NoLiveUpdate True (Just key) af) ( use downloadRemotes (genTransfer Download True) , return [] ) let unwanted' = S.difference unwanted slocs return (unwanted', ts) diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 0ab04b1e47..f7b9fc53e7 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -142,11 +142,15 @@ Planned schedule of work: also be done by just repeatedly touching a file named with the processes's pid in it, to avoid sqlite overhead. +* Check for TODO XXX markers + * Check all uses of NoLiveUpdate to see if a live update can be started and performed there. There is one in Annex.Cluster in particular that needs a - live update + live update. -* Check for TODO XXX markers +* The assistant is using NoLiveUpdate, but it should be posssible to plumb + a LiveUpdate through it from preferred content checking to location log + updating. * `git-annex info` in the limitedcalc path in cachedAllRepoData double-counts redundant information from the journal due to using
initial report on desire to do handle pathspecs
diff --git a/doc/bugs/get__47__metadata__47____63____63____63____58___does_not_handle_pathspec_correct.mdwn b/doc/bugs/get__47__metadata__47____63____63____63____58___does_not_handle_pathspec_correct.mdwn new file mode 100644 index 0000000000..1afa94d559 --- /dev/null +++ b/doc/bugs/get__47__metadata__47____63____63____63____58___does_not_handle_pathspec_correct.mdwn @@ -0,0 +1,67 @@ +### Please describe the problem. + +Wanted to use `metadata` (to annotate anatomical T1s with metadata), and then tried `get` on a pathspec. +`git annex` then incorrectly claims that no files patch although I show with `git ls-files` on the same pathspec that there are files: + +```shell +❯ git annex version +git-annex version: 10.20240731+git17-g6d1592f857-1~ndall+1 +build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV +... + +❯ git ls-files '**/*.nii.gz' | head -n 1 +sub-0001/ses-01/anat/sub-0001_ses-01_acq-MPRAGEXp3X08mm_T1w.nii.gz + +❯ git annex metadata '**/*.nii.gz' +error: pathspec './**/*.nii.gz' did not match any file(s) known to git +Did you forget to 'git add'? +metadata: 1 failed + +# git-annex changed pathspec to have leading ./ -- let's try with that too: +❯ git ls-files './**/*.nii.gz' | head -n 1 +sub-0001/ses-01/anat/sub-0001_ses-01_acq-MPRAGEXp3X08mm_T1w.nii.gz + +# annex get -- the same story +❯ git annex get '**/*.nii.gz' +error: pathspec './**/*.nii.gz' did not match any file(s) known to git +Did you forget to 'git add'? +(merging typhon/git-annex into git-annex...) +(recording state in git...) +get: 1 failed +``` + +From `annex --debug` we can see that annex unconditionally uses `--literal-pathspecs` + +```shell +❯ git annex --debug get '**/*.nii.gz' +[2024-08-23 21:29:36.951044831] (Utility.Process) process [3889124] read: git ["--git-dir=.git","--work-tree=.","--literal-pathspecs","-c","annex.debug=true","ls-files","--stage","-z","--error-unmatch","--","./**/*.nii.gz"] +``` + +so, I think then annex should have at least used "literal" in the error, e.g. + +``` +error: literal pathspec './**/*.nii.gz' did not match any file(s) known to git +``` + +and ideally also hinted on how to disable such behavior (if possible) and do allow for "magical" etc pathspecs there. + +FWIW, I have tried with `GIT_GLOB_PATHSPECS=1` env var but that didn't help.... not sure if possible at all looking at the code + +``` +fixupRepo :: Repo -> GitConfig -> IO Repo +fixupRepo r c = do + let r' = disableWildcardExpansion r + r'' <- fixupUnusualRepos r' c + if annexDirect c + then return (fixupDirect r'') + else return r'' + +{- Disable git's built-in wildcard expansion, which is not wanted + - when using it as plumbing by git-annex. -} +disableWildcardExpansion :: Repo -> Repo +disableWildcardExpansion r = r + { gitGlobalOpts = gitGlobalOpts r ++ [Param "--literal-pathspecs"] } +``` + +[[!meta author=yoh]] +[[!tag projects/openneuro]]
initial idea on another ability for get
diff --git a/doc/todo/get__58___allow_for_both_--branch_and_pathspec.mdwn b/doc/todo/get__58___allow_for_both_--branch_and_pathspec.mdwn new file mode 100644 index 0000000000..e80cfb14bd --- /dev/null +++ b/doc/todo/get__58___allow_for_both_--branch_and_pathspec.mdwn @@ -0,0 +1,8 @@ +It is desired to be able to get keys which correspond for some commit which would otherwise be not easy/undesired to checkout (too big tree, tree used actively etc) . +But it seems it is impossible to do so ATM: + +```shell +yoh@typhon:/mnt/DATA/data/yoh/1076_spacetop$ git annex get --branch e6888f70ed97099f83a77d5bcf3372a9a75a2b5e^ '**/*.nii.gz' +git-annex: Can only specify one of file names, --all, --branch, --unused, --failed, --key, or --incomplete + +```
plumb in LiveUpdate (WIP)
Each command that first checks preferred content (and/or required
content) and then does something that can change the sizes of
repositories needs to call prepareLiveUpdate, and plumb it through the
preferred content check and the location log update.
So far, only Command.Drop is done. Many other commands that don't need
to do this have been updated to keep working.
There may be some calls to NoLiveUpdate in places where that should be
done. All will need to be double checked.
Not currently in a compilable state.
Each command that first checks preferred content (and/or required
content) and then does something that can change the sizes of
repositories needs to call prepareLiveUpdate, and plumb it through the
preferred content check and the location log update.
So far, only Command.Drop is done. Many other commands that don't need
to do this have been updated to keep working.
There may be some calls to NoLiveUpdate in places where that should be
done. All will need to be double checked.
Not currently in a compilable state.
diff --git a/Annex.hs b/Annex.hs index cbec4befca..4208e5c741 100644 --- a/Annex.hs +++ b/Annex.hs @@ -1,6 +1,6 @@ {- git-annex monad - - - Copyright 2010-2021 Joey Hess <id@joeyh.name> + - Copyright 2010-2024 Joey Hess <id@joeyh.name> - - Licensed under the GNU AGPL version 3 or higher. -} @@ -79,6 +79,7 @@ import Types.RepoSize import Annex.VectorClock.Utility import Annex.Debug.Utility import qualified Database.Keys.Handle as Keys +import Database.RepoSize.Handle import Utility.InodeCache import Utility.Url import Utility.ResourcePool @@ -225,6 +226,7 @@ data AnnexState = AnnexState , insmudgecleanfilter :: Bool , getvectorclock :: IO CandidateVectorClock , proxyremote :: Maybe (Either ClusterUUID (Types.Remote.RemoteA Annex)) + , reposizehandle :: Maybe RepoSizeHandle } newAnnexState :: GitConfig -> Git.Repo -> IO AnnexState @@ -280,6 +282,7 @@ newAnnexState c r = do , insmudgecleanfilter = False , getvectorclock = vc , proxyremote = Nothing + , reposizehandle = Nothing } {- Makes an Annex state object for the specified git repo. diff --git a/Annex/Cluster.hs b/Annex/Cluster.hs index f3283094d3..0b61790431 100644 --- a/Annex/Cluster.hs +++ b/Annex/Cluster.hs @@ -108,7 +108,10 @@ clusterProxySelector clusteruuid protocolversion (Bypass bypass) = do , proxyPUT = \af k -> do locs <- S.fromList <$> loggedLocations k let l = filter (flip S.notMember locs . Remote.uuid . remote) nodes - l' <- filterM (\n -> isPreferredContent (Just (Remote.uuid (remote n))) mempty (Just k) af True) l + --- XXX FIXME TODO NoLiveUpdate should not be used + -- here. Doing a live update here is exactly why + -- live update is needed. + l' <- filterM (\n -> isPreferredContent NoLiveUpdate (Just (Remote.uuid (remote n))) mempty (Just k) af True) l -- PUT to no nodes doesn't work, so fall -- back to all nodes. return $ nonempty [l', l] nodes diff --git a/Annex/Common.hs b/Annex/Common.hs index 0fc602205a..37644dd857 100644 --- a/Annex/Common.hs +++ b/Annex/Common.hs @@ -11,6 +11,7 @@ import Annex.Locations as X import Annex.Debug as X (fastDebug, debug) import Messages as X import Git.Quote as X +import Types.RepoSize as X #ifndef mingw32_HOST_OS import System.Posix.IO as X hiding (createPipe, append) #endif diff --git a/Annex/Content.hs b/Annex/Content.hs index 4ad045d763..93d111140f 100644 --- a/Annex/Content.hs +++ b/Annex/Content.hs @@ -788,7 +788,7 @@ moveBad key = do createAnnexDirectory (parentDir dest) cleanObjectLoc key $ liftIO $ moveFile src dest - logStatus key InfoMissing + logStatus NoLiveUpdate key InfoMissing return dest data KeyLocation = InAnnex | InAnywhere diff --git a/Annex/Drop.hs b/Annex/Drop.hs index ccbc18e6e1..49c15746c4 100644 --- a/Annex/Drop.hs +++ b/Annex/Drop.hs @@ -29,9 +29,9 @@ type Reason = String - required content, and numcopies settings. - - Skips trying to drop from remotes that are appendonly, since those drops - - would presumably fail. Also skips dropping from exporttree/importtree remotes, - - which don't allow dropping individual keys, and from thirdPartyPopulated - - remotes. + - would presumably fail. Also skips dropping from exporttree/importtree + - remotes, which don't allow dropping individual keys, and from + - thirdPartyPopulated remotes. - - The UUIDs are ones where the content is believed to be present. - The Remote list can include other remotes that do not have the content; @@ -92,11 +92,12 @@ handleDropsFrom locs rs reason fromhere key afile si preverified runner = do dropr fs r n >>= go fs rest | otherwise = pure n - checkdrop fs n u a = + checkdrop fs n u a = do let afs = map (AssociatedFile . Just) fs - pcc = Command.Drop.PreferredContentChecked True - in ifM (wantDrop True u (Just key) afile (Just afs)) - ( dodrop n u (a pcc) + let pcc = Command.Drop.PreferredContentChecked True + lu <- prepareLiveUpdate u key RemovingKey + ifM (wantDrop lu True u (Just key) afile (Just afs)) + ( dodrop n u (a lu pcc) , return n ) @@ -116,12 +117,16 @@ handleDropsFrom locs rs reason fromhere key afile si preverified runner = do , return n ) - dropl fs n = checkdrop fs n Nothing $ \pcc numcopies mincopies -> + dropl fs n = checkdrop fs n Nothing $ \lu pcc numcopies mincopies -> stopUnless (inAnnex key) $ - Command.Drop.startLocal pcc afile ai si numcopies mincopies key preverified (Command.Drop.DroppingUnused False) + Command.Drop.startLocal lu pcc afile ai si + numcopies mincopies key preverified + (Command.Drop.DroppingUnused False) - dropr fs r n = checkdrop fs n (Just $ Remote.uuid r) $ \pcc numcopies mincopies -> - Command.Drop.startRemote pcc afile ai si numcopies mincopies key (Command.Drop.DroppingUnused False) r + dropr fs r n = checkdrop fs n (Just $ Remote.uuid r) $ \lu pcc numcopies mincopies -> + Command.Drop.startRemote lu pcc afile ai si + numcopies mincopies key + (Command.Drop.DroppingUnused False) r ai = mkActionItem (key, afile) diff --git a/Annex/FileMatcher.hs b/Annex/FileMatcher.hs index c2490711ae..fc84a9c02b 100644 --- a/Annex/FileMatcher.hs +++ b/Annex/FileMatcher.hs @@ -53,22 +53,22 @@ import Control.Monad.Writer type GetFileMatcher = RawFilePath -> Annex (FileMatcher Annex) -checkFileMatcher :: GetFileMatcher -> RawFilePath -> Annex Bool -checkFileMatcher getmatcher file = - checkFileMatcher' getmatcher file (return True) +checkFileMatcher :: LiveUpdate -> GetFileMatcher -> RawFilePath -> Annex Bool +checkFileMatcher lu getmatcher file = + checkFileMatcher' lu getmatcher file (return True) -- | Allows running an action when no matcher is configured for the file. -checkFileMatcher' :: GetFileMatcher -> RawFilePath -> Annex Bool -> Annex Bool -checkFileMatcher' getmatcher file notconfigured = do +checkFileMatcher' :: LiveUpdate -> GetFileMatcher -> RawFilePath -> Annex Bool -> Annex Bool +checkFileMatcher' lu getmatcher file notconfigured = do matcher <- getmatcher file - checkMatcher matcher Nothing afile S.empty notconfigured d + checkMatcher matcher Nothing afile lu S.empty notconfigured d where afile = AssociatedFile (Just file) -- checkMatcher will never use this, because afile is provided. d = return True -checkMatcher :: FileMatcher Annex -> Maybe Key -> AssociatedFile -> AssumeNotPresent -> Annex Bool -> Annex Bool -> Annex Bool -checkMatcher matcher mkey afile notpresent notconfigured d +checkMatcher :: FileMatcher Annex -> Maybe Key -> AssociatedFile -> LiveUpdate -> AssumeNotPresent -> Annex Bool -> Annex Bool -> Annex Bool +checkMatcher matcher mkey afile lu notpresent notconfigured d | isEmpty (fst matcher) = notconfigured | otherwise = case (mkey, afile) of (_, AssociatedFile (Just file)) -> @@ -85,12 +85,12 @@ checkMatcher matcher mkey afile notpresent notconfigured d in go (MatchingInfo i) (Nothing, _) -> d where - go mi = checkMatcher' matcher mi notpresent + go mi = checkMatcher' matcher mi lu notpresent -checkMatcher' :: FileMatcher Annex -> MatchInfo -> AssumeNotPresent -> Annex Bool -checkMatcher' (matcher, (MatcherDesc matcherdesc)) mi notpresent = do +checkMatcher' :: FileMatcher Annex -> MatchInfo -> LiveUpdate -> AssumeNotPresent -> Annex Bool +checkMatcher' (matcher, (MatcherDesc matcherdesc)) mi lu notpresent = do (matches, desc) <- runWriterT $ matchMrun' matcher $ \op -> - matchAction op notpresent mi + matchAction op lu notpresent mi explain (mkActionItem mi) $ UnquotedString <$> describeMatchResult matchDesc desc ((if matches then "matches " else "does not match ") ++ matcherdesc ++ ": ") @@ -259,9 +259,9 @@ addUnlockedMatcher = AddUnlockedMatcher <$> matchalways True = return (MOp limitAnything, matcherdesc) matchalways False = return (MOp limitNothing, matcherdesc) -checkAddUnlockedMatcher :: AddUnlockedMatcher -> MatchInfo -> Annex Bool -checkAddUnlockedMatcher (AddUnlockedMatcher matcher) mi = - checkMatcher' matcher mi S.empty +checkAddUnlockedMatcher :: LiveUpdate -> AddUnlockedMatcher -> MatchInfo -> Annex Bool +checkAddUnlockedMatcher lu (AddUnlockedMatcher matcher) mi = + checkMatcher' matcher mi lu S.empty simply :: MatchFiles Annex -> ParseResult (MatchFiles Annex) simply = Right . Operation @@ -271,8 +271,8 @@ usev a v = Operation <$> a v call :: String -> Either String (Matcher (MatchFiles Annex)) -> ParseResult (MatchFiles Annex) (Diff truncated)
add live size changes to RepoSize database
Not yet used.
Not yet used.
diff --git a/Database/RepoSize.hs b/Database/RepoSize.hs index 40599fddf8..c75ecb16e6 100644 --- a/Database/RepoSize.hs +++ b/Database/RepoSize.hs @@ -6,6 +6,7 @@ -} {-# LANGUAGE CPP #-} +{-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE QuasiQuotes, TypeFamilies, TemplateHaskell #-} {-# LANGUAGE OverloadedStrings, GADTs, FlexibleContexts #-} {-# LANGUAGE MultiParamTypeClasses, GeneralizedNewtypeDeriving #-} @@ -24,6 +25,9 @@ module Database.RepoSize ( closeDb, getRepoSizes, setRepoSizes, + getLiveSizeChanges, + startingLiveSizeChange, + finishedLiveSizeChange, ) where import Annex.Common @@ -40,6 +44,7 @@ import Database.Persist.Sql hiding (Key) import Database.Persist.TH import qualified System.FilePath.ByteString as P import qualified Data.Map as M +import qualified Data.Text as T newtype RepoSizeHandle = RepoSizeHandle (Maybe H.DbHandle) @@ -53,6 +58,12 @@ RepoSizes AnnexBranch commit SSha UniqueCommit commit +-- Changes that are currently being made that affect repo sizes. +LiveSizeChanges + repo UUID + key Key + change SizeChange + UniqueLiveSizeChange repo key |] {- Opens the database, creating it if it doesn't exist yet. @@ -143,3 +154,45 @@ recordAnnexBranchCommit :: Sha -> SqlPersistM () recordAnnexBranchCommit branchcommitsha = do deleteWhere ([] :: [Filter AnnexBranch]) void $ insertUniqueFast $ AnnexBranch $ toSSha branchcommitsha + +data SizeChange = AddingKey | RemovingKey + +{- If there is already a size change for the same UUID and Key, it is + - overwritten with the new size change. -} +startingLiveSizeChange :: UUID -> Key -> SizeChange -> SqlPersistM () +startingLiveSizeChange u k sc = + void $ upsertBy + (UniqueLiveSizeChange u k) + (LiveSizeChanges u k sc) + [LiveSizeChangesChange =. sc] + +finishedLiveSizeChange :: UUID -> Key -> SizeChange -> SqlPersistM () +finishedLiveSizeChange u k sc = deleteWhere + [ LiveSizeChangesRepo ==. u + , LiveSizeChangesKey ==. k + , LiveSizeChangesChange ==. sc + ] + +getLiveSizeChanges :: RepoSizeHandle -> IO (M.Map UUID (Key, SizeChange)) +getLiveSizeChanges (RepoSizeHandle (Just h)) = H.queryDb h $ do + m <- M.fromList . map conv <$> getLiveSizeChanges' + return m + where + conv entity = + let LiveSizeChanges u k sc = entityVal entity + in (u, (k, sc)) +getLiveSizeChanges (RepoSizeHandle Nothing) = return mempty + +getLiveSizeChanges' :: SqlPersistM [Entity LiveSizeChanges] +getLiveSizeChanges' = selectList [] [] + +instance PersistField SizeChange where + toPersistValue AddingKey = toPersistValue (1 :: Int) + toPersistValue RemovingKey = toPersistValue (-1 :: Int) + fromPersistValue b = fromPersistValue b >>= \case + (1 :: Int) -> Right AddingKey + -1 -> Right RemovingKey + v -> Left $ T.pack $ "bad serialized SizeChange "++ show v + +instance PersistFieldSql SizeChange where + sqlType _ = SqlInt32 diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 76febcdf61..1475edfd8e 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -79,6 +79,7 @@ Planned schedule of work: Add to reposizes db a table for live updates. Listing process ID, thread ID, UUID, key, addition or removal + (done) Make checking the balanced preferred content limit record a live update in the table and use other live updates in making its
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index f1f213f326..76febcdf61 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -35,6 +35,10 @@ Planned schedule of work: May not be a bug, needs reproducing and analysis. +* Check if reposizes updates works when using `git-annex transferrer`. + Eg, does the location log update happen in the parent process or in + the transferrer process? + * Concurrency issues with RepoSizes calculation and balanced content: * What if 2 concurrent threads are considering sending two different @@ -102,13 +106,17 @@ Planned schedule of work: Somehow detect when an upload (or drop) fails, and remove from the live update table and in-memory cache. How? Possibly have a thread that - waits on an empty MVar. Fill MVar on location log update. If MVar gets + waits on an empty MVar. Thread MVar through somehow to location log + update. (Seems this would need checking preferred content to return + the MVar? Or alternatively, the MVar could be passed into it, which + seems better..) Fill MVar on location log update. If MVar gets GCed without being filled, the thread will get an exception and can remove from table and cache then. This does rely on GC behavior, but if the GC takes some time, it will just cause a failed upload to take longer to get removed from the table and cache, which will just prevent another upload of a different key from running immediately. - (Need to check if MVar GC behavior operates like this.) + (Need to check if MVar GC behavior operates like this. + See https://stackoverflow.com/questions/10871303/killing-a-thread-when-mvar-is-garbage-collected ) Have a counter in the reposizes table that is updated on write. This can be used to quickly determine if it has changed. On every check of @@ -118,7 +126,7 @@ Planned schedule of work: The counter could also be a per-UUID counter, so two processes operating on different remotes would not have overhead. - When loading the live update table, check if processes in it are still + When loading the live update table, check if PIDs in it are still running (and are still git-annex), and if not, remove stale entries from it, which can accumulate when processes are interrupted. Note that it will be ok for the wrong git-annex process, running again @@ -126,6 +134,17 @@ Planned schedule of work: is unlikely and exponentially unlikely to happen repeatedly, so stale information will only be used for a short time. + But then, how to check if a PID is git-annex or not? /proc of course, + but what about other OS's? Windows? + + Perhaps stale entries can be found in a different way. Require the live + update table to be updated with a timestamp every 5 minutes. The thread + that waits on the MVar can do that, as long as the transfer is running. If + interrupted, it will become stale in 5 minutes, which is probably good + enough? Could do it every minute, depending on overhead. This could + also be done by just repeatedly touching a file named with the processes's + pid in it, to avoid sqlite overhead. + * `git-annex info` in the limitedcalc path in cachedAllRepoData double-counts redundant information from the journal due to using overLocationLogs. In the other path it does not, and this should be fixed
possible design to address reposizes concurrency issues
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 5445b65419..f1f213f326 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -71,6 +71,61 @@ Planned schedule of work: command behave non-ideally, the same as the thread concurrency problems. + * Possible solution: + + Add to reposizes db a table for live updates. + Listing process ID, thread ID, UUID, key, addition or removal + + Make checking the balanced preferred content limit record a + live update in the table and use other live updates in making its + decision. With locking as necessary. + + Note: This will only work when preferred content is being checked. + If a git-annex copy without --auto is run, for example, it won't + tell other processes that it is in the process of filling up a remote. + That seems ok though, because if the user is running a command like + that, they are ok with a remote filling up. + + In the unlikely event that one thread of a process is storing a key and + another thread is dropping the same key from the same uuid, at the same + time, reconcile somehow. How? Or is this perhaps something that cannot + happen? + + Also keep an in-memory cache of the live updates being performed by + the current process. For use in location log update as follows.. + + Make updating location log for a key that is in the in-memory cache + of the live update table update the db, removing it from that table, + and updating the in-memory reposizes. This needs to have + locking to make sure redundant information is never visible: + Take lock, journal update, remove from live update table. + + Somehow detect when an upload (or drop) fails, and remove from the live + update table and in-memory cache. How? Possibly have a thread that + waits on an empty MVar. Fill MVar on location log update. If MVar gets + GCed without being filled, the thread will get an exception and can + remove from table and cache then. This does rely on GC behavior, but if + the GC takes some time, it will just cause a failed upload to take + longer to get removed from the table and cache, which will just prevent + another upload of a different key from running immediately. + (Need to check if MVar GC behavior operates like this.) + + Have a counter in the reposizes table that is updated on write. This + can be used to quickly determine if it has changed. On every check of + balanced preferred content, check the counter, and if it's been changed + by another process, re-run calcRepoSizes. This would be expensive, but + it would only happen when another process is running at the same time. + The counter could also be a per-UUID counter, so two processes + operating on different remotes would not have overhead. + + When loading the live update table, check if processes in it are still + running (and are still git-annex), and if not, remove stale entries + from it, which can accumulate when processes are interrupted. + Note that it will be ok for the wrong git-annex process, running again + at a pid to keep a stale item in the live update table, because that + is unlikely and exponentially unlikely to happen repeatedly, so stale + information will only be used for a short time. + * `git-annex info` in the limitedcalc path in cachedAllRepoData double-counts redundant information from the journal due to using overLocationLogs. In the other path it does not, and this should be fixed
Added a comment: No root privileges server - annex-shell replaced by git-annex-shell
diff --git a/doc/tips/get_git-annex-shell_into_PATH/comment_1_a1b1e9a23e8f58ed809386b54fa076f2._comment b/doc/tips/get_git-annex-shell_into_PATH/comment_1_a1b1e9a23e8f58ed809386b54fa076f2._comment new file mode 100644 index 0000000000..ff31c041fd --- /dev/null +++ b/doc/tips/get_git-annex-shell_into_PATH/comment_1_a1b1e9a23e8f58ed809386b54fa076f2._comment @@ -0,0 +1,26 @@ +[[!comment format=mdwn + username="gauss@055c9051f507c97fa5612f46c74ce636f5ecde10" + nickname="gauss" + avatar="http://cdn.libravatar.org/avatar/07c3a0c551ecfe4aa8c047ff5f6f4e79" + subject="No root privileges server - annex-shell replaced by git-annex-shell" + date="2024-08-23T01:51:49Z" + content=""" +I've cloned a git repository through ssh from a server which I don't have root privileges. +The clone command is something like: + + git clone ssh://johndoe@somedomain.com:23/home/johndoe/Downloads/gitannextest4/ + +I tried to enable the remote and I get the error: *Remote gitannextest4 does not have git-annex installed; setting annex-ignore*. +I had no success following the steps here. + +I believe there is an error in the last of the alternatives presented here: + + git config remote.annoyingserver.annex-shell /home/me/bin/git-annex-shell (does not work) + git config remote.annoyingserver.git-annex-shell /home/me/bin/git-annex-shell (works!) + +So, **annex-shell** should be replaced by **git-annex-shell**. + +Hope it helps. + + +"""]]
improve docs
diff --git a/doc/git-annex-maxsize.mdwn b/doc/git-annex-maxsize.mdwn index 04870ce777..5bbfb99d1a 100644 --- a/doc/git-annex-maxsize.mdwn +++ b/doc/git-annex-maxsize.mdwn @@ -14,24 +14,31 @@ git annex maxsize This configures the maximum combined size of annexed files that can be stored in a repository. When run with a repository but without a size, -it displays the currently configured maxsize. When run without a -repository, it displays an overview of the size and maxsize of all +it displays the currently configured maxsize. When run without any +parameters, it displays an overview of the size and maxsize of all repositories. The repository can be specified by git remote name or by uuid. For the current repository, use "here". The size can be specified using any units. For example "100 gigabytes" or -"0.8 TB" +"0.8TB" This is advisory only, it does not prevent git-annex from trying to store -more data than that in a repository. So use this to tell git-annex about -hard repository size limits that are enforced in some other way. +more data in a repository. When a repository has a preferred content +expression configured using "balanced" or "sizebalanced", it will take the +maxsize into account when checking preferred content. It is still possible +for the maxsize to be exceeded, eg when there are multiple writers to the +same repository. + +A hard repository size limit has to be enforced in some other way, +eg by putting the repository on a partition of the desired size. +This command can then be used to tell git-annex about that size limit. For example, if a git repository is on a 1 terabyte drive, and is the only thing stored on that drive, and `annex.diskreserve` is configured to 1 -gigabyte, then it would make sense to run -`git-annex maxsize here "999 gigabytes"`. +gigabyte, then it would make sense to run +`git-annex maxsize here "999 gigabytes"` # OPTIONS @@ -49,6 +56,8 @@ gigabyte, then it would make sense to run [[git-annex]](1) +[[git-annex-preferred-content]](1) + # AUTHOR Joey Hess <id@joeyh.name>
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index ebdcae3433..5445b65419 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -33,6 +33,8 @@ Planned schedule of work: * `git-annex assist --rebalance` of `balanced=foo:2` sometimes needs several runs to stabalize. + May not be a bug, needs reproducing and analysis. + * Concurrency issues with RepoSizes calculation and balanced content: * What if 2 concurrent threads are considering sending two different @@ -79,6 +81,8 @@ Planned schedule of work: * Balanced preferred content basic implementation, including --rebalance option. * Implemented [[track_free_space_in_repos_via_git-annex_branch]] +* `git-annex maxsize` +* annex.fullybalancedthreshhold ## completed items for August's work on git-annex proxy support for exporttre
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index f8606e5aba..ebdcae3433 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -69,24 +69,6 @@ Planned schedule of work: command behave non-ideally, the same as the thread concurrency problems. -* implement size-based balancing, so all balanced repositories are around - the same percent full, perhaps as another preferred - content expression. - -* `fullybalanced=foo:2` can get stuck in suboptimal situations. Eg, - when 2 out of 3 repositories are full, and the 3rd is mostly empty, - it is no longer possible to add new files to 2 repositories. - Moving some files from one of the full repositories to the empty one - would improve things, but is there any way for fullybalanced to know - when it makes sense to do that? - - If this is not resolved, it may be better to lose the ":number" part - of balanced and fullybalanced. With 1 copy balanced, this situation does - not occur. Users wanting 2 copies can have 2 groups which are each - balanced, although that would mean more repositories on more drives. - - Size based rebalancing may offer a solution; see design. - * `git-annex info` in the limitedcalc path in cachedAllRepoData double-counts redundant information from the journal due to using overLocationLogs. In the other path it does not, and this should be fixed
Added the annex.fullybalancedthreshhold git config.
diff --git a/CHANGELOG b/CHANGELOG index d5c63b3a45..95b7579d5e 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -23,6 +23,7 @@ git-annex (10.20240831) UNRELEASED; urgency=medium * Added --rebalance option. * maxsize: New command to tell git-annex how large the expected maximum size of a repository is, and to display repository sizes. + * Added the annex.fullybalancedthreshhold git config. * vicfg: Include maxsize configuration. * info: Improved speed. diff --git a/Limit.hs b/Limit.hs index 4ed4853f78..72175d8a44 100644 --- a/Limit.hs +++ b/Limit.hs @@ -602,7 +602,7 @@ limitFullyBalanced' :: String -> Maybe UUID -> Annex GroupMap -> MkLimit Annex limitFullyBalanced' = limitFullyBalanced'' $ \n key candidates -> do maxsizes <- getMaxSizes sizemap <- getRepoSizes False - let threshhold = 0.9 :: Double + threshhold <- annexFullyBalancedThreshhold <$> Annex.getGitConfig let toofull u = case (M.lookup u maxsizes, M.lookup u sizemap) of (Just (MaxSize maxsize), Just (RepoSize reposize)) -> diff --git a/Types/GitConfig.hs b/Types/GitConfig.hs index 68ce298d02..4b9c546e86 100644 --- a/Types/GitConfig.hs +++ b/Types/GitConfig.hs @@ -163,6 +163,7 @@ data GitConfig = GitConfig , annexAdviceNoSshCaching :: Bool , annexViewUnsetDirectory :: ViewUnset , annexClusters :: M.Map RemoteName ClusterUUID + , annexFullyBalancedThreshhold :: Double } extractGitConfig :: ConfigSource -> Git.Repo -> GitConfig @@ -296,6 +297,9 @@ extractGitConfig configsource r = GitConfig M.mapMaybe (mkClusterUUID . toUUID) $ M.mapKeys removeclusterprefix $ M.filterWithKey isclusternamekey (config r) + , annexFullyBalancedThreshhold = + fromMaybe 0.9 $ (/ 100) <$> getmayberead + (annexConfig "fullybalancedthreshhold") } where getbool k d = fromMaybe d $ getmaybebool k diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 14520d0a03..f8606e5aba 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -30,8 +30,6 @@ Planned schedule of work: ## work notes -* Implement annex.fullybalancedthreshhold - * `git-annex assist --rebalance` of `balanced=foo:2` sometimes needs several runs to stabalize.
display new empty repos in maxsize table
A new repo that has no location log info yet, but has an entry in
uuid.log has 0 size, so make RepoSize aware of that.
Note that a new repo that does not yet appear in uuid.log will still not
be displayed.
When a remote is added but not synced with yet, it has no uuid.log
entry. If git-annex maxsize is used to configure that remote, it needs
to appear in the maxsize table, and the change to Command.MaxSize takes
care of that.
A new repo that has no location log info yet, but has an entry in
uuid.log has 0 size, so make RepoSize aware of that.
Note that a new repo that does not yet appear in uuid.log will still not
be displayed.
When a remote is added but not synced with yet, it has no uuid.log
entry. If git-annex maxsize is used to configure that remote, it needs
to appear in the maxsize table, and the change to Command.MaxSize takes
care of that.
diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index d7c0f0a436..ff053ba03f 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -126,7 +126,9 @@ diffBranchRepoSizes quiet oldsizemap oldbranchsha newbranchsha = do newsizemap <- readpairs 100000 reader oldsizemap Nothing liftIO $ wait feedtid ifM (liftIO cleanup) - ( return (newsizemap, newbranchsha) + ( do + newsizemap' <- addemptyrepos newsizemap + return (newsizemap', newbranchsha) , return (oldsizemap, oldbranchsha) ) where @@ -156,3 +158,10 @@ diffBranchRepoSizes quiet oldsizemap oldbranchsha newbranchsha = do readpairs n' reader sizemap' Nothing Nothing -> return sizemap parselog = maybe mempty (S.fromList . parseLoggedLocationsWithoutClusters) + + addemptyrepos newsizemap = do + knownuuids <- M.keys <$> uuidDescMap + return $ foldl' + (\m u -> M.insertWith (flip const) u (RepoSize 0) m) + newsizemap + knownuuids diff --git a/Command/MaxSize.hs b/Command/MaxSize.hs index 58684311b6..c6d2ff0aad 100644 --- a/Command/MaxSize.hs +++ b/Command/MaxSize.hs @@ -76,9 +76,16 @@ sizeOverview o = do descmap <- Remote.uuidDescriptions deadset <- S.fromList <$> trustGet DeadTrusted maxsizes <- getMaxSizes - reposizes <- flip M.withoutKeys deadset <$> getRepoSizes True + reposizes <- getRepoSizes True + -- Add repos too new and empty to have a reposize, + -- whose maxsize has been set. + let reposizes' = foldl' + (\m u -> M.insertWith (flip const) u (RepoSize 0) m) + reposizes + (M.keys maxsizes) + let reposizes'' = flip M.withoutKeys deadset reposizes' let l = reverse $ sortOn snd $ M.toList $ - M.mapWithKey (gather maxsizes) reposizes + M.mapWithKey (gather maxsizes) reposizes'' v <- Remote.prettyPrintUUIDsWith' False (Just "size") "repositories" descmap showsizes l showRaw $ encodeBS $ tablerow (zip widths headers) @@ -89,11 +96,10 @@ sizeOverview o = do sizefield = "size" :: T.Text maxsizefield = "maxsize" :: T.Text - gather maxsizes u (RepoSize currsize) = Just $ - M.fromList - [ (sizefield, Just currsize) - , (maxsizefield, fromMaxSize <$> M.lookup u maxsizes) - ] + gather maxsizes u (RepoSize currsize) = Just $ M.fromList + [ (sizefield, Just currsize) + , (maxsizefield, fromMaxSize <$> M.lookup u maxsizes) + ] (widths, headers) = unzip [ (7, "size") diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index f8a7dbdd06..14520d0a03 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -35,21 +35,6 @@ Planned schedule of work: * `git-annex assist --rebalance` of `balanced=foo:2` sometimes needs several runs to stabalize. -* Bug: - - git init foo - cd foo - git-annex init - git commit --allow-empty -m foo - cd .. - git clone foo bar - cd foo - git remote add bar ../bar - git-annex maxsize bar - - Now `git-annex maxsize` will omit displaying bar. This is because the - DB got written without it. - * Concurrency issues with RepoSizes calculation and balanced content: * What if 2 concurrent threads are considering sending two different
Added a comment: Precise Workflow
diff --git a/doc/forum/git_annex_sync__58___only_git-annex/comment_6_48f1a48e20a3200c657c448398e4cf70._comment b/doc/forum/git_annex_sync__58___only_git-annex/comment_6_48f1a48e20a3200c657c448398e4cf70._comment new file mode 100644 index 0000000000..9cf87a7136 --- /dev/null +++ b/doc/forum/git_annex_sync__58___only_git-annex/comment_6_48f1a48e20a3200c657c448398e4cf70._comment @@ -0,0 +1,13 @@ +[[!comment format=mdwn + username="Spencer" + avatar="http://cdn.libravatar.org/avatar/2e0829f36a68480155e09d0883794a55" + subject="Precise Workflow" + date="2024-08-22T00:18:27Z" + content=""" +To be more precise on how to accomplish this - say for synchronizing special remotes for repos that are otherwise completely different - one might consider: + +1. `git push destination git-annex:synced/git-annex`. This is what git-annex does under the hood during the `push` step of the `sync`. +1. In `destination`, run `git annex merge`. This performs the merging of `synced/git-annex` into `git-annex`. + +I found this useful when I was trying to set up multiple repositories to use one central location (an rclone special remote) for file content sharing. Since the repos had a shared context (a project), but were otherwise disjoint from one another, `sync` was not an option. However, I felt odd running `git annex initremote` for each repo separately because then I could end up with myriad special remotes with the same configuration but different UUIDs for each. Ultimately this is not a problem - to have the same special remote have different UUIDs in different repositories - so long as the repos **never** come in contact. But I, novice as I was, had already muddled the git-annex branches of these repos together already, so for sake of cleanliness I went back and reimplemented these special remotes as the same UUID on every repo. This often involved adding repos as remotes to one another, fetching - which implicitly performs some merging - and then pushing (as above) any metadata changes to the repo, leaving content changes untouched. +"""]]
make --rebalance of balanced use fullysizebalanced when useful
When the specified number of copies is > 1, and some repositories are
too full, it can be better to move content from them to other less full
repositories, in order to make space for new content.
annex.fullybalancedthreshhold is documented, but not implemented yet
This is not tested very well yet, and is known to sometimes take several
runs to stabalize.
When the specified number of copies is > 1, and some repositories are
too full, it can be better to move content from them to other less full
repositories, in order to make space for new content.
annex.fullybalancedthreshhold is documented, but not implemented yet
This is not tested very well yet, and is known to sometimes take several
runs to stabalize.
diff --git a/Limit.hs b/Limit.hs index 9d403c5ff2..4ed4853f78 100644 --- a/Limit.hs +++ b/Limit.hs @@ -599,18 +599,31 @@ limitFullyBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex limitFullyBalanced = limitFullyBalanced' "fullybalanced" limitFullyBalanced' :: String -> Maybe UUID -> Annex GroupMap -> MkLimit Annex -limitFullyBalanced' = limitFullyBalanced'' filtercandidates - where - filtercandidates _ key candidates = do - maxsizes <- getMaxSizes - sizemap <- getRepoSizes False - currentlocs <- S.fromList <$> loggedLocations key - let keysize = fromMaybe 0 (fromKey keySize key) - let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of - (Just maxsize, Just reposize) -> - repoHasSpace keysize (u `S.member` currentlocs) reposize maxsize - _ -> True - return $ S.filter hasspace candidates +limitFullyBalanced' = limitFullyBalanced'' $ \n key candidates -> do + maxsizes <- getMaxSizes + sizemap <- getRepoSizes False + let threshhold = 0.9 :: Double + let toofull u = + case (M.lookup u maxsizes, M.lookup u sizemap) of + (Just (MaxSize maxsize), Just (RepoSize reposize)) -> + fromIntegral reposize >= fromIntegral maxsize * threshhold + _ -> False + needsizebalance <- ifM (Annex.getRead Annex.rebalance) + ( return $ n > 1 && + n > S.size candidates + - S.size (S.filter toofull candidates) + , return False + ) + if needsizebalance + then filterCandidatesFullySizeBalanced maxsizes sizemap n key candidates + else do + currentlocs <- S.fromList <$> loggedLocations key + let keysize = fromMaybe 0 (fromKey keySize key) + let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of + (Just maxsize, Just reposize) -> + repoHasSpace keysize (u `S.member` currentlocs) reposize maxsize + _ -> True + return $ S.filter hasspace candidates repoHasSpace :: Integer -> Bool -> RepoSize -> MaxSize -> Bool repoHasSpace keysize inrepo (RepoSize reposize) (MaxSize maxsize) @@ -673,23 +686,31 @@ limitFullySizeBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex limitFullySizeBalanced = limitFullySizeBalanced' "fullysizebalanced" limitFullySizeBalanced' :: String -> Maybe UUID -> Annex GroupMap -> MkLimit Annex -limitFullySizeBalanced' = limitFullyBalanced'' filtercandidates +limitFullySizeBalanced' = limitFullyBalanced'' $ \n key candidates -> do + maxsizes <- getMaxSizes + sizemap <- getRepoSizes False + filterCandidatesFullySizeBalanced maxsizes sizemap n key candidates + +filterCandidatesFullySizeBalanced + :: M.Map UUID MaxSize + -> M.Map UUID RepoSize + -> Int + -> Key + -> S.Set UUID + -> Annex (S.Set UUID) +filterCandidatesFullySizeBalanced maxsizes sizemap n key candidates = do + currentlocs <- S.fromList <$> loggedLocations key + let keysize = fromMaybe 0 (fromKey keySize key) + let go u = case (M.lookup u maxsizes, M.lookup u sizemap, u `S.member` currentlocs) of + (Just maxsize, Just reposize, inrepo) + | repoHasSpace keysize inrepo reposize maxsize -> + proportionfree keysize inrepo u reposize maxsize + | otherwise -> Nothing + _ -> Nothing + return $ S.fromList $ + map fst $ take n $ reverse $ sortOn snd $ + mapMaybe go $ S.toList candidates where - filtercandidates n key candidates = do - maxsizes <- getMaxSizes - sizemap <- getRepoSizes False - currentlocs <- S.fromList <$> loggedLocations key - let keysize = fromMaybe 0 (fromKey keySize key) - let go u = case (M.lookup u maxsizes, M.lookup u sizemap, u `S.member` currentlocs) of - (Just maxsize, Just reposize, inrepo) - | repoHasSpace keysize inrepo reposize maxsize -> - proportionfree keysize inrepo u reposize maxsize - | otherwise -> Nothing - _ -> Nothing - return $ S.fromList $ - map fst $ take n $ reverse $ sortOn snd $ - mapMaybe go $ S.toList candidates - proportionfree keysize inrepo u (RepoSize reposize) (MaxSize maxsize) | maxsize > 0 = Just ( u diff --git a/doc/git-annex-preferred-content.mdwn b/doc/git-annex-preferred-content.mdwn index f8b8a1db26..a9e1cb0b5f 100644 --- a/doc/git-annex-preferred-content.mdwn +++ b/doc/git-annex-preferred-content.mdwn @@ -318,6 +318,16 @@ elsewhere to allow removing it). When the `--rebalance` option is used, `balanced` is the same as `fullybalanced`. + When the specified number is greater than 1, and too many repositories + in the group are more than 90% full (as configured by + annex.fullybalancedthreshhold), this behaves like `fullysizebalanced`. + + For example, `fullybalanced=foo:3`, when group foo has 5 repositories, + two 50% full and three 99% full, will make some content move from the + full repositories to the others. Moving content like that is expensive, + but it allows new files to continue to be stored on the specified number + of repositories. + * `sizebalanced=groupname:number` Distributes content amoung repositories in the group, keeping diff --git a/doc/git-annex.mdwn b/doc/git-annex.mdwn index b46526e8dc..2615ff4d2c 100644 --- a/doc/git-annex.mdwn +++ b/doc/git-annex.mdwn @@ -928,6 +928,12 @@ repository, using [[git-annex-config]]. See its man page for a list.) The default reserve is 100 megabytes. +* `annex.fullybalancedthreshhold` + + Configures the percent full a repository must be in order for + the "fullybalanced" preferred content expression to consider it + to be full. The default is 90. + * `annex.skipunknown` Set to true to make commands like "git-annex get" silently skip over diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 2c14e07fef..f8a7dbdd06 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -30,6 +30,11 @@ Planned schedule of work: ## work notes +* Implement annex.fullybalancedthreshhold + +* `git-annex assist --rebalance` of `balanced=foo:2` + sometimes needs several runs to stabalize. + * Bug: git init foo
Support "sizebalanced=" and "fullysizebalanced=" too
Might want to make --rebalance turn balanced=group:N where N > 1
to fullysizebalanced=group:N. Have not yet determined if that will
improve situations enough to be worth the extra work.
Might want to make --rebalance turn balanced=group:N where N > 1
to fullysizebalanced=group:N. Have not yet determined if that will
improve situations enough to be worth the extra work.
diff --git a/Annex/FileMatcher.hs b/Annex/FileMatcher.hs index b7b5ca1553..c2490711ae 100644 --- a/Annex/FileMatcher.hs +++ b/Annex/FileMatcher.hs @@ -173,6 +173,8 @@ preferredContentTokens pcd = , ValueToken "onlyingroup" (usev $ limitOnlyInGroup $ getGroupMap pcd) , ValueToken "balanced" (usev $ limitBalanced (repoUUID pcd) (getGroupMap pcd)) , ValueToken "fullybalanced" (usev $ limitFullyBalanced (repoUUID pcd) (getGroupMap pcd)) + , ValueToken "sizebalanced" (usev $ limitSizeBalanced (repoUUID pcd) (getGroupMap pcd)) + , ValueToken "fullysizebalanced" (usev $ limitFullySizeBalanced (repoUUID pcd) (getGroupMap pcd)) ] ++ commonTokens LimitAnnexFiles where preferreddir = maybe "public" fromProposedAccepted $ diff --git a/CHANGELOG b/CHANGELOG index a56638aeb1..d5c63b3a45 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -19,6 +19,7 @@ git-annex (10.20240831) UNRELEASED; urgency=medium remotes. External special remotes should not use that config for their own purposes. * Support "balanced=" and "fullybalanced=" in preferred content expressions. + * Support "sizebalanced=" and "fullysizebalanced=" too. * Added --rebalance option. * maxsize: New command to tell git-annex how large the expected maximum size of a repository is, and to display repository sizes. diff --git a/Limit.hs b/Limit.hs index 3ea93f91f4..9d403c5ff2 100644 --- a/Limit.hs +++ b/Limit.hs @@ -558,7 +558,11 @@ limitOnlyInGroup getgroupmap groupname = Right $ MatchFiles limitBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex limitBalanced mu getgroupmap groupname = do - fullybalanced <- limitFullyBalanced mu getgroupmap groupname + fullybalanced <- limitFullyBalanced' "balanced" mu getgroupmap groupname + limitBalanced' "balanced" fullybalanced mu groupname + +limitBalanced' :: String -> MatchFiles Annex -> Maybe UUID -> MkLimit Annex +limitBalanced' termname fullybalanced mu groupname = do copies <- limitCopies $ if ':' `elem` groupname then groupname else groupname ++ ":1" @@ -588,38 +592,65 @@ limitBalanced mu getgroupmap groupname = do matchNeedsLocationLog present || matchNeedsLocationLog fullybalanced || matchNeedsLocationLog copies - , matchDesc = "balanced" =? groupname + , matchDesc = termname =? groupname } limitFullyBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex -limitFullyBalanced mu getgroupmap want = +limitFullyBalanced = limitFullyBalanced' "fullybalanced" + +limitFullyBalanced' :: String -> Maybe UUID -> Annex GroupMap -> MkLimit Annex +limitFullyBalanced' = limitFullyBalanced'' filtercandidates + where + filtercandidates _ key candidates = do + maxsizes <- getMaxSizes + sizemap <- getRepoSizes False + currentlocs <- S.fromList <$> loggedLocations key + let keysize = fromMaybe 0 (fromKey keySize key) + let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of + (Just maxsize, Just reposize) -> + repoHasSpace keysize (u `S.member` currentlocs) reposize maxsize + _ -> True + return $ S.filter hasspace candidates + +repoHasSpace :: Integer -> Bool -> RepoSize -> MaxSize -> Bool +repoHasSpace keysize inrepo (RepoSize reposize) (MaxSize maxsize) + | inrepo = + reposize <= maxsize + | otherwise = + reposize + keysize <= maxsize + +limitFullyBalanced'' + :: (Int -> Key -> S.Set UUID -> Annex (S.Set UUID)) + -> String + -> Maybe UUID + -> Annex GroupMap + -> MkLimit Annex +limitFullyBalanced'' filtercandidates termname mu getgroupmap want = case splitc ':' want of [g] -> go g 1 [g, n] -> maybe - (Left "bad number for fullybalanced") + (Left $ "bad number for " ++ termname) (go g) (readish n) - _ -> Left "bad value for fullybalanced" + _ -> Left $ "bad value for " ++ termname where - go s n = limitFullyBalanced' mu getgroupmap (toGroup s) n want + go s n = limitFullyBalanced''' filtercandidates termname mu + getgroupmap (toGroup s) n want -limitFullyBalanced' :: Maybe UUID -> Annex GroupMap -> Group -> Int -> MkLimit Annex -limitFullyBalanced' mu getgroupmap g n want = Right $ MatchFiles +limitFullyBalanced''' + :: (Int -> Key -> S.Set UUID -> Annex (S.Set UUID)) + -> String + -> Maybe UUID + -> Annex GroupMap + -> Group + -> Int + -> MkLimit Annex +limitFullyBalanced''' filtercandidates termname mu getgroupmap g n want = Right $ MatchFiles { matchAction = const $ checkKey $ \key -> do gm <- getgroupmap let groupmembers = fromMaybe S.empty $ M.lookup g (uuidsByGroup gm) - maxsizes <- getMaxSizes - sizemap <- getRepoSizes False - let keysize = fromMaybe 0 (fromKey keySize key) - currentlocs <- S.fromList <$> loggedLocations key - let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of - (Just (MaxSize maxsize), Just (RepoSize reposize)) -> - if u `S.member` currentlocs - then reposize <= maxsize - else reposize + keysize <= maxsize - _ -> True - let candidates = S.filter hasspace groupmembers + candidates <- filtercandidates n key groupmembers return $ if S.null candidates then False else case (mu, M.lookup g (balancedPickerByGroup gm)) of @@ -630,9 +661,46 @@ limitFullyBalanced' mu getgroupmap g n want = Right $ MatchFiles , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = False - , matchDesc = "fullybalanced" =? want + , matchDesc = termname =? want } +limitSizeBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex +limitSizeBalanced mu getgroupmap groupname = do + fullysizebalanced <- limitFullySizeBalanced' "sizebalanced" mu getgroupmap groupname + limitBalanced' "sizebalanced" fullysizebalanced mu groupname + +limitFullySizeBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex +limitFullySizeBalanced = limitFullySizeBalanced' "fullysizebalanced" + +limitFullySizeBalanced' :: String -> Maybe UUID -> Annex GroupMap -> MkLimit Annex +limitFullySizeBalanced' = limitFullyBalanced'' filtercandidates + where + filtercandidates n key candidates = do + maxsizes <- getMaxSizes + sizemap <- getRepoSizes False + currentlocs <- S.fromList <$> loggedLocations key + let keysize = fromMaybe 0 (fromKey keySize key) + let go u = case (M.lookup u maxsizes, M.lookup u sizemap, u `S.member` currentlocs) of + (Just maxsize, Just reposize, inrepo) + | repoHasSpace keysize inrepo reposize maxsize -> + proportionfree keysize inrepo u reposize maxsize + | otherwise -> Nothing + _ -> Nothing + return $ S.fromList $ + map fst $ take n $ reverse $ sortOn snd $ + mapMaybe go $ S.toList candidates + + proportionfree keysize inrepo u (RepoSize reposize) (MaxSize maxsize) + | maxsize > 0 = Just + ( u + , fromIntegral freespacesanskey / fromIntegral maxsize + :: Double + ) + | otherwise = Nothing + where + freespacesanskey = maxsize - reposize + + if inrepo then keysize else 0 + {- Adds a limit to skip files not using a specified key-value backend. -} addInBackend :: String -> Annex () addInBackend = addLimit . limitInBackend diff --git a/doc/git-annex-preferred-content.mdwn b/doc/git-annex-preferred-content.mdwn index 7ef98d14c8..f8b8a1db26 100644 --- a/doc/git-annex-preferred-content.mdwn +++ b/doc/git-annex-preferred-content.mdwn @@ -268,38 +268,40 @@ elsewhere to allow removing it). The number is the number of repositories in the group that will want each file. When not specified, the default is 1. + + For example, "balanced=backup:2", when there are 3 members of the backup + group, will make each backup repository want 2/3rds of the files. For this to work, each repository in the group should have its preferred content set to the same expression. Using `groupwanted` is a good way to do that. - - For example, "balanced=backup:2", when there are 3 members of the backup - group, will make each backup repository want 2/3rds of the files. The sizes of files are not taken into account, so it's possible for one repository to get larger than usual files and so fill up before the other repositories. But files are only wanted by repositories that have enough free space to hold them. So once a repository is full, the remaining repositories will have any additional files balanced - amoung them. In order for this to work, you must use - [[git-annex-maxsize]](1) to specify the size of each repository in the (Diff truncated)
bug
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index cfd3530e7f..2c14e07fef 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -30,6 +30,21 @@ Planned schedule of work: ## work notes +* Bug: + + git init foo + cd foo + git-annex init + git commit --allow-empty -m foo + cd .. + git clone foo bar + cd foo + git remote add bar ../bar + git-annex maxsize bar + + Now `git-annex maxsize` will omit displaying bar. This is because the + DB got written without it. + * Concurrency issues with RepoSizes calculation and balanced content: * What if 2 concurrent threads are considering sending two different
implement fullbalanced=group:N
Rebalancing this when it gets into a suboptimal situation will need
further work.
Rebalancing this when it gets into a suboptimal situation will need
further work.
diff --git a/Annex/Balanced.hs b/Annex/Balanced.hs index ad917ef1e5..ab643287d6 100644 --- a/Annex/Balanced.hs +++ b/Annex/Balanced.hs @@ -17,16 +17,18 @@ import Data.Bits (shiftL) import qualified Data.Set as S import qualified Data.ByteArray as BA -type BalancedPicker = S.Set UUID -> Key -> UUID +-- The Int is how many UUIDs to pick. +type BalancedPicker = S.Set UUID -> Key -> Int -> [UUID] -- The set of UUIDs provided here are all the UUIDs that are ever -- expected to be picked amoung. A subset of that can be provided -- when later using the BalancedPicker. Neither set can be empty. balancedPicker :: S.Set UUID -> BalancedPicker -balancedPicker s = \s' key -> +balancedPicker s = \s' key num -> let n = calcMac tointeger HmacSha256 combineduuids (serializeKey' key) m = fromIntegral (S.size s') - in S.elemAt (fromIntegral (n `mod` m)) s' + in map (\i -> S.elemAt (fromIntegral ((n + i) `mod` m)) s') + [0..fromIntegral (num - 1)] where combineduuids = mconcat (map fromUUID (S.toAscList s)) @@ -36,7 +38,10 @@ balancedPicker s = \s' key -> {- The selection for a given key never changes. -} prop_balanced_stable :: Bool -prop_balanced_stable = balancedPicker us us k == toUUID "332" +prop_balanced_stable = and + [ balancedPicker us us k 1 == [toUUID "332"] + , balancedPicker us us k 3 == [toUUID "332", toUUID "333", toUUID "334"] + ] where us = S.fromList $ map (toUUID . show) [1..500 :: Int] k = fromJust $ deserializeKey "WORM--test" diff --git a/Limit.hs b/Limit.hs index d74befbc31..d1b222f456 100644 --- a/Limit.hs +++ b/Limit.hs @@ -592,7 +592,19 @@ limitBalanced mu getgroupmap groupname = do } limitFullyBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex -limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles +limitFullyBalanced mu getgroupmap want = + case splitc ':' want of + [g] -> go g 1 + [g, n] -> maybe + (Left "bad number for fullybalanced") + (go g) + (readish n) + _ -> Left "bad value for fullybalanced" + where + go s n = limitFullyBalanced' mu getgroupmap (toGroup s) n want + +limitFullyBalanced' :: Maybe UUID -> Annex GroupMap -> Group -> Int -> MkLimit Annex +limitFullyBalanced' mu getgroupmap g n want = Right $ MatchFiles { matchAction = const $ checkKey $ \key -> do gm <- getgroupmap let groupmembers = fromMaybe S.empty $ @@ -611,16 +623,14 @@ limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles return $ if S.null candidates then False else case (mu, M.lookup g (balancedPickerByGroup gm)) of - (Just u, Just picker) -> u == picker candidates key + (Just u, Just picker) -> u == picker candidates key n _ -> False , matchNeedsFileName = False , matchNeedsFileContent = False , matchNeedsKey = True , matchNeedsLocationLog = False - , matchDesc = "fullybalanced" =? groupname + , matchDesc = "fullybalanced" =? want } - where - g = toGroup groupname {- Adds a limit to skip files not using a specified key-value backend. -} addInBackend :: String -> Annex () diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 2ab97d85e6..cfd3530e7f 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -66,6 +66,10 @@ Planned schedule of work: command behave non-ideally, the same as the thread concurrency problems. +* implement size-based balancing, so all balanced repositories are around + the same percent full, perhaps as another preferred + content expression. + * `fullybalanced=foo:2` can get stuck in suboptimal situations. Eg, when 2 out of 3 repositories are full, and the 3rd is mostly empty, it is no longer possible to add new files to 2 repositories. @@ -80,17 +84,11 @@ Planned schedule of work: Size based rebalancing may offer a solution; see design. -* "fullybalanced=foo:2" is not currently actually implemented! - * `git-annex info` in the limitedcalc path in cachedAllRepoData double-counts redundant information from the journal due to using overLocationLogs. In the other path it does not, and this should be fixed for consistency and correctness. -* implement size-based balancing, so all balanced repositories are around - the same percent full, either as the default or as another preferred - content expression. - ## completed items for August's work on balanced preferred content * Balanced preferred content basic implementation, including --rebalance
Added a comment: Help with .nfsXXXX files
diff --git a/doc/tips/git-annex_on_NFS/comment_3_1d018c2ec628db5a2df15fb24be162af._comment b/doc/tips/git-annex_on_NFS/comment_3_1d018c2ec628db5a2df15fb24be162af._comment new file mode 100644 index 0000000000..ac11ec28ee --- /dev/null +++ b/doc/tips/git-annex_on_NFS/comment_3_1d018c2ec628db5a2df15fb24be162af._comment @@ -0,0 +1,13 @@ +[[!comment format=mdwn + username="Matthew" + avatar="http://cdn.libravatar.org/avatar/495960d189a9cdc26f8e449bbf28aaf4" + subject="Help with .nfsXXXX files" + date="2024-08-19T21:20:58Z" + content=""" +I have many dozens of .nfs files that I cannot seem to remove. I have had IT reboot the machine I was using with git-annex, as well as the file server in hopes of killing the process that have the files open. The files stubbornly remain, and cannot be removed with 'rm -f .nfsXXXX' with resulting \"rm: cannot remove ‘.nfsXXXX’: Permission denied\", even after the reboots. + +Any thoughts are appreciated, as I have a few hundred gigabytes tied up in these files. + +My next step is to see about working with IT to put the file server in single-user mode, and getting root access to see if we can remove the files. But, I'm hoping maybe there are some other suggestion before taking such a drastic step. + +"""]]
Added a comment
diff --git a/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_4_ae6134d491a3a404ab7b4468111c47af._comment b/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_4_ae6134d491a3a404ab7b4468111c47af._comment new file mode 100644 index 0000000000..d52673f000 --- /dev/null +++ b/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_4_ae6134d491a3a404ab7b4468111c47af._comment @@ -0,0 +1,16 @@ +[[!comment format=mdwn + username="matrss" + avatar="http://cdn.libravatar.org/avatar/59541f50d845e5f81aff06e88a38b9de" + subject="comment 4" + date="2024-08-19T10:25:13Z" + content=""" +When `annex.dotfiles` is set to true, git-annex should treat dotfiles just like other files, so it should apply .gitattributes to them. With that in mind, I'd expect something like this to work: + +``` +.* annex.largefiles=nothing +.caltrash annex.largefiles=anything +.calnotes annex.largefiles=anything +``` + +I haven't tested it though. +"""]]
size based rebalancing design
diff --git a/doc/design/balanced_preferred_content.mdwn b/doc/design/balanced_preferred_content.mdwn index 163c28b7d9..2898a2eebc 100644 --- a/doc/design/balanced_preferred_content.mdwn +++ b/doc/design/balanced_preferred_content.mdwn @@ -58,6 +58,17 @@ If the maximum size of some but not others is known, what then? Balancing this way would fall back to the method above when several repos are equally good candidates to hold a key. +The problem with size balancing is that in a split brain situation, +the known sizes are not accurate, and so one repository will end up more +full than others. Consider, for example, a group of 2 repositories of the +same size, where one repository is 50% full and the other is 75%. Sending +files to that group will put them all in the 50% repository until it gets +to 75%. But if another clone is doing the same thing and sending different +files, the 50% full repository will end up 100% full. + +Rebalancing could fix that, but it seems better generally to use `N mod M` +balancing amoung the repositories known/believed to have enough free space. + ## stability Note that this preferred content expression will not be stable. A change in @@ -90,10 +101,11 @@ key. However, once 3 of those 5 repos get full, new keys will only be able to be stored on 2 of them. At that point one or more new repos will need to be -added to reach the goal of each key being stored in 3 of them. It would be -possible to rebalance the 3 full repos by moving some keys from them to the -other 2 repos, and eke out more storage before needing to add new -repositories. A separate rebalancing pass, that does not use preferred +added to reach the goal of each key being stored in 3 of them. + +It would be possible to rebalance the 3 full repos by moving some keys from +them to the other 2 repos, and eke out more storage before needing to add +new repositories. A separate rebalancing pass, that does not use preferred content alone, could be implemented to handle this (see below). ## use case: geographically distinct datacenters @@ -183,4 +195,20 @@ users who want it, then `balanced=group:N == (fullybalanced=group:N and not copies=group:N) or present` usually and when --rebalance is used, `balanced=group:N == fullybalanced=group:N)` - +In the balanced=group:3 example above, some content needs to be moved from +the 3 full repos to the 2 less full repos. To handle this, +fullybalanced=group:N needs to look at how full the repositories in +the group are. What could be done is make it use size based balancing +when rebalancing `group:N (>1) + +While size based balancing generally has problems as described above with +split brain, rebalancing is probably run in a single repository, so split +brain won't be an issue. + +Note that size based rebalancing will need to take into account the size +if the content is moved from one of the repositories that contains it to +the candidate repository. For example, if one repository is 75% full and +the other is 60% full, and the annex object in the 75% full repo is 20% +of the size of the repositories, then it doesn't make sense to make the +repo that currently contains it not want it any more, because the other +repo would end up more full. diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 6843ec7d38..2ab97d85e6 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -78,8 +78,9 @@ Planned schedule of work: not occur. Users wanting 2 copies can have 2 groups which are each balanced, although that would mean more repositories on more drives. - Also note that "fullybalanced=foo:2" is not currently actually - implemented! + Size based rebalancing may offer a solution; see design. + +* "fullybalanced=foo:2" is not currently actually implemented! * `git-annex info` in the limitedcalc path in cachedAllRepoData double-counts redundant information from the journal due to using
maxsize overview display and --json support
diff --git a/CHANGELOG b/CHANGELOG index bd9330e2f5..a56638aeb1 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -21,7 +21,7 @@ git-annex (10.20240831) UNRELEASED; urgency=medium * Support "balanced=" and "fullybalanced=" in preferred content expressions. * Added --rebalance option. * maxsize: New command to tell git-annex how large the expected maximum - size of a repository is. + size of a repository is, and to display repository sizes. * vicfg: Include maxsize configuration. * info: Improved speed. diff --git a/Command/Info.hs b/Command/Info.hs index 0e75c84245..bb3471ad06 100644 --- a/Command/Info.hs +++ b/Command/Info.hs @@ -573,7 +573,8 @@ reposizes_stats count desc m = stat desc $ nojson $ do let maxlen = maximum (map (length . snd) l) descm <- lift Remote.uuidDescriptions -- This also handles json display. - s <- lift $ Remote.prettyPrintUUIDsWith (Just "size") desc descm (Just . show) $ + s <- lift $ Remote.prettyPrintUUIDsWith (Just "size") desc descm + (\sz -> Just $ show sz ++ ": ") $ map (\(u, sz) -> (u, Just $ mkdisp sz maxlen)) l return $ if count then countRepoList (length l) s diff --git a/Command/MaxSize.hs b/Command/MaxSize.hs index 19788ad3eb..3efaed5880 100644 --- a/Command/MaxSize.hs +++ b/Command/MaxSize.hs @@ -5,21 +5,28 @@ - Licensed under the GNU AGPL version 3 or higher. -} +{-# LANGUAGE OverloadedStrings #-} + module Command.MaxSize where import Command import qualified Remote +import Annex.RepoSize +import Types.RepoSize import Logs.MaxSize -import Utility.SafeOutput +import Logs.Trust import Utility.DataUnits import qualified Data.Map as M +import qualified Data.Set as S +import qualified Data.Text as T cmd :: Command -cmd = noMessages $ command "maxsize" SectionSetup - "configure maximum size of repositoriy" - (paramPair paramRepository (paramOptional paramSize)) - (seek <$$> optParser) +cmd = noMessages $ withAnnexOptions [jsonOptions] $ + command "maxsize" SectionSetup + "configure maximum size of repositoriy" + (paramPair paramRepository (paramOptional paramSize)) + (seek <$$> optParser) data MaxSizeOptions = MaxSizeOptions { cmdparams :: CmdParams @@ -37,16 +44,17 @@ optParser desc = MaxSizeOptions seek :: MaxSizeOptions -> CommandSeek seek o = case cmdparams o of (rname:[]) -> commandAction $ do - u <- Remote.nameToUUID rname - startingCustomOutput (ActionItemOther Nothing) $ do + enableNormalOutput + showCustom "maxsize" (SeekInput [rname]) $ do + u <- Remote.nameToUUID rname v <- M.lookup u <$> getMaxSizes - liftIO $ putStrLn $ safeOutput $ case v of + maybeAddJSONField "maxsize" (fromMaxSize <$> v) + showRaw $ encodeBS $ case v of Just (MaxSize n) -> - if bytesOption o - then show n - else preciseSize storageUnits False n + formatSize o (preciseSize storageUnits True) n Nothing -> "" - next $ return True + return True + stop (rname:sz:[]) -> commandAction $ do u <- Remote.nameToUUID rname let si = SeekInput (cmdparams o) @@ -57,4 +65,51 @@ seek o = case cmdparams o of Just n -> do recordMaxSize u (MaxSize n) next $ return True - _ -> giveup "Specify a repository." + [] -> commandAction $ sizeOverview o + _ -> giveup "Too many parameters" + +sizeOverview :: MaxSizeOptions -> CommandStart +sizeOverview o = do + enableNormalOutput + showCustom "maxsize" (SeekInput []) $ do + descmap <- Remote.uuidDescriptions + deadset <- S.fromList <$> trustGet DeadTrusted + maxsizes <- getMaxSizes + reposizes <- flip M.withoutKeys deadset <$> getRepoSizes True + let l = reverse $ sortOn snd $ M.toList $ + M.mapWithKey (gather maxsizes) reposizes + v <- Remote.prettyPrintUUIDsWith' False (Just "size") + "repositories" descmap showsizes l + showRaw $ encodeBS $ tableheader + showRaw $ encodeBS $ dropWhileEnd (== '\n') v + return True + stop + where + sizefield = "size" :: T.Text + maxsizefield = "maxsize" :: T.Text + + gather maxsizes u (RepoSize currsize) = Just $ + M.fromList + [ (sizefield, Just currsize) + , (maxsizefield, fromMaxSize <$> M.lookup u maxsizes) + ] + + tableheader = tablerow ["size", "maxsize", "repository"] + + showsizes m = do + size <- M.lookup sizefield m + maxsize <- M.lookup maxsizefield m + return $ tablerow [formatsize size, formatsize maxsize, ""] + + formatsize = maybe "" (formatSize o (roughSize' storageUnits True 0)) + + padcolumn s = replicate (7 - length s) ' ' ++ s + + tablerow [] = "" + tablerow (s:[]) = " " ++ s + tablerow (s:l) = padcolumn s ++ " " ++ tablerow l + +formatSize :: MaxSizeOptions -> (ByteSize -> String) -> ByteSize -> String +formatSize o f n + | bytesOption o = show n + | otherwise = f n diff --git a/Messages.hs b/Messages.hs index c6ba6ed40a..b989d1dd8b 100644 --- a/Messages.hs +++ b/Messages.hs @@ -55,6 +55,7 @@ module Messages ( mkPrompter, sanitizeTopLevelExceptionMessages, countdownToMessage, + enableNormalOutput, ) where import Control.Concurrent @@ -87,9 +88,7 @@ showStartMessage (StartMessage command ai si) = where json = JSON.startActionItem command ai si showStartMessage (StartUsualMessages command ai si) = do - outputType <$> Annex.getState Annex.output >>= \case - QuietOutput -> Annex.setOutput NormalOutput - _ -> noop + enableNormalOutput showStartMessage (StartMessage command ai si) showStartMessage (StartNoMessage _) = noop showStartMessage (CustomOutput _) = @@ -379,3 +378,9 @@ countdownToMessage n showmsg | otherwise = do let !n' = pred n return n' + +enableNormalOutput :: Annex () +enableNormalOutput = + outputType <$> Annex.getState Annex.output >>= \case + QuietOutput -> Annex.setOutput NormalOutput + _ -> noop diff --git a/Remote.hs b/Remote.hs index eea052e254..326fd59fca 100644 --- a/Remote.hs +++ b/Remote.hs @@ -40,6 +40,7 @@ module Remote ( prettyPrintUUIDs, prettyPrintUUIDsDescs, prettyPrintUUIDsWith, + prettyPrintUUIDsWith', prettyListUUIDs, prettyUUID, remoteFromUUID, @@ -229,11 +230,25 @@ prettyPrintUUIDsWith -> (v -> Maybe String) -> [(UUID, Maybe v)] -> Annex String -prettyPrintUUIDsWith optfield header descm showval uuidvals = do +prettyPrintUUIDsWith = prettyPrintUUIDsWith' True + +prettyPrintUUIDsWith' + :: ToJSON' v + => Bool + -> Maybe String + -> String (Diff truncated)
Added a comment
diff --git a/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_3_4c7b9602176c2dfc75a320cbed9adec5._comment b/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_3_4c7b9602176c2dfc75a320cbed9adec5._comment new file mode 100644 index 0000000000..edf8bec5d8 --- /dev/null +++ b/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_3_4c7b9602176c2dfc75a320cbed9adec5._comment @@ -0,0 +1,10 @@ +[[!comment format=mdwn + username="xentac" + avatar="http://cdn.libravatar.org/avatar/773b6c7b0dc34f10b66aa46d2730a5b3" + subject="comment 3" + date="2024-08-18T03:17:12Z" + content=""" +Thanks! That does seem to have been it. + +Do you know if there's a way to only treat some dotfiles as annexed? Like I want .caltrash and .calnotes to be annexed, but maybe not if I create .gitattributes. +"""]]
consistently don't show sizes of empty repositories
This used to be the case, and when matching options are used, that code
path still omits them, so also omit them in the getRepoSize code path.
This used to be the case, and when matching options are used, that code
path still omits them, so also omit them in the getRepoSize code path.
diff --git a/Command/Info.hs b/Command/Info.hs index 3f120c1920..0e75c84245 100644 --- a/Command/Info.hs +++ b/Command/Info.hs @@ -654,7 +654,7 @@ cachedAllRepoData = do usereposizes s = do sizemap <- lift $ getRepoSizes True deadset <- lift $ S.fromList <$> trustGet DeadTrusted - let sizemap' = M.withoutKeys sizemap deadset + let sizemap' = M.filter (> 0) $ M.withoutKeys sizemap deadset lift $ unlessM (null <$> getUnmergedRefs) warnunmerged return $ s diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 3c7b8fa8d5..21fdc8fd99 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -81,6 +81,11 @@ Planned schedule of work: Also note that "fullybalanced=foo:2" is not currently actually implemented! +* `git-annex info` in the limitedcalc path in cachedAllRepoData + double-counts redundant information from the journal due to using + overLocationLogs. In the other path it does not, and this should be fixed + for consistency and correctness. + * `git-annex info` can use maxsize to display how full repositories are * implement size-based balancing, so all balanced repositories are around
git-annex info speed up using getRepoSizes
diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index 14b6da4a92..bdd446460c 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -31,21 +31,21 @@ import qualified Data.Map.Strict as M import qualified Data.Set as S {- Gets the repo size map. Cached for speed. -} -getRepoSizes :: Annex (M.Map UUID RepoSize) -getRepoSizes = do +getRepoSizes :: Bool -> Annex (M.Map UUID RepoSize) +getRepoSizes quiet = do rsv <- Annex.getRead Annex.reposizes liftIO (takeMVar rsv) >>= \case Just sizemap -> do liftIO $ putMVar rsv (Just sizemap) return sizemap - Nothing -> calcRepoSizes rsv + Nothing -> calcRepoSizes quiet rsv {- Fills an empty Annex.reposizes MVar with current information - from the git-annex branch, supplimented with journalled but - not yet committed information. -} -calcRepoSizes :: MVar (Maybe (M.Map UUID RepoSize)) -> Annex (M.Map UUID RepoSize) -calcRepoSizes rsv = bracket setup cleanup $ \h -> go h `onException` failed +calcRepoSizes :: Bool -> MVar (Maybe (M.Map UUID RepoSize)) -> Annex (M.Map UUID RepoSize) +calcRepoSizes quiet rsv = bracket setup cleanup $ \h -> go h `onException` failed where go h = do (oldsizemap, moldbranchsha) <- liftIO $ Db.getRepoSizes h @@ -60,13 +60,14 @@ calcRepoSizes rsv = bracket setup cleanup $ \h -> go h `onException` failed return sizemap calculatefromscratch h = do - showSideAction "calculating repository sizes" + unless quiet $ + showSideAction "calculating repository sizes" (sizemap, branchsha) <- calcBranchRepoSizes liftIO $ Db.setRepoSizes h sizemap branchsha calcJournalledRepoSizes sizemap branchsha incrementalupdate h oldsizemap oldbranchsha currbranchsha = do - (sizemap, branchsha) <- diffBranchRepoSizes oldsizemap oldbranchsha currbranchsha + (sizemap, branchsha) <- diffBranchRepoSizes quiet oldsizemap oldbranchsha currbranchsha liftIO $ Db.setRepoSizes h sizemap branchsha calcJournalledRepoSizes sizemap branchsha @@ -113,8 +114,8 @@ calcJournalledRepoSizes startmap branchsha = Nothing {- Incremental update by diffing. -} -diffBranchRepoSizes :: M.Map UUID RepoSize -> Sha -> Sha -> Annex (M.Map UUID RepoSize, Sha) -diffBranchRepoSizes oldsizemap oldbranchsha newbranchsha = do +diffBranchRepoSizes :: Bool -> M.Map UUID RepoSize -> Sha -> Sha -> Annex (M.Map UUID RepoSize, Sha) +diffBranchRepoSizes quiet oldsizemap oldbranchsha newbranchsha = do g <- Annex.gitRepo catObjectStream g $ \feeder closer reader -> do (l, cleanup) <- inRepo $ @@ -148,8 +149,10 @@ diffBranchRepoSizes oldsizemap oldbranchsha newbranchsha = do removedlocs = S.difference prevlog currlog !sizemap' = accumRepoSizes k (newlocs, removedlocs) sizemap in do - n' <- countdownToMessage n $ - showSideAction "calculating repository sizes" + n' <- if quiet + then pure n + else countdownToMessage n $ + showSideAction "calculating repository sizes" readpairs n' reader sizemap' Nothing Nothing -> return sizemap parselog = maybe mempty (S.fromList . parseLoggedLocationsWithoutClusters) diff --git a/CHANGELOG b/CHANGELOG index 542fd3d864..bd9330e2f5 100644 --- a/CHANGELOG +++ b/CHANGELOG @@ -23,6 +23,7 @@ git-annex (10.20240831) UNRELEASED; urgency=medium * maxsize: New command to tell git-annex how large the expected maximum size of a repository is. * vicfg: Include maxsize configuration. + * info: Improved speed. -- Joey Hess <id@joeyh.name> Wed, 31 Jul 2024 15:52:03 -0400 diff --git a/Command/Info.hs b/Command/Info.hs index 1161bec939..3f120c1920 100644 --- a/Command/Info.hs +++ b/Command/Info.hs @@ -1,16 +1,18 @@ {- git-annex command - - - Copyright 2011-2023 Joey Hess <id@joeyh.name> + - Copyright 2011-2024 Joey Hess <id@joeyh.name> - - Licensed under the GNU AGPL version 3 or higher. -} -{-# LANGUAGE BangPatterns, DeriveDataTypeable, PackageImports, OverloadedStrings #-} +{-# LANGUAGE BangPatterns, DeriveDataTypeable, PackageImports #-} +{-# LANGUAGE OverloadedStrings, FlexibleContexts #-} module Command.Info where import "mtl" Control.Monad.State.Strict import qualified Data.Map.Strict as M +import qualified Data.Set as S import qualified Data.Vector as V import qualified System.FilePath.ByteString as P import System.PosixCompat.Files (isDirectory) @@ -33,7 +35,7 @@ import Annex.WorkTree import Logs.UUID import Logs.Trust import Logs.Location -import Annex.Branch (UnmergedBranches(..)) +import Annex.Branch (UnmergedBranches(..), getUnmergedRefs) import Annex.NumCopies import Git.Config (boolConfig) import qualified Git.LsTree as LsTree @@ -48,6 +50,8 @@ import Types.Availability import qualified Limit import Messages.JSON (DualDisp(..), ObjectMap(..)) import Annex.BloomFilter +import Annex.RepoSize +import Types.RepoSize import qualified Command.Unused import qualified Utility.RawFilePath as R @@ -640,28 +644,51 @@ cachedAllRepoData = do case allRepoData s of Just _ -> return s Nothing -> do - matcher <- lift getKeyOnlyMatcher - r <- lift $ overLocationLogs False False (emptyKeyInfo, mempty) $ \k locs (d, rd) -> do - ifM (matchOnKey matcher k) - ( do - alivelocs <- snd - <$> trustPartition DeadTrusted locs - let !d' = addKeyCopies (genericLength alivelocs) k d - let !rd' = foldl' (flip (accumrepodata k)) rd alivelocs - return (d', rd') - , return (d, rd) - ) - case r of - NoUnmergedBranches (!(d, rd), _) -> do - let s' = s { allRepoData = Just d, repoData = rd } - put s' - return s' - UnmergedBranches _ -> do - lift $ warning "This repository is read-only, and there are unmerged git-annex branches. Information from those branches is not included here." - return s + s' <- ifM (lift Limit.limited) + ( limitedcalc s + , usereposizes s + ) + put s' + return s' where + usereposizes s = do + sizemap <- lift $ getRepoSizes True + deadset <- lift $ S.fromList <$> trustGet DeadTrusted + let sizemap' = M.withoutKeys sizemap deadset + lift $ unlessM (null <$> getUnmergedRefs) + warnunmerged + return $ s + { allRepoData = Just $ + convsize (sum (M.elems sizemap')) + , repoData = M.map convsize sizemap' + } + + limitedcalc s = do + matcher <- lift getKeyOnlyMatcher + r <- lift $ overLocationLogs False False (emptyKeyInfo, mempty) $ \k locs (d, rd) -> do + ifM (matchOnKey matcher k) + ( do + alivelocs <- snd + <$> trustPartition DeadTrusted locs + let !d' = addKeyCopies (genericLength alivelocs) k d + let !rd' = foldl' (flip (accumrepodata k)) rd alivelocs + return (d', rd') + , return (d, rd) + ) + (!(d, rd), _) <- case r of + NoUnmergedBranches v -> + return v + UnmergedBranches v -> do + lift warnunmerged + return v + return $ s { allRepoData = Just d, repoData = rd } + accumrepodata k = M.alter (Just . addKey k . fromMaybe emptyKeyInfo) + convsize (RepoSize sz) = emptyKeyInfo { sizeKeys = sz } + + warnunmerged = warning "There are unmerged git-annex branches. Information from those branches is not included here." + cachedNumCopiesStats :: StatState (Maybe NumCopiesStats) cachedNumCopiesStats = numCopiesStats <$> get diff --git a/Limit.hs b/Limit.hs index 13ba824fd8..d74befbc31 100644 (Diff truncated)
update RepoSize database from git-annex branch incrementally
The use of catObjectStream is optimally fast. Although it might be
possible to combine this with git-annex branch merge to avoid some
redundant work.
Benchmarking, a git-annex branch that had 100000 files changed
took less than 1.88 seconds to run through this.
The use of catObjectStream is optimally fast. Although it might be
possible to combine this with git-annex branch merge to avoid some
redundant work.
Benchmarking, a git-annex branch that had 100000 files changed
took less than 1.88 seconds to run through this.
diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index a75e0440ba..14b6da4a92 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -17,12 +17,18 @@ import qualified Annex import Annex.Branch (UnmergedBranches(..), getBranch) import Types.RepoSize import qualified Database.RepoSize as Db +import Logs import Logs.Location import Logs.UUID import Git.Types (Sha) +import Git.FilePath +import Git.CatFile +import qualified Git.DiffTree as DiffTree import Control.Concurrent +import Control.Concurrent.Async import qualified Data.Map.Strict as M +import qualified Data.Set as S {- Gets the repo size map. Cached for speed. -} getRepoSizes :: Annex (M.Map UUID RepoSize) @@ -49,10 +55,7 @@ calcRepoSizes rsv = bracket setup cleanup $ \h -> go h `onException` failed currbranchsha <- getBranch if oldbranchsha == currbranchsha then calcJournalledRepoSizes oldsizemap oldbranchsha - else do - -- XXX todo incremental update by diffing - -- from old to new branch. - calculatefromscratch h + else incrementalupdate h oldsizemap oldbranchsha currbranchsha liftIO $ putMVar rsv (Just sizemap) return sizemap @@ -62,6 +65,11 @@ calcRepoSizes rsv = bracket setup cleanup $ \h -> go h `onException` failed liftIO $ Db.setRepoSizes h sizemap branchsha calcJournalledRepoSizes sizemap branchsha + incrementalupdate h oldsizemap oldbranchsha currbranchsha = do + (sizemap, branchsha) <- diffBranchRepoSizes oldsizemap oldbranchsha currbranchsha + liftIO $ Db.setRepoSizes h sizemap branchsha + calcJournalledRepoSizes sizemap branchsha + setup = Db.openDb cleanup = Db.closeDb @@ -75,7 +83,7 @@ calcRepoSizes rsv = bracket setup cleanup $ \h -> go h `onException` failed - branch commit that was used. - - The map includes the UUIDs of all known repositories, including - - repositories that are empty. + - repositories that are empty. But clusters are not included. - - Note that private repositories, which do not get recorded in - the git-annex branch, will have 0 size. journalledRepoSizes @@ -100,8 +108,48 @@ calcJournalledRepoSizes -> Sha -> Annex (M.Map UUID RepoSize) calcJournalledRepoSizes startmap branchsha = - overLocationLogsJournal startmap branchsha accumsizes Nothing + overLocationLogsJournal startmap branchsha + (\k v m -> pure (accumRepoSizes k v m)) + Nothing + +{- Incremental update by diffing. -} +diffBranchRepoSizes :: M.Map UUID RepoSize -> Sha -> Sha -> Annex (M.Map UUID RepoSize, Sha) +diffBranchRepoSizes oldsizemap oldbranchsha newbranchsha = do + g <- Annex.gitRepo + catObjectStream g $ \feeder closer reader -> do + (l, cleanup) <- inRepo $ + DiffTree.diffTreeRecursive oldbranchsha newbranchsha + feedtid <- liftIO $ async $ do + forM_ l $ feedpairs feeder + closer + newsizemap <- readpairs 500000 reader oldsizemap Nothing + liftIO $ wait feedtid + ifM (liftIO cleanup) + ( return (newsizemap, newbranchsha) + , return (oldsizemap, oldbranchsha) + ) where - accumsizes k (newlocs, removedlocs) m = return $ - let m' = foldl' (flip $ M.alter $ addKeyRepoSize k) m newlocs - in foldl' (flip $ M.alter $ removeKeyRepoSize k) m' removedlocs + feedpairs feeder ti = + let f = getTopFilePath (DiffTree.file ti) + in case extLogFileKey locationLogExt f of + Nothing -> noop + Just k -> do + feeder (k, DiffTree.srcsha ti) + feeder (k, DiffTree.dstsha ti) + + readpairs n reader sizemap Nothing = liftIO reader >>= \case + Just (_k, oldcontent) -> readpairs n reader sizemap (Just oldcontent) + Nothing -> return sizemap + readpairs n reader sizemap (Just oldcontent) = liftIO reader >>= \case + Just (k, newcontent) -> + let prevlog = parselog oldcontent + currlog = parselog newcontent + newlocs = S.difference currlog prevlog + removedlocs = S.difference prevlog currlog + !sizemap' = accumRepoSizes k (newlocs, removedlocs) sizemap + in do + n' <- countdownToMessage n $ + showSideAction "calculating repository sizes" + readpairs n' reader sizemap' Nothing + Nothing -> return sizemap + parselog = maybe mempty (S.fromList . parseLoggedLocationsWithoutClusters) diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index ebd656b4bf..3b8cb40d4f 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -16,6 +16,7 @@ import Logs.Presence.Pure import Control.Concurrent import qualified Data.Map.Strict as M +import qualified Data.Set as S updateRepoSize :: UUID -> Key -> LogStatus -> Annex () updateRepoSize u k s = do @@ -46,3 +47,8 @@ removeKeyRepoSize k mrs = case mrs of Nothing -> Nothing where ksz = fromMaybe 0 $ fromKey keySize k + +accumRepoSizes :: Key -> (S.Set UUID, S.Set UUID) -> M.Map UUID RepoSize -> M.Map UUID RepoSize +accumRepoSizes k (newlocs, removedlocs) sizemap = + let !sizemap' = foldl' (flip $ M.alter $ addKeyRepoSize k) sizemap newlocs + in foldl' (flip $ M.alter $ removeKeyRepoSize k) sizemap' removedlocs diff --git a/Database/Keys.hs b/Database/Keys.hs index 0af2a4446c..9704b6ff4c 100644 --- a/Database/Keys.hs +++ b/Database/Keys.hs @@ -476,19 +476,11 @@ reconcileStaged dbisnew qh = ifM isBareRepo dbwriter dbchanged n catreader = liftIO catreader >>= \case Just (ka, content) -> do changed <- ka (parseLinkTargetOrPointerLazy =<< content) - !n' <- countdownToMessage n + n' <- countdownToMessage n $ + showSideAction "scanning for annexed files" dbwriter (dbchanged || changed) n' catreader Nothing -> return dbchanged - -- When the diff is large, the scan can take a while, - -- so let the user know what's going on. - countdownToMessage n - | n < 1 = return 0 - | n == 1 = do - showSideAction "scanning for annexed files" - return 0 - | otherwise = return (pred n) - -- How large is large? Too large and there will be a long -- delay before the message is shown; too short and the message -- will clutter things up unnecessarily. It's uncommon for 1000 diff --git a/Logs.hs b/Logs.hs index 91d4566bdd..52968ca575 100644 --- a/Logs.hs +++ b/Logs.hs @@ -179,7 +179,10 @@ migrationTreeGraftPoint = "migrate.tree" {- The pathname of the location log file for a given key. -} locationLogFile :: GitConfig -> Key -> RawFilePath locationLogFile config key = - branchHashDir config key P.</> keyFile key <> ".log" + branchHashDir config key P.</> keyFile key <> locationLogExt + +locationLogExt :: S.ByteString +locationLogExt = ".log" {- The filename of the url log for a given key. -} urlLogFile :: GitConfig -> Key -> RawFilePath diff --git a/Logs/Location.hs b/Logs/Location.hs index dad2ddc808..3948c71a33 100644 --- a/Logs/Location.hs +++ b/Logs/Location.hs @@ -35,6 +35,8 @@ module Logs.Location ( overLocationLogs, overLocationLogs', overLocationLogsJournal, + parseLoggedLocations, + parseLoggedLocationsWithoutClusters, ) where import Annex.Common @@ -110,7 +112,10 @@ loggedLocationsHistorical = getLoggedLocations . historicalLogInfo loggedLocationsRef :: Ref -> Annex [UUID] loggedLocationsRef ref = map (toUUID . fromLogInfo) . getLog <$> catObject ref -{- Parses the content of a log file and gets the locations in it. -} +{- Parses the content of a log file and gets the locations in it. + - + - Adds the UUIDs of any clusters whose nodes are in the list. + -} parseLoggedLocations :: Clusters -> L.ByteString -> [UUID] parseLoggedLocations clusters = addClusterUUIDs clusters . parseLoggedLocationsWithoutClusters @@ -127,7 +132,6 @@ getLoggedLocations getter key = do clusters <- getClusters return $ addClusterUUIDs clusters locs (Diff truncated)
Added a comment: Remote Helper?
diff --git a/doc/install/OSX/Homebrew/comment_3_8906cf93e9d42081cc24ba560aa86a6f._comment b/doc/install/OSX/Homebrew/comment_3_8906cf93e9d42081cc24ba560aa86a6f._comment new file mode 100644 index 0000000000..776975c8db --- /dev/null +++ b/doc/install/OSX/Homebrew/comment_3_8906cf93e9d42081cc24ba560aa86a6f._comment @@ -0,0 +1,19 @@ +[[!comment format=mdwn + username="Spencer" + avatar="http://cdn.libravatar.org/avatar/2e0829f36a68480155e09d0883794a55" + subject="Remote Helper?" + date="2024-08-17T05:33:01Z" + content=""" +Homebrew doesn't seem to install the remote helper (`git remote-annex` is not a known command). + +Building from source doesn't work because brew installs base>4.20 which is incompatible with filepath-bytestring. Since homebrew is against backward compatibility I presume changing base version by installing a different ghc is out of the question. + +Maybe there's a way to do this with sandboxes? I'm not familiar with haskell, can anyone update the build recipe on how to build git-annex on MacOS (Apple silicon)? As I understand it one would need: + +1. `brew install ghc cabal-install haskell-stack` instead of `haskell-platform` +1. option `--bindir -> --installdir` +1. To specify `extra-lib-dirs` and `extra-include-dirs` to `/opt/homebrew/(lib|include)` respectively in cabal config or as additional options +1. `base` version `< 4.20` must be installed when installing `ghc` + +This is where I got stuck because I can't reinstall `base` without understanding sandboxes or installing a different GHC version (I think? This is effectively my first exposure to haskell) +"""]]
diff --git a/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository.mdwn b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository.mdwn new file mode 100644 index 0000000000..7c11340b57 --- /dev/null +++ b/doc/bugs/lookupkey_--ref_refuses_to_run_in_bare_repository.mdwn @@ -0,0 +1,39 @@ +### Please describe the problem. + +`git annex lookupkey --ref` refuses to run in a bare repository, even though the `--ref` option was added specifically for that use-case. + + +### What steps will reproduce the problem? + +`git annex lookupkey --ref` in a bare repository. + + +### What version of git-annex are you using? On what operating system? + +``` +git-annex version: 10.20240732-g1e0f13ad7ffed0d75e3944d6189d984faefdb4af +build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Servant Benchmark Feeds Testsuite S3 WebDAV +dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.33 DAV-1.3.4 feed-1.3.2.1 ghc-9.4.7 http-client-0.7.14 persistent-sqlite-2.13.2.0 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1 +key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X* +remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external +operating system: linux x86_64 +supported repository versions: 8 9 10 +upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10 +``` + + +### Please provide any additional information below. + +[[!format sh """ +# If you can, paste a complete transcript of the problem occurring here. +# If the problem is with the git-annex assistant, paste in .git/annex/daemon.log + +$ git annex lookupkey --ref 84ec55fbac12fd09a679fca6f94ea98be83356df +git-annex: You cannot run this command in a bare repository. + +# End of transcript or log. +"""]] + +### Have you had any luck using git-annex before? (Sometimes we get tired of reading bug reports all day and a lil' positive end note does wonders) + +
Added a comment
diff --git a/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_2_6fe90bada4dd1ad956c0098a27c9a8be._comment b/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_2_6fe90bada4dd1ad956c0098a27c9a8be._comment new file mode 100644 index 0000000000..3111c5b338 --- /dev/null +++ b/doc/bugs/Assistant_commits_Calibre_.caltrash_files/comment_2_6fe90bada4dd1ad956c0098a27c9a8be._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="matrss" + avatar="http://cdn.libravatar.org/avatar/59541f50d845e5f81aff06e88a38b9de" + subject="comment 2" + date="2024-08-16T15:45:45Z" + content=""" +By default, git-annex puts all dotfiles into git and doesn't consider the largefiles setting for them. I think that is what you are seeing. You can change that behavior with the `annex.dotfiles` configuration. +"""]]
fix Annex.repoSize sharing between threads
diff --git a/Annex.hs b/Annex.hs index 1de9c234f8..cbec4befca 100644 --- a/Annex.hs +++ b/Annex.hs @@ -132,6 +132,7 @@ data AnnexRead = AnnexRead , forcenumcopies :: Maybe NumCopies , forcemincopies :: Maybe MinCopies , forcebackend :: Maybe String + , reposizes :: MVar (Maybe (M.Map UUID RepoSize)) , rebalance :: Bool , useragent :: Maybe String , desktopnotify :: DesktopNotify @@ -149,6 +150,7 @@ newAnnexRead c = do tp <- newTransferrerPool cm <- newTMVarIO M.empty cc <- newTMVarIO (CredentialCache M.empty) + rs <- newMVar Nothing return $ AnnexRead { branchstate = bs , activekeys = emptyactivekeys @@ -166,6 +168,7 @@ newAnnexRead c = do , forcebackend = Nothing , forcenumcopies = Nothing , forcemincopies = Nothing + , reposizes = rs , rebalance = False , useragent = Nothing , desktopnotify = mempty @@ -202,7 +205,6 @@ data AnnexState = AnnexState , remoteconfigmap :: Maybe (M.Map UUID RemoteConfig) , clusters :: Maybe (Annex Clusters) , maxsizes :: Maybe (M.Map UUID MaxSize) - , reposizes :: Maybe (M.Map UUID RepoSize) , forcetrust :: TrustMap , trustmap :: Maybe TrustMap , groupmap :: Maybe GroupMap @@ -258,7 +260,6 @@ newAnnexState c r = do , remoteconfigmap = Nothing , clusters = Nothing , maxsizes = Nothing - , reposizes = Nothing , forcetrust = M.empty , trustmap = Nothing , groupmap = Nothing diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index 28a3874723..5a30321db4 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -5,7 +5,7 @@ - Licensed under the GNU AGPL version 3 or higher. -} -{-# LANGUAGE OverloadedStrings #-} +{-# LANGUAGE OverloadedStrings, BangPatterns #-} module Annex.RepoSize ( getRepoSizes, @@ -15,43 +15,60 @@ import Annex.Common import Annex.RepoSize.LiveUpdate import qualified Annex import Annex.Branch (UnmergedBranches(..), getBranch) -import Annex.Journal (lockJournal) import Types.RepoSize import qualified Database.RepoSize as Db import Logs.Location import Logs.UUID import Git.Types (Sha) +import Control.Concurrent import qualified Data.Map.Strict as M {- Gets the repo size map. Cached for speed. -} getRepoSizes :: Annex (M.Map UUID RepoSize) -getRepoSizes = maybe calcRepoSizes return =<< Annex.getState Annex.reposizes +getRepoSizes = do + rsv <- Annex.getRead Annex.reposizes + liftIO (takeMVar rsv) >>= \case + Just sizemap -> do + liftIO $ putMVar rsv (Just sizemap) + return sizemap + Nothing -> calcRepoSizes rsv -{- Sets Annex.reposizes with current information from the git-annex - - branch, supplimented with journalled but not yet committed information. - - - - This should only be called when Annex.reposizes = Nothing. +{- Fills an empty Annex.reposizes MVar with current information + - from the git-annex branch, supplimented with journalled but + - not yet committed information. -} -calcRepoSizes :: Annex (M.Map UUID RepoSize) -calcRepoSizes = bracket Db.openDb Db.closeDb $ \h -> do - (oldsizemap, moldbranchsha) <- liftIO $ Db.getRepoSizes h - case moldbranchsha of - Nothing -> calculatefromscratch h - Just oldbranchsha -> do - currbranchsha <- getBranch - if oldbranchsha == currbranchsha - then calcJournalledRepoSizes oldsizemap oldbranchsha - else do - -- XXX todo incremental update by diffing - -- from old to new branch. - calculatefromscratch h +calcRepoSizes :: MVar (Maybe (M.Map UUID RepoSize)) -> Annex (M.Map UUID RepoSize) +calcRepoSizes rsv = bracket setup cleanup $ \h -> go h `onException` failed where + go h = do + (oldsizemap, moldbranchsha) <- liftIO $ Db.getRepoSizes h + !sizemap <- case moldbranchsha of + Nothing -> calculatefromscratch h + Just oldbranchsha -> do + currbranchsha <- getBranch + if oldbranchsha == currbranchsha + then calcJournalledRepoSizes oldsizemap oldbranchsha + else do + -- XXX todo incremental update by diffing + -- from old to new branch. + calculatefromscratch h + liftIO $ putMVar rsv (Just sizemap) + return sizemap + calculatefromscratch h = do showSideAction "calculating repository sizes" (sizemap, branchsha) <- calcBranchRepoSizes liftIO $ Db.setRepoSizes h sizemap branchsha calcJournalledRepoSizes sizemap branchsha + + setup = Db.openDb + + cleanup = Db.closeDb + + failed = do + liftIO $ putMVar rsv (Just M.empty) + return M.empty {- Sum up the sizes of all keys in all repositories, from the information - in the git-annex branch, but not the journal. Retuns the sha of the @@ -77,19 +94,13 @@ calcBranchRepoSizes = do {- Given the RepoSizes calculated from the git-annex branch, updates it with - data from journalled location logs. - - - - This should only be called when Annex.reposizes = Nothing. -} -calcJournalledRepoSizes :: M.Map UUID RepoSize -> Sha -> Annex (M.Map UUID RepoSize) -calcJournalledRepoSizes startmap branchsha = lockJournal $ \_jl -> do - sizemap <- overLocationLogsJournal startmap branchsha accumsizes - -- Set while the journal is still locked. Since Annex.reposizes - -- was Nothing until this point, any other thread that might be - -- journalling a location log change at the same time will - -- be blocked from running updateRepoSize concurrently with this. - Annex.changeState $ \st -> st - { Annex.reposizes = Just sizemap } - return sizemap +calcJournalledRepoSizes + :: M.Map UUID RepoSize + -> Sha + -> Annex (M.Map UUID RepoSize) +calcJournalledRepoSizes startmap branchsha = + overLocationLogsJournal startmap branchsha accumsizes where accumsizes k (newlocs, removedlocs) m = return $ let m' = foldl' (flip $ M.alter $ addKeyRepoSize k) m newlocs diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs index d9a7d6c35c..ebd656b4bf 100644 --- a/Annex/RepoSize/LiveUpdate.hs +++ b/Annex/RepoSize/LiveUpdate.hs @@ -14,17 +14,19 @@ import qualified Annex import Types.RepoSize import Logs.Presence.Pure +import Control.Concurrent import qualified Data.Map.Strict as M updateRepoSize :: UUID -> Key -> LogStatus -> Annex () -updateRepoSize u k s = Annex.getState Annex.reposizes >>= \case - Nothing -> noop - Just sizemap -> do - let !sizemap' = M.adjust - (fromMaybe (RepoSize 0) . f k . Just) - u sizemap - Annex.changeState $ \st -> st - { Annex.reposizes = Just sizemap' } +updateRepoSize u k s = do + rsv <- Annex.getRead Annex.reposizes + liftIO (takeMVar rsv) >>= \case + Nothing -> liftIO (putMVar rsv Nothing) + Just sizemap -> do + let !sizemap' = M.adjust + (fromMaybe (RepoSize 0) . f k . Just) + u sizemap + liftIO $ putMVar rsv (Just sizemap') where f = case s of InfoPresent -> addKeyRepoSize diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 3e28307a47..f7da42a4bd 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -39,10 +39,6 @@ Planned schedule of work: (Diff truncated)
todo
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 89d196d603..3e28307a47 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -39,6 +39,10 @@ Planned schedule of work: Note ideas in above todo about doing this at git-annex branch merge time to reuse the git diff done there. + * Annex.reposizes is not shared amoung threads, so duplicate work + to populate it, and threads won't learn about changes made by other + threads. + * What if 2 concurrent threads are considering sending two different keys to a repo at the same time. It can hold either but not both. It should avoid sending both in this situation. (Also discussed in @@ -57,11 +61,23 @@ Planned schedule of work: the provisional update was made until that is called.... But what if it is never called for some reason? + Also, in a race between two threads at the checking preferred content + stage, neither would have started sending yet, and so both would think + it was ok for them to. + This race only really matters when the repo becomes full, then the second thread will fail to send because it's full. Or will send more than the configured maxsize. Still this would be good to fix. + * If all the above thread concurrency problems are fixed, separate + processes will still have concurrency problems. One case where that is + bad is a cluster accessed via ssh. Each connection to the cluster is + a separate process. So each will be unaware of changes made by others. + When `git-annex copy --to cluster -Jn` is used, this makes a single + command behave non-ideally, the same as the thread concurrency + problems. + * `fullybalanced=foo:2` can get stuck in suboptimal situations. Eg, when 2 out of 3 repositories are full, and the 3rd is mostly empty, it is no longer possible to add new files to 2 repositories.
todo
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index fd55c01152..89d196d603 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -44,6 +44,24 @@ Planned schedule of work: It should avoid sending both in this situation. (Also discussed in above todo) + * There can also be a race with 2 concurrent threads where one just + finished sending to a repo, but has not yet updated the location log. + So the other one won't see an updated repo size. + + The fact that location log changes happen in CommandCleanup makes + this difficult to fix. + + Could provisionally update Annex.reposizes before starting to send a + key, and roll it back if the send fails. But then Logs.Location + would update Annex.reposizes redundantly. So would need to remember + the provisional update was made until that is called.... But what if it + is never called for some reason? + + This race only really matters when the repo becomes full, + then the second thread will fail to send because it's full. Or will + send more than the configured maxsize. Still this would be good to + fix. + * `fullybalanced=foo:2` can get stuck in suboptimal situations. Eg, when 2 out of 3 repositories are full, and the 3rd is mostly empty, it is no longer possible to add new files to 2 repositories.
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 6db1b549d4..fd55c01152 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -36,6 +36,14 @@ Planned schedule of work: database is older than the current git-annex branch. Diff from old to new branch to efficiently update. + Note ideas in above todo about doing this at git-annex branch merge + time to reuse the git diff done there. + + * What if 2 concurrent threads are considering sending two different + keys to a repo at the same time. It can hold either but not both. + It should avoid sending both in this situation. (Also discussed in + above todo) + * `fullybalanced=foo:2` can get stuck in suboptimal situations. Eg, when 2 out of 3 repositories are full, and the 3rd is mostly empty, it is no longer possible to add new files to 2 repositories.
RepoSize concurrency fix
When loading the journalled repo sizes, make sure that the current
process is prevented from making changes to the journal in another
thread.
When loading the journalled repo sizes, make sure that the current
process is prevented from making changes to the journal in another
thread.
diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index d9fa13794d..28a3874723 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -15,6 +15,7 @@ import Annex.Common import Annex.RepoSize.LiveUpdate import qualified Annex import Annex.Branch (UnmergedBranches(..), getBranch) +import Annex.Journal (lockJournal) import Types.RepoSize import qualified Database.RepoSize as Db import Logs.Location @@ -25,35 +26,32 @@ import qualified Data.Map.Strict as M {- Gets the repo size map. Cached for speed. -} getRepoSizes :: Annex (M.Map UUID RepoSize) -getRepoSizes = maybe updateRepoSizes return =<< Annex.getState Annex.reposizes +getRepoSizes = maybe calcRepoSizes return =<< Annex.getState Annex.reposizes -{- Updates Annex.reposizes with current information from the git-annex +{- Sets Annex.reposizes with current information from the git-annex - branch, supplimented with journalled but not yet committed information. + - + - This should only be called when Annex.reposizes = Nothing. -} -updateRepoSizes :: Annex (M.Map UUID RepoSize) -updateRepoSizes = bracket Db.openDb Db.closeDb $ \h -> do +calcRepoSizes :: Annex (M.Map UUID RepoSize) +calcRepoSizes = bracket Db.openDb Db.closeDb $ \h -> do (oldsizemap, moldbranchsha) <- liftIO $ Db.getRepoSizes h case moldbranchsha of - Nothing -> calculatefromscratch h >>= set + Nothing -> calculatefromscratch h Just oldbranchsha -> do currbranchsha <- getBranch if oldbranchsha == currbranchsha - then journalledRepoSizes oldsizemap oldbranchsha - >>= set + then calcJournalledRepoSizes oldsizemap oldbranchsha else do -- XXX todo incremental update by diffing -- from old to new branch. - calculatefromscratch h >>= set + calculatefromscratch h where calculatefromscratch h = do showSideAction "calculating repository sizes" (sizemap, branchsha) <- calcBranchRepoSizes liftIO $ Db.setRepoSizes h sizemap branchsha - journalledRepoSizes sizemap branchsha - set sizemap = do - Annex.changeState $ \st -> st - { Annex.reposizes = Just sizemap } - return sizemap + calcJournalledRepoSizes sizemap branchsha {- Sum up the sizes of all keys in all repositories, from the information - in the git-annex branch, but not the journal. Retuns the sha of the @@ -79,10 +77,19 @@ calcBranchRepoSizes = do {- Given the RepoSizes calculated from the git-annex branch, updates it with - data from journalled location logs. + - + - This should only be called when Annex.reposizes = Nothing. -} -journalledRepoSizes :: M.Map UUID RepoSize -> Sha -> Annex (M.Map UUID RepoSize) -journalledRepoSizes startmap branchsha = - overLocationLogsJournal startmap branchsha accumsizes +calcJournalledRepoSizes :: M.Map UUID RepoSize -> Sha -> Annex (M.Map UUID RepoSize) +calcJournalledRepoSizes startmap branchsha = lockJournal $ \_jl -> do + sizemap <- overLocationLogsJournal startmap branchsha accumsizes + -- Set while the journal is still locked. Since Annex.reposizes + -- was Nothing until this point, any other thread that might be + -- journalling a location log change at the same time will + -- be blocked from running updateRepoSize concurrently with this. + Annex.changeState $ \st -> st + { Annex.reposizes = Just sizemap } + return sizemap where accumsizes k (newlocs, removedlocs) m = return $ let m' = foldl' (flip $ M.alter $ addKeyRepoSize k) m newlocs diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index b60822aee5..6db1b549d4 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -32,12 +32,6 @@ Planned schedule of work: * Implement [[track_free_space_in_repos_via_git-annex_branch]]: - * When calling journalledRepoSizes make sure that the current - process is prevented from making changes to the journal in another - thread. Probably lock the journal? (No need to worry about changes made - by other processes; Annex.reposizes does not need to be kept current - with what other processes might be doing.) - * updateRepoSizes incrementally when the git-annex branch sha in the database is older than the current git-annex branch. Diff from old to new branch to efficiently update.
update Annex.reposizes when changing location logs
The live update is only needed when Annex.reposizes has already been
populated.
The live update is only needed when Annex.reposizes has already been
populated.
diff --git a/Annex/Branch.hs b/Annex/Branch.hs index d75b8f249b..8afe6f9912 100644 --- a/Annex/Branch.hs +++ b/Annex/Branch.hs @@ -410,15 +410,21 @@ getRef ref file = withIndex $ catFile ref file change :: Journalable content => RegardingUUID -> RawFilePath -> (L.ByteString -> content) -> Annex () change ru file f = lockJournal $ \jl -> f <$> getToChange ru file >>= set jl ru file -{- Applies a function which can modify the content of a file, or not. -} -maybeChange :: Journalable content => RegardingUUID -> RawFilePath -> (L.ByteString -> Maybe content) -> Annex () +{- Applies a function which can modify the content of a file, or not. + - + - Returns True when the file was modified. -} +maybeChange :: Journalable content => RegardingUUID -> RawFilePath -> (L.ByteString -> Maybe content) -> Annex Bool maybeChange ru file f = lockJournal $ \jl -> do v <- getToChange ru file case f v of Just jv -> let b = journalableByteString jv - in when (v /= b) $ set jl ru file b - _ -> noop + in if v /= b + then do + set jl ru file b + return True + else return False + _ -> return False data ChangeOrAppend t = Change t | Append t diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index dac089a962..d9fa13794d 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -12,6 +12,7 @@ module Annex.RepoSize ( ) where import Annex.Common +import Annex.RepoSize.LiveUpdate import qualified Annex import Annex.Branch (UnmergedBranches(..), getBranch) import Types.RepoSize @@ -86,17 +87,3 @@ journalledRepoSizes startmap branchsha = accumsizes k (newlocs, removedlocs) m = return $ let m' = foldl' (flip $ M.alter $ addKeyRepoSize k) m newlocs in foldl' (flip $ M.alter $ removeKeyRepoSize k) m' removedlocs - -addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize -addKeyRepoSize k mrs = case mrs of - Just (RepoSize sz) -> Just $ RepoSize $ sz + ksz - Nothing -> Just $ RepoSize ksz - where - ksz = fromMaybe 0 $ fromKey keySize k - -removeKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize -removeKeyRepoSize k mrs = case mrs of - Just (RepoSize sz) -> Just $ RepoSize $ sz - ksz - Nothing -> Nothing - where - ksz = fromMaybe 0 $ fromKey keySize k diff --git a/Annex/RepoSize/LiveUpdate.hs b/Annex/RepoSize/LiveUpdate.hs new file mode 100644 index 0000000000..d9a7d6c35c --- /dev/null +++ b/Annex/RepoSize/LiveUpdate.hs @@ -0,0 +1,46 @@ +{- git-annex repo sizes, live updates + - + - Copyright 2024 Joey Hess <id@joeyh.name> + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +{-# LANGUAGE BangPatterns #-} + +module Annex.RepoSize.LiveUpdate where + +import Annex.Common +import qualified Annex +import Types.RepoSize +import Logs.Presence.Pure + +import qualified Data.Map.Strict as M + +updateRepoSize :: UUID -> Key -> LogStatus -> Annex () +updateRepoSize u k s = Annex.getState Annex.reposizes >>= \case + Nothing -> noop + Just sizemap -> do + let !sizemap' = M.adjust + (fromMaybe (RepoSize 0) . f k . Just) + u sizemap + Annex.changeState $ \st -> st + { Annex.reposizes = Just sizemap' } + where + f = case s of + InfoPresent -> addKeyRepoSize + InfoMissing -> removeKeyRepoSize + InfoDead -> removeKeyRepoSize + +addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize +addKeyRepoSize k mrs = case mrs of + Just (RepoSize sz) -> Just $ RepoSize $ sz + ksz + Nothing -> Just $ RepoSize ksz + where + ksz = fromMaybe 0 $ fromKey keySize k + +removeKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize +removeKeyRepoSize k mrs = case mrs of + Just (RepoSize sz) -> Just $ RepoSize $ sz - ksz + Nothing -> Nothing + where + ksz = fromMaybe 0 $ fromKey keySize k diff --git a/Logs/ContentIdentifier.hs b/Logs/ContentIdentifier.hs index 6448693ae7..bf8fef5b2e 100644 --- a/Logs/ContentIdentifier.hs +++ b/Logs/ContentIdentifier.hs @@ -32,7 +32,7 @@ recordContentIdentifier :: RemoteStateHandle -> ContentIdentifier -> Key -> Anne recordContentIdentifier (RemoteStateHandle u) cid k = do c <- currentVectorClock config <- Annex.getGitConfig - Annex.Branch.maybeChange + void $ Annex.Branch.maybeChange (Annex.Branch.RegardingUUID [u]) (remoteContentIdentifierLogFile config k) (addcid c . parseLog) diff --git a/Logs/Location.hs b/Logs/Location.hs index 73c1c5fe48..78ad36d60a 100644 --- a/Logs/Location.hs +++ b/Logs/Location.hs @@ -40,6 +40,7 @@ module Logs.Location ( import Annex.Common import qualified Annex.Branch import Annex.Branch (FileContents) +import Annex.RepoSize.LiveUpdate import Logs import Logs.Presence import Types.Cluster @@ -81,11 +82,13 @@ logChange key u@(UUID _) s | isClusterUUID u = noop | otherwise = do config <- Annex.getGitConfig - maybeAddLog + changed <- maybeAddLog (Annex.Branch.RegardingUUID [u]) (locationLogFile config key) s (LogInfo (fromUUID u)) + when changed $ + updateRepoSize u key s logChange _ NoUUID _ = noop {- Returns a list of repository UUIDs that, according to the log, have @@ -162,14 +165,15 @@ setDead key = do ls <- compactLog <$> readLog logfile mapM_ (go logfile) (filter (\l -> status l == InfoMissing) ls) where - go logfile l = + go logfile l = do let u = toUUID (fromLogInfo (info l)) c = case date l of VectorClock v -> CandidateVectorClock $ v + realToFrac (picosecondsToDiffTime 1) Unknown -> CandidateVectorClock 0 - in addLog' (Annex.Branch.RegardingUUID [u]) logfile InfoDead + addLog' (Annex.Branch.RegardingUUID [u]) logfile InfoDead (info l) c + updateRepoSize u key InfoDead data Unchecked a = Unchecked (Annex (Maybe a)) diff --git a/Logs/Presence.hs b/Logs/Presence.hs index 5c8dcdb343..6763e4676a 100644 --- a/Logs/Presence.hs +++ b/Logs/Presence.hs @@ -49,8 +49,10 @@ addLog' ru file logstatus loginfo c = {- When a LogLine already exists with the same status and info, but an - older timestamp, that LogLine is preserved, rather than updating the log - with a newer timestamp. + - + - Returns True when the log was changed. -} -maybeAddLog :: Annex.Branch.RegardingUUID -> RawFilePath -> LogStatus -> LogInfo -> Annex () +maybeAddLog :: Annex.Branch.RegardingUUID -> RawFilePath -> LogStatus -> LogInfo -> Annex Bool maybeAddLog ru file logstatus loginfo = do c <- currentVectorClock Annex.Branch.maybeChange ru file $ \b -> diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 89edd2bb71..b60822aee5 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -32,11 +32,6 @@ Planned schedule of work: * Implement [[track_free_space_in_repos_via_git-annex_branch]]: - * Update Annex.reposizes in Logs.Location.logChange, - when it makes a change and when Annex.reposizes has a size - for the UUID. So Annex.reposizes is kept up-to-date - for each transfer and drop. - * When calling journalledRepoSizes make sure that the current (Diff truncated)
show message when doing possibly expensive from scratch reposize calculation
diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index 6e823c98b9..dac089a962 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -5,6 +5,8 @@ - Licensed under the GNU AGPL version 3 or higher. -} +{-# LANGUAGE OverloadedStrings #-} + module Annex.RepoSize ( getRepoSizes, ) where @@ -43,6 +45,7 @@ updateRepoSizes = bracket Db.openDb Db.closeDb $ \h -> do calculatefromscratch h >>= set where calculatefromscratch h = do + showSideAction "calculating repository sizes" (sizemap, branchsha) <- calcBranchRepoSizes liftIO $ Db.setRepoSizes h sizemap branchsha journalledRepoSizes sizemap branchsha diff --git a/doc/git-annex-preferred-content.mdwn b/doc/git-annex-preferred-content.mdwn index 8069cb201f..7ef98d14c8 100644 --- a/doc/git-annex-preferred-content.mdwn +++ b/doc/git-annex-preferred-content.mdwn @@ -294,6 +294,14 @@ elsewhere to allow removing it). expression, which will make repositories want to move files around as necessary in order to get fully balanced. + Using this in a perferred content expression makes git-annex need to do + some additional work to keep track of how full repositories are. Usually + that won't affect performance much. However, the first time git-annex + processes this in a given git repository, it will need to examine + all the locations of all files, which can be slow when there are a lot of + them. When this causes git-annex to do a lot of work, it will + display "(calculating repository sizes)". + Note that `not balanced` is a bad thing to put in a preferred content expression for the same reason `not present` is.
implement getRepoSizes
At this point the RepoSize database is getting populated, and it
all seems to be working correctly. Incremental updates still need to be
done to make it performant.
At this point the RepoSize database is getting populated, and it
all seems to be working correctly. Incremental updates still need to be
done to make it performant.
diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index 65b49ce005..6e823c98b9 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -5,17 +5,52 @@ - Licensed under the GNU AGPL version 3 or higher. -} -module Annex.RepoSize where +module Annex.RepoSize ( + getRepoSizes, +) where import Annex.Common -import Annex.Branch (UnmergedBranches(..)) +import qualified Annex +import Annex.Branch (UnmergedBranches(..), getBranch) import Types.RepoSize +import qualified Database.RepoSize as Db import Logs.Location import Logs.UUID import Git.Types (Sha) import qualified Data.Map.Strict as M +{- Gets the repo size map. Cached for speed. -} +getRepoSizes :: Annex (M.Map UUID RepoSize) +getRepoSizes = maybe updateRepoSizes return =<< Annex.getState Annex.reposizes + +{- Updates Annex.reposizes with current information from the git-annex + - branch, supplimented with journalled but not yet committed information. + -} +updateRepoSizes :: Annex (M.Map UUID RepoSize) +updateRepoSizes = bracket Db.openDb Db.closeDb $ \h -> do + (oldsizemap, moldbranchsha) <- liftIO $ Db.getRepoSizes h + case moldbranchsha of + Nothing -> calculatefromscratch h >>= set + Just oldbranchsha -> do + currbranchsha <- getBranch + if oldbranchsha == currbranchsha + then journalledRepoSizes oldsizemap oldbranchsha + >>= set + else do + -- XXX todo incremental update by diffing + -- from old to new branch. + calculatefromscratch h >>= set + where + calculatefromscratch h = do + (sizemap, branchsha) <- calcBranchRepoSizes + liftIO $ Db.setRepoSizes h sizemap branchsha + journalledRepoSizes sizemap branchsha + set sizemap = do + Annex.changeState $ \st -> st + { Annex.reposizes = Just sizemap } + return sizemap + {- Sum up the sizes of all keys in all repositories, from the information - in the git-annex branch, but not the journal. Retuns the sha of the - branch commit that was used. diff --git a/Limit.hs b/Limit.hs index bf78bedfde..13ba824fd8 100644 --- a/Limit.hs +++ b/Limit.hs @@ -598,12 +598,10 @@ limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles let groupmembers = fromMaybe S.empty $ M.lookup g (uuidsByGroup gm) maxsizes <- getMaxSizes - -- XXX do not calc this every time! - (sizemap, sha) <- calcBranchRepoSizes - sizemap' <- journalledRepoSizes sizemap sha + sizemap <- getRepoSizes let keysize = fromMaybe 0 (fromKey keySize key) currentlocs <- S.fromList <$> loggedLocations key - let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap') of + let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of (Just (MaxSize maxsize), Just (RepoSize reposize)) -> if u `S.member` currentlocs then reposize <= maxsize diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index a93f462829..89edd2bb71 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -30,6 +30,23 @@ Planned schedule of work: ## work notes +* Implement [[track_free_space_in_repos_via_git-annex_branch]]: + + * Update Annex.reposizes in Logs.Location.logChange, + when it makes a change and when Annex.reposizes has a size + for the UUID. So Annex.reposizes is kept up-to-date + for each transfer and drop. + + * When calling journalledRepoSizes make sure that the current + process is prevented from making changes to the journal in another + thread. Probably lock the journal? (No need to worry about changes made + by other processes; Annex.reposizes does not need to be kept current + with what other processes might be doing.) + + * updateRepoSizes incrementally when the git-annex branch sha in the + database is older than the current git-annex branch. Diff from old to + new branch to efficiently update. + * `fullybalanced=foo:2` can get stuck in suboptimal situations. Eg, when 2 out of 3 repositories are full, and the 3rd is mostly empty, it is no longer possible to add new files to 2 repositories. @@ -45,52 +62,13 @@ Planned schedule of work: Also note that "fullybalanced=foo:2" is not currently actually implemented! -* implement size-based balancing, so all balanced repositories are around - the same percent full, either as the default or as another preferred - content expression. +* Make `git-annex info` use Annex.reposizes. * `git-annex info` can use maxsize to display how full repositories are -* Implement [[track_free_space_in_repos_via_git-annex_branch]]: - - * Goal is for limitFullyBalanced not to need to calcRepoSizes. - - * When Annex.reposizes does not list the size of a UUID, and - that UUID's size is needed eg for balanced preferred - content, use calcRepoSizes and store in - Database.RepoSizes. - - * Load Annex.reposizes from Database.RepoSizes on demand, - supplimenting with journalledRepoSizes. - - * Update Annex.reposizes in Logs.Location.logChange, - when it makes a change and when Annex.reposizes has a size - for the UUID. So Annex.reposizes is kept up-to-date - for each transfer and drop. - - * When calling journalledRepoSizes make sure that the current - process is prevented from making changes to the journal in another - thread. Probably lock the journal? (No need to worry about changes made - by other processes; Annex.reposizes does not need to be kept current - with what other processes might be doing.) - - * Update Database.RepoSizes incrementally during merge of - git-annex branch, and after commit of git-annex branch. - (Also update Annex.reposizes) - - (Annex.reposizes can be updated to the resulting values as well.) - - * Perhaps: setRepoSize to 0 when initializing a new repo or a - new special remote (but not when reinitializing), - and also update Annex.reposizes for that uuid. - - Whether it makes sense to do this will depend on how expensive - it is to update Database.RepoSize on git-annex branch merge and commit. - If it is not expensive, will want to track reposizes from the beginning - whenever possible, to avoid a later expensive read of the git-annex - branch to calculate the reposizes. - -* Make `git-annex info` use Annex.reposizes. +* implement size-based balancing, so all balanced repositories are around + the same percent full, either as the default or as another preferred + content expression. ## completed items for August's work on balanced preferred content
finalize RepoSize database
Including locking on creation, handling of permissions errors, and
setting repo sizes.
I'm confident that locking is not needed while using this database.
Since writes happen in a single transaction. When there are two writers
that are recording sizes based on different git-annex branch commits,
one will overwrite what the other one recorded. Which is fine, it's only
necessary that the database stays consistent with the content of a
git-annex branch commit.
Including locking on creation, handling of permissions errors, and
setting repo sizes.
I'm confident that locking is not needed while using this database.
Since writes happen in a single transaction. When there are two writers
that are recording sizes based on different git-annex branch commits,
one will overwrite what the other one recorded. Which is fine, it's only
necessary that the database stays consistent with the content of a
git-annex branch commit.
diff --git a/Annex/Locations.hs b/Annex/Locations.hs index 0b1ad4d556..18b4dd2ed0 100644 --- a/Annex/Locations.hs +++ b/Annex/Locations.hs @@ -76,6 +76,7 @@ module Annex.Locations ( gitAnnexImportFeedDbDir, gitAnnexImportFeedDbLock, gitAnnexRepoSizeDbDir, + gitAnnexRepoSizeDbLock, gitAnnexScheduleState, gitAnnexTransferDir, gitAnnexCredsDir, @@ -521,6 +522,10 @@ gitAnnexRepoSizeDbDir :: Git.Repo -> GitConfig -> RawFilePath gitAnnexRepoSizeDbDir r c = fromMaybe (gitAnnexDir r) (annexDbDir c) P.</> "reposize" +{- Lock file for the reposize database. -} +gitAnnexRepoSizeDbLock :: Git.Repo -> GitConfig -> RawFilePath +gitAnnexRepoSizeDbLock r c = gitAnnexRepoSizeDbDir r c <> ".lck" + {- .git/annex/schedulestate is used to store information about when - scheduled jobs were last run. -} gitAnnexScheduleState :: Git.Repo -> RawFilePath diff --git a/Database/RepoSize.hs b/Database/RepoSize.hs index c4a6814e1a..ff70affbfc 100644 --- a/Database/RepoSize.hs +++ b/Database/RepoSize.hs @@ -22,17 +22,18 @@ module Database.RepoSize ( RepoSizeHandle, openDb, closeDb, - getRepoSizes, - setRepoSize, - updateRepoSize, + getRepoSizes, + setRepoSizes, ) where +import Annex.Common +import Annex.LockFile import Types.RepoSize -import Database.Types () +import Git.Types import qualified Database.Queue as H import Database.Init -import Annex.Locations -import Annex.Common +import Database.Utility +import Database.Types import qualified Utility.RawFilePath as R import Database.Persist.Sql hiding (Key) @@ -40,62 +41,107 @@ import Database.Persist.TH import qualified System.FilePath.ByteString as P import qualified Data.Map as M -newtype RepoSizeHandle = RepoSizeHandle H.DbQueue +newtype RepoSizeHandle = RepoSizeHandle (Maybe H.DbQueue) share [mkPersist sqlSettings, mkMigrate "migrateRepoSizes"] [persistLowerCase| +-- Corresponds to location log information from the git-annex branch. RepoSizes repo UUID size Integer UniqueRepo repo +-- The last git-annex branch commit that was used to update RepoSizes. +AnnexBranch + commit SSha + UniqueCommit commit |] {- Opens the database, creating it if it doesn't exist yet. - - - No locking is done by this, so caller must prevent multiple processes - - running this at the same time. + - Multiple readers and writers can have the database open at the same + - time. Database.Handle deals with the concurrency issues. + - The lock is held while opening the database, so that when + - the database doesn't exist yet, one caller wins the lock and + - can create it undisturbed. -} openDb :: Annex RepoSizeHandle openDb = do - dbdir <- calcRepo' gitAnnexRepoSizeDbDir - let db = dbdir P.</> "db" - unlessM (liftIO $ R.doesPathExist db) $ do - initDb db $ void $ - runMigrationSilent migrateRepoSizes - h <- liftIO $ H.openDbQueue db "reposizes" - return $ RepoSizeHandle h + lck <- calcRepo' gitAnnexRepoSizeDbLock + catchPermissionDenied permerr $ withExclusiveLock lck $ do + dbdir <- calcRepo' gitAnnexRepoSizeDbDir + let db = dbdir P.</> "db" + unlessM (liftIO $ R.doesPathExist db) $ do + initDb db $ void $ + runMigrationSilent migrateRepoSizes + h <- liftIO $ H.openDbQueue db "repo_sizes" + return $ RepoSizeHandle (Just h) + where + -- If permissions don't allow opening the database, + -- just don't use it. Since this database is just a cache + -- of information available in the git-annex branch, the same + -- information can be queried from the branch, though much less + -- efficiently. + permerr _e = return (RepoSizeHandle Nothing) closeDb :: RepoSizeHandle -> Annex () -closeDb (RepoSizeHandle h) = liftIO $ H.closeDbQueue h +closeDb (RepoSizeHandle (Just h)) = liftIO $ H.closeDbQueue h +closeDb (RepoSizeHandle Nothing) = noop -{- Doesn't see changes that were just made with setRepoSize or - - updateRepoSize before flushing the queue. -} -getRepoSizes :: RepoSizeHandle -> IO (M.Map UUID RepoSize) -getRepoSizes (RepoSizeHandle h) = H.queryDbQueue h $ - M.fromList . map conv <$> getRepoSizes' +getRepoSizes :: RepoSizeHandle -> IO (M.Map UUID RepoSize, Maybe Sha) +getRepoSizes (RepoSizeHandle (Just h)) = H.queryDbQueue h $ do + sizemap <- M.fromList . map conv <$> getRepoSizes' + annexbranchsha <- getAnnexBranchCommit + return (sizemap, annexbranchsha) where conv entity = let RepoSizes u sz = entityVal entity in (u, RepoSize sz) +getRepoSizes (RepoSizeHandle Nothing) = return (mempty, Nothing) getRepoSizes' :: SqlPersistM [Entity RepoSizes] getRepoSizes' = selectList [] [] -setRepoSize :: UUID -> RepoSize -> RepoSizeHandle -> IO () -setRepoSize u (RepoSize sz) (RepoSizeHandle h) = H.queueDb h checkCommit $ +getAnnexBranchCommit :: SqlPersistM (Maybe Sha) +getAnnexBranchCommit = do + l <- selectList ([] :: [Filter AnnexBranch]) [] + case l of + (s:[]) -> return $ Just $ fromSSha $ + annexBranchCommit $ entityVal s + _ -> return Nothing + +{- Updates the recorded sizes of all repositories. + - + - This can be called without locking since the update runs in a single + - transaction. + - + - Any repositories that are not in the provided map, but do have a size + - recorded in the database will have it cleared. This is unlikely to + - happen, but ensures that the database is consistent. + -} +setRepoSizes :: RepoSizeHandle -> M.Map UUID RepoSize -> Sha -> IO () +setRepoSizes (RepoSizeHandle (Just h)) sizemap branchcommitsha = + H.queueDb h commitimmediately $ do + l <- getRepoSizes' + forM_ (map entityVal l) $ \(RepoSizes u _) -> + unless (M.member u sizemap) $ + unsetRepoSize u + forM_ (M.toList sizemap) $ + uncurry setRepoSize + recordAnnexBranchCommit branchcommitsha + where + commitimmediately _ _ = pure True +setRepoSizes (RepoSizeHandle Nothing) _ _ = noop + +setRepoSize :: UUID -> RepoSize -> SqlPersistM () +setRepoSize u (RepoSize sz) = void $ upsertBy (UniqueRepo u) (RepoSizes u sz) [RepoSizesSize =. sz] -{- Applies an offset to the size. If no size is recorded for the repo, does - - nothing. -} -updateRepoSize :: UUID -> Integer -> RepoSizeHandle -> IO () -updateRepoSize u offset (RepoSizeHandle h) = H.queueDb h checkCommit $ - void $ updateWhere - [RepoSizesRepo ==. u] - [RepoSizesSize +=. offset] +unsetRepoSize :: UUID -> SqlPersistM () +unsetRepoSize u = deleteWhere [RepoSizesRepo ==. u] -checkCommit :: H.QueueSize -> H.LastCommitTime -> IO Bool -checkCommit sz _lastcommittime - | sz > 1000 = return True - | otherwise = return False +recordAnnexBranchCommit :: Sha -> SqlPersistM () +recordAnnexBranchCommit branchcommitsha = do + deleteWhere ([] :: [Filter AnnexBranch]) + void $ insertUniqueFast $ AnnexBranch $ toSSha branchcommitsha diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index e049ed3c13..a93f462829 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -55,8 +55,6 @@ Planned schedule of work: * Goal is for limitFullyBalanced not to need to calcRepoSizes. - * Add git-annex branch sha to Database.RepoSizes. - * When Annex.reposizes does not list the size of a UUID, and (Diff truncated)
Added a comment
diff --git a/doc/bugs/__96__git_annex_info__96___hangs_with_git_special_remote/comment_9_80bf050d4c244b69ae42570782976456._comment b/doc/bugs/__96__git_annex_info__96___hangs_with_git_special_remote/comment_9_80bf050d4c244b69ae42570782976456._comment new file mode 100644 index 0000000000..97c56a4d4a --- /dev/null +++ b/doc/bugs/__96__git_annex_info__96___hangs_with_git_special_remote/comment_9_80bf050d4c244b69ae42570782976456._comment @@ -0,0 +1,31 @@ +[[!comment format=mdwn + username="Atemu" + avatar="http://cdn.libravatar.org/avatar/6ac9c136a74bb8760c66f422d3d6dc32" + subject="comment 9" + date="2024-08-15T15:40:19Z" + content=""" +I just hit this bug again and it's even nastier than I remembered. + +I also found a super simple reproducer: + +1. Have two machines A and B +2. Init a git-annex repo on A +3. Clone the git-annex repo on B (`git clone ssh://A:/tmp/testrepo`) +4. Make A unreachable for B (i.e. `systemctl suspend`) +5. Execute `git annex info` on B. +6. It hangs forever + +I have not found a way to get out of this situation (`--fast` does not help) other than restoring the connection to A which is sometimes simply not possible. + +``` +git-annex version: 10.20240701 +build flags: Assistant Webapp Pairing Inotify DBus DesktopNotify TorrentParser MagicMime Feeds Testsuite S3 WebDAV +dependency versions: aws-0.24.1 bloomfilter-2.0.1.2 crypton-0.34 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.6 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1 +key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X* +remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external +operating system: linux x86_64 +supported repository versions: 8 9 10 +upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10 +local repository version: 10 +``` +"""]]
implement journalledRepoSizes
Plan is to run this when populating Annex.reposizes on demand.
So Annex.reposizes will be up-to-date with the journal, including
crucially journal entries for private repositories. But also
anything that has been written to the journal by another process,
especially if the process was ran with annex.alwayscommit=false.
From there, Annex.reposizes can be kept up to date with changes made
by the running process.
Plan is to run this when populating Annex.reposizes on demand.
So Annex.reposizes will be up-to-date with the journal, including
crucially journal entries for private repositories. But also
anything that has been written to the journal by another process,
especially if the process was ran with annex.alwayscommit=false.
From there, Annex.reposizes can be kept up to date with changes made
by the running process.
diff --git a/Annex/Branch.hs b/Annex/Branch.hs index ed6f648641..41d8f5354b 100644 --- a/Annex/Branch.hs +++ b/Annex/Branch.hs @@ -21,6 +21,7 @@ module Annex.Branch ( updateTo, get, getHistorical, + getRef, getUnmergedRefs, RegardingUUID(..), change, @@ -39,6 +40,7 @@ module Annex.Branch ( UnmergedBranches(..), overBranchFileContents, overJournalFileContents, + combineStaleJournalWithBranch, updatedFromTree, ) where @@ -1010,7 +1012,7 @@ overBranchFileContents -- and in this case it's also possible for the callback to be -- passed some of the same file content repeatedly. -> (RawFilePath -> Maybe v) - -> (Annex (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex a) + -> (Annex (Maybe (v, RawFilePath, Maybe (L.ByteString, Maybe Bool))) -> Annex a) -> Annex (UnmergedBranches (a, Git.Sha)) overBranchFileContents ignorejournal select go = do st <- update @@ -1024,7 +1026,7 @@ overBranchFileContents ignorejournal select go = do overBranchFileContents' :: (RawFilePath -> Maybe v) - -> (Annex (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex a) + -> (Annex (Maybe (v, RawFilePath, Maybe (L.ByteString, Maybe Bool))) -> Annex a) -> BranchState -> Annex (a, Git.Sha) overBranchFileContents' select go st = do @@ -1038,11 +1040,14 @@ overBranchFileContents' select go st = do buf <- liftIO newEmptyMVar let go' reader = go $ liftIO reader >>= \case Just ((v, f), content) -> do - content' <- checkjournal f content + content' <- checkjournal f content >>= return . \case + Nothing -> Nothing + Just c -> Just (c, Just False) return (Just (v, f, content')) Nothing | journalIgnorable st -> return Nothing - | otherwise -> overJournalFileContents' buf (handlestale branchsha) select + | otherwise -> + overJournalFileContents' buf (handlestale branchsha) select res <- catObjectStreamLsTree l (select' . getTopFilePath . Git.LsTree.file) g go' `finally` liftIO (void cleanup) return (res, branchsha) @@ -1059,29 +1064,33 @@ overBranchFileContents' select go st = do handlestale branchsha f journalledcontent = do -- This is expensive, but happens only when there is a -- private journal file. - content <- getRef branchsha f - return (content <> journalledcontent) + branchcontent <- getRef branchsha f + return (combineStaleJournalWithBranch branchcontent journalledcontent, Just True) + +combineStaleJournalWithBranch :: L.ByteString -> L.ByteString -> L.ByteString +combineStaleJournalWithBranch branchcontent journalledcontent = + branchcontent <> journalledcontent {- Like overBranchFileContents but only reads the content of journalled - - files. Note that when there are private UUIDs, the journal files may - - only include information about the private UUID, while information about - - other UUIDs has been committed to the git-annex branch. + - files. -} overJournalFileContents - :: (RawFilePath -> Maybe v) - -> (Annex (Maybe (v, RawFilePath, Maybe L.ByteString)) -> Annex a) + :: (RawFilePath -> L.ByteString -> Annex (L.ByteString, Maybe b)) + -- ^ Called with the journalled file content when the journalled + -- content may be stale or lack information committed to the + -- git-annex branch. + -> (RawFilePath -> Maybe v) + -> (Annex (Maybe (v, RawFilePath, Maybe (L.ByteString, Maybe b))) -> Annex a) -> Annex a -overJournalFileContents select go = do +overJournalFileContents handlestale select go = do buf <- liftIO newEmptyMVar go $ overJournalFileContents' buf handlestale select - where - handlestale _f journalledcontent = return journalledcontent overJournalFileContents' :: MVar ([RawFilePath], [RawFilePath]) - -> (RawFilePath -> L.ByteString -> Annex L.ByteString) + -> (RawFilePath -> L.ByteString -> Annex (L.ByteString, Maybe b)) -> (RawFilePath -> Maybe a) - -> Annex (Maybe (a, RawFilePath, Maybe L.ByteString)) + -> Annex (Maybe (a, RawFilePath, (Maybe (L.ByteString, Maybe b)))) overJournalFileContents' buf handlestale select = liftIO (tryTakeMVar buf) >>= \case Nothing -> do @@ -1096,7 +1105,7 @@ overJournalFileContents' buf handlestale select = content <- getJournalFileStale (GetPrivate True) f >>= \case NoJournalledContent -> return Nothing JournalledContent journalledcontent -> - return (Just journalledcontent) + return (Just (journalledcontent, Nothing)) PossiblyStaleJournalledContent journalledcontent -> Just <$> handlestale f journalledcontent return (Just (v, f, content)) diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs index 936101f02a..65b49ce005 100644 --- a/Annex/RepoSize.hs +++ b/Annex/RepoSize.hs @@ -22,23 +22,43 @@ import qualified Data.Map.Strict as M - - The map includes the UUIDs of all known repositories, including - repositories that are empty. + - + - Note that private repositories, which do not get recorded in + - the git-annex branch, will have 0 size. journalledRepoSizes + - takes care of getting repo sizes for those. -} calcBranchRepoSizes :: Annex (M.Map UUID RepoSize, Sha) calcBranchRepoSizes = do knownuuids <- M.keys <$> uuidDescMap let startmap = M.fromList $ map (\u -> (u, RepoSize 0)) knownuuids - overLocationLogs True startmap accum >>= \case + overLocationLogs True startmap accumsizes >>= \case UnmergedBranches v -> return v NoUnmergedBranches v -> return v where - addksz ksz (Just (RepoSize sz)) = Just $ RepoSize $ sz + ksz - addksz ksz Nothing = Just $ RepoSize ksz - accum k locs m = return $ - let sz = fromMaybe 0 $ fromKey keySize k - in foldl' (flip $ M.alter $ addksz sz) m locs + accumsizes k locs m = return $ + foldl' (flip $ M.alter $ addKeyRepoSize k) m locs {- Given the RepoSizes calculated from the git-annex branch, updates it with - data from journalled location logs. -} journalledRepoSizes :: M.Map UUID RepoSize -> Sha -> Annex (M.Map UUID RepoSize) -journalledRepoSizes m branchsha = undefined --- XXX +journalledRepoSizes startmap branchsha = + overLocationLogsJournal startmap branchsha accumsizes + where + accumsizes k (newlocs, removedlocs) m = return $ + let m' = foldl' (flip $ M.alter $ addKeyRepoSize k) m newlocs + in foldl' (flip $ M.alter $ removeKeyRepoSize k) m' removedlocs + +addKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize +addKeyRepoSize k mrs = case mrs of + Just (RepoSize sz) -> Just $ RepoSize $ sz + ksz + Nothing -> Just $ RepoSize ksz + where + ksz = fromMaybe 0 $ fromKey keySize k + +removeKeyRepoSize :: Key -> Maybe RepoSize -> Maybe RepoSize +removeKeyRepoSize k mrs = case mrs of + Just (RepoSize sz) -> Just $ RepoSize $ sz - ksz + Nothing -> Nothing + where + ksz = fromMaybe 0 $ fromKey keySize k diff --git a/Database/ImportFeed.hs b/Database/ImportFeed.hs index 5f35ff3505..e78f6ca9aa 100644 --- a/Database/ImportFeed.hs +++ b/Database/ImportFeed.hs @@ -197,7 +197,7 @@ updateFromLog db@(ImportFeedDbHandle h) (oldtree, currtree) | otherwise = Nothing goscan reader = reader >>= \case - Just ((), f, Just content) + Just ((), f, Just (content, _)) | isUrlLog f -> do knownurls (parseUrlLog content) goscan reader diff --git a/Limit.hs b/Limit.hs index 753aa4469a..bf78bedfde 100644 --- a/Limit.hs +++ b/Limit.hs @@ -599,10 +599,11 @@ limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles M.lookup g (uuidsByGroup gm) maxsizes <- getMaxSizes -- XXX do not calc this every time! - (sizemap, _sha) <- calcBranchRepoSizes + (sizemap, sha) <- calcBranchRepoSizes + sizemap' <- journalledRepoSizes sizemap sha let keysize = fromMaybe 0 (fromKey keySize key) currentlocs <- S.fromList <$> loggedLocations key - let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of + let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap') of (Just (MaxSize maxsize), Just (RepoSize reposize)) -> if u `S.member` currentlocs then reposize <= maxsize diff --git a/Logs/Location.hs b/Logs/Location.hs index ddcbb58234..73842ddd97 100644 --- a/Logs/Location.hs +++ b/Logs/Location.hs (Diff truncated)
Added a comment: parameter --from not accepted
diff --git a/doc/git-annex-export/comment_1_62c53e59a315e5c014bf2de6b0e447b5._comment b/doc/git-annex-export/comment_1_62c53e59a315e5c014bf2de6b0e447b5._comment new file mode 100644 index 0000000000..42ac87ff0b --- /dev/null +++ b/doc/git-annex-export/comment_1_62c53e59a315e5c014bf2de6b0e447b5._comment @@ -0,0 +1,10 @@ +[[!comment format=mdwn + username="pedro-lopes-de-azevedo" + avatar="http://cdn.libravatar.org/avatar/492b7020bff4e7cb466e95dfd72fd206" + subject="parameter --from not accepted" + date="2024-08-14T14:27:53Z" + content=""" +Recently I tested the export command adding the `--from` parameter and it was not accepted. + +git-annex version: 10.20240701 +"""]]
Added a comment
diff --git a/doc/bugs/migrate_removes_associated_URLs_with_custom_scheme/comment_2_e464d062c2539b8eb2d65a72bdab6709._comment b/doc/bugs/migrate_removes_associated_URLs_with_custom_scheme/comment_2_e464d062c2539b8eb2d65a72bdab6709._comment new file mode 100644 index 0000000000..91130d1e5c --- /dev/null +++ b/doc/bugs/migrate_removes_associated_URLs_with_custom_scheme/comment_2_e464d062c2539b8eb2d65a72bdab6709._comment @@ -0,0 +1,8 @@ +[[!comment format=mdwn + username="bvaa" + avatar="http://cdn.libravatar.org/avatar/1c36fa5fed5065f59842ebce35b10299" + subject="comment 2" + date="2024-08-14T07:18:25Z" + content=""" +I have the same issue but with an external special remote that claims some https: URL based on a specific domain name. +"""]]
plan
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 7853268073..5e1bd0c817 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -51,6 +51,17 @@ Planned schedule of work: * `git-annex info` can use maxsize to display how full repositories are +* overBranchFileContents can improve its handling of journalled files + by first going over the branch, and then at the end, feeding + the journalled filenames into catObjectStream (run on the same branch + sha) to check if the file was in the branch. Only pass the journalled + file to the callback when it was not. This will avoid innaccuracies + in calcRepoSizes and git-annex info. + + calcRepoSizes currently skips log files in private journals, + when they are for a key that does not appear in the git-annex branch. + It needs to include those. + * Implement [[track_free_space_in_repos_via_git-annex_branch]]: * Goal is for limitFullyBalanced not to need to calcRepoSizes. @@ -83,10 +94,6 @@ Planned schedule of work: (Annex.reposizes can be updated to the resulting values.) - * calcRepoSizes currently skips log files in private journals, - when they are for a key that does not appear in the git-annex branch. - It needs to include those. - * Perhaps: setRepoSize to 0 when initializing a new repo or a new special remote (but not when reinitializing), and also update Annex.reposizes for that uuid.
update
diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 00b7c4030b..7853268073 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -53,15 +53,19 @@ Planned schedule of work: * Implement [[track_free_space_in_repos_via_git-annex_branch]]: - * Load Annex.reposizes from Database.RepoSizes on startup. + * Goal is for limitFullyBalanced not to need to calcRepoSizes. + + * Load Annex.reposizes from Database.RepoSizes on demand. * When Annex.reposizes does not list the size of a UUID, and that UUID's size is needed eg for balanced preferred - content, read the git-annex branch to get all repo sizes, - the same way `git-annex info` gets repo sizes. And store in + content, use calcRepoSizes and store in Database.RepoSizes. - * Update Annex.reposizes after each successful transfer. + * Update Annex.reposizes in Logs.Location.logChange, + when it makes a change and when Annex.reposizes has a size + for the UUID. So Annex.reposizes is kept up-to-date + for each transfer and drop. * Update Database.RepoSizes during merge of git-annex branch. (Also update Annex.reposizes) @@ -75,8 +79,14 @@ Planned schedule of work: part of the commit. So need to read all the changed location logs, and update Database.RepoSize accordingly. + Also private journals complicate this. + (Annex.reposizes can be updated to the resulting values.) + * calcRepoSizes currently skips log files in private journals, + when they are for a key that does not appear in the git-annex branch. + It needs to include those. + * Perhaps: setRepoSize to 0 when initializing a new repo or a new special remote (but not when reinitializing), and also update Annex.reposizes for that uuid.
fixed --rebalance stability on drop
Was checking the wrong uuid, oops
Was checking the wrong uuid, oops
diff --git a/Limit.hs b/Limit.hs index 81e0d91f47..3e0bd80654 100644 --- a/Limit.hs +++ b/Limit.hs @@ -603,7 +603,7 @@ limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles let keysize = fromMaybe 0 (fromKey keySize key) let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of (Just (MaxSize maxsize), Just (RepoSize reposize)) -> - if maybe False (`S.member` notpresent) mu + if u `S.member` notpresent then reposize <= maxsize else reposize + keysize <= maxsize _ -> True diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 1b9607c713..00b7c4030b 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -51,10 +51,6 @@ Planned schedule of work: * `git-annex info` can use maxsize to display how full repositories are -* --rebalance is not stable. It will drop a key that was just stored in a - repo. Seems that limitFullyBalanced needs to take AssumeNotPresent - into account to handle dropping correctly. - * Implement [[track_free_space_in_repos_via_git-annex_branch]]: * Load Annex.reposizes from Database.RepoSizes on startup.
take maxsize into account for balanced preferred content
This is very innefficient, it will need to be optimised not to
calculate the sizes of repos every time.
Also, fixed a bug in balancedPicker that caused it to pick a too high
index when some repos were excluded due to being full.
This is very innefficient, it will need to be optimised not to
calculate the sizes of repos every time.
Also, fixed a bug in balancedPicker that caused it to pick a too high
index when some repos were excluded due to being full.
diff --git a/Annex/Balanced.hs b/Annex/Balanced.hs index 46089ea0be..ad917ef1e5 100644 --- a/Annex/Balanced.hs +++ b/Annex/Balanced.hs @@ -21,14 +21,14 @@ type BalancedPicker = S.Set UUID -> Key -> UUID -- The set of UUIDs provided here are all the UUIDs that are ever -- expected to be picked amoung. A subset of that can be provided --- when later using the BalancedPicker. +-- when later using the BalancedPicker. Neither set can be empty. balancedPicker :: S.Set UUID -> BalancedPicker balancedPicker s = \s' key -> let n = calcMac tointeger HmacSha256 combineduuids (serializeKey' key) + m = fromIntegral (S.size s') in S.elemAt (fromIntegral (n `mod` m)) s' where combineduuids = mconcat (map fromUUID (S.toAscList s)) - m = fromIntegral (S.size s) tointeger :: Digest a -> Integer tointeger = foldl' (\i b -> (i `shiftL` 8) + fromIntegral b) 0 diff --git a/Annex/RepoSize.hs b/Annex/RepoSize.hs new file mode 100644 index 0000000000..e2f1702254 --- /dev/null +++ b/Annex/RepoSize.hs @@ -0,0 +1,33 @@ +{- git-annex repo sizes + - + - Copyright 2024 Joey Hess <id@joeyh.name> + - + - Licensed under the GNU AGPL version 3 or higher. + -} + +module Annex.RepoSize where + +import Annex.Common +import Types.RepoSize +import Logs.Location +import Logs.UUID + +import qualified Data.Map.Strict as M + +{- Sum up the sizes of all keys in all repositories, from the information + - in the git-annex branch. Can be slow. + - + - The map includes the UUIDs of all known repositories, including + - repositories that are empty. + -} +calcRepoSizes :: Annex (M.Map UUID RepoSize) +calcRepoSizes = do + knownuuids <- M.keys <$> uuidDescMap + let startmap = M.fromList $ map (\u -> (u, RepoSize 0)) knownuuids + overLocationLogs startmap $ \k locs m -> + return $ + let sz = fromMaybe 0 $ fromKey keySize k + in foldl' (flip $ M.alter $ addksz sz) m locs + where + addksz ksz (Just (RepoSize sz)) = Just $ RepoSize $ sz + ksz + addksz ksz Nothing = Just $ RepoSize ksz diff --git a/Limit.hs b/Limit.hs index 850abfaaef..4f558a7fa9 100644 --- a/Limit.hs +++ b/Limit.hs @@ -17,6 +17,9 @@ import Annex.Content import Annex.WorkTree import Annex.UUID import Annex.Magic +import Annex.RepoSize +import Types.RepoSize +import Logs.MaxSize import Annex.Link import Types.Link import Logs.Trust @@ -590,14 +593,24 @@ limitBalanced mu getgroupmap groupname = do limitFullyBalanced :: Maybe UUID -> Annex GroupMap -> MkLimit Annex limitFullyBalanced mu getgroupmap groupname = Right $ MatchFiles - { matchAction = const $ checkKey $ \key -> do + { matchAction = \notpresent -> checkKey $ \key -> do gm <- getgroupmap let groupmembers = fromMaybe S.empty $ M.lookup g (uuidsByGroup gm) - -- TODO free space checking - return $ case (mu, M.lookup g (balancedPickerByGroup gm)) of - (Just u, Just picker) -> u == picker groupmembers key - _ -> False + maxsizes <- getMaxSizes + -- XXX do not calc this every time! + sizemap <- calcRepoSizes + let hasspace u = case (M.lookup u maxsizes, M.lookup u sizemap) of + (Just (MaxSize maxsize), Just (RepoSize reposize)) -> + reposize + fromMaybe 0 (fromKey keySize key) + <= maxsize + _ -> True + let candidates = S.filter hasspace groupmembers + return $ if S.null candidates + then False + else case (mu, M.lookup g (balancedPickerByGroup gm)) of + (Just u, Just picker) -> u == picker candidates key + _ -> False , matchNeedsFileName = False , matchNeedsFileContent = False , matchNeedsKey = True diff --git a/doc/todo/git-annex_proxies.mdwn b/doc/todo/git-annex_proxies.mdwn index 145f3cf27e..1b9607c713 100644 --- a/doc/todo/git-annex_proxies.mdwn +++ b/doc/todo/git-annex_proxies.mdwn @@ -45,13 +45,15 @@ Planned schedule of work: Also note that "fullybalanced=foo:2" is not currently actually implemented! -* implement size-based balancing, either as the default or as another - preferred content expression. +* implement size-based balancing, so all balanced repositories are around + the same percent full, either as the default or as another preferred + content expression. * `git-annex info` can use maxsize to display how full repositories are -* balanced= and fullybalanced= need to limit the set of repositories to - ones with enough free space to contain a key. +* --rebalance is not stable. It will drop a key that was just stored in a + repo. Seems that limitFullyBalanced needs to take AssumeNotPresent + into account to handle dropping correctly. * Implement [[track_free_space_in_repos_via_git-annex_branch]]: diff --git a/git-annex.cabal b/git-annex.cabal index 713c9e658c..b54a55c7a6 100644 --- a/git-annex.cabal +++ b/git-annex.cabal @@ -574,6 +574,7 @@ Executable git-annex Annex.Queue Annex.ReplaceFile Annex.RemoteTrackingBranch + Annex.RepoSize Annex.SafeDropProof Annex.SpecialRemote Annex.SpecialRemote.Config
Added a comment: Workaround: --force-small
diff --git a/doc/bugs/unannex_vs_unlock_hook_confusion/comment_2_5bdabf2c6d8f62f407da7038ff5ce202._comment b/doc/bugs/unannex_vs_unlock_hook_confusion/comment_2_5bdabf2c6d8f62f407da7038ff5ce202._comment new file mode 100644 index 0000000000..b05fe1702b --- /dev/null +++ b/doc/bugs/unannex_vs_unlock_hook_confusion/comment_2_5bdabf2c6d8f62f407da7038ff5ce202._comment @@ -0,0 +1,17 @@ +[[!comment format=mdwn + username="Spencer" + avatar="http://cdn.libravatar.org/avatar/2e0829f36a68480155e09d0883794a55" + subject="Workaround: --force-small" + date="2024-08-13T07:05:57Z" + content=""" +One workaround I've (finally) found is `git annex add --force-small` instead of `git add`. This **forces** annex to add the content to git. Phew! + +What's even more interesting is that all along, `git hash-object` has been hashing the contents of the pointer file without me even knowing it. On my system when a file is a pointer file and I have the file contents in my annex: + +- `ls -l` shows the file content size. Dropping the file from the annex changes this number to the pointer file string size (tens of bytes). +- `git hash-object FILE` hashes the **pointer file contents**. Reproduce the hash via `git cat-file -p :/path/to/FILE | git hash-object --stdin`. Trying `echo \"pointer\" | git hash-object --stdin` won't work with or without spaces. Also, I can `cat <file> | git hash-object --stdin` to see the real hash of the file contents. + +In summary, *annex is committing what I want*: the hash of the actual contents stored in git. `hash-object`, annex, and git are somehow recognizing the file as a pointer file where `ls` cannot. I assume this is done by annex behind the scenes, which fascinates me because `git hash-object` otherwise isn't affected by repositories and can be run anywhere on any file. + +Going forward - for others who run into this issue - you can use `git annex add --force-small` to overcome this confusion with unlock. +"""]]
Added a comment: Exact Moment Things Go Wrong
diff --git a/doc/bugs/unannex_vs_unlock_hook_confusion/comment_1_76ef9e8f1aeee084ec69dc3c9a16c0e8._comment b/doc/bugs/unannex_vs_unlock_hook_confusion/comment_1_76ef9e8f1aeee084ec69dc3c9a16c0e8._comment new file mode 100644 index 0000000000..c340b43834 --- /dev/null +++ b/doc/bugs/unannex_vs_unlock_hook_confusion/comment_1_76ef9e8f1aeee084ec69dc3c9a16c0e8._comment @@ -0,0 +1,20 @@ +[[!comment format=mdwn + username="Spencer" + avatar="http://cdn.libravatar.org/avatar/2e0829f36a68480155e09d0883794a55" + subject="Exact Moment Things Go Wrong" + date="2024-08-13T06:22:11Z" + content=""" +Hopefully this specific issue can be reproduced: + +1. Have a repo with an annexed file committed. +2. Run `git annex unannex` on the locked file. +3. Run `git commit` to save the file as deleted on the index. +4. Drop the file contents in git annex (useful to have a remote with contents so you don't have to --force) by key (`git annex drop --key KEY`) +4a. Has to be done by key because `git annex unused` does NOT show the key as unused. +4b. Instead, `git annex whereused --key KEY --historical` should show `[here] branch~X:path/to/file` i.e. it's used X commits prior to the head `branch` +5. `git annex findkeys` to see key not there. +6. `git add FILE` +7. Key now back in annex, e.g. under `findkeys`. +7a. At this point, dropping the file contents appears to change the file size in `ls -Al`: a tiny (tens of bytes) file tells you that it's really a pointer file. +8. Never during this process will `ls -Al` show any indication that the file isn't a normal file after unannexing. inode = 1, no symlink. Just the file size changes if the contents aren't in the annex. +"""]]
.md linting
diff --git a/doc/bugs/unannex_vs_unlock_hook_confusion.mdwn b/doc/bugs/unannex_vs_unlock_hook_confusion.mdwn index 9737248994..0893842dab 100644 --- a/doc/bugs/unannex_vs_unlock_hook_confusion.mdwn +++ b/doc/bugs/unannex_vs_unlock_hook_confusion.mdwn @@ -21,6 +21,7 @@ local repository version: 10 ``` Install Details (Brew) + ``` ==> git-annex: stable 10.20240808 (bottled), HEAD Manage files with git without checking in file contents
diff --git a/doc/bugs/unannex_vs_unlock_hook_confusion.mdwn b/doc/bugs/unannex_vs_unlock_hook_confusion.mdwn new file mode 100644 index 0000000000..9737248994 --- /dev/null +++ b/doc/bugs/unannex_vs_unlock_hook_confusion.mdwn @@ -0,0 +1,48 @@ +# Committing File Contents to Git: Unlock Confusion + +I cannot convert a file from being annexed to its content being committed to git. Instead, annex commits a pointer to git as if the file were to be unlocked. This is regardless of if the key exists in `git/annex/objects`. There is no workaround it seems. At this point I have an annexed file committed to a repo. If I want to go back and commit the file contents to git instead, I tried the workaround of committing the deletion after running `git annex unannex` and then committing the file again via `git commit`. However, this still only commits the pointer contents to git as shown by `git annex HASH`. What's worse, the HASH - found from `git log --raw` is the same hash that can be gotten from `git hash-object FILE`. So it looks like the file content committed correctly but it's not. + +It appears to be that the hook hashes the file content, and if that content has ever been logged in the git-annex branch logs, it assumes the user just unlocked the file. I would hope that what is shown in `git log --raw` is in fact representative of the *content* saved to the git repo. I would assert then that annex should commit a git object with a hash for a pointer file that is **different** than for the file contents. So, if I have a "unlocked" pointer file of contents `/annex/objects/MD5E-s87104--942e5878169ea672dc8ab47889694974.txt` the object should be `6a/0da5de8f1a16a30b713b180972dadacb1edd7a`. Then if I manually hash-object the file and see `80d6030a72be1bb60644df613b1597793263a8d5` (the hash of the actual contents in my case) I can see that this content is in fact NOT within my git history yet. + +I notice that when I truly unlock a file, because I have (by default) `annex.thin=false`, the file content moves out of the annex on unlock, but *folder structure remains*. This is in contrast to unannex where the emptied `annex/objects/` tree is deleted. Maybe the hook checks for the existence of empty folders in the annex as a signal of unlock versus unannex? More trivially, if `annex.thin=true`, then maybe the inode count can indicate unlocking. + +In case this is platform dependent here is my info: + +``` +git-annex version: 10.20240701 +build flags: Assistant Webapp Pairing FsEvents TorrentParser MagicMime Benchmark Feeds Testsuite S3 WebDAV +dependency versions: aws-0.24.2 bloomfilter-2.0.1.2 crypton-1.0.0 DAV-1.3.4 feed-1.3.2.1 ghc-9.6.3 http-client-0.7.17 persistent-sqlite-2.13.3.0 torrent-10000.1.3 uuid-1.3.15 yesod-1.6.2.1 +key/value backends: SHA256E SHA256 SHA512E SHA512 SHA224E SHA224 SHA384E SHA384 SHA3_256E SHA3_256 SHA3_512E SHA3_512 SHA3_224E SHA3_224 SHA3_384E SHA3_384 SKEIN256E SKEIN256 SKEIN512E SKEIN512 BLAKE2B256E BLAKE2B256 BLAKE2B512E BLAKE2B512 BLAKE2B160E BLAKE2B160 BLAKE2B224E BLAKE2B224 BLAKE2B384E BLAKE2B384 BLAKE2BP512E BLAKE2BP512 BLAKE2S256E BLAKE2S256 BLAKE2S160E BLAKE2S160 BLAKE2S224E BLAKE2S224 BLAKE2SP256E BLAKE2SP256 BLAKE2SP224E BLAKE2SP224 SHA1E SHA1 MD5E MD5 WORM URL GITBUNDLE GITMANIFEST VURL X* +remote types: git gcrypt p2p S3 bup directory rsync web bittorrent webdav adb tahoe glacier ddar git-lfs httpalso borg rclone hook external +operating system: darwin aarch64 +supported repository versions: 8 9 10 +upgrade supported from repository versions: 0 1 2 3 4 5 6 7 8 9 10 +local repository version: 10 +``` + +Install Details (Brew) +``` +==> git-annex: stable 10.20240808 (bottled), HEAD +Manage files with git without checking in file contents +https://git-annex.branchable.com/ +Installed +/opt/homebrew/Cellar/git-annex/10.20240701 (11 files, 167.2MB) * + Poured from bottle using the formulae.brew.sh API on 2024-07-18 at 13:46:03 +From: https://github.com/Homebrew/homebrew-core/blob/HEAD/Formula/g/git-annex.rb +License: AGPL-3.0-or-later and BSD-2-Clause and BSD-3-Clause and GPL-2.0-only and GPL-3.0-or-later and MIT +==> Dependencies +Build: cabal-install ✘, ghc@9.8 ✘, pkg-config ✔ +Required: libmagic ✔ +==> Options +--HEAD + Install HEAD version +==> Caveats +To start git-annex now and restart at login: + brew services start git-annex +Or, if you don't want/need a background service you can just run: + /opt/homebrew/opt/git-annex/bin/git-annex assistant --autostart +==> Analytics +install: 542 (30 days), 1,832 (90 days), 6,629 (365 days) +install-on-request: 439 (30 days), 1,574 (90 days), 5,735 (365 days) +build-error: 0 (30 days) +```