Please describe the problem.
Doing backup of data using git annex move
from local to dropbox via git-annex-remote-rclone
.
Full invocation here is
( source ~/git-annexes/10.20231227+git24-gd37dbd62b8.env; OUR_LOG_FILE=~/.cache/rclone/logs/`date --iso-8601=seconds`-movedebug; RCLONE_LOG_LEVEL=DEBUG RCLONE_LOG_FILE=${OUR_LOG_FILE}_rclone.log git annex --debug move --to dandi-dandisets-dropbox sub-484677/sub-484677_ses-20210419T161140_behavior+ecephys+ogen.nwb 2>&1 | tee ${OUR_LOG_FILE}_git-annex.log ; )
under /mnt/backup/dandi/dandisets/000363
. On git-annex side we see
dandi@drogon:~/.cache/rclone/logs$ grep -4 'Transfer stalled' 2024-01-04T14:28:29-05:00-movedebug_git-annex.log
[2024-01-04 14:43:24.5722525] (Annex.TransferrerPool) < op 295000000000
75% 274.74 GiB 7 MiB/s 3h46m2024/01/04 14:44:49 DEBUG : Setting --log-level "DEBUG" from environment variable RCLONE_LOG_LEVEL="DEBUG"
2024/01/04 14:44:49 DEBUG : Setting --log-file "/home/dandi/.cache/rclone/logs/2024-01-04T14:28:29-05:00-movedebug_rclone.log" from environment variable RCLONE_LOG_FILE="/home/dandi/.cache/rclone/logs/2024-01-04T14:28:29-05:00-movedebug_rclone.log"
Transfer stalled
[2024-01-04 14:45:40.042053663] (Utility.Process) process [2329029] chat: /home/dandi/git-annexes/10.20231227+git24-gd37dbd62b8/usr/lib/git-annex.linux/git-annex ["transferrer","-c","annex.debug=true"]
and rclone logs for around that time point(s):
2024/01/04 14:44:49 DEBUG : rclone: Version "v1.61.1-DEV" starting with parameters ["rclone" "copy" ".git/annex/tmp/SHA256E-s393314879887-S1000000000-C296--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb" "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757/"]
2024/01/04 14:44:49 DEBUG : Creating backend with remote ".git/annex/tmp/SHA256E-s393314879887-S1000000000-C296--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb"
2024/01/04 14:44:49 DEBUG : Using config file from "/home/dandi/.rclone.conf"
2024/01/04 14:44:49 DEBUG : fs cache: adding new entry for parent of ".git/annex/tmp/SHA256E-s393314879887-S1000000000-C296--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb", "/mnt/backup/dandi/dandisets/000363/.git/annex/tmp"
2024/01/04 14:44:49 DEBUG : Creating backend with remote "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757/"
2024/01/04 14:44:49 DEBUG : fs cache: renaming cache item "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757/" to be canonical "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757"
2024/01/04 14:44:49 DEBUG : SHA256E-s393314879887-S1000000000-C296--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Need to transfer - File not found at Destination
2024/01/04 14:44:50 DEBUG : SHA256E-s393314879887-S1000000000-C296--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Uploading chunk 1/20
...
2024/01/04 14:45:36 DEBUG : SHA256E-s393314879887-S1000000000-C296--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Uploading chunk 19/20
2024/01/04 14:45:38 DEBUG : SHA256E-s393314879887-S1000000000-C296--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Uploading chunk 20/20
2024/01/04 14:45:39 INFO : Signal received: interrupt
2024/01/04 14:45:39 INFO : Dropbox root 'dandi-dandisets/annexstore/b4f/757': Committing uploads - please wait...
2024/01/04 14:45:39 INFO : Exiting...
using older git-annex I got similar crash but on earlier chunk, so it is not pertinent to last chunk
2024/01/04 13:56:29 DEBUG : rclone: Version "v1.61.1-DEV" starting with parameters ["rclone" "copy" ".git/annex/tmp/SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb" "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757/"]
2024/01/04 13:56:29 DEBUG : Creating backend with remote ".git/annex/tmp/SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb"
2024/01/04 13:56:29 DEBUG : Using config file from "/home/dandi/.rclone.conf"
2024/01/04 13:56:29 DEBUG : fs cache: adding new entry for parent of ".git/annex/tmp/SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb", "/mnt/backup/dandi/dandisets/000363/.git/annex/tmp"
2024/01/04 13:56:29 DEBUG : Creating backend with remote "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757/"
2024/01/04 13:56:30 DEBUG : fs cache: renaming cache item "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757/" to be canonical "dandi-dandisets-dropbox:dandi-dandisets/annexstore/b4f/757"
2024/01/04 13:56:30 DEBUG : SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Need to transfer - File not found at Destination
2024/01/04 13:56:30 DEBUG : SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Uploading chunk 1/20
...
2024/01/04 13:57:08 DEBUG : SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Uploading chunk 17/20
2024/01/04 13:57:10 DEBUG : SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Uploading chunk 18/20
2024/01/04 13:57:11 DEBUG : SHA256E-s393314879887-S1000000000-C287--38c8c93c66ece7cbe8eeca9621ab72bf3968873e842f3e376ddda012eb0a0e5f.nwb: Uploading chunk 19/20
2024/01/04 13:57:14 INFO : Signal received: interrupt
2024/01/04 13:57:14 INFO : Dropbox root 'dandi-dandisets/annexstore/b4f/757': Committing uploads - please wait...
2024/01/04 13:57:14 INFO : Exiting...
So it seems that stall detection (I didn't look how it even decides on that) is flawed in case of large (chunked) files being transferred through external custom remotes which have no way to report progress as it is the case here in https://github.com/git-annex-remote-rclone/git-annex-remote-rclone/blob/master/git-annex-remote-rclone (filed a bug report to see that addressed).
What version of git-annex are you using? On what operating system?
first was a year old version, then tried with bleeding edge 10.20231227+git24-gd37dbd62b8
It looks like you must have annex.stalldetection (or the per-remote config) set. git-annex does not behave this way without that configuration. What is it set to?
You are probably right in that it involves rclone special remote not reporting transfer progress back to git-annex.
Normally, when a special remote does not do progress reporting, git-annex does not do any stall detection, because there must have been at least some previous progress update in order for it to detect a stall.
But when chunking is enabled (as it was in your case with 1 gb chunks), git-annex itself updates the progress after each chunk. When the special remote does not do any progress reporting, and chunk size is large, that means that the progress will be updated very infrequently.
So for example, if it takes 2 minutes to upload a chunk, and you had annex.stalldetection set to eg "10mb/1m", then in a chunk after the 1st one, git-annex would wake up after 1 minute, see that no data seemed to have been sent, and conclude there was a stall. You would need to change the time period in the config to something less granular eg "100mb/10m" to avoid that.
This might be a documentation problem, it may not be clear to the user that "10mb/1m" is any different than "100mb/10m". And finding a value that works depends on knowing details of how often the progress gets updated for a given remote.
But, your transcript show that the stall was detected on chunk 296. (git-annex's chunk, rclone is doing its own chunking to dropbox) So the annex.stalldetection configuration must have been one that worked most of the time, for it to transfer the previous 295 chunks without a stall having been detected. Unless this was a resume after previous command(s) had uploaded those other chunks.
I think that what git-annex could do is detect when progress updates are happening with too low a granularity for the annex.stalldetection configuration.
When waiting for the first progress update, it can keep track of how much time has elapsed. If annex.stalldetection is "10mb/2m" and it took 20 minutes to get the first progress update, the granularity is clearly too low.
And then it could disregard the configuration, or suggest a better configuration value, or adjust what it's expecting to match the observed granularity.
(The stall detection auto-prober uses a similar heuristic to that already. It observes the granularity and only if it's sufficiently low (an update every 30 seconds or less) does it assume that 60 seconds without an update may be a stall.)
To reproduce this (more or less), I modified the example.sh external special remote to sleep for 10 seconds before each key store. Set up a remote with chunk=1mb, and annex.stalldetection = "0.001mb/1s".
Uploading a 100 mb file, a stall is detected after the first chunk is uploaded. As expected, since 1 second passed with no update.
When I resume the upload, the second chunk is uploaded and then a stall is detected on the third. And so on.
I've implemented dynamic granularity scaling now, and in this test case, it notices it took 11 seconds for the first chunk, and behaves as if it were configured with annex.stalldetection of "0.022mb/22s". Which keeps it from detecting a stall.
Thank you Joey for looking into it! FWIW, indeed I had
stalldetection = 1KB/120s
. But I believe I introduced it primarily for data "download", where git-annex actually IIRC keeps its monitoring of the file size as it is arriving. For thecopy
out, situation is indeed different and in case of assymmetric connections I even wondered if worth allowing different values for stalldetection-get and -put?In this particular case, since original report we did adjust rclone special remote to report PROGRESS back. I think it completed fine on few 1 large file uploads. I will tomorrow (after we get daily git-annex build in) run it "on full" for a good number of files and see what happens.
git-annex does do file size monitoring in a few cases (when running
cp
, and when downloading an url withcurl
), but it does not do it for downloads in general, and not for external special remotes specifically.If it did for external special remotes, and if rclone grows the destination file as it downloads, rather than atomically at the end (which I have not verified), that would have avoided this problem. Probably it will avoids this type of problem for downloads from the majority of external special remotes. Though of course not for uploads.
Ok, implemented that for external special remotes.
Re asymmetric connections, stall detection doesn't need to be configured anywhere near the total available bandwidth, so generally it would be ok to configure it for a slow uplink even when a fast downlink is available. And if you're only interested in detecting total stalls (as opposed to some kind of "too slow" condition), "1k" per some unit of time is sufficient.
But maybe it does make sense to be able to configure it to abandon slow downloads but allow slow uploads. For example,
git-annex get
with the content on several remotes, where the download speed from one remote is often fast but occasionally slows down, and another remote is consistently medium speed.So you might set "10gb/1m" for downloads from remote, knowing that if it is slow it will abort the download from it and fall back to the medium speed remote. But when sending content to the variable speed remote, would not want to give up only because it was a little slow.
Ok, added annex.stalldetection-download, annex.stalldetection-upload, etc.