Recent comments posted to this site:

On the ssh config, one way to do it is to pass -F with a config file that git-annex generates. It could look like:

Include ~/.ssh/config
Include /etc/ssh/ssh_config
ServerAliveInterval 60

Since ssh uses the first config setting it sees, if ~/.ssh/config or /etc/ssh/ssh_config set a ServerAliveInterval that one will be used, and otherwise the value git-annex sets will be used.

But.. Ssh enables TCPKeepAlive by default. You'd think that would be enough to detect this kind of problem.

There do seem to be reasons for users to disable TCPKeepAlive; perhaps it causes annoying disconnects when there's a minor hiccough, or a firewall does not support it.

If the problem is that users are disabling TCPKeepAlive, then having git-annex enable ServerAliveInterval makes sense.

Ok; implemented this.

Comment by joey Wed Oct 26 19:55:18 2016

The most common way a network connection can stall like this is when moving to a different wifi network: the connection is open but no more data will be received. I suppose other kinds of network glitches could also lead to this kind of situation.

ssh has some things, like ServerAliveInterval and TCPKeepAlive, that it can use to detect such problems. You may find them useful.

As for the retrying once a stall is detected, some transfers use forwardRetry which will automatically retry as long as the failed try managed to send some data. But the get/move/copy commands currently use noRetry. I can't find any justification for not always using forwardRetry; I think that it was added for the assistant originally and the other stuff just never switched over.

Only problem I can think of is, if there actually is a ssh password prompt, it would prompt again on retry. But most people using git-annex with ssh have something in place to make ssh not prompt repeatedly for passwords.

So, I've gone ahead and enabled forwardRetry for everything.

Occurs to me that git-annex could try to notice when a transfer is not progressing, by reusing the existing progress metering code.

Since some remotes don't update the progress meter, this could only be used to detect stalls after the progress meter has been updated at least once. If the stall occurs earlier than that, it would not be able to be detected.

It seems quite hard to come up with a good timeout value to detect a stalled connection. Often progress meters are updated after every small (eg 32kb) chunk transferred. But others might poll periodically, or might use a larger chunk size. It's even possible that some special remotes are looking at a percent output by some program, and only update the meter when the percent transferred changes -- in which case it could be many minutes in between each meter update when a large file is being transferred.

If the timeout is too short, git-annex will stall in a new way, by constantly killing "stalled" connections before they can send enough data.

So it really seems better to fix the ssh connection to not stall, since that is not so heuristic a fix. Seems like git-annex could force ServerAliveInterval to be set, and perhaps lower ServerAliveCountMax from 3 to 1. The ssh BatchMode setting sets the former to 300, so a stalled connection will time out after 15 minutes. But BatchMode also disables prompting, and git-annex should not disable that.

Catch is, what if the user has configured ssh with some other ServerAliveInterval value? We don't want git-annex to override that.

(git-annex does have a rudimentary .ssh/config parser, but it's not good enough to handle eg, "Host * ")

Comment by joey Wed Oct 26 18:26:35 2016

This kind of thing tends to be due to a problem with locales, or a filename in the repository that can't be represented under the current locale.

Just so happens that the version you upgraded to changed how the standalone tarball for linux handles locales. Did you install using that tarball?

What does locale say?

Comment by joey Wed Oct 26 18:21:13 2016

I don't know what you mean when you say "FAT link". Do you mean it's a regular file that contains what looks like the pointer of a symlink?

Comment by joey Wed Oct 26 18:18:57 2016

The first try at making git-annex ignore the fsck lines about duplicate entries didn't quite work; the second try landed 8 days ago and it's not been in a release yet so that's probably why you continue to see the problem.

I don't see how deleting a special remote could lead to this. But we know that git annex adjust --unlock did, for another user.

Comment by joey Wed Oct 26 18:14:30 2016

Well these look like the names I'd expect to see used for encrypted files.

If you had the assistant watching some files that were frequently changing, then it could lead to something like this, since many commits would be made of many versions of a file, and each version backed up to a new encrypted file in the special remote. You can take a look at git log -S in the git repository and see if there are many commits of some of your files.

Comment by joey Wed Oct 26 18:07:51 2016

You need to provide more detail about "lock contention on .git/config". Ideally an actual error message.

Normally it's perfectly fine to run multiple git annex copy, or any other git-annex command for that matter.

And the only thing I know of that locks .git/config is when the configuration is being changed, which doesn't normally happen when copying to a remote, unless perhaps this remote has never been used before.

Comment by joey Wed Oct 26 18:02:43 2016

The behavior your comment describes is only the case with v5 repoitories in indirect mode. With v5 direct mode repositories and with the newer v6 repository format, git annex unannex is able to safely handle files that have been added and not committed yet.

Comment by joey Wed Oct 26 17:53:23 2016

@scottgorlin bare git repositories cannot in general be detected when looking at a remote, so git annex sync picks a behavior that works whether a remote is a bare git repository or not.

A bare repo and a rsync special remote should have pretty similar performance.

Comment by joey Wed Oct 26 17:44:55 2016

Well no, the filename passed to "TRANSFER STORE" is wherever the content of the file is, in most circumstances it will not be a file in the working tree.

(And even if the filename is a worktree file in some case, the special remote needs to support storing multiple versions of a file. So trying to use the name used in the working tree on the special remote seems very problimatic.)

In any case, the external special remote protocol already has SETURLPRESENT which can be used if a TRANSFER STORE makes a key be available at an url.

Comment by joey Wed Oct 26 17:29:46 2016