Recent comments posted to this site:

This has occurred for me yet again, after starting a new remote.

Strangely 'git fsck' succeeds, showing only dangling objects, but 'git annex sync' fails on commit with this error.

litelog is a set of service scripts I'm making which automatically record and log from devices when they are connected. Voice recorder and sensor logs are copied off of android phones from a handful of supported apps.

Comment by xloem Sun May 29 02:02:38 2016

Union merge on windows does indeed add \r onto lines.

Looks like hashBlob is at fault; it writes a string to a temp file, and the IO layer does CRLF conversion at that point.

The git-annex branch transition code also uses hashBlob so would also do it.

So I've reproduced the root cause of this now. Fixing..

Comment by joey Fri May 27 19:03:05 2016

Non-alphanumerics are stripped.

This also results in a file "foo.ba________________________r" having ".bar" picked as its extension.

I think the fix is to filter out over-long "extensions" before stripping the non-alphanumerics.

Comment by joey Fri May 27 17:06:41 2016

Please open todo items for patches, don't send them as comments here.

(I suspect that the patch as provided might break compilation with old versions of ghc, or old versions of yesod..)

Comment by joey Fri May 27 15:45:49 2016

I've developed a patch which yields a good whereis display in this repo.

Still remains to be seen if there's some code path that currently causes '\r' to get added in the current version of git-annex on Windows.

Comment by joey Fri May 27 15:40:16 2016

In git-annex version 5.20140606, we have:

  * Windows: Fix bug introduced in last release that caused files
    in the git-annex branch to have lines teminated with \r.

There was only 1 week where git-annex had that bug AFAIK, but of course the broken version could have kept on being used for some time.

It's also always possible that this same problem popped up again in the code. The fix in 1ab3d7c81049e4ce7e8e47800e2ef1fecb3a9ab4 was to Annex.Journal and that still seems ok. The merge code is another place this could perhaps happen.

Comment by joey Fri May 27 15:02:05 2016

I tried reproducing this artificially by duplicating a presence log line. That didn't work; whereis showed only 1 copy, not multiple ones. Which makes sense, because the log reader makes a map that has the UUID as the key. So a single UUID can't appear more than once at that point. Good!

So, there's something else going on in the problem repository that allows this behavior to occur.

Aha! It's sequences of one or more \r at the end of the line.

fromList [("0866153a-19e5-4382-aeb6-30e8210706cc",LogLine {date = 1444995329.589s, status = InfoPresent, info = "0866153a-19e5-4382-aeb6-30e8210706cc"}),("0866153a-19e5-4382-aeb6-30e8210706cc\r",LogLine {date = 1444995329.589s, status = InfoPresent, info = "0866153a-19e5-4382-aeb6-30e8210706cc\r"}),("0866153a-19e5-4382-aeb6-30e8210706cc\r\r",LogLine {date = 1444995329.589s, status = InfoPresent, info = "0866153a-19e5-4382-aeb6-30e8210706cc\r\r"}) ...

So, 0866153a-19e5-4382-aeb6-30e8210706cc seems to appear multiple times, but it's really due to these \r's.

Suspect this means that this problem only impacts display. It should not lead to data loss, because no remote will have a UUID ending in `\r', so there should be no way for git-annex to somehow count a remote twice as containing a copy of a file when dropping. Indeed, we can see in the whereis output that it only matches up some instances of the "duplicate" uuid with remotes -- because the other instances have carriage returns appended.

Also, this suggests that the reason the duplicate lines occurred in the first place was something to do with a windows system, which presumably added some \r\n that is being stumbled over.

Comment by joey Fri May 27 14:47:40 2016

Received a clone of this repository (in git-annex-test-repos/annex.bundle here), and was able to reproduce the bug.

Looking at one duplicate UUID for one file, I see:

1444995510.830128s 1 0866153a-19e5-4382-aeb6-30e8210706cc
1444995510.830128s 1 0866153a-19e5-4382-aeb6-30e8210706cc
1444995510.830128s 1 0866153a-19e5-4382-aeb6-30e8210706cc
1444995510.830128s 1 0866153a-19e5-4382-aeb6-30e8210706cc
1444995510.830128s 1 0866153a-19e5-4382-aeb6-30e8210706cc

The notable thing here is not that there are multiple lines for a UUID, but that they somehow have the exact same timestamp down to the microsecond.

I'm a) unsure how this could happen and b) afraid that the log file compaction fails in this case, with catastrophic results.

Regarding how this could happen, git blame shows a single commit adding duplicate lines with the same timestamp. Commit message was "update". The commit touched a wide swath of the repository, including even non-location-log files like trust.log, which also got duplicate lines with the same timestamp.

Some of the lines were entirely new, but some existing lines also got duplicated.

There were some duplicate lines before this commit, so it was not an isolated incident.

Clearly, log compaction needs to collapse down lines that are identical except for timestamp. The location log code also needs to throw out all but one current item for a given uuid, since other code treats each returned location as a copy, expecting there to not be any duplicate UUIDs. With these changes, whatever caused these duplicate lines to occur in the first place at least won't result in weird output or data loss. I have not verified yet if data loss can actually occur in this case.

Comment by joey Fri May 27 14:26:18 2016

I have a patch:

https://github.com/ggreif/git-annex/tree/patch-1

It heals e.g.

Assistant/WebApp/Form.hs:52:1: warning: [-Wredundant-constraints]
    ? Redundant constraint: Monad m
    ? In the type signature for:
           withNote :: (Monad m, ToWidget (HandlerSite m) a) =>
                       Field m v -> a -> Field m v
Comment by ggreif Fri May 27 14:21:11 2016

"to remote host " so it was "--to". annex is already aware of having those files in that remote (see below).

$> git annex copy --to=datalad-public --fast .        
git annex copy --to=datalad-public --fast .  7.33s user 0.91s system 55% cpu 14.772 total

$> git annex info
repository mode: indirect
trusted repositories: 0
semitrusted repositories: 5
    00000000-0000-0000-0000-000000000001 -- web
    00000000-0000-0000-0000-000000000002 -- bittorrent
    123c73e5-a8dc-4cff-8ffc-679c7ea67f94 -- yoh@smaug:/mnt/datasets/datalad/crawl/neurovault [here]
    48c1556f-6241-45de-9497-338d437fcb62 -- yoh@falkor:/srv/datasets.datalad.org/www/neurovault/snapshots [datalad-public]
    af2785da-2538-4346-a6f6-f2f30fc3f025 -- [datalad-archives]
untrusted repositories: 0
transfers in progress: none
available local disk space: 31.42 terabytes (+1 megabyte reserved)
local annex keys: 6615
local annex size: 12.77 gigabytes
annexed files in working tree: 6628
size of annexed files in working tree: 6.31 gigabytes
bloom filter size: 32 mebibytes (1.3% full)
backend usage: 
    SHA256E: 6628

$> git annex whereis | head -30               
whereis 1003/13873.nii.gz (3 copies) 
    123c73e5-a8dc-4cff-8ffc-679c7ea67f94 -- yoh@smaug:/mnt/datasets/datalad/crawl/neurovault [here]
    48c1556f-6241-45de-9497-338d437fcb62 -- yoh@falkor:/srv/datasets.datalad.org/www/neurovault/snapshots [datalad-public]
    af2785da-2538-4346-a6f6-f2f30fc3f025 -- [datalad-archives]

  datalad-archives: dl+archive:SHA256E-s6460020224--710cc05117e2290e2f793271d11e26452cdc111121e09a937dbf5a34b3cc0107.tar/neurovault_snapshot/1003/13873.nii.gz#size=23262
ok
whereis 1003/13874.nii.gz (3 copies) 
    123c73e5-a8dc-4cff-8ffc-679c7ea67f94 -- yoh@smaug:/mnt/datasets/datalad/crawl/neurovault [here]
    48c1556f-6241-45de-9497-338d437fcb62 -- yoh@falkor:/srv/datasets.datalad.org/www/neurovault/snapshots [datalad-public]
    af2785da-2538-4346-a6f6-f2f30fc3f025 -- [datalad-archives]
...
> git annex copy --to=datalad-public .       
copy 1003/13873.nii.gz (checking datalad-public...) yoh@datasets.datalad.org's password: 
Comment by EbvxpTI_xP9Aod7Mg4cwGhgjrCrdM5s- [me.yahoo.com/a] Wed May 25 01:09:56 2016