Joey blogs about his work here on a semi-daily basis.

Preparing for a release tomorrow. Yury fixed the Windows autobuilder over the weekend. The OSX autobuilder was broken by my changes Friday, which turned out to have a simple bug that took quite a long time to chase down.

Also added git annex sync --content-of=path to sync the contents of files in a path, rather than in the whole work tree as --content does. I would have rather made this be --content=path but optparse-applicative does not support options that can be either boolean or have a string value. Really, I'd rather git annex sync path do it, but that would be ambiguous with the remote name parameter.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Mar 20 21:17:38 2017

Found a bug in git-annex-shell where verbose messages would sometimes make it output things git-annex didn't expect.

While fixing that, I wanted to add a test case, but the test suite actually does not test git-annex-shell at all. It would need to ssh, which test suites should not do. So, I took a detour..

Support for GIT_SSH and GIT_SSH_COMMAND has been requested before for various reasons. So I implemented that, which took 4 hours. (With one little possible compatability caveat, since git-annex needs to pass the -n parameter to ssh sometimes, and git's interface doesn't allow for such a parameter.)

Now the test suite can use those environment variables to make mock ssh remotes be accessed using local sh instead of ssh.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Sat Mar 18 00:02:35 2017

The new annex.securehashesonly config setting prevents annexed content that does not use a cryptographically secure hash from being downloaded or otherwise added to a repository.

Using that and signed commits prevents SHA1 collisions from causing problems with annexed files. See using signed git commits for details about how to use it, and why I believe it makes git-annex safe despite git's vulnerability to SHA1 collisions in general.

If you are using git-annex to publish binary files in a repository, you should follow the instructions in using signed git commits.

If you're using git to publish binary files, you can improve the security of your repository by switchingto git-annex and signed commits.

Today's work was sponsored by Riku Voipio.

Posted Mon Feb 27 20:12:00 2017

Yesterday I said that a git-annex repository using signed commits and SHA2 backend would be secure from SHA1 collision attacks. Then I noticed that there were two ways to embed the necessary collision generation data inside git-annex key names. I've fixed both of them today, and cannot find any other ways to embed collision generation data in between a signed commit and the annexed files.

I also have a design for a way to configure git-annex to expect to see only keys using secure hash backends, which will make it easier to work with repositories that want to use signed commits and SHA2. Planning to implement that tomorrow.

sha1 collision embedding in git-annex keys has the details.

Posted Sat Feb 25 00:06:43 2017

The first SHA1 collision was announced today, produced by an identical-prefix collision attack.

After looking into it all day, it does not appear to impact git's security immediately, except for targeted attacks against specific projects by very wealthy attackers. But we're well past the time when it seemed ok that git uses SHA1. If this gets improved into a chosen-prefix collision attack, git will start to be rather insecure.

Projects that store binary files in git, that might be worth $100k for an attacker to backdoor should be concerned by the SHA1 collisions. A good example of such a project is <git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git>.

Using git-annex (with a suitable backend like SHA256) and signed commits together is a good way to secure such repositories.

Update 12:25 am: However, there are some ways to embed SHA1-colliding data in the names of git-annex keys. That makes git-annex with signed commits be no more secure than git with signed commits. I am working to fix git-annex to not use keys that have such problems.

Posted Thu Feb 23 20:44:24 2017

Today was all about writing making a remote repo update when changes are pushed to it.

That's a fairly simple page, because I added workarounds for all the complexity of making it work in direct mode repos, adjusted branches, and repos on filesystems not supporting executable git hooks. Basically, the user should be able to set the standard receive.denyCurrentBranch=updateInstead configuration on a remote, and then git push or git annex sync should update that remote's working tree.

There are a couple of unhandled cases; git push to a remote on a filesystem like FAT won't update it, and git annex sync will only update it if it's local, not accessed over ssh. Also, the emulation of git's updateInstead behavior is not perfect for direct mode repos and adjusted branches.

Still, it's good enough that most users should find it meets their needs, I hope. How to set this kind of thing up is a fairly common FAQ, and this makes it much simpler.

(Oh yeah, the first ancient kernel arm build is still running. May finish before tomorrow.)

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Fri Feb 17 19:56:56 2017

When you see a command like "ssh somehost rm -f file", you probably don't think that consumes stdin. After all, the rm -f doesn't. But, ssh can pass stdin over the network even if it's not being consumed, and it turns out git-annex was bitten by this.

That bug made git-annex-checkpresentkey --batch with remote accessed over ssh not see all the batch-mode input that was passed into it, because ssh sometimes consumed some of it.

Shell scripts using git-annex could also be impacted by the bug, for example:

#!/bin/sh
find . -type l -atime 100 | \
    while read file; do
        echo "gonna drop $file that has not been used in a while"
        git annex drop "$file"
    done

Depending on what remotes git annex drop talks to, it might consume parts of the output of find.

I've fixed this in git-annex now (using ssh -n when running commands that are not fed some stdin of their own), but this seems like a class of bug that could impact lots of programs that run ssh.


I've been thinking about ?simpler setup for remote worktree update on push.

One nice way to make a remote update its worktree on push is available in recent-ish gits, receive.denyCurrentBranch=updateInstead. That could already be used with git annex sync, but it hid any error messages when pushing the master branch to the remote (since that push fails with a large error message in default configurations). Found a way to make the error message be displayed when the remote's receive.denyCurrentBranch does not have the default configuration.

The remaining problem is that direct mode and adjusted branch remotes won't get their works trees updated even when configured that way. I am thinking about adding a post-update hook to support those.


Also continuing to bring up the ancient kernel arm autobuilder. It's running its first build now.

Today's work was sponsored by Riku Voipio.

Posted Wed Feb 15 20:44:28 2017

Last week I only had energy to work most of each day on git-annex, or to blog about it. I chose quiet work. The changelog did grow a good amount.

Today, fixed some autobuilder problems, and I am gearing up to add another autobuild, targeting arm boxes with older linux kernels, since I got a chance to upgrade the arm autobuilder's disk this weekend.

Also, some work on the S3 special remote, and worked around a bug in sqlite's handling of umask.

Backlog is down to 243 messages.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Mon Feb 13 21:42:21 2017

Finished the repository-clone-global configuration settings I started adding on Monday. Came up with a nice type-driven way to make sure that configuration is loaded when needed, and only loaded once. Then it was easy to make annex.autocommit be configurable by git-annex config. Also added a new annex.synccontent configuration, which can also be set by git-annex config.

Also resolved a tricky situation with providing an appid to magic wormhole. It will happen on a flag day in 2021. I've marked my calendar..

Today's work was sponsored by Thomas Hochstein on Patreon.

Posted Fri Feb 3 19:46:04 2017

Spent rather too long today tracking down a memory leak in git annex unused. Actually, it was three memory leaks; one of them was a reversion introduced while otherwise improving a function to not be partial. Another only happened in very rare circumstances. The third, which took several more hours staring at the code, turned out to simply be an unnecessary use of an accumulating list. Feel like I should have seen that one sooner, but then I am under the weather and was running profiles in a daze for several hours.. In the end, git-annex unused went from needing 1 gb of memory to 150 mb in my big repo.

One advantage to all the profiling though, was I noticed that the split function was allocating a lot of memory, and seemed generally ineficient. This has to do with it splitting on a string; splitting on a single character can run twice as fast and churn the GC quite a bit less, so I wrote up a specialized version of that, and it's used extensively in git-annex now, so it may run up to 50% faster in some cases. Seems like haskell libraries with a split function should perhaps use the more optimal version when splitting on a single character, and I'm going to file bugs to that effect.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Feb 1 00:02:26 2017

First day working on git-annex in over a month. I've been away preparing for and giving two talks at Linux Conf Australia and then recovering from conference flu, but am now raring to dive back into git-annex development!

The backlog stood at over 300 messages this morning, and is down to 274 now. So still lots of catching up to do. But nothing seems to have blown up badly in my absence. The antipatterns page was a nice development while I was away, listing some ways people sometimes find to shoot their feet. Read and responded to lots of questions, including one user who mentioned a scientific use case: "We are exploring use of git-annex to manage the large boundary conditions used within our weather model."

The main bit of coding today was adding a new git annex config command. This is fairly similar to git config, but it stores the settings in the git-annex branch, so they're visible in all clones of the repo (aka "global"). Not every setting will be configurable this way (that would be too expensive, and too foot-shooty), but starting with annex.autocommit I plan to enable selected settings that make sense to be able to set globally. If you've wanted to be able to configure some part of git-annex in all clones of a repository, suggestions are welcome in the todo item about this

git annex vicfg can also be used to edit the global settings, and I also made it able to edit the global git annex numcopies setting which was omitted before. There's no real reason to have a separate git annex numcopies command now, since git annex config could configure global annex.numcopies.. but it's probably not worth changing that.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Mon Jan 30 21:35:05 2017

The webapp's wormhole pairing almost worked perfectly on the first test. Turned out the remotedaemon was not noticing that the tor hidden service got enabled. After fixing that, it worked perfectly!

So, I've merged that feature, and removed XMPP support from the assistant at the same time. If all goes well, the autobuilds will be updated soon, and it'll be released in time for new year's.

Anyone who's been using XMPP to keep repositories in sync will need to either switch to Tor, or could add a remote on a ssh server to sync by instead. See http://git-annex.branchable.com/assistant/share_with_a_friend_walkthrough/ for the pointy-clicky way to do it, and http://git-annex.branchable.com/tips/peer_to_peer_network_with_tor/ for the command-line way.

Posted Wed Dec 28 16:42:45 2016

Added the Magic Wormhole UI to the webapp for pairing Tor remotes. This replaces the XMPP pairing UI when using "Share with a friend" and "Share with your other devices" in the webapp.

I have not been able to fully test it yet, and it's part of the no-xmpp branch until I can.

It's been a while since I worked on the webapp. It was not as hard as I remembered to deal with Yesod. The inversion of control involved in coding for the web is as annoying as I remembered.


Today's work was sponsored by Riku Voipio.

Posted Tue Dec 27 21:18:44 2016

Have been working on some improvements to git annex enable-tor. Made it su to root, using any su-like program that's available. And made it test the hidden service it sets up, and wait until it's propigated the the Tor directory authorities. The webapp will need these features, so I thought I might as well add them at the command-line level.

Also some messing about with locale and encoding issues. About most of which the less said the better. One significant thing is that I've made the filesystem encoding be used for all IO by git-annex, rather than needing to explicitly enable it for each file and process. So, there should be much less bother with encoding problems going forward.

Posted Sat Dec 24 21:16:25 2016

git annex p2p --pair implemented, using Magic Wormhole codes that have to be exchanged between the repositories being paired.

It looks like this, with the same thing being done at the same time in the other repository.

joey@elephant:~/tmp/bench3/a>git annex p2p --pair
p2p pair peer1 (using Magic Wormhole) 

This repository's pairing code is: 1-select-bluebird

Enter the other repository's pairing code: (here I entered 8-fascinate-sawdust) 
Exchanging pairing data...
Successfully exchanged pairing data. Connecting to peer1...
ok

And just that simply, the two repositories find one another, Tor onion addresses and authentication data is exchanged, and a git remote is set up connecting via Tor.

joey@elephant:~/tmp/bench3/a>git annex sync peer1
commit  
ok
pull peer1 
warning: no common commits
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 5 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (5/5), done.
From tor-annex::5vkpoyz723otbmzo.onion:61900
 * [new branch]      git-annex  -> peer1/git-annex

Very pleased with this, and also the whole thing worked on the very first try!

It might be slightly annoying to have to exchange two codes during pairing. It would be possible to make this work with only one code. I decided to go with two codes, even though it's only marginally more secure than one, mostly for UI reasons. The pairing interface and instructions for using it is simplfied by being symmetric.

(I also decided to revert the work I did on Friday to make p2p --link set up a bidirectional link. Better to keep --link the simplest possible primitive, and pairing makes bidirectional links more easily.)

Next: Some more testing of this and the Tor hidden services, a webapp UI for P2P peering, and then finally removing XMPP support. I hope to finish that by New Years.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Sun Dec 18 21:44:53 2016

Improved git annex p2p --link to create a bi-directional link automatically. Bi-directional links are desirable more often than not, so it's the default behavior.

Also continued thinking about using magic wormhole for communicating p2p addresses for pairing. And filed some more bugs on magic wormhole.

Posted Fri Dec 16 21:54:45 2016

Quite a backlog developed in the couple of weeks I was concentrating on tor support. I've taken a first pass through it and fixed the most pressing issues now.

Most important was an ugly memory corruption problem in the GHC runtime system that may have led to data corruption when using git-annex with Linux kernels older than 4.5. All the Linux standalone builds of git-annex have been updated to fix that issue.

Today dealt with several more things, including fixing a buggy timestamp issue with metadata --batch, reverting the ssh ServerAliveInterval setting (broke on too many systems with old ssh or complicated ssh configurations), making batch input not be rejected when it can't be decoded as UTF-8, and more.

Also, spent some time learning a little bit about Magic Wormhole and SPAKE, as a way to exchange tor remote addresses. Using Magic Wormhole for that seems like a reasonable plan. I did file a couple bugs on it which will need to get fixed, and then using it is mostly a question of whether it's easy enough to install that git-annex can rely on it.

Posted Tue Dec 13 19:49:18 2016

More improvements to tor support. Yesterday, debugged a reversion that broke push/pull over tor, and made actual useful error messages be displayed when there were problems. Also fixed a memory leak, although I fixed it by reorganizing code and could not figure out quite why it happened, other than that the ghc runtime was not managing to be as lazy as I would expect.

Today, added git ref change notification to the P2P protocol, and made the remotedaemon automatically fetch changes from tor remotes. So, it should work to use the assistant to keep repositories in sync over tor. I have not tried it yet, and linking over tor still needs to be done at the command line, so it's not really ready for webapp users yet.

Also fixed a denial of service attack in git-annex-shell and git-annex when talking to a remote git-annex-shell. It was possible to feed either a large amount of data when they tried to read a line of data, and summon the OOM killer. Next release will be expedited some because of that.

Today's work was sponsored by Thomas Hochstein on Patreon.

Posted Fri Dec 9 20:46:53 2016

Git annex transfers over Tor worked correctly the first time I tried them today. I had been expecting protocol implementation bugs, so this was a nice surprise!

Of course there were some bugs to fix. I had forgotten to add UUID discovery to git annex p2p --link. And, resuming interrupted transfers was buggy.

Spent some time adding progress updates to the Tor remote. I was curious to see what speed transfers would run. Speed will of course vary depending on the Tor relays being used, but this example with a 100 mb file is not bad:

copy big4 (to peer1...) 
62%          1.5MB/s 24s

There are still a couple of known bugs, but I've merged the tor branch into master already.


Alpernebbi has built a GUI for editing git-annex metadata. Something I always wanted!
Read about it here


Today's work was sponsored by Ethan Aubin.

Posted Wed Dec 7 19:51:20 2016

Friday and today were spent implementing both sides of the P2P protocol for git-annex content transfers.

There were some tricky cases to deal with. For example, when a file is being sent from a direct mode repository, or v6 annex.thin repository, the content of the file can change as it's being transferred. Including being appended to or truncated. Had to find a way to deal with that, to avoid breaking the protocol by not sending the indicated number of bytes of data.

It all seems to be done now, but it's not been tested at all, and there are probably some bugs to find. (And progress info is not wired up yet.)

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Tue Dec 6 21:09:21 2016

Today I finished the second-to-last big missing peice for tor hidden service remotes. Networks of these remotes are P2P networks, and there needs to be a way for peers to find one-another, and to authenticate with one-another. The git annex p2p command sets up links between peers in such a network.

So far it has only a basic interface that sets up a one way link between two peers. In the first repository, run git annex p2p --gen-address. That outputs a long address. In the second repository, run git annex p2p --link peer1, and paste the address into it. That sets up a git remote named "peer1" that connects back to the first repository over tor.

That is a one-directional link, while a bi-directional link would be much more convenient to have between peers. Worse, the address can be reused by anyone who sees it, to link into the repository. And, the address is far too long to communicate in any way except for pasting it.

So I want to improve that later. What I'd really like to have is an interface that displays a one-time-use phrase of five to ten words, that can be read over the phone or across the room. Exchange phrases with a friend, and get your repositories securely linked together with tor.

But, git annex p2p is good enough for now. I can move on to the final keystone of the tor support, which is file transfer over tor. That should, fingers crossed, be relatively easy, and the tor branch is close to mergeable now.

Today's work was sponsored by Riku Voipio.

Posted Wed Nov 30 21:06:46 2016

Debian's tor daemon is very locked down in the directories it can read from, and so I've had a hard time finding a place to put the unix socket file for git-annex's tor hidden service. Painful details in http://bugs.debian.org/846275. At least for now, I'm putting it under /etc/tor/, which is probably a FHS violation, but seems to be the only option that doesn't involve a lot of added complexity.


The Windows autobuilder is moving, since NEST is shutting down the server it has been using. Yury Zaytsev has set up a new Windows autobuilder, hosted at Dartmouth College this time.

Posted Tue Nov 29 21:39:03 2016

The tor branch is coming along nicely.

This weekend, I continued working on the P2P protocol, implementing it for network sockets, and extending it to support connecting up git-send-pack/git-receive-pack.

There was a bit of a detour when I split the Free monad into two separate ones, one for Net operations and the other for Local filesystem operations.

This weekend's work was sponsored by Thomas Hochstein on Patreon.


Today, implemented a git-remote-tor-annex command that git will use for tor-annex:: urls, and made git annex remotedaemon serve the tor hidden service.

Now I have git push/pull working to the hidden service, for example:

git pull tor-annex::eeaytkuhaupbarfi.onion:47651

That works very well, but does not yet check that the user is authorized to use the repo, beyond knowing the onion address. And currently it only works in git-annex repos; with some tweaks it should also work in plain git repos.

Next, I need to teach git-annex how to access tor-annex remotes. And after that, an interface in the webapp for setting them up and connecting them together.

Today's work was sponsored by Josh Taylor on Patreon.

Posted Tue Nov 22 02:38:08 2016

For a Haskell programmer, and day where a big thing is implemented without the least scrap of code that touches the IO monad is a good day. And this was a good day for me!

Implemented the p2p protocol for tor hidden services. Its needs are somewhat similar to the external special remote protocol, but the two protocols are not fully overlapping with one-another. Rather than try to unify them, and so complicate both cases, I prefer to reuse as much code as possible between separate protocol implementations. The generating and parsing of messages is largely shared between them. I let the new p2p protocol otherwise develop in its own direction.

But, I do want to make this p2p protocol reusable for other types of p2p networks than tor hidden services. This was an opportunity to use the Free monad, which I'd never used before. It worked out great, letting me write monadic code to handle requests and responses in the protocol, that reads the content of files and resumes transfers and so on, all independent of any concrete implementation.

The whole implementation of the protocol only needed 74 lines of monadic code. It helped that I was able to factor out functions like this one, that is used both for handling a download, and by the remote when an upload is sent to it:

receiveContent :: Key -> Offset -> Len -> Proto Bool
receiveContent key offset len = do
        content <- receiveBytes len
        ok <- writeKeyFile key offset content
        sendMessage $ if ok then SUCCESS else FAILURE
        return ok

To get transcripts of the protocol in action, the Free monad can be evaluated purely, providing the other side of the conversation:

ghci> putStrLn $ protoDump $ runPure (put (fromJust $ file2key "WORM--foo")) [PUT_FROM (Offset 10), SUCCESS]
> PUT WORM--foo
< PUT-FROM 10
> DATA 90
> bytes
< SUCCESS
result: True

ghci> putStrLn $ protoDump $ runPure (serve (toUUID "myuuid")) [GET (Offset 0) (fromJust $ file2key "WORM--foo")]
< GET 0 WORM--foo
> PROTO-ERROR must AUTH first
result: ()

Am very happy with all this pure code and that I'm finally using Free monads. Next I need to get down the the dirty business of wiring this up to actual IO actions, and an actual network connection.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Thu Nov 17 21:20:39 2016

Fixed one howler of a bug today. Turns out that git annex fsck --all --from remote didn't actually check the content of the remote, but checked the local repository. Only --all was buggy; git annex fsck --from remote was ok. Don't think this is crash priority enough to make a release for, since only --all is affected.

Somewhat uncomfortably made git annex sync pass --allow-unrelated-histories to git merge. While I do think that git's recent refusal to merge unrelated histories is good in general, the problem is that initializing a direct mode repository involves making an empty commit. So merging from a remote into such a direct mode repository means merging unrelated histories, while an indirect mode repository doesn't. Seems best to avoid such inconsistencies, and the only way I could see to do it is to always use --allow-unrelated-histories. May revisit this once direct mode is finally removed.

Using the git-annex arm standalone bundle on some WD NAS boxes used to work, and then it seems they changed their kernel to use a nonstandard page size, and broke it. This actually seems to be a bug in the gold linker, which defaults to an unncessarily small page size on arm. The git-annex arm bundle is being adjusted to try to deal with this.

ghc 8 made error include some backtrace information. While it's really nice to have backtraces for unexpected exceptions in Haskell, it turns out that git-annex used error a lot with the intent of showing an error message to the user, and a backtrace clutters up such messages. So, bit the bullet and checked through every error in git-annex and made such ones not include a backtrace.

Also, I've been considering what protocol to use between git-annex nodes when communicating over tor. One way would be to make it very similar to git-annex-shell, using rsync etc, and possibly reusing code from git-annex-shell. However, it can take a while to make a connection across the tor network, and that method seems to need a new connection for each file transfered etc. Also thought about using a http based protocol. The servant library is great for that, you get both http client and server implementations almost for free. Resuming interrupted transfers might complicate it, and the hidden service side would need to listen on a unix socket, instead of the regular http port. It might be worth it to use http for tor, if it could be reused for git-annex http servers not on the tor network. But, then I'd have to make the http server support git pull and push over http in a way that's compatable with how git uses http, including authentication. Which is a whole nother ball of complexity. So, I'm leaning instead to using a simple custom protocol something like:

    > AUTH $localuuid $token
    < AUTH-SUCCESS $remoteuuid
    > SENDPACK $length
    > $gitdata
    < RECVPACK $length
    < $gitdata
    > GET $pos $key
    < DATA $length
    < $bytes
    > SUCCESS
    > PUT $key
    < PUT-FROM $pos
    > DATA $length
    > $bytes
    < SUCCESS

Today's work was sponsored by Riku Voipio.

Posted Wed Nov 16 20:18:30 2016

Have waited too long for some next-generation encrypted P2P network, like telehash to emerge. Time to stop waiting; tor hidden services are not as cutting edge, but should work. Updated the design and started implementation in the tor branch.

Unfortunately, Tor's default configuration does not enable the ControlPort. And, changing that in the configuration could be problimatic. This makes it harder than it ought to be to register a tor hidden service. So, I implemented a git annex enable-tor command, which can be run as root to set it up. The webapp will probably use su-to-root or gksu to run it. There's some Linux-specific parts in there, and it uses a socket for communication between tor and the hidden service, which may cause problems for Windows porting later.

Next step will be to get git annex remotedaemon to run as a tor hidden service.

Also made a no-xmpp branch which removes xmpp support from the assistant. That will remove 3000 lines of code when it's merged. Will probably wait until after tor hidden services are working.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Nov 14 20:50:26 2016

Worked on several bug reports today, fixing some easy ones, and following up on others. And then there are the hard bugs.. Very pleased that I was able to eventually reproduce a bug based entirely on the information that git-annex's output did not include a filename. Didn't quite get that bug fixed though.

At the end of the day, got a bug report that git annex add of filenames containing spaces has broken. This is a recent reversion and I'm pushing out a release with a fix ASAP.

Posted Mon Oct 31 22:43:37 2016

Made a significant change today: Enabled automatic retrying of transfers that fail. It's only done if the previous try managed to advance the progress by some amount. The assistant has already had that retrying for many years, but now it will also be done when using git-annex at the command line.

One good reason for a transfer to fail and need a retry is when the network connection stalls. You'd think that TCP keepalives would detect this kind of thing and kill the connection but I've had enough complaints, that I suppose that doesn't always work or gets disabled. Ssh has a ServerAliveInterval that detects such stalls nicely for the kind of batch transfers git-annex uses ssh for, but it's not enabled by default. So I found a way to make git-annex enable it, while still letting ~/.ssh/config settings override that.

Also got back to analizing an old bug report about proliferating ".nfs*.lock" files when using git-annex on nfs; this was caused by the wacky NFS behavior of renaming deleted files, and I found a change to the ssh connection caching cleanup code that should avoid the problem.

Posted Wed Oct 26 20:48:52 2016

Several bug fixes involving v6 unlocked files today. Several related bugs were caused by relying on the inode cache information, without a fallback to handle the case where the inode cache had not gotten updated. While the inode cache is generally kept up-to-date well by the smudge/clean filtering, it is just a cache and can be out of date. Did some auditing for such problems and hopefully I've managed to find them all.

Also, there was a tricky upgrade case where a v5 repository contained a v6 unlocked file, and the annexed content got copied into it. This triggered the above-described bugs, and in this case the worktree needs to be updated on upgrade, to replace the pointer file with the content.

As I caught up with recent activity, it was nice to see some contributions from others. James MacMahon sent in a patch to improve the filenames generated by importfeed. And, xloem is writing workflow documentation for git-annex in Workflow guide.

Posted Mon Oct 17 20:50:11 2016

Finished up where I left off yesterday, writing test cases and fixing bugs with syncing in adjusted branches. While adjusted branches need v6 mode, and v6 mode is still considered experimental, this is still a rather nasty bug, since it can make files go missing (though still available in git history of course). So, planning to release a new version with these fixes as soon as the autobuilders build it.

Posted Tue Oct 11 20:03:25 2016

Over a month ago, I had some reports that syncing into adjusted branches was losing some files that had been committed. I couldn't reproduce it, but IIRC both felix and tbm reported problems in this area. And, felix kindly sent me enough of his git repo to hopefully reproduce it the problem.

Finally got back to that today. Luckily, I was able to reproduce the bug using felix's repo. The bug only occurs when there's a change deep in a tree of an adjusted branch, and not always then. After staring at it for a couple of hours, I finally found the problem; a modification flag was not getting propagated in this case, and some changes made deep in the tree were not getting included into parent trees.

So, I think I've fixed it, but need to look at it some more to be sure, and develop a test case. And fixing that exposed another bug in the same code. Gotta run unfortunately, so will finish this tomorrow..

Today's work was sponsored by Riku Voipio.

Posted Mon Oct 10 19:05:19 2016

Several bug fixes today and got caught up on most recent messages. Backlog is 157.

The most significant one prevents git-annex from reading in the whole content of a large git object when it wants to check if it's an annex symlink. In several situations where large files were committed to git, or staged, git-annex could do a lot of work, and use a lot of memory and maybe crash. Fixed by checking the size of an object before asking git cat-file for its content.

Also a couple of improvements around versions and upgrading. IIRC git-annex used to only support one repository version at a time, but this was changed to support V6 as an optional upgrade from V5, and so the supported versions became a list. Since V3 repositories are identical to V5 other than the version, I added it to the supported version list, and any V3 repos out there can be used without upgading. Particularly useful if they're on read-only media.

And, there was a bug in the automatic upgrading of a remote that caused it to be upgraded all the way to V6. Now it will only be upgraded to V5.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Oct 5 21:11:07 2016

Realized recently that despite all the nice concurrency support in git-annex, external special remotes were limited to handling one request at a time.

While the external special remote prococol could almost support concurrent requests, that would complicate implementing them, and probably need a version flag to enable to avoid breaking existing ones.

Instead, made git-annex start up multiple external special remote processes as needed to handle concurrency.

Today's work was sponsored by Josh Taylor on Patreon.

Posted Fri Sep 30 23:52:30 2016

Did most of the optimisations that recent profiling suggested. This sped up a git annex find from 3.53 seconds to 1.73 seconds. And, git annex find --not --in remote from 12.41 seconds to 5.24 seconds. One of the optimisations sped up git-annex branch querying by up to 50%, which should also speed up use of some preferred content expressions. All in all, a very nice little optimisation pass.

Posted Thu Sep 29 21:17:29 2016

Only had a couple hours today, which were spent doing some profiling of git-annex in situations where it has to look through a large working tree in order to find files to act on. The top five hot spots this found are responsible for between 50% and 80% of git-annex's total CPU use in these situations.

The first optimisation sped up git annex find by around 18%. More tomorrow..

Posted Mon Sep 26 20:54:34 2016

Catching up on backlog today. I hope to be back to a regular work schedule now. Unanswered messages down to 156. A lot of time today spent answering questions.

There were several problems involving git branches with slashes in their name, such as "foo/bar" (but not "origin/master" or "refs/heads/foo"). Some branch names based on such a branch would take only the "bar" part. In git annex sync, this led to perhaps merging "foo/bar" into "other/bar" or "bar". And the adjusted branch code was entirely broken for such branches. I've fixed it now.

Also made git annex addurl behave better when the file it wants to add is gitignored.

Thinking about implementing git annex copy --from A --to B. It does not seem too hard to do that, at least with a temp file used inbetween. See transitive transfers.

Today's work was sponsored by Thomas Hochstein on Patreon.

Posted Wed Sep 21 22:03:16 2016

Turned out to not be very hard at all to make git annex get -JN assign different threads to different remotes that have the same cost. Something like that was requested back in 2011, but it didn't really make sense until parallel get was implemented last year.

(Also spent too much time fixing up broken builds.)

Posted Tue Sep 6 19:20:07 2016

Back after taking most of August off and working on other projects.

Got the unanswered messages backlog down from 222 to 170. Still scary high.

Numerous little improvements today. Notable ones:

  • Windows: Handle shebang in external special remote program. This is needed for git-annex-remote-rclone to work on Windows. Nice to see that external special remote is getting ported and apparently lots of use.
  • Make --json and --quiet suppress automatic init messages, and any other messages that might be output before a command starts. This was a reversion introduced in the optparse-applicative changes over a year ago.

Also I'm developing a plan to improve parallel downloading when multiple remotes have the same cost. See get round robin.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Sep 5 20:39:45 2016

A user suggested adding --failed to retry failed transfers. That was a great idea and I landed a patch for it 3 hours later. Love it when a user suggests something so clearly right and I am able to quickly make it happen!


Unfortunately, my funding from the DataLad project to work on git-annex is running out. It's been a very good two years funded that way, with an enormous amount of improvements and support and bug fixes, but all good things must end. I'll continue to get some funding from them for the next year, but only for half as much time as the past two years.

I need to decide it it makes sense to keep working on git-annex to the extent I have been. There are definitely a few (hundred) things I still want to do on git-annex, starting with getting the git patches landed to make v6 mode really shine. Past that, it's mostly up to the users. If they keep suggesting great ideas and finding git-annex useful, I'll want to work on it more.

What to do about funding? Maybe some git-annex users can contribute a small amount each month to fund development. I've set up a Patreon page for this, https://www.patreon.com/joeyh


Anyhoo... Back to today's (unfunded) work.

--failed can be used with get, move, copy, and mirror. Of course those commands can all be simply re-ran if some of the transfers fail and will pick up where they left off. But using --failed is faster because it does not need to scan all files to find out which still need to be transferred. And accumulated failures from multiple commands can be retried with a single use of --failed.

It's even possible to do things like git annex get --from foo; git annex get --failed --from bar, which first downloads everything it can from the foo remote and falls back to using the bar remote for the rest. Although setting remote costs is probably a better approach most of the time.

Turns out that I had earlier disabled writing failure log files, except by the assistant, because only the assistant was using them. So, that had to be undone. There's some potential for failure log files to accumulate annoyingly, so perhaps some expiry mechanism will be needed. This is why --failed is documented as retrying "recent" transfers. Anyway, the failure log files are cleaned up after successful transfers.

Posted Wed Aug 3 18:55:31 2016

With yesterday's JSON groundwork in place, I quickly implemented git annex metadata --batch today in only 45 LoC. The interface is nicely elegant; the same JSON format that git-annex metadata outputs can be fed into it to get, set, delete, and modify metadata.

Posted Wed Jul 27 20:02:41 2016

I've had to change the output of git annex metadata --json. The old output looked like this:

{"command":"metadata","file":"foo","key":"...","author":["bar"],...,"note":"...","success":true}

That was not good, because it didn't separate the metadata fields from the rest of the JSON object. What if a metadata field is named "note" or "success"? It would collide with the other "note" and "success" in the JSON.

So, changed this to a new format, which moves the metadata fields into a "fields" object:

{"command":"metadata","file":"foo","key":"...","fields":{"author":["bar"],...},"note":"...","success":true}

I don't like breaking backwards compatability of JSON output, but in this case I could see no real alternative. I don't know if anyone is using metadata --batch anyway. If you are and this will cause a problem, get in touch.


While making that change, I also improved the JSON output layer, so it can use Aeson. Update: And switched everything over to using Aeson, so git-annex no longer depends on two different JSON libraries.

This let me use Aeson to generate the "fields" object for metadata --json. And it was also easy enough to use Aeson to parse the output of that command (and some simplified forms of it).

So, I've laid the groundwork for git annex metadata --batch today.

Posted Tue Jul 26 19:55:20 2016

A common complaint is that git annex fsck in a bare repository complains about missing content of deleted files. That's because in a bare repository, git-annex operates on all versions of all files. Today I added a --branch option, so if you only want to check say, the master branch, you can: git annex fsck --branch master

The new option has other uses too. Want to get all the files in the v1.0 tag? git annex get --branch v1.0

It might be worth revisiting the implicit --all behavior for bare repositories. It could instead default to --branch HEAD or something like that. But I'd only want to change that if there was a strong consensus in favor.

Over 3/4th of the time spent implementing --branch was spent in adjusting the output of commands, to show "branch:file" is being operated on. How annoying.

Posted Wed Jul 20 19:59:37 2016

First release in over a month. Before making this release, a few last minute fixes, including a partial workaround for the problem that Sqlite databases don't work on Lustre filesystems.

Backlog is now down to 140 messages, and only 3 of those are from this month. Still higher than I like.

Posted Tue Jul 19 20:19:43 2016

Noticed that in one of my git-annex repositories, git-annex was spending a full second at startup checking all the git-annex branches from remotes to see if they contained changes that needed to be merged in. So, I added a cache of recently merged branches to avoid that. I remember considering this optimisation years ago; don't know why I didn't do it then. Not every day that I can speed up git-annex so much!

Also, made git annex log --all show location log changes for all keys. This was tricky to get right and fast.

Posted Sun Jul 17 19:20:51 2016

Worked on recent bug reports. Two bugs fixed today were both reversions introduced when the v6 repository support was added. Backlog is down to 153.

Posted Tue Jul 12 20:47:20 2016

Revisited my enhanced smudge/clean patch set for git, updating it for code review and to deal with changes in git since I've been away. This took several hours unfortunately.

Posted Mon Jul 11 22:48:56 2016

Back from vacation, with a message backlog of 181. I'm concentrating first on low-hanging fruit of easily implemented todos, and well reproducible bugs, to get started again.

Implemented --batch mode for git annex get and git annex drop, and also enabled --json for those.

Investigated git-annex startup time; see http://git-annex.branchable.com/todo/could_standalone___39__fixed__39___git-annex_binaries_be_prelinked__63__/. Turns out that cabal has a bug that causes many thousands of unnecessary syscalls when linking in the shared libraries. Working around it halved git-annex's startup time.

Fixed a bug that caused git annex testremote to crash when testing a freshly made external special remote.

Posted Wed Jul 6 19:14:24 2016

Continued working on the enhanced smudge/clean interface in git today. Sent in a third version of the patch set, which is now quite complete.

I'll be away for the next week and a half, on vacation.

Posted Wed Jun 22 20:25:17 2016

Continued working on the enhancaed smudge/clean interface in git, incorporating feedback from the git developers.

In a spare half an hour, I made an improved-smudge-filters branch that teaches git-annex smudge to use the new interface.

Doing a quick benchmark, git checkout of a deleted 1 gb file took:

  • 19 seconds before
  • 11 seconds with the new interface
  • 0.1 seconds with the new interface and annex.thin set
    (while also saving 1 gb of disk space!)

So, this new interface is very much worthwhile.

Posted Fri Jun 17 20:56:27 2016

Working on git, not git-annex the past two days, I have implemented the smudge-to-file/clean-from-file extension to the smudge/clean filter interface. Patches have been sent to the git developers, and hopefully they'll like it and include it. This will make git-annex v6 work a lot faster and better.

Amazing how much harder it is to code on git than on git-annex! While I'm certianly not as familiar with the git code base, this is mostly because C requires so much more care about innumerable details and so much verbosity to do anything. I probably could have implemented this interface in git-annex in 2 hours, not 2 days.

Posted Thu Jun 16 20:36:35 2016

There was one more test suite failure when run on FAT, which I've investigated today. It turns out that a bug report was filed about the same problem, and at root it seems to be a bug in git merge. Luckily, it was not hard to work around the strange merge behavior.

It's been very worthwhile running the test suite on FAT; it's pointed me at several problems with adjusted branches over the past weeks. It would be good to add another test suite pass to test adjusted branches explicitly, but when I tried adding that, there were a lot of failures where the test suite is confused by adjusted branch behavior and would need to be taught about it.

I've released git-annex 6.20160613. If you're using v6 repositories and especially adjusted branches, you should upgrade since it has many fixes.

Posted Mon Jun 13 20:18:22 2016

Today I was indeed able to get to the bottom of and fix the bug that had stumped me the other day.

Rest of the day was taken up by catching up to some bug requests and suggestions for v6 mode. Like making unlock and lock work for files that are not locally present. And, improving the behavior of the clean filter so it remembers what backend was used for a file before and continues using that same backend.

About ready to make a release, but IIRC there's one remaining test suite failure on FAT.

Posted Thu Jun 9 20:39:47 2016

Been having a difficult time fixing the two remaining test suite failures when run on a FAT filesystem.

On Friday, I got quite lost trying to understand the first failure. At first I thought it had something to do with queued git staging commands not being run in the right git environment when git-annex is using a different index file or work tree. I did find and fix a potential bug in that area. It might be that some reports long ago of git-annex branch files getting written to the master branch was caused by that. But, fixing it did not help with the test suite failure at hand.

Today, I quickly found the actual cause of the first failure. Of course, it had nothing to do with queued git commands at all, and was a simple fix in the end.

But, I've been staring at the second failure for hours and am not much wiser. All I know is, an invalid tree object gets generated by the adjusted branch code that contains some files more than once. (git gets very confused when a repository contains such tree objects; if you wanted to break a git repository, getting such trees into it might be a good way. cough) This invalid tree object seems to be caused by the basis ref for the adjusted branch diverging somehow from the adjusted branch itself. I have not been able to determine why or how the basis ref can diverge like that.

Also, this failure is somewhat indeterminite, doesn't always occur and reordering the tests in the test suite can hide it. Weird.

Well, hopefully looking at it again later with fresh eyes will help.

Posted Tue Jun 7 20:01:27 2016

A productive day of small fixes. Including a change to deal with an incompatibility in git 2.9's commit.gpgsign, and couple of fixes involving gcrypt repositories.

Also several improvements to cloning from repositories where an adjusted branch is checked out. The clone automatically ends up with the adjusted branch checked out too.

The test suite has 3 failures when run on a FAT repository, all involving adjusted branches. Managed to fix one of them today, hope to get to the others soon.

Posted Thu Jun 2 21:03:50 2016

Release today includes a last-minute fix to parsing lines from the git-annex branch that might have one or more carriage returns at the end. This comes from Windows of course, where since some things transparently add/remove \r before the end of lines, while other things don't, it could result in quite a mess. Luckily it was not hard or expensive to handle. If you are lucky enough not to use Windows, the release also has several more interesting improvements.

Posted Fri May 27 20:51:03 2016

git-annex has always balanced implicit and explicit behavior. Enabling a git repository to be used with git-annex needs an explicit init, to avoid foot-shooting; but a clone of a repository that is already using git-annex will be implicitly initialized. Git remotes implicitly are checked to see if they use git-annex, so the user can immediately follow git remote add with git annex get to get files from it.

There's a fine line here, and implicit git remote enabling sometimes crosses it; sometimes the remote doesn't have git-annex-shell, and so there's an ugly error message and annex-ignore has to be set to avoid trying to enable that git remote again. Sometimes the probe of a remote can occur when the user doesn't really expect it to (and it can involve a ssh password prompt).

Part of the problem is, there's not an explicit way to enable a git remote to be used by git-annex. So, today, I made git annex enableremote do that, when the remote name passed to it is a git remote rather than a special remote. This way, you can avoid the implicit behavior if you want to.

I also made git annex enableremote un-set annex-ignore, so if a remote got that set due to a transient configuration problem, it can be explicitly enabled.

Posted Tue May 24 21:11:15 2016

Over the weekend, I noticed that a relative path to GIT_INDEX_FILE is interpreted in several different, inconsistent ways by git. git-annex mostly used absolute paths, but did use a relative path in git annex view. Now it will only use absolute paths to avoid git's wacky behavior.

Integrated some patches to support building with ghc 8.0.1, which was recently released.

The gnupg-options git configs were not always passed to gpg. Fixing this involved quite a lot of plumbing to get the options to the right functions, and consumed half of today.

Also did some design work on external special remote protocol to avoid backwards compatability problems when adding new protocol features.

Posted Mon May 23 22:26:56 2016

Fixed several problems with v6 mode today. The assistant was doing some pretty wrong things when changes were synced into v6 repos, and that behavior is fixed. Also dealt with a race that caused updates made to the keys database by one process to not be seen by another process. And, made git annex add of a unlocked pointer file not annex the pointer file's content, but just add it to git as-is.

Also, Thowz pointed out that adjusted branches could be used to locally adjust where annex symlinks point to, when a repository's git directory is not in the usual location. I've added that, as git annex adjust --fix. It was quite easy to implement this, which makes me very happy with the adjusted branches code!

Posted Mon May 16 21:35:43 2016

Posted a proposal for extending git smudge/clean filters with raw file access. If git gets an interface like that, it will make it easy to deal with most of the remaining v6 todo list.

Posted Thu May 12 21:20:41 2016

It's not every day I add a new special remote encryption mode to git-annex! The new encryption=sharedpubkey mode lets anyone with a clone of the git repository (and access to the remote) store files in the remote, but then only the private key owner can access those files. Which opens up some interesting new use cases...

Posted Tue May 10 21:18:39 2016

Lots of little fixes and improvements here and there over the past couple days.

The main thing was fixing several bugs with adjusted branches and Windows. They seem to work now, and commits made on the adjusted branch are propigated back to master correctly.

It would be good to finish up the last todos for v6 mode this month. The sticking point is I need a way to update the file stat in the git index when git-annex gets/drops/etc an unlocked file. I have not decided yet if it makes the most sense to add a dependency on libgit2 for that, or extend git update-index, or even write a pure haskell library to manipulate index files. Each has its pluses and its minuses.

Posted Wed May 4 18:40:43 2016

git-annex 6.20160419 has a rare security fix. A bug made encrypted special remotes that are configured to use chunks accidentally expose the checksums of content that is uploaded to the remote. Such information is supposed to be hidden from the remote's view by the encryption. The same bug also made resuming interrupted uploads to such remotes start over from the beginning.

After releasing that, I've been occupied today with fixing the Android autobuilder, which somehow got its build environment broken (unsure how), and fixing some other dependency issues.

Posted Thu Apr 28 20:19:34 2016

I'm on a long weekend. This did not prevent git-annex from getting an impressive lot of features though, as Daniel Dent contributed https://github.com/DanielDent/git-annex-remote-rclone which uses rclone to add support for a ton of additional cloud storage things, including:

Google Drive, Openstack Swift, Rackspace cloud files, Memset Memstore, Dropbox, Google Cloud Storage, Amazon Cloud Drive, Microsoft One Drive, Hubic, Backblaze B2, Yandex Disk

Wow! I hope that rclone will end up packaged in more distributions (eg Debian) so this will be easier to set up.

Posted Mon Apr 25 17:47:48 2016

Something that has come up repeatedly is that git annex reinject is too hard to use since you have to tell it which annexed file you're providing the content for. Now git-annex reinject --known can be passed a list of files and it will reinject any that hash to known annexed contents and ignore the rest. That works best when only one backend is used in a repository; otherwise it would need to be run repeatedly with different --backend values.

Turns out that the GIT_COMMON_DIR feature used by adjusted branches is only a couple years old, so don't let adjusted branches be used with a too old git.

And, git merge is getting a new sanity check that prevents merging in a branch with a disconnected history. git annex sync will inherit that sanity check, but the assistant needs to let such merges happen when eg, pairing repositories, so more git version checking there.

Posted Fri Apr 22 20:10:51 2016

The past three days have felt kind of low activity days, but somehow a lot of stuff still got done, both bug fixes and small features, and I am feeling pretty well caught up with backlog for the first time in over a month. Although as always there is some left, 110 messages.

On Monday I fixed a bug that could cause a hang when dropping content, if git-annex had to verify the content was present on a ssh remote. That bug was bad enough to make an immediate release for, even though it was only a week since the last release.

Posted Wed Apr 20 20:16:16 2016

Seems I forgot about executable files entirely when implementing v6 unlocked files. Fixed that oversight today.

Posted Thu Apr 14 19:56:05 2016

Yesterday I released version 6.20160412, which is the first to support adjusted branches.

Today, some planning for ways to better support annex.thin, but that seems to be stuck on needing a way to update git's index file. Which is the main thing needed to fix various problems with v6 unlocked files.

Dove back into the backlog, got it down to 144 messages. Several bug fixes.

Posted Wed Apr 13 23:18:24 2016

Think I'm really finished with adjusted branches now. Fixed a bug in annex symlink calculation when merging into an adjusted branch. And, fixed a race condition involving a push of master from another repository.

While git annex adjust --unlock is reason enough to have adjusted branches, I do want to at some point look into implementing git annex adjust --hide-missing, and perhaps rewrite the view branches to use adjusted branches, which would allow for updating view branches when pulling from a remote.

Also, turns out Windows supports hard links, so I got annex.thin working on Windows, as well as a few other things that work better with hard links.

Posted Sat Apr 9 18:20:43 2016

Well, I had to rethink how merges into adjusted branches should be handled. The old method often led to unnecessary merge conflicts. My new approach should always avoid unncessary merge conflicts, but it's quite a trick.

To merge origin/master into adjusted/master, it first merges origin/master into master. But, since adjusted/master is checked out, it has to do the merge in a temporary work tree. Luckily this can be done fairly inexpensively. To handle merge conflicts at this stage, git-annex's automatic merge conflict resolver is used. This approach wouldn't be feasible without a way to automatically resolve merge conflicts, because the user can't help with conflict resolution when the merge is not happening in their working tree.

Once that out-of-tree merge is done, the result is adjusted, and merged into the adjusted branch. Since we know the adjusted branch is a child of the old master branch, this merge can be forced to always be a fast-forward. This second merge will only ever have conflicts if the work tree has something uncommitted in it that causes a merge conflict.

Wow! That's super tricky, but it seems to work well. While I ended up throwing away everything I did last Thursday due to this new approach, the code is in some ways simpler than that old, busted approach.

Posted Wed Apr 6 23:32:23 2016

Feels like I've been working on adjusted branches too long.

Did make some excellent progress today. Upgrading a direct mode repo to v6 will now enter an adjusted branch where all files are unlocked. Using an adjusted branch like this avoids unlocking all files in the master branch of the repo, which means that different clones of a repo can be upgraded to v6 mode at different times. This should let me advance the timetable for enabling v6 by default, and getting rid of direct mode.

Also, cloning a repository that has an adjusted branch checked out will now work; the clone starts out in the same adjusted branch.

But, I realized today that the way merges from origin/master into adjusted/master are done will often lead to merge conflicts. I have came up with a better way to handle these merges that won't unncessarily conflict, but didn't feel ready to implement that today.


Instead, I spent the latter half of the day getting caught up on some of the backlog. Got it down from some 200 messages to 150.

Posted Mon Apr 4 21:02:06 2016

Spent all day fixing sync in adjusted branches. I was lost in the weeds for a long time. Eventually, drawing this diagram helped me find my way to a solution:

origin/master    adjusted/master     master
A                                    A
|--------------->A'                  |
|                |                   |
|                C'- - - - - - - - > C
B                                    |
|                                    |
|--------------->M'<-----------------|

After implementing that, syncing in adjusted branches seems to work much better now. And I've finally merged support for them into master.

There's still several bugs and race conditions and upgrade things to sort out around adjusted branches. Proably another week's work all told.

Posted Thu Mar 31 23:07:49 2016

Back from Libreplanet and a week of spring break. Backlog is not too bad for two weeks mostly away; 143 messages.

Finally got the OSX app updated for the git security fix yesterday. Had to drop builds for old OSX releases.

Getting back into working on adjusted branches now. Polishing up the UI and docs today. Nearly ready to merge the feature; the only blocker is there seems to be something a little bit wrong with how pulled changes are merged into the adjusted branch that I noticed in testing.

Posted Tue Mar 29 20:47:33 2016

Pushed out a git-annex release this morning mostly because of the recent git security fix. Several git-annex builds bundle a copy of git and needed to be updated. Note that the OSX autobuilder is temporarily down and so it's not been updated yet -- hopefully soon.

Posted Fri Mar 18 15:46:07 2016

Caught up with a few last things today, before I leave for a week in Boston.

Converted several places that ran git hash-object repeatedly to feed data to a running process. This sped up git-annex add in direct mode and with v6 unlocked files, by up to 2x.

Posted Mon Mar 14 20:53:27 2016

After a real brain-bender of a day, I have commit propagation from the adjusted branch back to the original branch working, without needing to reverse adjust the whole tree. This is faster, but the really nice thing is that it makes individual adjustments simpler to write.

In fact, it's so simple that I took 10 minutes just now to implement a second adjustment!

adjustTreeItem HideMissingAdjustment h ti@(TreeItem _ _ s) = do
         mk <- catKey s
         case mk of
                 Just k -> ifM (inAnnex k)
                         ( return (Just ti)
                         , return Nothing
                         )
                 Nothing -> return (Just ti)
Posted Fri Mar 11 23:55:45 2016

Over the weekend, I converted the linux "ancient" autobuilder to use stack. This makes it easier to get all the recent versions of all the haskell dependencies installed there.

Also, merged my no-ffi branch, removing some library code from git-annex and adding new dependencies. It's good to remove code.

Today, fixed the OSX dmg file -- its bundled gpg was broken. I pushed out a new version of the OSX dmg file with the fix.

With the recent incident in mind of malware inserted into the Transmission dmg, I've added a virus scan step to the release process for all the git-annex images. This way, we'll notice if an autobuilder gets a virus.

Also caught up on some backlog, although the remaining backlog is a little larger than I'd like at 135 messages.

Hope to work some more on adjusted branches this week. A few mornings ago, I had what may be a key insight about how to reverse adjustments when propigating changes back from the adjusted branch.

Posted Mon Mar 7 20:26:38 2016

Tuesday was spent dealing with lock files. Turned out there were some bugs in the annex.pidlock configuration that prevented it from working, and could even lead to data loss.

And then more lock files today, since I needed to lock git's index file the same way git does. This involved finding out how to emulate O_EXCL under Windows. Urgh.

Finally got back to working on adjusted branches today. And, I've just gotten syncing of commits from adjusted branches back to the orginal branch working! Time for short demo of what I've been building for the past couple weeks:

joey@darkstar:~/tmp/demo>ls -l
total 4
lrwxrwxrwx 1 joey joey 190 Mar  3 17:09 bigfile -> .git/annex/objects/zx/X8/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f
joey@darkstar:~/tmp/demo>git annex adjust
Switched to branch 'adjusted/master(unlocked)'
ok
joey@darkstar:~/tmp/demo#master(unlocked)>ls -l
total 4
-rw-r--r-- 1 joey joey 1048576 Mar  3 17:09 bigfile

Entering the adjusted branch unlocked all the files.

joey@darkstar:~/tmp/demo#master(unlocked)>git mv bigfile newname
joey@darkstar:~/tmp/demo#master(unlocked)>git commit -m rename
[adjusted/master(unlocked) 29e1bc8] rename
 1 file changed, 0 insertions(+), 0 deletions(-)
  rename bigfile => newname (100%)
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline
29e1bc835080298bbeeaa4a9faf42858c050cad5 rename
a195537dc5beeee73fc026246bd102bae9770389 git-annex adjusted branch
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline master
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add

The commit was made on top of the commit that generated the adjusted branch. It's not yet reached the master branch.

joey@darkstar:~/tmp/demo#master(unlocked)>git annex sync
commit  ok
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline
b60c5d6dfe55107431b80382596f14f4dcd259c9 git-annex adjusted branch
9c36848f078a2bb7a304010e962a2b7318c0877c rename
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline master
9c36848f078a2bb7a304010e962a2b7318c0877c rename
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add

Now the commit has reached master. Notice how the history of the adjusted branch was rebased on top of the updated master branch as well.

joey@darkstar:~/tmp/demo#master(unlocked)>ls -l
total 1024
-rw-r--r-- 1 joey joey 1048576 Mar  3 17:09 newname
joey@darkstar:~/tmp/demo#master(unlocked)>git checkout master
Switched to branch 'master'
joey@darkstar:~/tmp/demo>ls -l
total 4
lrwxrwxrwx 1 joey joey 190 Mar  3 17:12 newname -> .git/annex/objects/zx/X8/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f

Just as we'd want, the file is locked in master, and unlocked in the adjusted branch.

(Not shown: git annex sync will also merge in and adjust changes from remotes.)

So, that all looks great! But, it's cheating a bit, because it locks all files when updating the master branch. I need to make it remember, somehow, when files were originally unlocked, and keep them unlocked. Also want to implement other adjustments, like hiding files whose content is not present.

Posted Thu Mar 3 21:20:53 2016

Pushed out a release today, could not resist the leap day in the version number, and also there were enough bug fixes accumulated to make it worth doing.

I now have git-annex sync working inside adjusted branches, so pulls get adjusted appropriately before being merged into the adjusted branch. Seems to mostly work well, I did just find one bug in it though. Only propigating adjusted commits remains to be done to finish my adjusted branches prototype.

Posted Mon Feb 29 21:40:37 2016

Now I have a proof of concept adjusted branches implementation, that creates a branch where all locked files are adjusted to be unlocked. It works!

Building the adjusted branch is pretty fast; around 2 thousand files per second. And, I have a trick in my back pocket that could double that speed. It's important this be quite fast, because it'll be done often.

Checking out the adjusted branch can be bit slow though, since git runs git annex smudge once per unlocked file. So that might need to be optimised somehow. On the other hand, this should be done only rarely.

I like that it generates reproducible git commits so the same adjustments of the same branch will always have the same sha, no matter when and where it's done. Implementing that involved parsing git commit objects.

Next step will be merging pulled changes into the adjusted branch, while maintaining the desired adjustments.

Posted Thu Feb 25 21:13:30 2016

Getting started on adjusted branches, taking a top-down and bottom-up approach. Yesterday I worked on improving the design. Today, built a git mktree interface that supports recursive tree generation and filtering, which is the low-level core of what's needed to implement the adjusted branches.

To test that, wrote a fun program that generates a git tree with all the filenames reversed.

import Git.Tree
import Git.CurrentRepo
import Git.FilePath
import Git.Types
import System.FilePath

main = do
        r <- Git.CurrentRepo.get
        (Tree t, cleanup) <- getTree (Ref "HEAD") r
        print =<< recordTree r (Tree (map reverseTree t))
        cleanup

reverseTree :: TreeContent -> TreeContent
reverseTree (TreeBlob f m s) = TreeBlob (reverseFile f) m s
reverseTree (RecordedSubTree f s l) = NewSubTree (reverseFile f) (map reverseTree l)

reverseFile :: TopFilePath -> TopFilePath
reverseFile = asTopFilePath . joinPath . map reverse . splitPath . getTopFilePath

Also, fixed problems with the Android, Windows, and OSX builds today. Made a point release of the OSX dmg, because the last several releases of it will SIGILL on some hardware.

Posted Tue Feb 23 20:57:15 2016

Should mention that there was a release two days ago. The main reason for the timing of that release is because the Linux wstandalone builds include glibc, which recently had a nasty security hole and had to be updated.

Today, fixed a memory leak, and worked on getting caught up with backlog, which now stands at 112 messages.

Posted Fri Feb 19 21:09:44 2016

In a v6 repository on a filesystem not supporting symlinks, it makes sense for commands like git annex add and git annex import to add the files unlocked, since locked files are not usable there. After implementing that, I also added an annex.addunlocked config setting, so that the same behavior can be configured in other repositories.

Rest of the day was spent fixing up the test suite's v6 repository tests to work on FAT and Windows.

Posted Tue Feb 16 21:07:39 2016

Made a no-cbits branch that removes several things that use C code and the FFI. I moved one of them out to a new haskell library, http://hackage.haskell.org/package/mountpoints. Others were replaced with other existing libraries. This will simplify git-annex's build process, and more library use is good. Planning to merge this branch in a week or two.

v6 unlocked files don't work on Windows. I had assumed that since the build was succeeding, the test suite was passing there. But, it turns out the test suite was failing and somehow not failing the build. Have now fixed several problems with v6 on Windows. Still a couple test suite problems to address.

Posted Mon Feb 15 20:56:10 2016

This was one of those days where I somehow end up dealing with tricky filename encoding problems all day.

First, worked around inability for concurrent-output to display unicode characters when in a non-unicode locale. The normal trick that git-annex uses doesn't work in this case. Since it only affected -J, I decided to make git-annex detect the problem and make -J behave as if it was not built with the concurrent-output feature. So, it just doesn't display concurrent output, which is better than crashing with an encoding error.

The other problem affects v6 repos only. Seems that not all Strings will round trip through a persistent sqlite database. In particular, unicode surrogate characters are replaced with garbage. This is really a bug in persistent. But, for git-annex's purposes, it was possible to work around it, by detecting such Strings and serializing them differently.

Then I had to enhance git annex fsck to fix up repositories that were affected by that problem.

Posted Sun Feb 14 22:03:49 2016

Working on a design for adjusted branches. I've been kicking this idea around for a while to replace direct mode on crippled filesystems with v6 unlocked files. And the same thing would allow for hiding not present files. It's somewhat complicated, but the design I have seems like it would work.

Posted Tue Feb 9 19:37:34 2016

The 2015 git-annex user servey is over with, and I'm reading through it and comparing with the 2013 survey.

37% fewer users responded to the 2015 survey than in 2013. It's hard to tell if this has anything to do with the total number of git-annex users; Debian's popcon suggests the number of users has doubled since 2013, although its graph also suggests the number of users has flattened off since 2014. The difference may just be that I promoted the 2013 survey better than the 2015 survey, perhaps reaching kickstarter backers who I was in touch with back then.

25% use the assistant. Of those, 20% use XMPP, which is good to know as I'd like to get rid of it.

Android use has quardrupled, and Windows use has doubled; both are now at 4%. It's not surprising that Android and Windows users still think more porting work is needed for those OSes. iOS is the only unsupported OS that more than 1% of users want. Embedded and NAS systems were mentioned much less than in 2013; probably the arm tarball build met many such needs.

About the same percentage of users prefer direct mode in 2015 as did in 2013, and ditto for indirect mode. But, more users in 2015 only use direct mode on platforms that force its use. Correlating with the OS percentages suggests that many of these users are using removable media with the FAT filesystem, rather than an OS like Windows or Android. Hopefully v6 unlocked files will eventually better meet those user's needs.

The percent of users installing git-annex from source has halved since 2013, and it seems that builds from this website have taken up most of that slack; I would have expected more installs from Debian, Homebrew etc, but that seems not to have increased.

The number of repositories per user has gone up quite a lot since 2013, when only 7% of users had more than 10 repos. Now, 23% of users do. And, 2% of users have more than 100 repos! This probably involves both more repositories for different purposes, and cloning of repositories to more devices.

Similarly, the amount of data stored has gone up. 34% have more than 1 terabyte stored, up from 18% in 2013. 2% have more than 16 terabytes.

There's some indications of more users sharing repositories or otherwise using it in teams of larger groups, although most users still use it by themselves.

Users seem happier with git-annex now than in 2013. 16% call it "one of my favorite applications of all time". And, significantly fewer find it too hard to use than in 2013.

The main blocking problems are documentation, performance with many files (a general git problem), and various issues with the assistant. Respondants suggest more focus on making it easier for nontechnical users, and for use in larger groups/organizations.

Posted Fri Feb 5 22:30:14 2016

The same parser was used for both preferred content expressions and annex.largefiles. Reworked that today, splitting it into two distinct parsers. It doesn't make any sense to use terms like "standard" or "lackingcopies" in annex.largefiles, and such are now rejected.

That groundwork also let me add a feature that only makes sense for annex.largefiles, and not for preferred content expressions: Matching by mime type, such as mimetype=text/*

Posted Wed Feb 3 20:59:21 2016

For use cases that mix annexed files with files stored in git, the annex.largefiles config is more important in v6 repositories than before, since it configures the behavior of git add and even git commit -a. To make it possible to set annex.largefiles so it'll stick across clones of a repository, I have now made it be supported in .gitattributes files as well as git config.

Setting it in .gitattributes looks a little bit different, since the regular .gitattributes syntax can be used to match on the filename.

* annex.largefiles=(largerthan=100kb)
*.c annex.largefiles=nothing

It seems there's no way to make a git attribute value contain whitespace. So, more complicated annex.largefiles expressions need to use parens to break up the words.

* annex.largefiles=(largerthan=100kb)and(not(include=*.c))
Posted Tue Feb 2 19:35:35 2016

Bugfix release of git-annex today. The release earlier this month had a bug that caused git annex sync --content to drop files that should be preferred content. So I had to rush out a fix after that bug was reported. (Some of the builds for the new release are still updating as I post this.)

In the past week I've been dealing with a blizzard. Snowed in for 6 days and counting. That has slightly back-burnered working on git-annex, and I've mostly been making enhancements that the DataLad project needs, along the lines of more commands supporting --batch and better --json output.

Posted Tue Jan 26 19:45:23 2016

After finally releasing git-annex 6 yesterday, I did some catching up today, and got the message backlog back down from 120 to 100.

By the way, the first OSX release of git-annex 6 was broken; I had to fix an issue on the builder and update the build. If you upgraded at the wrong time, you might find that git-annex doesn't run; if so reinstall it. I now have an account on a separate OSX machine from the build machine, that automatically tests the daily build, to detect such problems.

Posted Fri Jan 15 20:56:19 2016

Added git annex benchmark which uses the excellent Criterion to benchmark parts of git-annex. What I'm interested in benchmarking right now is the sqlite database that is used to manage v6 unlocked files, but having a built-in benchmark will probably have other uses later.

The benchmark results were pretty good; queries from the database are quite fast (60 microseconds warm cache) and scale well as the size increases. I did find one scalability issue, which was fixed by adding another index to the database. The kind of schema change that it's easy to make now, but that would be a painful transition if it had to be done once this was in wide use.

Posted Tue Jan 12 18:12:11 2016

Test suite is 100% green! Fixed one remaining bug it found, and solved the strange sqlite crash, which turned out to be caused by the test suite deleting its temporary repository before sqlite was done with the database inside it.

The only remaining blocker for using v6 unlocked files is a bad interaction with shared clones. That should be easy to fix, so release of git-annex version 6 is now not far away!

While I've only talked about v6/smudge stuff here lately, I have been fixing various other bugs along the way, and have accumulated a dozen bug fixes since the last release. Earlier this week I fixed a bug in git annex unused. Yesterday I noticed that git annex migrate didn't copy over metadata. Today, fixed a crash of git annex view in a non-unicode locale. Etc. So it'll be good not to have the release blocked any longer by v6 stuff.

Posted Fri Jan 8 20:35:32 2016

Been working hard on the last several test suite failures for v6 unlocked files. Now I've solved almost all of them, which is a big improvement to my confidence in its (almost) correctness.

Frustratingly, the test suite is still not green after all this work. There's some kind of intermittent failure related to the sqlite database. Only seems to happen when the test suite is running, and the error message is simply "Error" which is making it hard to track down..

Posted Thu Jan 7 22:03:50 2016

Got the test suite passing 100%, but then added a pass that uses v6 unlocked files and 30-some more failures appeared. Fixed a couple of the bugs today. After sprinting unexpectedly hard all December on v6, I need a change of pace, so I started digging into the website message backlog and fixed some bugs and posted some comments there.

Posted Fri Jan 1 21:51:00 2016

Automatic merge conflict resolver updated to work with unlocked files in v6 repos. Fairly tricky and painful; thank goodness the test suite tests a lot of edge cases in that code.

Posted Tue Dec 29 21:49:08 2015

If you've got some free holiday time, the v6 repository mode is now available in many of the daily builds, and there's documentation at unlocked files. It would be very useful now if you can give it a try. Use a clone or new repository for safety.

Yesterday I checked all parts of the code that special case direct mode, and found a few things that needed adjusting for v6 unlocked files. Today, I added the annex.thin config. Around 4 other major todo items need to be dealt with before this is ready for more than early adopters.

Posted Sun Dec 27 21:18:51 2015

Got unexpectedly far today on optimising the database that v6 repositories use to keep track of unlocked files. The database schema may still need optimization, but everything else to do with the database is optimised. Writes to the database are queued together. And reads to the database avoid creating the database if it doesn't exist yet. Which means v5 repos, and v6 repos with no unlocked files will avoid any database overhead.

Posted Wed Dec 23 23:41:22 2015

Today was mostly spent making the assistant support v6 repositories. That was harder than expected, because I have not touched this part of the assistant's code much in a long time, and there are lots of tricky races and edge cases to deal with.

The smudge branch has a 4500 diff from master now. Not counting documentation changes (Another 500 lines.) The todo list for it is shrinking slowly now. May not get it done before the new year.

Posted Wed Dec 23 00:38:21 2015

Two more days working on v6 and the smudge branch is almost ready to be merged. The test suite is passing again for v5 repos, and is almost passing for v6 repos. Also I decided to make git annex init create v5 repos for now, so git annex init --version=6 or a git annex upgrade is needed to get a v6 repo. So while I still have plenty of todo items for v6 repos, they are working reasonably well and almost ready for early adopters.

The only real blocker to merging it is that the database stuff used by v6 is not optimised yet and probably slow, and even in v5 repos it will query the database. I hope to find an optimisation that avoids all database overhead unless unlocked files are used in a v6 repo.

I'll probably make one more release before that is merged though. Yesterday I fixed a small security hole in git annex repair, which could expose the contents of an otherwise not world-writable repository to local users.

BTW, the 2015 git-annex user survey closes in two weeks, please go fill it out if you haven't yet done so!

Posted Wed Dec 16 21:05:32 2015

New special remote alert! Chris Kastorff has made a special remote supporting Backblaze's B2 storage servie.

And I'm still working on v6 unlocked files. After beating on it for 2 more days, all git-annex commands should support them. There is still plenty of work to do on testing, upgrading, optimisation, merge conflict resolution, and reconciling staged changes.

Posted Fri Dec 11 20:25:58 2015

Well, another day working on smudge filters, or unlocked files as the feature will be known when it's ready. Got both git annex get and git annex drop working for these files today.

Get was the easy part; it just has to hard link or copy the object to the work tree file(s) that point to it.

Handling dropping was hard. If the user drops a file, but it's unlocked and modified, it shouldn't reset it to the pointer file. For this, I reused the InodeCache stuff that was built for direct mode. So the sqlite database tracks the InodeCaches of unlocked files, and when a key is dropped it can check if the file is modified.

But that's not a complete solution, because when git uses a clean filter, it will write the file itself, and git-annex won't have an InodeCache for it. To handle this case, git-annex will fall back to verifying the content of the file when dropping it if its InodeCache isn't known. Bit of a shame to need an expensive checksum to drop an unlocked file; maybe the git clean filter interface will eventually be improved to let git-annex use it more efficiently.

Anyway, smudged aka unlocked files are working now well enough to be a proof of concept. I have several missing safety checks that need to be added to get the implementation to be really correct, and quite a lot of polishing still to do, including making unlock, lock, fsck, and merge handle them, and finishing repository upgrade code.

Posted Wed Dec 9 22:14:31 2015

Made a lot of progress today. Implemented the database mapping a key to its associated files. As expected this database, when updated by the smudge/clean filters, is not always consistent with the current git work tree. In particular, commands like git mv don't update the database with the new filename. So queries of the database will need to do some additional work first to get it updated with any staged changes. But the database is good enough for a proof of concept, I hope.

Then I got git-annex commands treating smudged files as annexed files. So this works:

joey@darkstar:~/tmp/new>git annex init
init  ok
(recording state in git...)
joey@darkstar:~/tmp/new>cp ~/some.mp3 .
joey@darkstar:~/tmp/new>git add some.mp3
joey@darkstar:~/tmp/new>git diff --cached
diff --git a/some.mp3 b/some.mp3
new file mode 100644
index 0000000..2df8868
--- /dev/null
+++ b/some.mp3
@@ -0,0 +1 @@
+/annex/objects/SHA256E-s191213--e4b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.mp3
joey@darkstar:~/tmp/new>git annex whereis some.mp3
whereis some.mp3 (1 copy) 
    7de17427-329a-46ec-afd0-0a088f0d0b1b -- joey@darkstar:~/tmp/new [here]
ok

get/drop don't yet update the smudged files, and that's the next step.

Posted Mon Dec 7 21:25:32 2015

I've gotten git-annex working as a smudge/clean filter today in the smudge branch. It works ok in a local git repository. git add lets git-annex decide if it wants to annex a file's content, and checking out branches and other git commands involving those files works pretty well.

It can sometimes be slow; git's smudge interface necessarily needs to copy the content of files around, particularly when checking out files, and so it's never going to be as fast as the good old git-annex symlink approach. Most of the slow parts are things that can't be done in direct mode repos though, like switching branches, so that isn't a regression.

No git-annex commands to manage the annexed content work yet. That will need a key to worktree file mapping to be maintained, and implementing that mapping and ensuring its always consistent is probably going to be the harder part of this.

Also there's the question of how to handle upgrades from direct mode repositories. This will be an upgrade from annex.version 5 to 6, and you won't want to do it until all computers that have clones of a repository have upgraded to git-annex 6.x, since older versions won't be able to work with the upgraded repository. So, the repository upgrade will need to be run manually initially, and it seems I'll need to keep supporting direct mode for v5 repos in a transition period, which will probably be measured in years.

Posted Fri Dec 4 22:05:32 2015

Spent a couple of days catching up on backlog, and my backlog is down to 80 messages now. Lowest in recent memory.

Made the annex.largefiles config be honored by git annex import, git annex addurl, and even git annex importfeed.

Planning to dive into smudge filters soon. The design seems ready to go, although there is some complication in needing to keep track of mappings between worktree files and annex keys.

Posted Wed Dec 2 20:02:25 2015

I'm considering ways to get rid of direct mode, replacing it with something better implemented using smudge filters.

git-lfs

I started by trying out git-lfs, to see what I can learn from it. My feeling is that git-lfs brings an admirable simplicity to using git with large files. For example, it uses a push-hook to automatically upload file contents before pushing a branch.

But its simplicity comes at the cost of being centralized. You can't make a git-lfs repository locally and clone it onto other drive and have the local repositories interoperate to pass file contents around. Everything has to go back through a centralized server. I'm willing to pay complexity costs for decentralization.

Its simplicity also means that the user doesn't have much control over what files are present in their checkout of a repository. git-lfs downloads all the files in the work tree. It doesn't have facilities for dropping files to free up space, or for configuring a repository to only want to get a subset of files in the first place. Some of this could be added to it I suppose.

I also noticed that git-lfs uses twice the disk space, at least when initially adding files. It keep a copy of the file in .git/lfs/objects/, in addition to the copy in the working tree. That copy seems to be necessary due to the way git smudge filters work, to avoid data loss. Of course, git-annex manages to avoid that duplication when using symlinks, and its direct mode also avoids that duplication (at the cost of some robustness). I'd like to keep git-annex's single local copy feature if possible.

replacing direct mode

Anyway, as smudge/clean filters stand now, they can't be used to set up git-annex symlinks; their interface doesn't allow it. But, I was able to think up a design that uses smudge/clean filters to cover the same use cases that direct mode covers now.

Thanks to the clean filter, adding a file with git add would check in a small file that points to the git-annex object.

In the same repository, you could also use git annex add to check in a git-annex symlink, which would protect the object from modification, in the good old indirect mode way. git annex lock and git annex unlock could switch a file between those two modes.

So this allows mixing directly writable annexed files and locked down annexed files in the same repository. All regular git commands and all git-annex commands can be used on both sorts of files. Workflows could develop where a file starts out unlocked, but once it's done, is locked to prevent accidental edits and archived away or published.

That's much more flexible than the current direct mode, and I think it will be able to be implemented in a simpler, more scalable, and robust way too. I can lose the direct mode merge code, and remove hundreds of lines of other special cases for direct mode.

The downside, perhaps, is that for a repository to be usable on a crippled filesystem, all the files in it will need to be unlocked. A file can't easily be unlocked in one checkout and locked in another checkout.

Posted Mon Nov 23 20:56:38 2015

Monday: Some finishing touches on the pid locking support, and released 5.20151116. After the release I noticed that concurrent downloads didn't always include a progress meter, and made the necessary changes to fix that.

Wednesday: This was a day of minor bug fixing and responding to questions etc. Message backlog got down below 90, not bad.

Thursday: I've been distracted from coding today with an idea of making some new stickers. Hexagonal this time, and even better, composable... So they can show git-annex getting as big as you want. ;)

hex.svg

The design is done, see stickers, and seems to work well, and even better is easy to modify. May find time to get these printed at some point.

Posted Thu Nov 19 23:09:41 2015

Got the pid locks working pretty easily, as expected.

But then... Detoured into some truely insane behavior of the Lustre filesystem. It seems that Lustre is perfectly happy to let link() succeed even when there's a file there that it would overwrite. Rather than overwriting the file, Lustre picks an even more crazy way to violate POSIX.. It lets there be 2 files in a directory with the same name, but different contents. Has to be seen to be believed:

hess$ ls pidlock
-r--r--r--  1 hess root    70 Nov 13 15:07 pidlock
-r--r--r--  1 hess root    70 Nov 13 15:07 pidlock
hess$ rm pidlock; ls pidlock
-r--r--r--  1 hess root    74 Nov 13 14:35 pidlock  

git-annex's pid locking code now detects this and seems to work even on Lustre. Eep.

I'm clutching my "NO WARRANTY" disclaimer pretty hard though, if anyone wants to use git-annex on Lustre. When POSIX is being violated this badly, it's hard to anticipate what other strangeness might result.

Posted Fri Nov 13 20:35:05 2015

Been working today on getting git-annex to fall back from nice posix fcntl locks to pid locks when the former are not supported. There will be an annex.pidlock to control this. Mostly useful, I think for networked file systems like NFS and Lustre. While these do support posix locks, I guess it can be hard sometimes to get some big server configured appropriately, especially when you don't admin it and just want to use git-annex there.

Of course, the fun part about pid locks is that it can be pretty hard to tell if one is stale or not. Especialy when using a networked filesystem, because then the pid in question can be running on a different computer.

Even if you do figure out that a pid lock is stale, how do you then take over a stale pid lock, without racing with anther process that also wants to take it over? This was the truely tricky question of the day.

I have a possibly slightly novel approach to solve that: Put a more modern lock file someplace else (eg, /dev/shm) and use that lock file to lock the pid lock file. Then you can tell if a local pid lock file is stale quickly locally, and take it over safely. Of course, if the pid is not locked by a local process, this still has to fall back to the inevitable retry-and-timeout-and-fail.

I hope the result will work pretty well, although git-annex will not support as fine-grained concurrency when using pid locks. Will find out tomorrow when I run today's code! ;)

Posted Thu Nov 12 22:25:14 2015

Some work today on improving the standalone linux builds and the git-annex-standalone.deb. Also, improved fscks's behavior when asked to fsck a dead repo, and fixed some places in the assistant where configured ssh-options were not used. Backlog is back down to 95.

Posted Tue Nov 10 21:11:55 2015

Finally concurrent progress bars are working! After all the groundwork, it was really easy to add, under a dozen lines of code.

I've found several bugs while testing commands in -Jn mode, and the rest of today was spent fixing them. Two of them affected concurrent git annex add; the worst narrowly avoided being a data loss bug.

Posted Fri Nov 6 20:03:15 2015

Spent my time today porting concurrent-output to Windows, fixing a tricky problem with error handling/thread joining with git-annex -J, and improving the concurrent state handling to support the git command queue. Got add/addurl working in concurrent mode. No concurrent progress bars yet.. maybe tomorrow?

Posted Thu Nov 5 22:57:52 2015

Got git-annex using concurrent-output today. It works beautifully. Since the library is new, git-annex has to be explicitly configured to use it, so it'll be a while until this is available in regular builds.

There are no progress bars yet in concurrent output mode, but that will change soon.. Probably tomorrow.

Posted Wed Nov 4 22:38:04 2015

Today started with getting a release of git-annex out, to deal with a new version of the aws library, which broke the build. That also added support to the S3 remotes for creating Google Nearline buckets, although only when git-annex is built with the newest version of the aws library.

Rest of the day (and most of the past weekend) I've been working on the concurrent-output library. Today I finished making it support multi-line regions, and color, and even fully optimised its console updates to use minimal bandwidth. So, it's got everything git-annex can possibly need to display those troublesome concurrent actions. Will be starting to make git-annex use it soon!

Posted Tue Nov 3 15:15:24 2015

Things have been relatively quiet on git-annex this week. I've been distracted with other projects. But, a library that I developed for propellor to help with concurrent console output has been rapidly developing into a kind of tiling region manager for the console, which may be just the thing git-annex needs on the concurrent download progress display front.

After seeing it could go that way, and working on it around the clock to add features git-annex will need, here's a teaser of its abilities.

Probably coming soonish to a git-annex -J near you!

Posted Sat Oct 31 01:47:24 2015

The first release of git-annex was 5 years ago.

There have been a total of 187 releases, growing to 50k lines of haskell code developed by 28 contributors (and another 10 or so external special remote contributors). Approximately 2000 people have posted questions, answers, bugs, todos, etc to this website, with 18900 posts in total.

I've been funded for 3 of the 5 years to work on git-annex, with support from 1451 individuals and 6 organizations.

Released a new version today with rather more significant changes than usual (see recent devblog entries).

The 2015 git-annex user survey is now live.

Posted Mon Oct 19 19:59:14 2015

Feeling kind of ready to cut the next release of git-annex, but am giving the recent large changes just a little time to soak in and make sure they're ok.

Yesterday, changed the order that git annex sync --content and the assistant do drops. When dropping from the local repo and also some remotes, it now makes more sense to drop from the remotes first, and only then the local repo. There are scenaries where that order lets content be dropped from all the places that it should be, while the reverse order doesn't.

Today, caught up on recent bug reports, including fixing a bad merge commit that was made when git merge failed due to filenames not supported by a crippled filesystem, and cleaning up a network transport warning that was displayed incorrectly. Also developed a patch to the aws library to support google nearline when creating buckets.

Posted Thu Oct 15 20:50:12 2015

Well, I've spent all week making git annex drop --from safe.

On Tuesday I got a sinking feeling in my stomach, as I realized that there was hole in git-annex's armor to prevent concurrent drops from violating numcopies or even losing the last copy of a file. The bug involved an unlikely race condition, and for all I know it's never happened in real life, but still this is not good.

Since this is a potential data loss bug, expect a release pretty soon with the fix. And, there are 2 things to keep in mind about the fix:

  1. If a ssh remote is using an old version of git-annex, a drop may fail. Solution will be to just upgrade the git-annex on the remote to the fixed version.
  2. When a file is present in several special remotes, but not in any accessible git repositories, dropping it from one of the special remotes will now fail, where before it was allowed.

    Instead, the file has to be moved from one of the special remotes to the git repository, and can then safely be dropped from the git repository.

    This is a worrysome behavior change, but unavoidable.

Solving this clearly called for more locking, to prevent concurrency problems. But, at first I couldn't find a solution that would allow dropping content that was only located on special remotes. I didn't want to make special remotes need to involve locking; that would be a nightmare to implement, and probably some existing special remotes don't have any way to do locking anyway.

Happily, after thinking about it all through Wednesday, I found a solution, that while imperfect (see above) is probably the best one feasible. If my analysis is correct (and it seems so, although I'd like to write a more formal proof than the ad-hoc one I have so far), no locking is needed on special remotes, as long as the locking is done just right on the git repos and remotes. While this is not able to guarantee that numcopies is always preserved, it is able to guarantee that the last copy of a file is never removed. And, numcopies will always be preserved except for when this rare race condition occurs.

So, I've been implementing that all of yesterday and today. Getting it right involves building up 4 different kinds of evidence, which can be used to make sure that the last copy of a file can't possibly end up being dropped, no matter what other concurrent drops could be happening. I ended up with a very clean and robust implementation of this, and a 2,000 line diff.

Whew!

Posted Fri Oct 9 22:06:16 2015

Lots of porting work ongoing recently:

  • I've been working with Goeke on building git-annex on Solaris/SmartOS. Who knows, this may lead to a binary distribution in some way, but to start with I got the disk free space code ported to Solaris, and have seen git-annex work there.
  • Jirib has also been working on that same disk free code, porting it to OpenBSD. Hope to land an updated patch for that.
  • Yury kindly updated the Windows autobuilder to a new Haskell Platform release, and I was able to land the winprocfix branch that fixes ssh password prompting in the webapp on Windows.
  • The arm autobuilder is fixed and back in its colo, and should be making daily builds again.
Posted Sun Oct 4 20:12:58 2015

While at the DerbyCon security conference, I got to thinking about verifying objects that git-annex downloads from remotes. This can be expensive for big files, so git-annex has never done it at download time, instead deferring it to fsck time. But, that is a divergence from git, which always verifies checksums of objects it receives. So, it violates least surprise for git-annex to not verify checksums too. And this could weaken security in some use cases.

So, today I changed that. Now whenever git-annex accepts an object into .git/annex/objects, it first verifies its checksum and size. I did add a setting to disable that and get back the old behavior: git config annex.verify false, and there's also a per-remote setting if you want to verify content from some remotes but not others.

Posted Thu Oct 1 20:18:10 2015

I've mostly been chewing through old and new bug reports and support requests that past several days. The backlog is waaay low now -- only 82 messages! Just in time for me to go on another trip, to Louisville on Thursday.

Amazon S3 added an "Infrequent Access" storage class last week, and I got a patch into the haskell-aws library to support that, as well as partially supporting Google Nearline. That patch was accepted today, and git-annex is ready to use the new version of the library as soon as it's released.

At the end of today, I found myself rewriting git annex status to parse and adjust the output of git status --short. This new method makes it much more capable than before, including displaying Added files.

Posted Tue Sep 22 21:47:44 2015

Made the release this morning, first one in 3 weeks. A fair lot of good stuff in there.

Just in time for the release, git-annex has support for Ceph. Thanks to Mesar Hameed for building the external special remote!

Posted Thu Sep 17 03:36:26 2015

Seems that Git for Windows was released a few weeks ago, replacing msysgit. There were a couple problems using git-annex with that package of git, which I fixed on Thursday. The next release of git-annex won't work with msysgit any longer though; only with Git for Windows.

On Friday, I improved the Windows package further, making it work even when git is not added to the system PATH. In such an installation, git-annex will now work inside the "git bash" window, and I even got the webapp starting from the menu working without git in PATH.


In other dependency fun, the daily builds for Linux got broken due to a glibc bug in Debian unstable/testing, which makes the bundled curl and ssh segfault. With some difficulty I tracked that down, and it turns out the bug has been fixed upstream for quite a while. The daily builds are now using the fixed glibc 2.21.


Today, got back to making useful improvements, rather than chasing dependencies. Improved the bash completion for remotes and backends, made annex.hardlink be used more, and made special remotes that are configured with autoenable=true get automatically enabled by git annex init.

Posted Mon Sep 14 19:10:05 2015

Today was a scramble to get caught up after weeks away. Got the message backlog down from over 160 to 123. Fixed two reversions, worked around a strange bug, and implemented support for the gpg.program configuration, and made several smaller improvements.

Posted Wed Sep 9 22:12:16 2015

Did some work on Friday and Monday to let external special remotes be used in a readonly mode. This lets files that are stored in the remote be downloaded by git-annex without the user needing to install the external special remote program. For this to work, the external special remote just has to tell git-annex the urls to use. This was developed in collaboration with Benjamin Gilbert, who is developing gcsannex, a Google Cloud Storage special remote.

Today, got caught up with recent traffic, including fixing a couple of bugs. The backlog remains in the low 90's, which is a good place to be as I prepare for my August vacation week in the SF Bay Area, followed by a week for ICFP and the Haskell Symposium in Vancouver.

Posted Wed Aug 19 19:10:56 2015

Been doing a little bit of optimisation work. Which meant, first improving the --debug output to show fractions of a second, and show when commands exit.

That let me measure what takes up time when downloading files from ssh remotes. Found one place I could spawn a thread to run a cleanup action, and this simple change reduced the non-data-transfer overhead to 1/6th of what it had been!

Posted Thu Aug 13 20:23:51 2015

Catching up on weekend's traffic, and preparing for a release tomorrow.

Found another place where the optparse-applicative conversion broke some command-line parsing; using git-annex metadata to dump metadata recursively got broken. This is the second known bug caused by that transition, which is not too surpising given how large it was.

Tracked down and fixed a very tricky encoding problem with metadata values.

The arm autobuilder broke so it won't boot; got a serial console hooked up to it and looks like a botched upgrade resulting in a udev/systemd/linux version mismatch.

Posted Tue Aug 11 23:24:16 2015

The SHA-3 specification was released yesterday; git-annex got support for using SHA-3 hashes today. I had to add support for building with the new cryptonite library, as cryptohash doesn't (correctly) implement SHA-3 yet. Of course, nobody is likely to find a use for this for years, since SHA-2 is still prefectly fine, but it's nice to get support for new hashes in early. :)

Posted Thu Aug 6 22:34:16 2015

Took a half day and worked on making it simpler to set up ssh remotes. The complexity I've gotten rid of is there's no need to take any action to get a ssh remote initialized as a git-annex repository. Where before, either git-annex init needed to be ran on the remote, or a git-annex branch manually pushed to it, now the remote can simply be added and git annex sync will do the rest. This needed git-annex-shell changes, so will only work once servers are upgraded to use a newer version of git-annex.

Posted Wed Aug 5 18:41:44 2015

Ended up sending most of today working on git annex proxy. It had a lot of buggy edge cases, which are all cleaned up now.

Spent another couple hours catching up on recent traffic and fixing a couple other misc bugs.

Posted Tue Aug 4 20:28:05 2015

Work today has started in the git-annex bug tracker, but the real bugs were elsewhere. Got a patch into hinotify to fix its handling of filenames received from inotify events when used in a non-unicode locale. Tracked down why gitlab's git-annex-shell fails to initialize gcrypt repositories, and filed a bug on gitlab-shell.

Yesterday, I got the Android autobuilder fixed. I had started upgrading it to new versions of yesod etc, 2 months ago, and something in those new versions led to character encoding problems that broke the template haskell splicing. Had to throw away the work done for that upgrade, but at least it's building again, at last.

Posted Mon Aug 3 19:43:14 2015

Made a release this morning, mostly because the release earlier this week turns out to have accidentally removed several options from git annex copy.

Spent some time this afternoon improving how git-annex shuts down when --time-limit is used. This used to be a quick and dirty shutdown, similar to if git-annex were ctrl-c'd, but I reworked things so it does a clean shutdown, including running any buffered git commands. This made incremental fsck with --time-limit resume much better, since it saves the incremental fsck database on shutdown. Also tuned when the database gets checkpointed during an incremental fsck, to resume better after it's interrupted.

Posted Fri Jul 31 20:57:10 2015

Made a release today, with recent work, including the optparse-applicative transition and initial gitlab.com support in the webapp.

I had time before the release to work out most of the wrinkles in the gitlab.com support, but was not able to get gcrypt encrypted repos to work with gitlab, for reasons that remain murky. Their git-annex-shell seems to be misbehaving somehow. Will need to get some debugging assistance from the gitlab.com developers to figure that out.

Posted Mon Jul 27 20:29:55 2015

I've been working on adding GitLab support to the webapp for the past 3 days.

That's not the only thing I've been working on; I've continued to work on the older parts of the backlog, which is now shrunk to 91 messages, and made some minor improvements and bugfixes.

But, GitLab support in the webapp has certianly taken longer than I'd have expected. Only had to write 82 lines of GitLab specific code so far, but it went slowly. The user will need to cut and paste repository url and ssh public key back and forth between the webapp and GitLab for now. And the way GitLab repositories use git-annex makes it a bit tricky to set up; in one case the webapp has to do a forced push dry run to check if the repository on GitLab can be accessed by ssh.

I found a way to adapt the existing code for setting up a ssh server to also support GitLab, so beyond the repo url prompt and ssh key setup, everything will be reused. I have something that works now, but there are lots of cases to test (encrypted repositories, enabling existing repositories, etc), so will need to work on it a bit more before merging this feature.

Also took some time to split the centralized git repository tutorial into three parts, one for each of GitHub, GitLab, and self-administered servers.


The git-annex package in Debian unstable hasn't been updated for 8 months. This is about to change; Richard Hartmann has stepped up and is preparing an upload of a recent version. Yay!

Posted Wed Jul 22 22:04:21 2015

Worked on bash tab completion some more. Got "git annex" to also tab complete. However, for that to work perfectly when using bash-completion to demand-load completion scripts, a small improvement is needed in git's own completion script, to have it load git-annex's completion script. I sent a patch for that to the git developers, and hopefully it'll get accepted soon.

Then fixed a relatively long-standing bug that prevented uploads to chunked remotes from resuming after the last successfully uploaded chunk.

Posted Thu Jul 16 19:19:20 2015

Worked through the rest of the changes this weekend and morning, and the optparse-applicative branch has landed in master, including bash completion support.

Posted Mon Jul 13 20:57:53 2015

Day 3 of the optparse-applicative conversion.
116 files changed, 1607 insertions(+), 1135 deletions(-)
At this point, everything is done except for around 20 sub-commands. Probably takes 15 minutes work for each. Will finish plowing through it in the evenings.

Meanwhile, made the release of version 5.20150710. The Android build for this version is not available yet, since I broke the autobuilder last week and haven't fixed it yet.

Posted Fri Jul 10 21:58:14 2015

Now working on converting git-annex to use optparse-applicative for its command line parsing. I've wanted to do this for a long time, because the current code for options is generally horrible, runs in IO, and is not at all type safe, while optparse-applicative has wonderful composable parsers and lets each subcommand have its own data type repesenting all its options.

What pushed me over the edge is that optparse-applicative has automatic bash completion!

# source <(git-annex --bash-completion-script `which git-annex`)
# git-annex fsck -
--all                   --key                   -S
--from                  --more                  -U

Since nobody has managed to write a full bash completion for git-annex before, let alone keep it up-to-date with changes to the code, automating the problem away is a really nice win. :)

The conversion is a rather huge undertaking; the diff is already over 3000 lines large after 8 hours of work, and I'm maybe 1/3rd done, with the groundwork laid (except for global options still todo) and a few subcommands converted. This won't land for this week's release; it'll need a lot of testing before it'll be ready for any release.

Posted Thu Jul 9 00:25:19 2015

Mostly spent today getting to older messages in the backlog. This did result in a few fixes, but with 97 old messages left, I can feel the diminishing returns setting in, to try to understand old bug reports that are often unclear or lacking necessary info to reproduce them.

By the way, if you feel your bug report or question has gotten lost in my backlog, the best thing to do is post an update to it, and help me reproduce it, or clarify it.

Moved on to looking through todo, which was a more productive way to find useful things to work on.

Best change made today is that git annex unused can now be configured to look at the reflog. So, old versions of files are considered still used until the reflog expires. If you've wanted a way to only delete (or move away) unused files after they get to a certian age, this is a way to do that ...

Posted Tue Jul 7 21:38:16 2015

Now caught up on nearly all of my backlog of messages, and indeed am getting to some messages that have been waiting for months. Backlog is down to 113! Couple of bugfixes resulted, and many questions answered.

Think I'll spend a couple more days dealing with the older part of the backlog. Then, when that reaches diminishing returns, I'll move on to some big change. I have been thinking about caching database on and off..

Posted Mon Jul 6 20:46:38 2015

Back, and have spent all day focusing on new bug reports. All told, I fixed 4 bugs, followed up on all other bugs reported while I was away, and fixed the android autobuilder.

The message backlog started the day at 250 or something, and is down to 178 now. Looks like others have been following up to forum posts while I was away (thanks!) so those should clear quickly.

Posted Fri Jul 3 03:09:38 2015

Well, not the literal last push, but I've caught up on as much backlog as I can (142 messages remain) and spent today developing a few final features before tomorrow's release.

Some of the newer things displayed by git annex info were not included in the --json mode output. The json includes everything now.

git annex sync --all --content will make it consider all known annexed objects, not only those in the current work tree. By default that syncs all versions of all files, but of course preferred content can tune what repositories want.

To make that work well with preferred content settings like "include=*.mp3", it makes two passes. The first pass is over the work tree, so preferred content expressions that match files by name will work. The second pass is over all known keys, and preferred content expressions that don't care about the filename can match those keys.

Two passes feels a bit like a hack, but it's a lot better than --all making nothing be synced when the a preferred content expression matches against filenames... I actually had to resort to bloom filters to make the two passes work.

This new feature led to some slightly tricky follow-on changes to the standard groups preferred content expressions.

Posted Tue Jun 16 22:56:34 2015

Ever since git annex fsck --all was added, people have ?complained that there's no way to stop it complaining about keys whose content is gone for good. Well, there is now: git annex dead --key can be used when you know that a key is no longer available and want fsck to stop complaining about it.

Running fsck on a directory will intentionally still complain about files in the directory with missing contents, even if the keys have been marked dead.

The crucial part was finding a good way to store the information; luckily location log files are parsed in a way that lets it be added there without breaking backwards compatability. A bonus is that adding a key's content back to the annex will automatically bring it back from the dead.

I'm pondering making git annex drop --force automatically mark a key as dead when the last copy is dropped, but I don't know if it's too DWIM or worth the complication. Another approach would be to let fsck mark keys as dead, but that would certianly need an extra flag.

Posted Tue Jun 9 19:59:26 2015

Now git-annex can be used to set up a public S3 remote. If you've cloned a repository that knows about such a remote, you can use the S3 remote without needing any S3 credentials. Read-only of course.

This tip shows how to do it: public Amazon S3 remote

One rather neat way to use this is to configure the remote with encryption=shared. Then, the files stored in S3 will be encrypted, and anyone with access to the git repository can get and decrypt the files.

This feature will work for at least AWS S3, and for the Internet Archive's S3. It may work for other S3 services, that can be configured to publish their files over unauthenticated http. There's a publicurl configuration setting to allow specifying the url when using a service that git-annex doesn't know the url for.

Actually, there was a hack for the IA before, that added the public url to an item when it was uploaded to the IA. While that hack is now not necessary, I've left it in place for now, to avoid breaking anything that depended on it.

Posted Fri Jun 5 20:39:24 2015

Worked thru some backlog. Currently stands at 152 messages.

Merged work from Sebastian Reuße to teach the assistant to listen for systemd-networkd dbus events when the network connection changes.

Added git annex get --incomplete, which can be used to resume whatever it was you were downloading earlier and interrupted, that you've forgotten about. ;)


The Isuma Media Players project is using git-annex to "create a two-way, distributed content distribution network for communities with poor connexions to the internet". My understanding is this involves places waaay up North.

Reading over their design docs is quite interesting, both to see how they've leveraged things like git-annex metadata and preferred content expressions and the assistant, and areas where git-annex falls short.

Between DataLad, Isuma, Baobáxia, IA.BAK, and more, there are a lot of projects being built on top of git-annex now!

Posted Tue Jun 2 19:54:36 2015

On Friday I installed the CubieTruck that is the new autobuilder for arm. This autobuilder is hosted at WetKnee Books, so its physical security includes a swamp.

The hardware is not fast, but it's faster and far more stable than qemu arm emulation. By Saturday I got the build environment all installed nicely, including building libraries that use template haskell!

But, ghc crashed with an internal error building git-annex. I upgraded to ghc 7.10.1 (which took another day), but it also crashed. Was almost giving up, but I looked at the ghc parameters, and -j2 stuck out in them. Removed the -j2, and the build works w/o crashing! \o/ (Filed a bug report on ghc.)


Anarcat has been working on improving the man pages, including lots of linking to related commands.

The 2015 Haskell Communities and Activities Report is out, and includes an entry for git-annex for the first time!

Posted Mon Jun 1 22:53:42 2015

After a less active than usual week (dentist), I made a release last Friday. Unfortunately, it turns out that the Linux standalone builds in that release don't include the webapp. So, another release is planned tomorrow.

Yesterday and part of today I dug into the windows ssh webapp password entry broken reversion. Eventually cracked the problem; it seems that different versions of ssh for Windows do different things in a isatty check, and there's a flag that can be passed when starting ssh to make it not see a controlling tty. However, this neeeds changes to the process library, which db48x and I have now coded up. So a fix for this bug is waiting on a new release of that library. Oh well.

Rest of today was catching up on recent traffic, and improving the behavior of git annex fsck when there's a disk IO error while checksumming a file. Now it'll detect a hardware fault exception, and take that to mean the file is bad, and move it to the bad files directory, instead of just crashing.

I need better tooling to create disk IO errors on demand. Yanking disks out works, but is a blunt instrument. Anyone know of good tools for that?

Posted Wed May 27 21:06:58 2015

There's something rotten in POSIX fctnl locking. It's not composable, or thread-safe.

The most obvious problem with it is that if you have 2 threads, and they both try to take an exclusive lock of the same file (each opening it separately) ... They'll both succeed. Unlike 2 separate processes, where only one can take the lock.

Then the really crazy bit: If a process has a lock file open and fcntl locked, and then the same process opens the lock file again, for any reason, closing the new FD will release the lock that was set using the other FD.

So, that's a massive gotcha if you're writing complex multithreaded code. Or generally for composition of code. Of course, C programmers deal with this kind of thing all the time, but in the clean world of Haskell, this is a glaring problem. We don't expect to need to worry about this kind of unrelated side effect that breaks composition and thread safety.

After noticing this problem affected git-anenx in at least one place, I have to assume there could be more. And I don't want to need to worry about this problem forever. So, I have been working today on a clean fix that I can cleanly switch all my lock-related code to use.

One reasonable approach would be to avoid fcntl locking, and use flock. But, flock works even less well on NFS than fcntl, and git-annex relies on some fcntl locking features. On Linux, there's an "open file description locks" feature that fixes POSIX fnctl locking to not have this horrible wart, but that's not portable.

Instead, my approach is to keep track of which files the process has locked. If it tries to do something with a lockfile that it already has locked, it avoids opening the same file again, instead implements its own in-process locking behavior. I use STM to do that in a thread-safe manner.

I should probably break out git-annex's lock file handling code as a library. Eventually.. This was about as much fun as a root canal, and I'm having a real one tomorrow. :-/


git-annex is now included in Stackage!

Daniel Kahn Gillmor is doing some work on reproducible builds of git-annex.

Posted Tue May 19 20:01:18 2015

Today I added a feature to git annex unused that lets the user tune which refs they are interested in using. Annexed objects that are used by other refs then are considered unused.

Did a fairly complicated refspec format for this, with globs and include/exclude of refs. Example:

+refs/heads/*:+HEAD^:+refs/tags/*:-refs/tags/old-tag

I think that, since Google dropped openid support, there seems to have been less activity on this website. Although possibly also a higher signal to noise ratio. :) I have been working on some ikiwiki changes to make it easier for users who don't have an openid to contiribute. So git-annex's website should soon let you log in and make posts with just an email address.

People sometimes ask for a git-annex mailing list. I wouldn't mind having one, and would certianly subscribe, but don't see any reason that I should be involved in running it.

Posted Thu May 14 19:57:19 2015

Implemented git annex drop --all. This also added for free drop with --unused and --key, which overlap with git annexdropunused and git annex dropkey.

The concurrentprogress branch had gone too long without being merged, and had a lot of merge conflicts. I resolved those, and went ahead and merged it into master. However, since the ascii-progress library is not ready yet, I made it a build flag, and it will build without it by default. So, git annex get -J5 can be used now, but no progress bars will display yet.

When doing concurrent downloads, either with the new -J or by hand by running multiple processes, there was a bug in the diskreserve checking code. It didn't consider the disk space that was in the process of being used by other concurrent downloads, so would let more downloads start up than there was space for.

I was able to fix this pretty easily, thanks to the transfer log files. Those were originally added just to let the webapp display transfers, but proved very helpful here!

Finally, made .git/annex/transfer/failed/ files stop accumulating when the assistant is not being used. Looked into also cleaning up stale .git/annex/transfer/{upload,download}/ files (from interrupted transfers). But, since those are used as lock files, it's difficult to remove them in a concurrency safe way.

Update: Unfortunately, I turned out to have stumbled over an apparent bug in haskell's implementation of file locking. https://github.com/haskell/unix/issues/44 Had to work around that.

Happily, the workaround also let me implement cleanup of stale transfer info files, left behind when a git-annex process was interrupted. So, .git/annex/transfer/ will entirely stop accumulating cruft!

Posted Tue May 12 20:36:53 2015

Lazy afternoon spent porting git-anenx to build under ghc 7.10. Required rather a lot of changes to build, and even more to build cleanly after the AMP transition.

Unfortunately, ghc 7.10 has started warning about every line that uses tab for indentation. I had to add additional cruft to turn those warnings off everywhere, and cannot say I'm happy about this at all.

Posted Sun May 10 20:33:21 2015

Got the release out after more struggling with ssh on windows and a last minute fix to the quvi support.

The downloads.kitenet.net git annex repository had accumulated 6 gb of past builds that were not publically available. I am publishing those on the Internet Archive now, so past builds can be downloaded using git-annex in that repository in the usual way. This worked great! :)

I have ordered a CubieTruck with 2 gb of ram to use for the new Arm builder. Hosting still TBD.

Looks like git-annex is almost ready to be included in stackage, which will make building it from source much less likely to fail due to broken libraries etc.

Posted Fri May 8 23:01:03 2015

I've not been blogging, but have been busy this week. Backlog is down to 113 messages.

Tuesday: I got a weird bug report where git annex get was deleting a file. This turned out to be a bug in wget ftp://... where it would delete a symlink that was not where it had been told to download the fie to. I put a workaround in git-annex; wget is now run in a temp directory. But this was a legitimate wget bug, and it's now been reported to the wget developers and will hopefully get fixed there.

Wednesday: Added a --batch mode for several plumbing commands (contentlocation, examinekey, and lookupkey). This avoids startup overhead, and so lets a lot of queries be done much faster. The implementation should make it easy to add --batch to more plumbing commands as needed, and could probably extend to non-plumbing commands too.

Today: The first 5 hours involved an incompatible mess of ssh and rsync versions on Windows. A Gordian knot of brokenness and dependency hell. I finally found a solution which involves downgrading the cygwin rsync to an older version, and using msysgit's ssh rather than cygwin's.

Finished up today with more post-Debian-release changes. Landed a patch to switch from dataenc to sandi that had been waiting since 2013, and got sandi installed on all the git-annex autobuilders. Finished up with some prep for a release tomorrow.


Finally, Debian has a new enough ghc that it can build template haskell on arm! So, whenever a new version of git-annex finally gets into Debian (I hope soon), the webapp will be available on arm for those arm laptops. Yay!

This also means I have the opportunity to make the standalone arm build be done much more simply. Currently it involves qemu and a separate companion native mode container that it has to ssh to and build stuff, that has to have the same versions of all libraries. It's just enormously complicated and touchy. With template haskell building support, all that complexity can fall away.

What I'd really like to do is get a fast-ish arm box with 2gb of ram hosted somewhere, and use that to do the builds, in native mode. Anyone want to help provide such a box for git-annex arm autobuilds?

Posted Thu May 7 23:39:39 2015

Reduced activity this week (didn't work on the assistant after all), but several things got done:

Monday: Fixed fsck --fast --from remote to not fail when the remote didn't support fast copy mode. And dealt with an incompatibility in S3 bucket names; the old hS3 library supported upper-case bucket names but the new one needs them all in lower case.

Wednesday: Caught up on most recent backlog, made some improvements to error handling in import, and improved integration with KDE's file manager to work with newer versions.

Today: Made import --deduplicate/--clean-duplicates actively verify that enough copies of a file exist before deleting it. And, thinking about some options for batch mode access to git-annex plumbing, to speed up things that use it a lot.

Posted Thu Apr 30 19:53:43 2015

Posted a design for balanced preferred content. This would let preferred content expressions assign each file to N repositories out of a group, selected using Math. Adding a repository could optionally be configured to automatically rebalance the files (not very bandwidth efficiently though). I think some have asked for a feature like this before, so read the design and see if it would be useful for you.

Spent a while debugging a problem with a S3 remote, which seems to have been a misconfiguration in the end. But several improvements came out of it to make it easier to debug S3 in the future etc.

Posted Thu Apr 23 20:34:05 2015

I hope that today's git-annex release will be landing in Debian unstable toward the end of the month. And I'm looking forward to some changes that have been blocked by wanting to keep git-annex buildable on Debian 7.

Yesterday I got rid of the SHA dependency, switching git-annex to use a newer version of cryptohash for HMAC generation (which its author Vincent Hanquez kindly added to it when I requested it, waay back in 2013). I'm considering using the LambdaCase extension to clean up a lot of the code next, and there are 500+ lines of old yesod compatability code I can eventually remove.

These changes and others will prevent backporting to the soon to be Debian oldstable, but the standalone tarball will still work there. And, the git-annex-standalone.deb that can be installed on any version of Debian is now available from the NeuroDebian repository, and its build support has been merged into the source tree.

In the run up to the release today, I also dealt with getting the Windows build tested and working, now that it's been updated to newer versions of rsync, ssh, etc from Cygwin. Had to add several more dlls to the installer. That testing also turned up a case where git-annex init could fail, which got a last-minute fix.

PS, scroll down this 10 year of git timeline and see what you find!

Posted Mon Apr 20 22:56:12 2015

Recent work has included improving fsck --from remote (and fixing a reversion caused by the relative path changes in January), and making annex.diskreserve be checked in more cases. And added a git annex required command for setting required content.

Also, I want to thank several people for their work:

  • Roy sent a patch to enable http proxy support.. despite having only learned some haskell by "30 mins with YAHT". I investigated that more, and no patch is actually necessary, but just a newer version of the http-client library.
  • CandyAngel has been posting lots of helpful comments on the website, including this tip that significantly speeds up a large git repository.
  • Øyvind fixed a lot of typos throughout the git-annex documentation.
  • Yaroslav has created a git-annex-standalone.deb package that will work on any system where debian packages can be installed, no matter how out of date it is (within reason), using the same methods as the standalone tarball.
Posted Sat Apr 18 20:15:19 2015

Mostly working on Windows recently. Fixed handling of git repos on different drive letters. Fixed crazy start menu loop. Worked around stange msysgit version problem.

Also some more work on the concurrentprogress branch, making the progress display prettier.

Added one nice new feature yesterday: git annex info $dir now includes a table of repositories that are storing files in the directory, with their sizes.

repositories containing these files: 
    288.98 MB: ca9c5d52-f03a-11df-ac14-6b772ffe59f9 -- archive-5
    288.98 MB: f1c0ce8d-d848-4d21-988c-dd78eed172e8 -- archive-8
     10.48 MB: 587b9ccf-4548-4d6f-9765-27faecc4105f -- darkstar
     15.18 kB: 42d47daa-45fd-11e0-9827-9f142c1630b3 -- origin

Nice thing about this feature is it's done for free, with no extra work other than a little bit of addition. All the heavy location lookup work was already being done to get the numcopies stats.

Posted Tue Apr 14 20:44:28 2015

Back working on git annex get --jobs=N today. It was going very well, until I realized I had a hard problem on my hands.

The hard problem is that the AnnexState structure at the core of git-annex is not able to be shared amoung multiple threads at all. There's too much complicated mutable state going on in there for that to be feasible at all.

In the git-annex assistant, which uses many threads, I long ago worked around this problem, by having a single shared AnnexState and when a thread needs to run an Annex action, it blocks until no other thread is using it. This worked ok for the assistant, with a little bit of thought to avoid long-duration Annex actions that could stall the rest of it.

That won't work for concurrent get etc. I spent a while investigating maybe making AnnexState thread safe, but it's just not built for it. Too many ways that can go wrong. For example, there's a CatFileHandle in the AnnexState. If two threads are running, they can both try to talk to the same git cat-file --batch command at once, with bad results. Worse, yet, some parts of the code do things like modifying the AnnexState's Git repo to add environment variables to use when running git commands.

It's not all gloom and doom though. Only very isolated parts of the code change the working directory or set environment variables. And the assistant has surely smoked out other thread concurrency problems already. And, separate git-annex programs can be run concurrently with no problems at all; it uses file locking to avoid different processes getting in each-others' way. So AnnexState is the only remaining obstacle to concurrency.

So, here's how I've worked around it: When git annex get -J10 is run, it will start by allocating 10 job slots. A fresh AnnexState will be created, and copied into each slot. Each time a job runs, it uses its slot's own AnnexState. This means 10 git cat-file processes, and maybe some contention over lock files, but generally, a nice, easy, and hopefully trouble-free multithreaded mode.

And indeed, I've gotten git annex get -J10 working robustly! And from there it was trivial to enable -J for move and copy and mirror too!

The only real blocker to merging the concurrentprogress branch is some bugs in the ascii-progress library that make it draw very scrambled progress bars the way git-annex uses it.

Posted Fri Apr 10 21:17:29 2015

I've had to release git-annex twice this week to fix reversions. On Monday, just after I made a planned release, I discovered a bug in it, and had to update it with a .1 release. Today's release fixes 2 other reversions introduced by recent changes, both only affecting the assistant.

Before making today's release, I did a bunch of other minor bugfixes and improvements, including adding a new contentlocationn plumbing command. This release also changes git annex add when annex.largefiles is configured, so it will git add the non-large files. That is particularly useful in direct mode.

I feel that the assistant needs some TLC, so I might devote a week to it in the latter part of this month. My current funding doesn't cover work on the assistant, but I should have some spare time toward the end of the month.

Posted Thu Apr 9 20:40:55 2015

Rethought distributed fsck. It's not really a fsck, but an expiration of inactive repositories, where fscking is one kind of activity. That insight let me reimplement it much more efficiently. Rather than updating all the location logs to prove it was active, git annex fsck can simply and inexpensively update an activity log. It's so cheap it'll do it by default. The git annex expire command then reads the activity log and expires (or unexpires) repositories that have not been active in the desired time period. Expiring a repository simply marks it as dead.

Yesterday, finished making --quiet really be quiet. That sounds easy, but it took several hours. On the concurrentprogress branch, I have ascii-progress hooked up and working, but it's not quite ready for prime time.

Posted Sun Apr 5 16:59:20 2015

I've started work on parallel get. Today, laid the groundwork in two areas:

  1. Evalulated the ascii-progress haskell library. It can display multiple progress bars in the terminal, portably, and its author Pedro Tacla Yamada has kindly offered to improve it to meet git-annex's needs.

    I ended up filing 10 issues on it today, around 3 of the are blockers for git-annex using it.

  2. Worked on making --quiet more quiet. Commands like rsync and wget need to have thier progress output disabled when run in parallel.

    Didn't quite finish this yet.


Yesterday I made some improvements to how git-annex behaves when it's passed a massive number of directories or files on the command line. Eg, when driven by xargs. There turned out to be some bugs in that scenario.

One problem one I kind of had to paper over. While git-annex get normally is careful to get the files in the same order they were listed on the command line, it becomes very expensive to expand directories using git-ls-files, and reorder its output to preserve order, when a large number offiles are passed on the command line. There was a O(N*M) time blowup.

I worked around it by making it only preserve the order of the first 100 files. Assumption being that if you're specifying so many files on the command line, you probably have less of an attachment to their ordering. :)

Posted Fri Apr 3 21:02:06 2015

Added two options to git annex fsck that allow for a form of distributed fsck. This is useful in situations where repositiories cannot be trusted to continue to exist, and cannot be checked directly, but you'd still like to keep track of their status. iabackup is one use case for this.

By running a periodic fsck with the --distributed option, the repositories can verify that they still exist and that the information about their contents is still accurate. This is done by doing an extra update of the location log each time a file is verified by fsck to still be in the repository.

The other option looks like --expire="30d somerepo:60d". It checks that each specified repository has recorded a distributed fsck within the specified time period. If not, the repository is dropped from the location tracking log. Of course it can always update that later if it's really still around.

Distributed fsck is not the default because those extra location log updates increase the size of the git-annex branch. I did one thing to keep the size increase small: An identical line is logged to for each key, including the timestamp, so git's delta compression will work as well as is possible. But, there's still commit and tree update overhead.

Probably doesn't make sense to run distributed fscks too often for that and other reasons. If the git-annex branch does get too large, there's always git annex forget ...

(Update: This was later rethought and works much more efficiently now..)

Posted Wed Apr 1 21:54:00 2015

Turns out that git has a feature I didn't know about; it will expand wildcards and other stuff in filenames passed to many git commands. This is on top of the shell's expansion.

That led to some broken behavior by git annex add 'foo.*' and, it could lead to other probably unwanted behavior, like git annex drop 'foo[barred]' dropping a file named food in addition to foo[barred]

For now, I've disabled this git feature throughout git-annex. If you relied on it for something, let me know, I might think about adding it back in specific places where it makes sense.


Improved git annex importfeed to check the itemid of the feed and avoid re-downloading a file with the same itemid. Before, it would add duplicate files if a feed kept the itemid the same, but changed the url. This was easier than expected because annex.genmetadata already caused the itemid to be stored in the git-annex metadata. I just had to make it check the itemid metadata, and set itemid even when annex.genmetadata isn't set.


Also got 4 other bug reports fixed, even though I feel I'm taking it easy today. It's good to be relaxed again!

Posted Tue Mar 31 20:02:08 2015

While I plowed through a lot of backlog the past several days, I still have some 120 messages piled deep.

That work did result in a number of improvements, culminating in a rather rushed release of version 5.20150327 today, to fix a regression affecting git annex sync when using the standalone linux tarballs. Unfortunately, I then had to update those tarballs a second time after the release as the first fix was incomplete.

And, I'm feeling super stressed out. At this point, I think I should step away until the end of the month. Unfortunately, this will mean more backlog later. Including lots of noise and hand-holding that I just don't seem to have time for if I want to continue making forward progress.

Maybe I'll think of a way to deal with it while I'm away. Currently, all I have is that I may have to start ignoring irc and the forum, and de-prioritizing bug reports that don't have either a working reproduction recipe or multiple independent confirmations that it's a real bug.

Posted Fri Mar 27 23:12:59 2015

While traveling for several days, I filled dead time with a rather massive reorganization of the git-annex man page, and I finished that up this morning.

That man page had gotten rather massive, at around 3 thousand lines. I split out 87 man pages, one for each git-annex command. Many of these were expanded with additional details, and have become a lot better thanks to the added focus and space. See for example, git-annex-find, or any of the links on the new git-annex man page. (Which is still over 1 thousand lines long..)

Also, git annex help <command> can be used to pull up a command's man page now!

I'm taking the rest of the day off to R&R from the big trip north, and expect to get back into the backlog of 143 messages starting tomorrow.

Posted Wed Mar 25 16:18:56 2015

Spent a couple of days at Dartmouth hanging out in the neuroscience department with the Datalad developers. Added several new plumbing commands and a new post-update-annex hook, based on their feedback of how they're using git-annex.

Posted Sat Mar 21 13:33:30 2015

Caught up with most of the recent backlog today. Was not very bad.

Fixed remotedaemon to support gcrypt remotes, which was never quite working before.

Seem to be on track to making a release tomorrow with a whole month's changes.

Posted Mon Mar 16 20:13:43 2015

After an intense week away, I didn't mean to work on git-annex today, but I got sucked back in..

Worked on some plumbing commands for mass repository creation. Made fromkey be able to read a stream of files to create from stdin. Added a new registerurl plumbing command, that reads a stream of keys and urls from stdin.

Posted Sun Mar 15 20:50:25 2015

Did a deep dive into ipfs last night. It has great promise.

As a first step toward using it with git-annex, I built an experimental ipfs special remote. It has some nice abilities; any ipfs address can be downloaded to a file in the repository:

git annex addurl ipfs:QmYgXEfjsLbPvVKrrD4Hf6QvXYRPRjH5XFGajDqtxBnD4W --file somefile

And, any file in the git-annex repository can be published to the world via ipfs, by simply using git annex copy --to ipfs. The ipfs address for the file is then visible in git annex whereis.

Had to extend the external special remote protocol slightly for that, so that ipfs addresses can be recorded as uris in git-annex, and will show up in git annex whereis.

Posted Thu Mar 5 21:07:40 2015

Fixed a mojibake bug that affected metadata values that included both whitespace and unicode characters. This was very fiddly to get right.

Finished up Monday's work to support submodules, getting them working on filesystems that don't support symlinks.

Posted Wed Mar 4 20:16:10 2015

This month is going to be a bit more random than usual where git-annex development is concerned.

  • On Saturday, the Seven Day Roguelike competition begins, and I will be spending a week building a game in haskell, to the exclusion of almost all other work.
  • On March 18th, I'll be at the Boston Haskell User's group. (Attending, not presenting.)
  • March 19-20, I'll be at Dartmouth visiting with the DataLad developers and learning more about what it needs from git-annex.
  • March 21-22, I'll be at the FSF's LibrePlanet conference at MIT.

Got started on the randomness today with this design proposal for using git-annex to back up the entire Internet Archive. This is something the Archive Team is considering taking on, and I had several hours driving and hiking to think about it and came up with a workable design. (Assuming large enough crowd of volunteers.)

Don't know if it will happen, but it was a useful thought problem to see how git-annex works, and doesn't work in this unusual use case.

One interesting thing to come out of that is that git-annex fsck does not currently make any record of successful fscks. In a very large distributed system, it can be useful to have successful fscks of an object's content recorded, by updating the timestamp in the location log to say "this repository still had the content at this time".

Posted Tue Mar 3 23:00:01 2015

I had thought that git-annex and git submodules couldn't mix. However, looking at it again, it turned out to be possible to use git-annex quite sanely in a submodule, with just a little tweaking of how git normally configures the repository. Details of this still experimental feature are in submodules.

There is still some work to be done to make git-annex work with submodules in repositories on filesystems that don't support symlinks.

Posted Mon Mar 2 20:45:19 2015

I'm snowed in, but keeping busy..

Developed a complete workaround for the sqlite SELECT ErrorBusy bug. So after a week, I finally have sqlite working robustly. And, I merged in the branch that uses sqlite for incremental fsck.

Benchmarking an incremental fsck --fast run, checking 40 thousand files, it used to take 4m30s using sticky bits, and using sqlite slowed it down by 10s. So one added second per 4 thousand or so files. I think that's ok. Incremental fsck is intended to be used in big repos, which are probably not checked in --fast most, so the checksumming of files will by far swamp that overhead.

Also got sqlite and persistent installed on all the autobuilders. This was easier than expected, because persistent bundles its own copy of sqlite.

That would have been a good stopping place for the day's work.. But then I got to spent 5 more hours getting the EvilSplicer to support Persistent. Urgh. :-/

Now I can look forward to using sqlite for something more interesting than incremental fsck, like metadata caching for views, or the direct mode mappings. But, given all the trouble I had with sqlite, I'm going to put that off for a little while, to make sure that I've really gotten sqlite to work robustly.

Posted Sun Feb 22 23:54:47 2015

Today's release doesn't have the database branch merged of course, but it still has a significant amount of changes.

Developed a test case for the sqlite problem, that reliably reproduces it, and sent it to the sqlite mailing list. It seems that under heavy write load, when a new connection is made to the database, SELECT can fail for a little while. Once one SELECT succeeds, that database connection becomes solid, and won't fail any more (apparently). This makes me think there might be some connection initialization steps that don't end up finishing before the SELECT goes through in this situation. I should be able to work around this problem by probing new connections for stability, and probably will have to, since it'll be years before any bug fixed sqlite is available everywhere.

I also noticed that current git-annex incremental parallel fsck doesn't really parallelize well; eg the processes do duplicate work. So, the database branch is not really a regression in this area.

Posted Thu Feb 19 22:45:55 2015

Breaking news: gitlab.com repositories now support git-annex!

A very nice surprise! More git hosters should do this..


Back to sqlite concurrency, I thought I had it dealt with, but more testing today has turned up a lot more problems with sqlite and concurrent writers (and readers).

First, I noticed that a process can be happily writing changes to the database, but if a second process starts reading from the database, this will make the writier start failing with BUSY, and keep failing until the second process goes idle. It turns out the solution to this is to use WAL mode, which prevents readers from blocking writers.

After several hours (persistent doesn't make it easy to enable WAL mode), it seemed pretty robust with concurrent fsck.

But then I saw SELECT fail with BUSY. I don't understand why a reader would fail in WAL mode; that's counter to the documentation. My best guess is that this happens when a checkpoint is being made.

This seems to be a real bug in sqlite. It may only affect the older versions bundled with persistent.

Posted Wed Feb 18 21:57:07 2015

Worked today on making incremental fsck's use of sqlite be safe with multiple concurrent fsck processes.

The first problem was that having fsck --incremental running and starting a new fsck --incremental caused it to crash. And with good reason, since starting a new incremental fsck deletes the old database, the old process was left writing to a database that had been deleted and recreated out from underneath it. Fixed with some locking.

Next problem is harder. Sqlite doesn't support multiple concurrent writers at all. One of them will fail to write. It's not even possible to have two processes building up separate transactions at the same time. Before using sqlite, incremental fsck could work perfectly well with multiple fsck processes running concurrently. I'd like to keep that working.

My partial solution, so far, is to make git-annex buffer writes, and every so often send them all to sqlite at once, in a transaction. So most of the time, nothing is writing to the database. (And if it gets unlucky and a write fails due to a collision with another writer, it can just wait and retry the write later.) This lets multiple processes write to the database successfully.

But, for the purposes of concurrent, incremental fsck, it's not ideal. Each process doesn't immediately learn of files that another process has checked. So they'll tend to do redundant work. Only way I can see to improve this is to use some other mechanism for short-term IPC between the fsck processes.


Also, I made git annex fsck --from remote --incremental use a different database per remote. This is a real improvement over the sticky bits; multiple incremental fscks can be in progress at once, checking different remotes.

Posted Tue Feb 17 21:13:13 2015

Yesterday I did a little more investigation of key/value stores. I'd love a pure haskell key/value store that didn't buffer everything in memory, and that allowed concurrent readers, and was ACID, and production quality. But so far, I have not found anything that meets all those criteria. It seems that sqlite is the best choice for now.

Started working on the database branch today. The plan is to use sqlite for incremental fsck first, and if that works well, do the rest of what's planned in caching database.

At least for now, I'm going to use a dedicated database file for each different thing. (This may not be as space-efficient due to lacking normalization, but it keeps things simple.)

So, .git/annex/fsck.db will be used by incremental fsck, and it has a super simple Persistent database schema:

Fscked
  key SKey
  UniqueKey key

It was pretty easy to implement this and make incremental fsck use it. The hard part is making it both fast and robust.

At first, I was doing everything inside a single runSqlite action. Including creating the table. But, it turns out that runs as a single transaction, and if it was interrupted, this left the database in a state where it exists, but has no tables. Hard to recover from.

So, I separated out creating the database, made that be done in a separate transation and fully atomically. Now fsck --incremental could be crtl-c'd and resumed with fsck --more, but it would lose the transaction and so not remember anything had been checked.

To fix that, I tried making a separate transation per file fscked. That worked, and it resumes nicely where it left off, but all those transactions made it much slower.

To fix the speed, I made it commit just one transaction per minute. This seems like an ok balance. Having fsck re-do one minute's work when restarting an interrupted incremental fsck is perfectly reasonable, and now the speed, using the sqlite database, is nearly as fast as the old sticky bit hack was. (Specifically, 6m7s old vs 6m27s new, fscking 37000 files from cold cache in --fast mode.)

There is still a problem with multiple concurrent fsck --more failing. Probably a concurrent writer problem? And, some porting will be required to get sqlite and persistent working on Windows and Android. So the branch isn't ready to merge yet, but it seems promising.

In retrospect, while incremental fsck has the simplest database schema, it might be one of the harder things listed in caching database, just because it involves so many writes to the database. The other use cases are more read heavy.

Posted Mon Feb 16 21:16:38 2015

Spent a couple hours to make the ssh-options git config setting be used in more places. Now it's used everywhere that git-annex supports ssh caching, including the git pull and git push done by sync and by the assistant. Also the remotedaemon and the gcrypt, rsync, and ddar special remotes.

Posted Thu Feb 12 20:23:07 2015

Many more little improvements made yesterday and part of today. While it's only been a week since the last release, it feels almost time to make another one, after so many recent bug fixes and small improvements.

I've updated the roadmap. I have been operating without a roadmap for half a year, and it would be nice to have some plans. Keeping up with bug reports and requests as they come in is a fine mode of work, but it can feel a little aimless. It's good to have a planned out course, or at least some longer term goals.

After the next release, I've penciled in the second half of this month to work on the caching database.

Posted Wed Feb 11 20:56:55 2015

Plowing through the backlog today, and fixing quite a few bugs! Got the backlog down to 87 messages from ~140. And some of the things I got to were old and/or hard.

About a third of the day was spent revisiting git-annex branch shows commit with looong commitlog. I still don't understand how that behavior can happen, but I have a donated repository where it did happen. Made several changes to try to make the problem less likely to occur, and not as annoying when it does occur, and maybe get me more info if it does happen to someone again.

Posted Mon Feb 9 22:47:51 2015

Made a release yesterday, and caught up on most recent messages earlier this week. Backlog stands at 128 messages.

Had to deal with an ugly problem with /usr/bin/glacier today. Seems that there are multiple programs all using that name, some of them shipping in some linux distributions, and the one from boto fails to fail when passed parameters it doesn't understand. Yugh! I had to make git-annex probe to make sure the right glacier program is installed.

I'm planning to deprecate the glacier special remote at some point. Instead, I'd like to make the S3 special remote support the S3-glacier lifecycle, so objects can be uploaded to S3, set to transition to glacier, and then if necessary pulled back from glacier to S3. That should be much simpler and less prone to break.

But not yet; haskell-aws needs glacier support added. Or I could use the new amazonka library, but I'd rather stick with haskell-aws.

Some other minor improvements today included adding git annex groupwanted, which makes for easier examples than using vicfg, and making git annex import support options like --include and --exclude.

Also I moved a many file matching options to only be accepted by the commands that actually use them. Of the remaining common options, most of them make sense for every command to accept (eg, --force and --debug). It would make sense to move --backend, --notify-start/finish, and perhaps --user-agent. Eventually.

Posted Fri Feb 6 21:31:17 2015

Today I put together a lot of things I've been thinking about:

  • There's some evidence that git-annex needs tuning to handle some unusual repositories. In particular very big repositories might benefit from different object hashing.
  • It's really hard to handle upgrades that change the fundamentals of how git-annex repositories work. Such an upgrade would need every git-annex user to upgrade their repository, and would be very painful. It's hard to imagine a change that is worth that amount of pain.
  • There are other changes some would like to see (like lower-case object hash directory names) that are certainly not enough to warrant a flag day repo format upgrade.
  • It would be nice to let people who want to have some flexibility to play around with changes, in their own repos, as long as they don't a) make git-annex a lot more complicated, or b) negatively impact others. (Without having to fork git-annex.)

This is discussed in more depth in new repo versions.

The solution, which I've built today, is support for tuning settings, when a new repository is first created. The resulting repository will be different in some significant way from a default git-annex repository, but git-annex will support it just fine.

The main limitations are:

  • You can't change the tuning of an existing repository (unless a tool gets written to transition it).
  • You absolutely don't want to merge repo B, which has been tuned in nonstandard ways, into repo A which has not. Or A into B. (Unless you like watching slow motion car crashes.)

I built all the infrastructure for this today. Basically, the git-annex branch gets a record of all tunings that have been applied, and they're automatically propagated to new clones of a repository.

And I implemented the first tunable setting:

git -c annex.tune.objecthashlower=true annex init

This is definitely an experimental feature for now. git-annex merge and similar commands will detect attempts to merge between incompatibly tuned repositories, and error out. But, there are a lot of ways to shoot yourself in the foot if you use this feature:

  • Nothing stops git merge from merging two incompatible repositories.
  • Nothing stops any version of git-annex older from today from merging either.

Now that the groundwork is laid, I can pretty easily, and inexpensively, add more tunable settings. The next two I plan to add are already documented, annex.tune.objecthashdirectories and annex.tune.branchhashdirectories. Most new tunables should take about 4 lines of code to add to git-annex.

Posted Tue Jan 27 21:39:18 2015

Today I got The pre-commit-annex hook working on Windows. It turns out that msysgit runs hook scripts even when they're not executable, and it parses the #! line itself. Now git-annex does too, on Windows.

Also, added a new chapter to the walkthrough, using special remotes. They clearly needed to be mentioned, especially to show the workflow of running initremote in one repository, then syncing another repository and running enableremote to enable the same special remote there.

Then more fun Windows porting! Turns out git-annex on Windows didn't handle files > 2 gb correctly; the way it was getting file size uses a too small data type on Windows. Luckily git-annex itself treats all file sizes as unbounded Integers, so I was easily able to swap in a getFileSize that returns correct values for large files.

While I haven't blogged since the 13th and have not been too active until today, there are still a number of little improvements that have been done here and there.

Including a fix for an interesting bug where the assistant would tell the remotedaemon that the network connection has been lost, twice in a row, and this would make the remotedeamon fail to reconnect to the remote when the network came up. I'm not sure what situation triggers this bug (Maybe machines with 2 interfaces? Or maybe a double disconnection event for 1 interface?), but I was able to reproduce it by sending messages to the remotedaemon, and so fixed it.

Backlog is down to 118 messages.

Posted Tue Jan 20 21:36:12 2015

Got a release out today.

I'm feeling a little under the weather, so wanted something easy to do in the rest of the day that would be nice and constructive. Ended up going over the todo list. Old todos come in three groups; hard problems, already solved, and easy changes that never got done. I left the first group alone, closed many todos in the second group, and implemented a few easy changes. Including git annex sync -m and adding some more info to git annex info remote.

Posted Tue Jan 13 22:36:29 2015

Worked more on the relativepaths branch last night, and I am actually fairly happy with it now, and plan to merge it after I've run it for a bit longer myself.

It seems that I did manage to get a git-annex executable that is built PIE so it will work on Android 5.0. But all the C programs like busybox included in the Android app also have to be built that way. Arranging for everything to get built twice and with the right options took up most of today.

Posted Wed Jan 7 21:27:44 2015

git-annex internally uses all absolute paths all the time. For a couple of reasons, I'd like it to use relative paths. The best reason is, it would let a repository be moved while git-annex was running, without breaking. A lesser reason is that Windows has some crazy small limit on the length of a path (260 bytes?!), and using relative paths would avoid hitting it so often.

I tried to do this today, in a relativepaths branch. I eventually got the test suite to pass, but I am very unsure about this change. A lot of random assumptions broke, and the test suite won't catch them all. In a few places, git-annex commands do change the current directory, and that will break with relative paths.

A frustrating day.

Posted Tue Jan 6 22:00:56 2015

I've finally been clued into why git-annex isn't working on Android 5, and it seems fixing it is as easy as pie.. That is, passing -pie -FPIE to the linker. I've added a 5.0 build to the Android autobuilder. It is currently untested, so I hope to get feedback from someone with an Android 5 device; a test build is now available.

I've been working through the backlog of messages today, and gotten down from 170 to 128. Mostly answered a lot of interesting questions, such as "Where to start reading the source code?"

Also did some work to make git-annex check git versions at runtime more often, instead of assuming the git version it was built against. It turns out this could be done pretty inexpensively in 2 of 4 cases, and one of the 2 fixed was the git check-attr behavior change, which could lead to git-annex add hanging if used with an old version of git.

Posted Mon Jan 5 21:13:03 2015

Took a holiday week off from git-annex development, and started a new side project building shell-monad, which might eventually be used in some parts of git-annex that generate shell scripts.

Message backlog is 165 and I have not dove back into it, but I have started spinning back up the development engines in preparation for new year takeoff.

Yesterday, added some minor new features -- git annex sync now supports git remote groups, and I added a new plumbing command setpresentkey for those times when you really need to mess with git-annex's internal bookkeeping. Also cleaned up a lot of build warning messages on OSX and Windows.

Today, first some improvements to make addurl more robust. Then the rest of the day was spent on Windows. Fixed (again) the Windows port's problem with rsync hating DOS style filenames. Got the rsync special remote fully working on Windows for the first time.

Best of all, got the Windows autobuilder to run the test suite successfully, and fixed a couple test suite failures on Windows.

Posted Tue Dec 30 21:52:44 2014

Spent a couple days adding a bittorrent special remote to git-annex. This is better than the demo external torrent remote I made on Friday: It's built into git-annex; it supports magnet links; it even parses aria2c's output so the webapp can display progress bars.

Besides needing aria2 to download torrents, it also currently depends on the btshowmetainfo command from the original bittorrent client (or bittornado). I looked into using http://hackage.haskell.org/package/torrent instead, but that package is out of date and doesn't currently build. I've got a patch fixing that, but am waiting to hear back from the library's author.

There is a bit of a behavior change here; while before git annex addurl of a torrent file would add the torrent file itself to the repository, it now will download and add the contents of the torrent. I think/hope this behavior change is ok..

Posted Wed Dec 17 19:45:48 2014

Some more work on the interface that lets remotes claim urls for git annex addurl. Added support for remotes suggesting a filename to use when adding an url. Also, added support for urls that result in multiple files when downloaded. The obvious use case for that is an url to a torrent that contains multiple files.

Then, got git annex importfeed to also check if a remote claims an url.

Finally, I put together a quick demo external remote using this new interface. git-annex-remote-torrent adds support for torrent files to git-annex, using aria2c to download them. It supports multi-file torrents, but not magnet links. (I'll probably rewrite this more robustly and efficiently in haskell sometime soon.)

Here's a demo:

# git annex initremote torrent type=external encryption=none externaltype=torrent
initremote torrent ok
(Recording state in git...)
# ls
# git annex addurl  --fast file:///home/joey/my.torrent
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   198  100   198    0     0  3946k      0 --:--:-- --:--:-- --:--:-- 3946k
addurl _home_joey_my.torrent/bar (using torrent) ok
addurl _home_joey_my.torrent/baz (using torrent) ok
addurl _home_joey_my.torrent/foo (using torrent) ok
(Recording state in git...)
# ls _home_joey_my.torrent/
bar@  baz@  foo@
# git annex get _home_joey_my.torrent/baz
get _home_joey_my.torrent/baz (from torrent...) 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:-100   198  100   198    0     0  3580k      0 --:--:-- --:--:-- --:--:-- 3580k

12/11 18:14:56 [NOTICE] IPv4 DHT: listening on UDP port 6946

12/11 18:14:56 [NOTICE] IPv4 BitTorrent: listening on TCP port 6961

12/11 18:14:56 [NOTICE] IPv6 BitTorrent: listening on TCP port 6961

12/11 18:14:56 [NOTICE] Seeding is over.
12/11 18:14:57 [NOTICE] Download complete: /home/joey/tmp/tmp.Le89hJSXyh/tor

12/11 18:14:57 [NOTICE] Your share ratio was 0.0, uploaded/downloaded=0B/0B
                                                                               
Download Results:
gid   |stat|avg speed  |path/URI
======+====+===========+=======================================================
71f6b6|OK  |       0B/s|/home/joey/tmp/tmp.Le89hJSXyh/tor/baz

Status Legend:
(OK):download completed.
ok                      
(Recording state in git...)
# git annex find
_home_joey_my.torrent/baz
# git annex whereis _home_joey_my.torrent/baz
whereis _home_joey_my.torrent/baz (2 copies) 
    1878241d-ee49-446d-8cce-041c46442d94 -- [torrent]
    52412020-2bb3-4aa4-ae16-0da22ba48875 -- joey@darkstar:~/tmp/repo [here]

  torrent: file:///home/joey/my.torrent#2
ok
Posted Thu Dec 11 22:20:57 2014

Worked on ?extensible addurl today. When git annex addurl is run, remotes will be asked if they claim the url, and whichever remote does will be used to download it, and location tracking will indicate that remote contains the object. This is a masive 1000 line patch touching 30 files, including follow-on changes in rmurl and whereis and even rekey.

It should now be possible to build an external special remote that handles *.torrent and magnet: urls and passes them off to a bittorrent client for download, for example.

Another use for this would be to make an external special remote that uses youtube-dl or some other program than quvi for downloading web videos. The builtin quvi support could probably be moved out of the web special remote, to a separate remote. I haven't tried to do that yet.

Posted Mon Dec 8 23:17:35 2014

Today's release has a month's accumulated changes, including several nice new features: git annex undo, git annex proxy, git annex diffdriver, and I was able to land the s3-aws branch in this release too, so lots of improvements to the S3 support.

Spent several hours getting the autobuilders updated, with the haskell aws library installed. Android and armel builds are still out of date.

Also fixed two Windows bugs related to the location of the bundled ssh program.

Posted Wed Dec 3 23:02:56 2014

Back from the holiday, catching up on traffic. Backlog stands at 113 messages.

Here's a nice tip that Giovanni added: publishing your files to the public (using a public S3 bucket)

Just before going on break, I added a new feature that I didn't mention here. git annex diffdriver integrates git-annex with git's external diff driver support. So if you have a smart diff program that can diff, say, genome sequences, or cat videos, or something in some useful way, it can be hooked up to git diff and will be able to see the content of annexed files.

Also today, I spent a couple hours today updating the license file included in the standalone git-annex builds to include the licenses of all the haskell libraries git-annex depends on. Which I had for some reason not thought to include before, despite them getting built into the git-annex binary.

Posted Mon Dec 1 23:35:22 2014

Built the git annex undo command. This is intended to be a simple interface for users who have changed one file, and want to undo the change without the complexities of git revert or git annex proxy. It's simple enough that I added undo as an action in the file manager integration.

And yes, you can undo an undo. :)

Posted Fri Nov 14 22:19:20 2014

Ever since the direct mode guard was added a year ago, direct mode has been a lot safer to use, but very limited in the git commands that could be run in a direct mode repository.

The worst limitation was that there was no way to git revert unwanted changes. But also, there was no way to check out different branches, or run commands like git mv.

Today I made git annex proxy, which allows doing all of those things, and more. documentation here

It's so flexible that I'm not sure where the boundries lie yet, but it seems it will work for any git command that updates both the work tree and the index. Some git commands only update one or the other and not both and won't work with the proxy. As an advanced user tool, I think this is a great solution. I still want to make a simpler ?undo command that can nicely integrate into file managers.

The implementation of git annex proxy is quite simple, because it reuses all the complicated work tree update code that was already written for git annex merge.


And here's the lede I buried: I've gotten two years of funding to work on git-annex part-time! Details in my personal blog.

Posted Wed Nov 12 21:06:59 2014

The OSX autobuilder has been updated to OSX 10.10 Yosemite. The resulting build might also work on 10.9 Mavericks too, and I'd appreciate help testing that.

Went ahead and fixed the ?partial commit problem by making the pre-commit hook detect and block problematic partial commits.

Posted Tue Nov 11 21:02:53 2014

S3 multipart is finally completely working. I still don't understand the memory issue that stumped me yesterday, but rewrote the code to use a simpler approach, which avoids the problem. Various other issues, and testing it with large files, took all day.

This is now merged into the s3-aws branch, so when that branch lands, S3 support will massively improve, from the current situation of using a buggy library that buffers uploaded files in memory, and cannot support very large file uploads at all, to being able to support hopefully files of arbitrary hugeness (at least up to a few terabytes).

BTW, thanks to Aristid Breitkreuz and Junji Hashimoto for working on the multipart support in the aws library.

Posted Tue Nov 4 22:04:40 2014

More work on S3 multipart uploads, since the aws library got fixed today to return the ETAGs for the parts. I got multipart uploads fully working, including progress display.

The code takes care to stream each part in from the file and out the socket, so I'd hoped it would have good memory behavior. However, for reasons I have not tracked down, something in the aws library is causing each part to be buffered in memory. This is a problem, since I want to use 1 gb as the default part size.

Posted Tue Nov 4 02:11:56 2014

Some progress on the ?S3 upload not using multipart bug. The aws library now includes the multipart API. However, when I dug into it, it looks like the API needs some changes to get the ETAG of each uploaded part. Once that's fixed, git-annex should be able to support S3 multipart uploads, although I think that git-annex's own chunking is better in most situations -- it supports resuming uploads and downloads better. The main use case for S3 multipart seems to be using git-annex to publish large files.

Also, managed to get the backlog down from 100 to just 65 messages, including catching up on quite old parts of backlog.

Posted Tue Oct 28 20:41:42 2014

New AWS region in Germany announced today. git-annex doesn't support it yet, unless you're using the s3-aws branch.

I cleaned up that branch, got it building again, and re-tested it with testremote, and then fixed a problem the test suite found that was caused by some changes in the haskell aws library.

Unfortunately, s3-aws is not ready to be merged because of some cabal dependency problems involving dbus and random. I did go ahead and update Debian's haskell-aws package to cherry-pick from a newer version the change needed for Inernet Archive support, which allows building the s3-aws branch on Debian. Getting closer..

Posted Thu Oct 23 21:02:00 2014

Today, I've expanded git annex info to also be able to be used on annexed files and on remotes. Looking at the info for an individual remote is quite useful, especially for answering questions like: Does the remote have embedded creds? Are they encrypted? Does it use chunking? Is that old style chunking?

remote: rsync.net
description: rsync.net demo remote
uuid: 15b42f18-ebf2-11e1-bea1-f71f1515f9f1
cost: 250.0
type: rsync
url: xxx@usw-s002.rsync.net:foo
encryption: encrypted (to gpg keys: 7321FC22AC211D23 C910D9222512E3C7)
chunking: 1 MB chunks
remote: ia3
description: test [ia3]
uuid: 12817311-a189-4de3-b806-5f339d304230
cost: 200.0
type: S3
creds: embedded in git repository (not encrypted)
bucket: joeyh-test-17oct-3
internet archive item: http://archive.org/details/joeyh-test-17oct-3
encryption: not encrypted
chunking: none

Should be quite useful info for debugging too..

Yesterday, I fixed a bug that prevented retrieving files from Glacier.

Posted Tue Oct 21 19:51:54 2014

3 days spent redoing the Android autobuilder! The new version of yesod-routes generates TH splices that break the EvilSplicer. So after updating everything to new versions for the Nth time, I instead went back to older versions. The autobuilder now uses Debian jessie, instead of wheezy. And all haskell packages are pinned to use the same version as in jessie, rather than the newest versions. Since jessie is quite near to being frozen, this should make the autobuilder much less prone to getting broken by new versions of haskell packages that need patches for Android.

I happened to stumble over http://hackage.haskell.org/package/setenv while doing that. This supports setting and unsetting environment variables on Windows, which I had not known a way to do from Haskell. Cleaned up several ugly corners of the Windows port using it.

Posted Thu Oct 16 19:37:28 2014

git commit $some_unlocked_file seems like a reasonably common thing for someone to do, so it's surprising to find that it's a ?little bit broken, leaving the file staged in the index after (correctly) committing the annexed symlink.

This is caused by either a bug in git and/or by git-annex abusing the git post-commit hook to do something it shouldn't do, although it's not unique in using the post-commit hook this way. I'm talking this over with Junio, and the fix will depend on the result of that conversation. It might involve git-annex detecting this case and canceling the commit, asking the user to git annex add the file first. Or it might involve a new git hook, although I have not had good luck getting hooks added to git before.


Meanwhile, today I did some other bug fixing. Fixed the Internet Archive support for embedcreds=yes. Made git annex map work for remote repos in a directory with an implicit ".git" prefix. And fixed a strange problem where the repository repair code caused a git gc to run and then tripped over its pid file.

I seem to have enough fixes to make another release pretty soon. Especially since the current release of git-annex doesn't build with yesod 1.4.

Backlog: 94 messages

Posted Sun Oct 12 20:13:27 2014

Made two releases of git-annex, yesterday and today, which turned out to contain only Debian changes. So no need for other users to upgrade.

This included fixing building on mips, and arm architectures. The mips build was running out of memory, and I was able to work around that. Then the arm builds broke today, because of a recent change to the version of llvm that has completely trashed ghc. Luckily, I was able to work around that too.

Hopefully that will get last week's security fix into Debian testing, and otherwise have git-annex in Debian in good shape for the upcoming freeze.

Posted Sat Sep 27 20:29:54 2014

Working through the forum posts and bugs. Backlog is down to 95.

Discovered the first known security hole in git-annex! Turns out that S3 and Glacier remotes that were configured with embedcreds=yes and encryption=pubkey or encryption=hybrid didn't actually encrypt the AWS credentials that get embedded into the git repo. This doesn't affect any repos set up by the assistant.

I've fixed the problem and am going to make a release soon. If your repo is affected, see insecure embedded creds for what to do about it.

Posted Thu Sep 18 22:24:43 2014

Made a release yesterday, which was all bugfixes.

Today, a few more bug fixes. Looked into making the webapp create non-bare repositories on removable drives, but before I got too far into the code, I noticed there's a big problem with that idea.

Rest of day was spent getting caught up on forum posts etc. I'm happy to read lots of good answers that have been posted while I've been away. Here's an excellent example: http://git-annex.branchable.com/install/fromsource/#comment-5f8ceb060643ae71cd2adc72f0fca3f0

That led to rewriting the docs for building git-annex from source. New page: fromsource.

Backlog is now down to 117.

Posted Tue Sep 16 20:18:33 2014

Yesterday and today were the first good solid days working on git-annex in a while. There's a big backlog, currently of 133 messages, so I have been concentrating on bug reports first. Happily, not many new bugs have been reported lately, and I've made good progress on them, fixing 5 bugs today, including a file descriptor leak.

catching up

In this end of summer rush, I've been too busy to blog for the past 20 days, but not entirely too busy to work on git-annex. Two releases have been made in that time, and a fair amount of improvements worked on.

Including a new feature: When a local git repository is cloned with git clone --shared, git-annex detects this and defaults to a special mode where file contents get hard linked into the clone. It also makes the cloned repository be untrusted, to avoid confusing numcopies counting with the hard links. This can be useful for temporary working repositories without the overhead of lots of copies of files.

looking back

I want to look back further, over the crowdfunded year of work covered by this devblog. There were a lot of things I wanted to accomplish this past year, and I managed to get to most of them. As well as a few surprises.

  • Windows support improved more than I guessed in my wildest dreams.
    git-annex went from working not too well on the command line to being pretty solid there, as well as having a working and almost polished webapp on Windows.
    There are still warts -- it's Windows after all!

  • Android didn't get many improvements. Most of the time I had budgeted to Android porting ended up being used on Windows porting instead. I did, however, get the Android build environment cleaned up a lot from the initial hacked together one, and generally kept it building and working on Android.

  • The direct mode guard was not planned, but the need for it became clear, and it's dramatically reduced the amount of command-line foot-shooting that goes on in direct mode.

  • Repository repair was planned, and I've very proud of git-repair. Also pleased with the webapp's UI for scheduling repository consistency checks.
    Always room for improvement in this kind of thing, but this brings a new capability to both git and git-annex.

  • The external special remote interface came together beautifully. External special remotes are now just as well supported as built-in ones, except the webapp cannot be used to configure them.

  • Using git-remote-gcrypt for fully encrypted git repositories, including support in the webapp for setting them (and gpg keys if necessary), happened. Still needs testing/more use/improvements. Avoided doing much in the area of gpg key management, which is probably good to avoid when possible, but is probably needed to make this a really suitable option for end users.

  • Telehash is still being built, and it's not clear if they've gotten it to work at all yet. The v2 telehash has recently been superseded by a a new v3. So I am not pleased that I didn't get git-annex working with telehash, but it was outside my control. This is a problem that needs to get solved outside git-annex first, either by telehash or something else. The plan is to keep an eye on everything in this space, including for example, Maidsafe.

  • In the meantime, the new notifychanges support in git-annex-shell makes XMPP/telehash/whatever unnecessary in a lot of configurations. git-annex's remotedaemon architecture supports that and is designed to support other notification methods later. And the webapp has a lot of improvements in the area of setting up ssh remotes, so fewer users will be stuck with XMPP.

  • I didn't quite get to deltas, but the final month of work on chunking provides a lot of new features and hopefully a foundation that will get to deltas eventually. There is a new haskell library that's being developed with the goal of being used for git-annex deltas.

  • I hadn't planned to make git-annex be able to upgrade itself, when installed from this website. But there was a need for that, and so it happened. Even got a gpg key trust path for the distribution of git-annex.

  • Metadata driven views was an entirely unplanned feature. The current prototype is very exciting, it opens up entire new use cases. I had to hold myself back to not work on it too much, especially as it shaded into adding a caching database to git-annex. Had too much other stuff planned to do all I wanted. Clearly this is an area I want to spend more time on!

Those are most of the big features and changes, but probably half of my work on git-annex this past year was in smaller things, and general maintenance. Lots of others have contributed, some with code (like the large effort to switch to bootstrap3), and others with documentation, bug reports, etc.

Perhaps it's best to turn to git diff --stat to sum up the activity and see just how much both the crowdfunding campaign and the previous kickstarter have pushed git-annex into high gear:

   campaign: 5410 files changed, 124159 insertions(+), 79395 deletions(-)
kickstarter: 4411 files changed, 123262 insertions(+), 13935 deletions(-)
year before: 1281 files changed,   7263 insertions(+), 55831 deletions(-)

What's next? The hope is, no more crowdfunded campaigns where I have to promise the moon anytime soon. Instead, the goal is to move to a more mature and sustainable funding model, and continue to grow the git-annex community, and the spaces where it's useful.

Posted Fri Sep 12 16:27:01 2014

Plan is to be on vacation and/or low activity this week before DebConf. However, today I got involved in fixing a bug that caused the assistant to keep files open after syncing with repositories on removable media.

Part of that bug involved lock files not being opend close-on-exec, and while fixing that I noticed again that the locking code was scattered all around and rather repetitive. That led to a lot of refactoring, which is always fun when it involves scary locking code. Thanks goodness for referential transparency.

Now there's a Utility.LockFile that works on both POSIX and Windows. Howver, that module actually exports very different functions for the two. While it might be tempting to try to do a portability layer, the two locking models are really very different, and there are lots of gotchas such a portability layer would face. The only API that's completely the same between the two is dropLock.

This refactoring process and the cleaner, more expressive code it led to helped me spot a couple of bugs involving locking. See e386e26ef207db742da6d406183ab851571047ff and 0a4d301051e4933661b7b0a0791afa95bfe9a1d3 Neither bug has ever seemed to cause a problem, but it's nice to be able to spot and fix such bugs before they do.

Posted Wed Aug 20 23:46:24 2014

Over the past couple days, got the arm autobuilder working again. It had been down since June with several problems. cabal install tended to crash; apparenty this has something to do with threading in user-mode qemu, because -j1 avoids that. And strange invalid character problems were fixed by downgrading file-embed. Also, with Yury's help I got the Windows autobuilder upgraded to the new Haskell Platform and working again.

Today a last few finishing touches, including getting rid of the last dependency on the old haskell HTTP library, since http-conduit is being used now. Ready for the release!

Posted Fri Aug 15 22:05:01 2014

Working on getting caught up with backlog. 73 messages remain.

Several minor bugs were fixed today. All edge cases. The most edge case one of all, I could not fix: git-annex cannot add a file that has a newline in its filename, because git cat-file --batch's interface does not support such filenames.

Added a page documenting how verify the signatures of git-annex releases.

Over the past couple days, all the autobuilders have been updated to new dependencies needed by the recent work. Except for Windows, which needs to be updated to the new Haskell Platform first, so hopefully soon.

Turns out that upgrading unix-compat means that inode(like) numbers are available even on Windows, which will make git-annex more robust there. Win win. ;)

Posted Tue Aug 12 20:54:33 2014

Yesterday, finished converting S3 to use the aws library. Very happy with the result (no memory leaks! connection caching!), but s3-aws is not merged into master yet. Waiting on a new release of the aws library so as to not break Internet Archive S3 support.

Today, spent a few hours adding more tests to testremote. The new tests take a remote, and construct a modified version that is intentionally unavailable. Then they make sure trying to use it fails in appropriate ways. This was a very good thing to test; two bugs were immediately found and fixed.

And that wraps up several weeks of hacking on the core of git-annex's remotes support, which started with reworking chunking and kind of took on a life of its own. I plan a release of this new stuff in a week. The next week will be spent catching up on 117 messages of backlog that accumulated while I was in deep coding mode.

Posted Sun Aug 10 19:21:59 2014

Finished up webdav, and after running testremote for a long time, I'm satisfied it's good. The newchunks branch has now been merged into master completely.

Spent the rest of the day beginning to rework the S3 special remote to use the aws library. This was pretty fiddly; I want to keep all the configuration exactly the same, so had to do a lot of mapping from hS3 configuration to aws configuration. Also there is some hairy stuff involving escaping from the ResourceT monad with responses and http connection managers intact.

Stopped once initremote worked. The rest should be pretty easy, although Internet Archive support is blocked by https://github.com/aristidb/aws/issues/119. This is in the s3-aws branch until it gets usable.

Posted Sat Aug 9 03:28:00 2014

Today was spent reworking so much of the webdav special remote that it was essentially rewritten from scratch.

The main improvement is that it now keeps a http connection open and uses it to perform multiple actions. Before, one connection was made per action. This is even done for operations on chunks. So, now storing a chunked file in webdav makes only 2 http connections total. Before, it would take around 10 connections per chunk. So a big win for performance, although there is still room for improvement: It would be possible to reduce that down to just 1 connection, and indeed keep a persistent connection reused when acting on multiple files.

Finished up by making uploading a large (non-chunked) file to webdav not buffer the whole file in memory.

I still need to make downloading a file from webdav not buffer it, and test, and then I'll be done with webdav and can move on to making similar changes to S3.

Posted Fri Aug 8 00:00:31 2014

Converted the webdav special remote to the new API. All done with converting everything now!

I also updated the new API to support doing things like reusing the same http connection when removing and checking the presence of chunks.

I've been working on improving the haskell DAV library, in a number of ways that will let me improve the webdav special remote. Including making changes that will let me do connection caching, and improving its API to support streaming content without buffering a whole file in memory.

Posted Wed Aug 6 22:43:13 2014

Just finished converting both rsync and gcrypt to the new API, and testing them. Still need to fix 2 test suite failures for gcrypt. Otherwise, only WebDAV remains unconverted.

Earlier today, I investigated switching from hS3 to http://hackage.haskell.org/package/aws. Learned its API, which seemed a lot easier to comprehend than the other two times I looked at it. Wrote some test programs, which are in the s3-aws branch. I was able to stream in large files to S3, without ever buffering them in memory (which hS3's API precludes). And for chunking, it can reuse an http connection. This seems very promising. (Also, it might eventually get Glacier support..)

I have uploaded haskell-aws to Debian, and once it gets into testing and backports, I plan to switch git-annex over to it.

Posted Mon Aug 4 00:38:10 2014

Have started converting lots of special remotes to the new API. Today, S3 and hook got chunking support. I also converted several remotes to the new API without supporting chunking: bup, ddar, and glacier (which should support chunking, but there were complications).

This removed 110 lines of code while adding features! And, I seem to be able to convert them faster than testremote can test them. :)

Now that S3 supports chunks, they can be used to work around several problems with S3 remotes, including file size limits, and a memory leak in the underlying S3 library.

The S3 conversion included caching of the S3 connection when storing/retrieving chunks. [Update: Actually, it turns out it didn't; the hS3 library doesn't support persistent connections. Another reason I need to switch to a better S3 library!]

But the API doesn't yet support caching when removing or checking if chunks are present. I should probably expand the API, but got into some type checker messes when using generic enough data types to support everything. Should probably switch to ResourceT.

Also, I tried, but failed to make testremote check that storing a key is done atomically. The best I could come up with was a test that stored a key and had another thread repeatedly check if the object was present on the remote, logging the results and timestamps. It then becomes a statistical problem -- somewhere toward the end of the log it's ok if the key has become present -- but too early might indicate that it wasn't stored atomically. Perhaps it's my poor knowledge of statistics, but I could not find a way to analize the log that reliably detected non-atomic storage. If someone would like to try to work on this, see the atomic-store-test branch.

Posted Sat Aug 2 23:13:16 2014

Built git annex testremote today.

That took a little bit longer than expected, because it actually found several fence post bugs in the chunking code.

It also found a bug in the sample external special remote script.

I am very pleased with this command. Being able to run 640 tests against any remote, without any possibility of damaging data already stored in the remote, is awesome. Should have written it a looong time ago!

Posted Fri Aug 1 21:59:23 2014

It took 9 hours, but I finally got to make c0dc134cded6078bb2e5fa2d4420b9cc09a292f7, which both removes 35 lines of code, and adds chunking support to all external special remotes!

The groundwork for that commit involved taking the type scheme I sketched out yesterday, completely failing to make it work with such high-ranked types, and falling back to a simpler set of types that both I and GHC seem better at getting our heads around.

Then I also had more fun with types, when it turned out I needed to run encryption in the Annex monad. So I had to go convert several parts of the utility libraries to use MonadIO and exception lifting. Yurk.

The final and most fun stumbling block caused git-annex to crash when retriving a file from an external special remote that had neither encryption not chunking. Amusingly it was because I had not put in an optimation (namely, just renaming the file that was retrieved in this case, rather than unnecessarily reading it in and writing it back out). It's not often that a lack of an optimisation causes code to crash!

So, fun day, great result, and it should now be very simple to convert the bup, ddar, gcrypt, glacier, rsync, S3, and WebDAV special remotes to the new system. Fingers crossed.

But first, I will probably take half a day or so and write a git annex testremote that can be run in a repository and does live testing of a special remote including uploading and downloading files. There are quite a lot of cases to test now, and it seems best to get that in place before I start changing a lot of remotes without a way to test everything.


Today's work was sponsored by Daniel Callahan.

Posted Wed Jul 30 00:46:42 2014

Zap! ... My internet gateway was destroyed by lightning. Limping along regardless, and replacement ordered.

Got resuming of uploads to chunked remotes working. Easy!


Next I want to convert the external special remotes to have these nice new features. But there is a wrinkle: The new chunking interface works entirely on ByteStrings containing the content, but the external special remote interface passes content around in files.

I could just make it write the ByteString to a temp file, and pass the temp file to the external special remote to store. But then, when chunking is not being used, it would pointlessly read a file's content, only to write it back out to a temp file.

Similarly, when retrieving a key, the external special remote saves it to a file. But we want a ByteString. Except, when not doing chunking or encryption, letting the external special remote save the content directly to a file is optimal.

One approach would be to change the protocol for external special remotes, so that the content is sent over the protocol rather than in temp files. But I think this would not be ideal for some kinds of external special remotes, and it would probably be quite a lot slower and more complicated.

Instead, I am playing around with some type class trickery:

{-# LANGUAGE Rank2Types TypeSynonymInstances FlexibleInstances MultiParamTypeClasses #-}

type Storer p = Key -> p -> MeterUpdate -> IO Bool

-- For Storers that want to be provided with a file to store.
type FileStorer a = Storer (ContentPipe a FilePath)

-- For Storers that want to be provided with a ByteString to store
type ByteStringStorer a = Storer (ContentPipe a L.ByteString)

class ContentPipe src dest where
        contentPipe :: src -> (dest -> IO a) -> IO a

instance ContentPipe L.ByteString L.ByteString where
        contentPipe b a = a b

-- This feels a lot like I could perhaps use pipes or conduit...
instance ContentPipe FilePath FilePath where
        contentPipe f a = a f

instance ContentPipe L.ByteString FilePath where
        contentPipe b a = withTmpFile "tmpXXXXXX" $ \f h -> do
                L.hPut h b
                hClose h
                a f

instance ContentPipe FilePath L.ByteString where
        contentPipe f a = a =<< L.readFile f

The external special remote would be a FileStorer, so when a non-chunked, non-encrypted file is provided, it just runs on the FilePath with no extra work. While when a ByteString is provided, it's swapped out to a temp file and the temp file provided. And many other special remotes are ByteStorers, so they will just pass the provided ByteStream through, or read in the content of a file.

I think that would work. Thoigh it is not optimal for external special remotes that are chunked but not encrypted. For that case, it might be worth extending the special remote protocol with a way to say "store a chunk of this file from byte N to byte M".


Also, talked with ion about what would be involved in using rolling checksum based chunks. That would allow for rsync or zsync like behavior, where when a file changed, git-annex uploads only the chunks that changed, and the unchanged chunks are reused.

I am not ready to work on that yet, but I made some changes to the parsing of the chunk log, so that additional chunking schemes like this can be added to git-annex later without breaking backwards compatability.

Posted Mon Jul 28 21:28:34 2014

Last night, went over the new chunking interface, tightened up exception handling, and improved the API so that things like WebDAV will be able to reuse a single connection while all of a key's chunks are being downloaded. I am pretty happy with the interface now, and except to convert more special remotes to use it soon.

Just finished adding a killer feature: Automatic resuming of interrupted downloads from chunked remotes. Sort of a poor man's rsync, that while less efficient and awesome, is going to work on every remote that gets the new chunking interface, from S3 to WebDAV, to all of Tobias's external special remotes! Even allows for things like starting a download from one remote, interrupting, and resuming from another one, and so on.

I had forgotten about resuming while designing the chunking API. Luckily, I got the design right anyway. Implementation was almost trivial, and only took about 2 hours! (See 9d4a766cd7b8e8b0fc7cd27b08249e4161b5380a)

I'll later add resuming of interrupted uploads. It's not hard to detect such uploads with only one extra query of the remote, but in principle, it should be possible to do it with no extra overhead, since git-annex already checks if all the chunks are there before starting an upload.

Posted Sun Jul 27 23:14:55 2014

Remained frustratingly stuck until 3 pm on the same stuff that puzzled me yesterday. However, 6 hours later, I have the directory special remote 100% working with both new chunk= and legacy chunksize= configuration, both with and without encryption.


So, the root of why this is was hard, since I thought about it a lot today in between beating my head into the wall: git-annex's internal API for remotes is really, really simple. It basically comes down to:

 Remote
        { storeKey :: Key -> AssociatedFile -> MeterUpdate -> Annex Bool
        , retrieveKeyFile :: Key -> AssociatedFile -> FilePath -> MeterUpdate -> Annex Bool
        , removeKey :: Key -> Annex Bool
        , hasKey :: Key -> Annex (Either String Bool)
        }

This simplicity is a Good Thing, because it maps very well to REST-type services. And it allows for quite a lot of variety in implementations of remotes. Ranging from reguar git remotes, that rsync files around without git-annex ever loading them itself, to remotes like webdav that load and store files themselves, to remotes like tahoe that intentionally do not support git-annex's built-in encryption methods.

However, the simplicity of that API means that lots of complicated stuff, like handling chunking, encryption, etc, has to be handled on a per-remote basis. Or, more generally, by Remote -> Remote transformers that take a remote and add some useful feature to it.

One problem is that the API is so simple that a remote transformer that adds encryption is not feasible. In fact, every encryptable remote has had its own code that loads a file from local disk, encrypts it, and sends it to the remote. Because there's no way to make a remote transformer that converts a storeKey into an encrypted storeKey. (Ditto for retrieving keys.)

I almost made the API more complicated today. Twice. But both times I ended up not, and I think that was the right choice, even though it meant I had to write some quite painful code.


In the end, I instead wrote a little module that pulls together supporting both encryption and chunking. I'm not completely happy because those two things should be independent, and so separate. But, 120 lines of code that don't keep them separate is not the end of the world.

That module also contains some more powerful, less general APIs, that will work well with the kinds of remotes that will use it.

The really nice result, is that the implementation of the directory special remote melts down from 267 lines of code to just 172! (Plus some legacy code for the old style chunking, refactored out into a file I can delete one day.) It's a lot cleaner too.

With all this done, I expect I can pretty easily add the new style chunking to most git-annex remotes, and remove code from them while doing it!


Today's work was sponsored by Mark Hepburn.

Posted Sun Jul 27 00:54:38 2014

A lil bit in the weeds on the chunking rewrite right now. I did succeed in writing the core chunk generation code, which can be used for every special remote. It was pretty hairy (needs to stream large files in constant memory, separating into chunks, and get the progress display right across operations on chunks, etc). That took most of the day.

Ended up getting stuck in integrating the encryptable remote code, and had to revert changes that could have led to rewriting (or perhaps eliminating?) most of the per-remote encryption specific code.

Up till now, this has supported both encrypted and non-encrypted remotes; it was simply passed encrypted keys for an encrypted remote:

remove :: Key -> Annex Bool

But with chunked encrypted keys, it seems it needs to be more complicated:

remove' :: Maybe (Key -> Key) -> ChunkConfig -> Key -> Annex Bool

So that when the remote is configured to use chunking, it can look up the chunk keys, and then encrypt them, in order to remove all the encrypted chunk keys.

I don't like that complication, so want to find a cleaner abstraction. Will sleep on it.


While I was looking at the encryptable remote generator, I realized the remote cost was being calculated wrongly for special remotes that are not encrypted. Fixed that bug.


Today's work was sponsored by bak.

Posted Sat Jul 26 01:00:04 2014

The design for new style chunks seems done, and I laid the groundwork for it today. Added chunk metadata to keys, reorganized the legacy chunking code for directory and webdav so it won't get (too badly) in the way, and implemented the chunk logs in the git-annex branch.

Today's work was sponsored by LeastAuthority.com.

Posted Thu Jul 24 20:51:58 2014

Working on designs for better chunking. Having a hard time finding a way to totally obscure file sizes, but otherwise a good design seems to be coming together. I particularly like that the new design puts the chunk count in the Key (which is then encrypted for special remotes, rather than having it be some special extension.

While thinking through chunking, I realized that the current chunking method can fail if two repositories have different chunksize settings for the same special remote and both upload the same key at the same time. Arn't races fun? The new design will eliminate this problem; in the meantime updated the docs to recommend never changing a remote's chunksize setting.

Posted Wed Jul 23 21:58:48 2014

Updated the Debian backport. (Also the git-remote-gcrypt backport.)

Made the assistant install a desktop file to integrate with Konqueror.

Improved git annex repair, fixing a bug that could cause it to leave broken branch refs and yet think that the repair was successful.


A bit surprised to see that now been a full year since I started doing development funded by my campaign. Not done yet!

Update on campaign rewards: https://campaign.joeyh.name/blog/stickers_soon/


Today's work was sponsored by Douglas Butts.

Posted Mon Jul 21 23:22:44 2014

Spent hours today in a 10-minute build/test cycle, tracking down a bug that caused the assistant to crash on Windows after exactly 10 minutes uptime. Eventually found the cause; this is fallout from last month's work that got it logging to the debug.log on Windows.

There was more, but that was the interesting one..

Posted Wed Jul 16 22:28:00 2014

I have mostly been thinking about gcrypt today. This issue needs to be dealt with. The question is, does it really make sense to try to hide the people a git repository is encrypted for? I have posted some thoughts and am coming to the viewpoint that obscuring the identities of users of a repository is not a problem git-annex should try to solve itself, although it also shouldn't get in the way of someone who is able and wants to do that (by using tor, etc).

Finally, I decided to go ahead and add a gcrypt.publish-participants setting to git-remote-gcrypt, and make git-annex set that by default when setting up a gcrypt repository.

Some promising news from the ghc build on arm. I got a working ghc, and even ghci works. Which would make the template haskell in the webapp etc avaialble on arm without the current horrible hacks. Have not managed to build the debian ghc package successfully yet though.

Also, fixed a bug that made git annex sync not pull/push with a local repository that had not yet been initialized for use with git-annex.

Today's work was sponsored by Stanley Yamane.

Posted Tue Jul 15 21:47:37 2014

Yay, the Linux autobuilder is back! Also fixed the Windows build.

Fixed a reversion that prevented the webapp from starting properly on Windows, which was introduced by some bad locking when I put in the hack that makes it log to the log file on that platform.

Various other minor fixes here and there. There are almost enough to do a release again soon.

I've also been trying to bootstrap ghc 7.8 on arm, for Debian. There's a script that's supposed to allow building 7.8 using 7.6.3, dealing with a linker problem by using the gold linker. Hopefully that will work since otherwise Debian could remain stuck with an old ghc or worse lose the arm ports. Neither would be great for git-annex..

Posted Tue Jul 15 04:42:51 2014

Spent past 2 days catching up on backlog and doing bug triage and some minor bug fixes and features. Backlog is 27, lowest in quite a while so I feel well on top of things.

I was saddened to find this bug where I almost managed to analize the ugy bug's race condition, but not quite (and then went on vacation). BTW, I have not heard from anyone else who was hit by that bug so far.

The linux autobuilders are still down; their host server had a disk crash in an electrical outage. Might be down for a while. I would not mind setting up a redundant autobuilder if anyone else would like to donate a linux VM with 4+ gb of ram.

Posted Fri Jul 11 21:03:01 2014

Important A bug ?caused the assistant to sometimes remove all files from the git repository. You should check if your repository is ok. If the bug hit you, it should be possible to revert the bad commit and recover your files with no data loss. See the bug report for details.

This affected git-annex versions since 5.20140613, and only when using the assistant in direct mode. It should be fixed in today's release, 5.20140709.

I'm available urgent2014@joeyh.name to help anyone hit by this unfortunate bug.

This is another bug in the direct mode merge code. I'm not happy about it. It's particularly annoying that I can't fix up after it automatically (because there's no way to know if any given commit in the git history that deletes all the files is the result of this bug, or a legitimate deletion of all files).

The only good thing is that the design of git-annex is pretty robust, and in this case, despite stupidly committing the deletion of all the files in the repository, git-annex did take care to preserve all their contents and so the problem should be able to be resolved without data loss.

Unfortunately, the main autobuilder is down and I've had to spin up autobuilders on a different machine (thank goodness that's very automated now!), and so I have not been able to build the fixed git-annex for android yet. I hope to get that done later this evening.


Yesterday, I fixed a few (much less bad) bugs, and did some thinking about plans for this month. The roadmap suggests working on some of chunks, deltas or gpgkeys. I don't know how to do deltas yet really. Chunks is pretty easily done. The gpg keys stuff is pretty open ended and needs some more work to define some use cases. But, after today, I am more inclined to want to spend time on better testing and other means of avoiding this kind of situation.

Posted Wed Jul 9 20:29:59 2014

Got the release out. Had to fix various autobuilder issues. The arm autobuilder is unfortunatly not working currently.

Updated git-annex to build with a new version of the bloomfilter library.

Posted Mon Jul 7 20:13:28 2014

Got a bit distracted improving Haskell's directory listing code.

Only real git-annex work today was fixing ?Assistant merge loop, which was caused by changes in the last release (that made direct mode merging crash/interrupt-safe). This is a kind of ugly bug, that can result in the assistant making lots of empty commits in direct mode repositories. So, I plan to make a new release on Monday.

Posted Sat Jul 5 21:24:19 2014

Spent the morning improving behavior when commit.gpgsign is set. Now git-annex will let gpg sign commits that are made when eg, manually running git annex sync, but not commits implicitly made to the git-annex branch. And any commits made by the assistant are not gpg signed. This was slightly tricky, since lots of different places in git-annex ran git commit, git merge and similar.

Then got back to a test I left running over vacation, that added millions of files to a git annex repo. This was able to reproduce a problem where git annex add blew the stack and crashed at the end. There turned out to be two different memory issues, one was in git-annex and the other is in Haskell's core getDirectoryContents. Was able to entirely fix it, eventually.

Posted Fri Jul 4 22:14:33 2014

Finally back to work with a new laptop!

Did one fairly major feature today: When using git-annex to pull down podcasts, metadata from the feed is copied into git-annex's metadata store, if annex.genmetadata is set. Should be great for views etc!

Worked through a lot of the backlog, which is down to 47 messages now.

Only other bug fix of note is a fix on Android. A recent change to git made it try to chmod files, which tends to fail on the horrible /sdcard filesystem. Patched git to avoid that.

For some reason the autobuilder box rebooted while I was away, and somehow the docker containers didn't come back up -- so they got automatically rebuilt. But I have to manually finish up building the android and armel ones. Will be babysitting that build this evening.

Today's work was sponsored by Ævar Arnfjörð Bjarmason.

Posted Thu Jul 3 20:44:33 2014

I am back from the beach, but my dev laptop is dead. A replacement is being shipped, and I have spent today getting my old netbook into a usable state so I can perhaps do some work using it in the meantime.

(Backlog is 95 messages.)

Posted Mon Jun 30 22:36:21 2014

Last night, got logging to daemon.log working on Windows. Aside from XMPP not working (but it's near to being deprecated anyway), and some possible issues with unicode characters in filenames, the Windows port now seems in pretty good shape for a beta release.

Today, mostly worked on fixing the release process so the metadata accurarely reflects the version from the autobuilder that is included in the release. Turns out there was version skew in the last release (now manually corrected). This should avoid that happening again, and also automates more of my release process.

Posted Thu Jun 19 02:57:57 2014

After despairing of ever solving this yesterday (and for the past 6 months really), I've got the webapp running on Windows with no visible DOS box. Also have the assistant starting up in the background on login.

It turns out a service was not the way to do. There is a way to write a VB Script that runs a "DOS" command in a hidden window, and this is what I used. Amazing how hard it was to work this out, probably partly because I don't have the Windows vocabulary to know what to look for.

Posted Tue Jun 17 18:31:11 2014

More work on ?windows git-annex service, but am stuck with a permissions problem.

Fixed a bug that prevented two assistants from syncing when there was only a uni-directional link between them. Only affected direct mode, and was introduced back when I added the direct mode guard.

Posted Mon Jun 16 23:56:30 2014

It's officially a Windows porting month. Now that I'm half way through it and with the last week of the month going to be a vacation, this makes sense.

Today, finished up dealing with the timezone/timestamp issues on Windows. This got stranger and stranger the closer I looked at it. After a timestamp change, a program that was already running will see one timestamp, while a program that is started after the change will see another one! My approach works pretty much no matter how Windows goes insane though, and always recovers a true timestamp. Yay.

Also fixed a regression test failure on Windows, which turned out to be rooted in a bug in the command queue runner, which neglected to pass along environment overrides on Windows.

Then I spent 5 hours tracking down a tricky test suite failure on Windows, which turned out to also affect FAT and be a recent reversion that has as it's root cause a fun bug in git itself. Put in a not very good workaround. Thank goodness for test suites!

Also got the arm autobuilder unstuck. Release tomorrow.

Posted Fri Jun 13 02:09:02 2014

Spent all day on some horrible timestamp issues on legacy systems.

On FAT, timestamps have a 2s granularity, which is ok, but then Linux adds a temporary higher resolution cache, which is lost on unmount. This confused git-annex since the mtimes seemed to change and it had to re-checksum half the files to get unconfused, which was not good. I found a way to use the inode sentinal file to detect when on FAT and put in a workaround, without degrading git-annex everywhere else.

On Windows, time zones are a utter disaster; it changes the mtime it reports for files after the time zone has changed. Also there's a bug in the haskell time library which makes it return old time zone data after a time zone change. (I just finished developing a fix for that bug..)

Left with nothing but a few sticks, I rubbed them together, and actually found a way to deal with this problem too. Scary details in ?Windows file timestamp timezone madness. While I've implemented it, it's stuck on a branch until I find a way to make git-annex notice when the timezone changes while it's running.


Today's work was sponsored by Svenne Krap.

Posted Wed Jun 11 23:08:13 2014

Have for the first time gotten git-annex to run as a proper Windows service, using nssm. (details) Not quite ready yet though; doesn't run as the right user.

And a few other windows porting bits.

Posted Tue Jun 10 23:23:18 2014

Spent most of today improving behavior when a sync or merge is interrupted in direct mode. It was possible for an interrupt at the wrong time to leave the merge committed, but the work tree not yet updated. And then the next sync would make a commit that reverted the merged changes!

To fix this I had to avoid making any merge commit or indeed updating the index until after the work tree is updated. It looked intractable for a while; I'm still surprised I eventually succeeded.

Posted Tue Jun 10 00:16:27 2014

Did work on Windows porting today. First, fixed a reversion in the last release, that broke the git-annex branch pretty badly on Windows, causing \r to be written to files on that branch that should never have DOS line endings. Second, fixed a long-standing bug that prevented getting a file from a local bare repository on Windows.

Also refreshed all autobuilders to deal with the gnutls and openssl security holes-of-the-week. (git-annex uses gnutls only for XMPP, and does not use openssl itself, but a few programs bundled with it, like curl, do use openssl.)

A nice piece of news: OSX Homebrew now contains git-annex, so it can be easily installed with brew install git-annex

Posted Thu Jun 5 21:33:54 2014

Yesterday I recorded a new screencast, demoing using the assistant on a local network with a small server. git-annex assistant lan. That's the best screencast yet; having a real framing story was nice; recent improvements to git-annex are taken advantage of without being made a big deal; and audio and video are improved. (But there are some minor encoding glitches which I'd have to re-edit it to fix.)

The roadmap has this month dedicated to improving Android. But I think what I'd more like to do is whatever makes the assistant usable by the most people. This might mean doing more on Windows, since I hear from many who would benefit from that. Or maybe something not related to porting?

Posted Wed Jun 4 21:18:31 2014

After making a release yesterday, I've been fixing some bugs in the webapp, all to do with repository configuration stored on the git-annex branch. I was led into this by a strange little bug where the webapp stored configuration in the wrong repo in one situation. From there, I noticed that often when enabling an existing repository, the webapp would stomp on its group and preferred content and description, replacing them with defaults.

This was a systematic problem, it had to be fixed in several places. And some of the fixes were quite tricky. For example, when adding a ssh repository, and it turns out there's already a git-annex repository at the entered location, it needs to avoid changing its configuration. But also, the configuration of that repo won't be known until after the first git pull from it. So it doesn't make sense to show the repository edit form after enabling such a repository.

Also worked on a couple other bugs, and further cleaned up the bugs page. I think I am finally happy with how the bug list is displayed, with confirmed/moreinfo/etc tags.

Today's work was sponsored by François Deppierraz.

Posted Fri May 30 21:56:26 2014

Got a handle on the Android webapp static file problems (no, they were not really encoding problems!), and hopefully that's all fixed now. Also, only 3 modules use Char8 now. And updated the git-annex backport. That's all I did today.

Meanwhile, a complete ZSH completion has been contributed by Schnouki. And, Ben Gamari sent in a patch moving from the deprecated MonadCatchIO-transformers library to the exceptions library.

Posted Wed May 28 22:26:53 2014

These themed days are inaverdent, but it happened again: Nearly everything done today had to do with encoding issues.

The big news is that it turned out everything written to files in the git-annex branch had unicode characters truncated to 8 bits. Now fixed so you should always get out the same thing you put in, no matter what encoding you use (but please use utf8). This affected things like storing repository descriptions, but worse, it affected metadata. (Also preferred content expressions, I suppose.)

With that fixed, there are still 7 source files left that use Char8 libraries. There used to be more; nearly every use of those is a bug. I looked over the remaining uses of it, and there might be a problem with Creds using it. I should probably make a push to stamp out all remaining uses of Char8.

Other encoding bugs were less reproducible.

And just now, Sören made some progress on Bootstrap3 icons missing on Android ... and my current theory is this is actually caused by an encoding issue too.

Posted Tue May 27 20:37:23 2014

With some help from Sören, have been redoing the android build environment for git-annex. This included making propellor put it in a docker container, which was easy. But then much struggling with annoying stuff like getting the gnutls linking to work, and working around some dependency issues on hackage that make cabal's dependency resolver melt down. Finally succeeded after much more time than I had wanted to spend on this.

Posted Tue May 27 17:23:39 2014

Working on moving the android autobuilder to Docker & Propellor, which will finish containerizing all the autobuilds that I run. Updated ghc-android to use the released ghc 7.8.2, which will make it build more reliably.

Also did bug triage. Bugs are now divided into confirmed and ?unconfirmed categories.

Posted Sun May 25 00:38:40 2014

Keeping lots of things going these past few days..

  • Rebootstrapping the armel autobuilder with propellor. Some qemu instability and the need to update haskell library patches meant this took a lot of hand-holding. Finally got a working setup today.
  • Designing and ordering new git-annex stickers on clear viynl backing; have put off sending those to campaign contributors for too long.
  • Added a new feature to the webapp: It now remembers the ssh remotes that it sets up, and makes it easy to enable them elsewhere, the same as other sorts of remotes. Had a very pleasant surprise building this, when I was able to reuse all the UI code for enabling rsync and gcrypt remotes. I think this will be a useful feature as we transition away from XMPP.
Posted Fri May 23 01:08:36 2014

Worked on triaging several bugs. Fixed an easy one, which involved the assistant choosing the wrong path to a repository that has multiple remotes. After today, backlog is down to 43, nearly pre-Brazil levels.

It seems that git-remote-gcrypt ?never quite worked on OSX. It looked like it did, but a bug prevented anything being pushed to the remote. Tracked down and fixed that bug.

This evening, getting back to working on the armel autobuilder setup using propellor. The autobuilder will use a pair of docker containers, one armel and a companion amd64, and their quite complex setup will be almost fully automated (except for the haskell library patching part).


Today's work was sponsored by Mica Semrick.

Posted Mon May 19 22:59:21 2014

Released git-annex 5.20140517 today. The changelog for this release is very unusual, because it's full of contributions from others! There are as many patches from others in this release as git-annex got in the first entire two years of its existence.

I'd like to keep that going. Also, I could really use help triaging bug reports right now. So I have updated the contribute page with more info about easy ways to contribute to git-annex. If you read this devblog, you're an ideal contributor, and you don't need to know how to write haskell either.. So take a look at the page and see if you can help out.

Posted Sun May 18 02:17:53 2014

Powered through the backlog today, and got it down to 67! Probably most of the rest is the hard ones though.

A theme today was: It's stupid hard to get git-annex-shell installed into PATH. While that should be the simplest thing in the world, I'm pinned between two problems:

  1. There's no single portable package format, so all the decades of development nice ways to get things into PATH don't work for everybody.
  2. bash provides not a single dotfile that will work in all circumstances to configure PATH. In particular, "ssh $host git-annex-shell" causes bash to helpfully avoid looking at any dotfiles at all.

Today's flailing to work around that inluded:

  • Merged a patch from Fraser Tweedale to allow git config remote.origin.annex-shell /not/in/path/git-annex-shell
  • Merged a patch from Justin Lebar to allow symlinking the git-annex-shell etc from the standalone tarball to a directory that is in PATH. (Only on Linux, not OSX yet.)
  • Improved the warning message git-annex prints when a remote server does not have git-annex-shell in PATH, suggesting some things the user could do to try to fix it.

I've found out why OSX machines were retrying upgrades repeatedly. The version in the .info file did not match the actual git-annex version for OSX. I've fixed the info file version, but will need to come up with a system to avoid such mismatches.

Made a few other fixes. A notable one is that dragging and dropping repositories in the webapp to reorder the list (and configure costs) had been broken since November.

git-annex 5.20140421 finally got into Debian testing today, so I updated the backport. I recommend upgrading, especially if you're using the assistant with a ssh remote, since you'll get all of last month's nice features that make XMPP unnecessary in that configuration.


Today's work was sponsored by Geoffrey Irving.

Posted Fri May 16 21:23:47 2014

Spent the day testing the sshpasswd branch. A few interesting things:

  • I was able to get rid of 10 lines of Windows specific code for rsync.net, which had been necessary for console ssh password prompting to work. Yay!
  • git-remote-gcrypt turned out to be broken when there is no controlling tty. --no-tty has to be passed to gpg to avoid it falling over in this case, even when a gpg agent is available to be used. I fixed this with a new release of git-remote-gcrypt.

Mostly the new branch just worked! And is merged...

Merged a patch from Robie Basak that adds a new special remote that's sort of like bup but supports deletion: ddar

Backlog: 172

Today's work was sponsored by Andrew Cant.

Posted Thu May 15 20:39:48 2014

My backlog is massive -- 181 items to answer. Will probably take the rest of the month to get caught back up. Rather than digging into that yet, spent today working on the webapp's ssh password prompting.

I simplified it so the password is entered on the same form as the rest of the server's information. That made the UI easy to build, but means that when a user already has a ssh key they want to use, they need to select "existing ssh key"; the webapp no longer probes to automatically detect that case.

Got the ssh password prompting in the webapp basically working, and it's a really nice improvement! I even got it to work on Windows (eventually...). It's still only in the sshpassword branch, since I need to test it more and probably fix some bugs. In particular, when enabling a remote that already exists, I think it never prompts for the password yet.

Today's work was sponsored by Nicola Chiapolini.

Posted Wed May 14 22:22:20 2014

I have a preliminary design for requests routing. Won't be working on it immediately, but simulations show it can work well in a large ad-hoc network.

Posted Tue May 6 21:08:49 2014

Sören Brunk's massive bootstrap 3 patch has landed! This is a 43 thousand line diff, with 2 thousand lines after the javascript and CSS libraries are filtered out. Either way, the biggest patch contributed by anyone to git-annex so far, and excellent work.

Meanwhile, I built a ?haskell program to simulate a network of highly distributed git-annex nodes with ad-hoc connections and the selective file syncing algorythm now documented at the bottom of efficiency.

Currently around 33% of requested files never get to their destination in this simulation, but this is probably because its network is randomly generated, and so contains disconnected islands. So next, some data entry, from a map that involves an Amazon not in .com, dotted with names of people I have recently met... :)

Posted Fri May 2 18:57:16 2014

I've moved out of implementation mode (unable to concentrate enough), and into high-level design mode.

Syncing efficiency has been an open TODO for years, to find a way to avoid flood filling the network, and find more efficient ways to ensure data only gets to the nodes that want it. Relatedly, Android devices often need a way to mark individual files they want to have. Had a very productive discussion with Vince and Fernao and I think we're heading toward a design that will address both these needs, as well as some more Brazil-specific use cases, about which more later.

Today's work was sponsored by Casa do Boneco.

Posted Thu May 1 17:23:47 2014

Reviewed Sören's updated bootstrap3 patch, which appeared while I was traveling. Sören kindly fixed it to work with Debian stable's old version of Yesod, which was quite a lot of work. The new new bootstrap3 UI looks nice, found a few minor issues, but expect to be able to merge it soon.

Started on sshpassword groundwork. Added a simple password cache to the assistant, with automatic expiration, and made git-annex be able to be run by ssh as the SSH_ASKPASS program.

The main difficulty will be changing the webapp's UI to prompt for the ssh password when one is needed. There are several code paths in ssh remote setup where a password might be needed. Since the cached password expires, it may need to be prompted for at any of those points. Since a new page is loading, it can't pop up a prompt on the current page; it needs to redirect to a password prompt page and then redirect back to the action that needed the password. ...At least, that's one way to do it. I'm going to sleep on it and hope I dream up a better way.

Posted Tue Apr 29 22:33:53 2014

Today was mostly spent driving across Brazil, but I had energy this evening for a little work on git-annex.

Made the assistant delete old temporary files on startup. I've had scattered reports of a few users whose .git/annex/tmp contained many files, apparently put there by the assistant when it locks down a file prior to annexing it. That seems it could possibly be a bug -- or it could just be unclean shutdowns interrupting the assistant. Anyway, this will deal with any source of tmp cruft, and I made sure to preserve tmp files for partially downloaded content.

Posted Mon Apr 28 01:12:55 2014

Next month the roadmap has me working on sshpassword. That will be a nice UI improvement and I'd be very surprised if it takes more than a week, which is great.

Getting a jump on it today, investigating using SSH_ASKPASS. It seems this will even work on Windows! Preliminary design in sshpassword.

Time to get on a plane to a plane to a plane to Brasilia!

Posted Fri Apr 25 20:32:36 2014

Now git-annex's self-upgrade code will check the gpg signature of a new version before using it.

To do this I had to include the gpg public keys into the git-annex distribution, and that raised the question of which public keys to include. Currently I have both the dedicated git-annex distribution signing key, and my own gpg key as a backup in case I somehow misplace the former.

Also spent a while looking at the recent logs on the web server. There seem to be around 600 users of the assistant with upgrade checking enabled. That breaks down to 68% Linux amd64, 20% Linux i386, 11% OSX Mavericks, and 0.5% OSX Lion.

Most are upgrading successfully, but there are a few that seem to repeatedly fail for some reason. (Not counting the OSX Lion, which will probably never find an upgrade available.) I hope that someone who is experiencing an upgrade failure gets in touch with some debug logs.

In the same time period, around 450 unique hosts manually downloaded a git-anex distribution. Also compare with Debian popcon, which has 1200 reporting git-annex users.

Posted Wed Apr 23 21:10:28 2014

I hope this will be a really good release. Didn't get all the way to telehash this month, but the remotedaemon is pretty sweet. Updated roadmap pushes telehash back again.

The files in this release are now gpg signed, after recently moving the downloads site to a dedicated server, which has a dedicated gpg key. You can verify the detached signatures as an additional security check over trusting SSL. The automatic upgrade code doesn't check the gpg signatures yet.

Sören Brunk has ported the webapp to Bootstrap 3. https://github.com/brunksn/git-annex/tree/bootstrap3
The branch is not ready for merging yet (it would break the Debian stable backports), but that was a nice surprise.

Posted Mon Apr 21 21:23:22 2014

Sometimes you don't notice something is missing for a long time until it suddenly demands attention. Like today.

Seems the webapp never had a way to stop using XMPP and delete the XMPP password. So I added one.

The new support for instantly noticing changes on a ssh remote forgot to start up a connection to a new remote after it was created. Fixed that.

(While doing some testing on Android for unrelated reasons, I noticed that my android tablet was pushing photos to a ssh server and my laptop immediately noticed and downloaded them from tere, which is an excellent demo. I will deploy this on my trip in Brazil next week. Yes, I'm spending 2 weeks in Brazil with git-annex users; more on this later.)

Finally, it turns out that "installing" git-annex from the standalone tarball, or DMG, on a server didn't make it usable by the webapp. Because git-annex shell is not in PATH on the server, and indeed git and rsync may not be in PATH either if they were installed with the git-annex bundle. Fixed this by making the bundle install a ~/.ssh/git-annex-wrapper, which the webapp will detect and use.

Also, quite a lot of other bug chasing activity.


Today's work was sponsored by Thomas Koch.

Posted Sun Apr 20 22:53:13 2014

Worked through message backlog today. Got it down from around 70 to just 37. Was able to fix some bugs, including making the webapp start up more robustly in some misconfigurations.

Added a new findref command which may be useful in a git update hook to deny pushes of refs if the annexed content has not been sent first.


BTW, I also added a new reinit command a few days ago, which can be useful if you're cloning back a deleted repository.

Also a few days ago, I made uninit a lot faster.

Posted Thu Apr 17 22:48:44 2014

After fixing a few bugs in the remotecontrol branch, It's landed in master. Try a daily build today, and see if the assistant can keep in sync using nothing more than a remote ssh repository!

So, now all the groundwork for telehash is laid too. I only need a telehash library to start developing on top of. Development on telehash-c is continuing, but I'm more excited that htelehash has been revived and is being updated to the v2 protocol, seemingly quite quickly.

Posted Tue Apr 15 01:26:57 2014

Made ssh connection caching be used in several more places. git annex sync will use it when pushing/pulling to a remote, as will the assistant. And git-annex remotedaemon also uses connection caching. So, when a push lands on a ssh remote, the assistant will immediately notice it, and pull down the change over the same TCP connection used for the notifications.

This was a bit of a pain to do. Had to set GIT_SSH=git-annex and then when git invokes git-annex as ssh, it runs ssh with the connection caching parameters.

Also, improved the network-manager and wicd code, so it detects when a connection has gone down. That propagates through to the remote-daemon, which closes all ssh connections. I need to also find out how to detect network connections/disconnections on OSX..

Otherwise, the remote-control branch seems ready to be merged. But I want to test it for a while first.


Followed up on yesterday's bug with writing some test cases for Utility.Scheduled, which led to some more bug fixes. Luckily nothing I need to rush out a release over. In the end, the code got a lot simpler and clearer.

-- Check if the new Day occurs one month or more past the old Day.
oneMonthPast :: Day -> Day -> Bool
new `oneMonthPast` old = fromGregorian y (m+1) d <= new
  where
        (y,m,d) = toGregorian old

Today's work was sponsored by Asbjørn Sloth Tønnesen.

Posted Sat Apr 12 22:45:29 2014

Pushed out a new release today, fixing two important bugs, followed by a second release which fixed the bugs harder.

Automatic upgrading was broken on OSX. The webapp will tell you upgrading failed, and you'll need to manually download the .dmg and install it.

With help from Maximiliano Curia, finally tracked down a bug I have been chasing for a while where the assistant would start using a lot of CPU while not seeming to be busy doing anything. Turned out to be triggered by a scheduled fsck that was configured to run once a month with no particular day specified.

That bug turned out to affect users who first scheduled such a fsck job after the 11th day of the month. So I expedited putting a release out to avoid anyone else running into it starting tomorrow.

(Oddly, the 11th day of this month also happens to be my birthday. I did not expect to have to cut 2 releases today..)

Posted Fri Apr 11 23:02:46 2014

The git-remote-daemon now robustly handles loss of signal, with reconnection backoffs. And it detects if the remote ssh server has too old a version of git-annex-shell and the webapp will display a warning message.

Also, made the webapp show a network signal bars icon next to both ssh and xmpp remotes that it's currently connected with. And, updated the webapp's nudging to set up XMPP to now suggest either an XMPP or a ssh remote.

I think that the remotecontrol branch is nearly ready for merging!

Today's work was sponsored by Paul Tagliamonte.

Posted Wed Apr 9 20:34:46 2014

git-remote-daemon is tied into the assistant, and working! Since it's not really ready yet, this is in the remotecontrol branch.

My test case for this is two client repositories, both running the assistant. Both have a bare git repository, accessed over ssh, set up as their only remote, and no other way to keep in touch with one-another. When I change a file in one repository, the other one instantly notices the change and syncs.

This is gonna be awesome. Much less need for XMPP. Windows will be fully usable even without XMPP. Also, most of the work I did today will be fully reused when the telehash backend gets built. The telehash-c developer is making noises about it being almost ready for use, too!

Today's work was sponsored by Frédéric Schütz.

Posted Tue Apr 8 22:27:21 2014

Various bug triage today. Was not good for much after shuffling paper for the whole first part of the day, but did get a few little things done.

Re http://heartbleed.com/, git-annex does not use OpenSSL itself, but when using XMPP, the remote server's key could have been intercepted using this new technique. Also, the git-annex autobuilds and this website are served over https -- working on generating new https certificates now. Be safe out there..

Posted Mon Apr 7 23:16:10 2014

Built git-annex remotedaemon command today. It's buggy, but it already works! If you have a new enough git-annex-shell on a remote server, you can run "git annex remotedaemon" in a git-annex repository, and it will notice any pushes that get made to that remote from any other clone, and pull down the changes.

Posted Sun Apr 6 23:16:31 2014

Added git-annex-shell notifychanges command, which uses inotify (etc) to detect when git refs have changed, and informs the caller about the changes. This was relatively easy to write; I reused the existing inotify code, and factored out code for simple line-based protocols from the external special remote protocol. Also implemented the git-remote-daemon protocol. 200 lines of code total.

Meanwhile, Johan Kiviniemi improved the dbus notifications, making them work on Ubuntu and adding icons. Awesome!

There's going to be some fun to get git-annex-shell upgraded so that the assistant can use this new notify feaure. While I have not started working on the assistant side of this, you can get a jump by installing today's upcoming release of git-annex. I had to push this out early because there was a bug that prevented the webapp from running on non-gnome systems. Since all changes in this release only affected Linux, today's release will be a Linux-only release.

Posted Sat Apr 5 20:58:08 2014

I have a plan for this month. While waiting for telehash, I am going to build git-remote-daemon, which is the infrastructure git-annex will need, to use telehash. Since it's generalized to support other protocols, I'll be able to start using it before telehash is ready.

In fact, I plan to first make it work with ssh:// remotes, where it will talk with git-annex-shell on the remote server. This will let the assistant immediately know when the server has received a commit, and that will simplify using the assistant with a ssh server -- no more need for XMPP in this case! It should also work with git-remote-gcrypt encrypted repositories, so also covers the case of an untrusted ssh server where everything is end-to-end encrypted.

Building the git-annex-shell part of this should be pretty easy, and building enough of the git-remote-daemon design to support it also not hard.

Posted Thu Apr 3 23:04:11 2014

Got caught up on all recent bugs and questions, although I still have a backlog of 27 older things that I really should find time for.

Fixed a couple of bugs. One was that the assistant set up ssh authorized_keys that didn't work with the fish shell.

Also got caught up on the current state of telehash-c. Have not quite gotten it to work, but it seems pretty close to being able to see it do something useful for the first time.

Pushing out a release this evening with a good number of changes left over from March.

Posted Wed Apr 2 21:14:45 2014

Last week's trip was productive, but I came home more tired than I realized. Found myself being snappy & stressed, so I have been on break.

I did do a little git-annex dev in the past 5 days. On Saturday I implemented ?preferred content (although without the active checks I think it probably ought to have.) Yesterday I had a long conversation with the Tahoe developers about improving git-annex's tahoe integration.

Today, I have been wrapping up building propellor. To test its docker support, I used propellor to build and deploy a container that is a git-annex autobuilder. I'll be replacing the old autobuilder setup with this shortly, and expect to also publish docker images for git-annex autobuilders, so anyone who wants to can run their own autobuilder really easily.


I have April penciled in on the roadmap as the month to do telehash. I don't know if telehash-c is ready for me yet, but it has had a lot of activity lately, so this schedule may still work out!

Posted Wed Apr 2 01:17:39 2014

Catching up on conference backlog. 36 messages backlog remains.

Fixed git-annex-shell configlist to automatically initialize a git remote when a git-annex branch had been pushed to it. This is necessary for gitolite to be easy to use, and I'm sure it used to work.

Updated the Debian backport and made a Debian package of the fdo-notify haskell library used for notifications.

Applied a patch from Alberto Berti to fix support for tahoe-lafs 1.10.

And various other bug fixes and small improvements.

Posted Wed Mar 26 21:04:47 2014

Attended at the f-droid sprint at LibrePlanet, and have been getting a handle on how their build server works with an eye toward adding git-annex to it. Not entirely successful getting vagrant to build an image yet.

Posted Sun Mar 23 22:17:55 2014

Yesterday coded up one nice improvement on the plane -- git annex unannex (and uninit) is now tons faster. Before it did a git commit after every file processed, now there's just 1 commit at the end. This required using some locking to prevent the pre-commit hook from running in a confusing state.

Today. LibrePlanet and a surprising amount of development. I've added file manager integration, only for Nautilus so far. The main part of this was adding --notify-start and --notify-finish, which use dbus desktop notifications to provide feedback.

(Made possible thanks to Max Rabkin for updating fdo-notify to use the new dbus library, and ion for developing the initial Nautilus integration scripts.)

Today's work and LibrePlanet visit was sponsored by Jürgen Lüters.

Posted Sat Mar 22 20:21:46 2014

Yesterday, worked on cleaning up the todo list. Fixed Windows slash problem with rsync remotes. Today, more Windows work; it turns out to have been quite buggy in its handling of non-ASCII characters in filenames. Encoding stuff is never easy for me, but I eventually managed to find a way to fix that, although I think there are other filename encoding problems lurking in git-annex on Windows still to be dealt with.

Implemented an interesting metadata feature yesterday. It turns out that metadata can have metadata. Particularly, it can be useful to know when a field was last set. That was already beeing tracked, internally (to make union merging work), so I was able to quite cheaply expose it as "$field-lastchanged" metadata that can be used like any other metadata.

I've been thinking about how to implement required content expressions, and think I have a reasonably good handle on it.

Posted Wed Mar 19 20:56:12 2014

The website broke and I spent several hours fixing it, changing the configuration to not let it break like this again, cleaning up after it, etc.

Did manage to make a few minor bugfixes and improvements, but nothing stunning.


I'll be attending LibrePlanet at MIT this weekend.

Posted Mon Mar 17 23:25:10 2014

Added some power and convenience to preferred content expressions.

Before, "standard" was a special case. Now it's a first-class keyword, so you can do things like "standard or present" to use the standard preferred content expression, modified to also want any file that happens to be present.

Also added a way to write your own reusable preferred content expressions, tied to groups. To make a repository use them, set its preferred content to "groupwanted". Of course, "groupwanted" is also a first-class keyword, so "not groupwanted" or something can also be done.

While I was at it, I made vicfg show the built-in standard preferred content expressions, for reference. This little IDE should be pretty self-explanatory, I hope.

So, preferred content is almost its own little programming language now. Except I was careful to not allow recursion. ;)

Posted Sat Mar 15 21:46:46 2014

Did some more exploration and perf tuning and thinking on caching databases, and am pretty sure I know how I want to implement it. Will be several stages, starting with using it for generating views, and ending(?) with using it for direct mode file mappings.

Not sure I'm ready to dive into that yet, so instead spent the rest of the day working on small bugfixes and improvements. Only two significant ones..

Made the webapp use a constant time string comparison (from securemem) to check if its auth token is valid. This could help avoid a potential timing attack to guess the auth token, although that is theoretical. Just best practice to do this.

Seems that openssh 6.5p1 had another hidden surprise (in addition to its now-fixed bug in handing hostnames in .ssh/config) -- it broke the method git-annex was using for stopping a cached ssh connection, which led to some timeouts for failing DNS lookups. If git-annex seems to stall for a few seconds at startup/shutdown, that may be why (--debug will say for sure). I seem to have found a workaround that avoids this problem.

Posted Thu Mar 13 23:45:48 2014

Updated the Debian stable backport to the last release. Also it seems that the last release unexpectedly fixed XMPP SIGILL on some OSX machines. Apparently when I rebuilt all the libraries recently, it somehow fixed that ?old unsolved bug.

RichiH suggested "wrt ballooning memory on repair: can you read in broken stuff and simply stop reading once you reach a certain threshold, then start repairing, re-run fsck, etc?" .. I had considered that but was not sure it would work. I think I've gotten it to work.

Now working on a design for using a caching database for some parts of git-annex. My initial benchmarks using SQLite indicate it would slow down associated file lookups by nearly an order of magnitude compared with the current ".map files" implementation. (But would scale better in edge cases). OTOH, using a SQLite database to index metadata for use in views looks very promising.

Posted Wed Mar 12 22:20:32 2014

Squashed three or four more bugs today. Unanswered message backlog is down to 27.

The most interesting problem today is that the git-repair code was using too much memory when git-fsck output a lot of problems (300 thousand!). I managed to half the memory use in the worst case (and reduced it much more in more likely cases). But, don't really feel I can close that bug yet, since really big, really badly broken repositories can still run it out of memory. It would be good to find a way to reorganize the code so that the broken objects list streams through git-repair and never has to all be buffered in memory at once. But this is not easy.

Posted Mon Mar 10 21:41:18 2014

Release made yesterday, but only finished up the armel build today. And it turns out the OSX build was missing the webapp, so it's also been updated today.

Post release bug triage including:

Added a nice piece of UI to the webapp on user request: A "Sync now" menu item in the repository for each repo. (The one for the current repo syncs with all its remotes.)

Copying files to a git repository on the same computer turns out to have had a resource leak issue, that caused 1 zombie process per file. With some tricky monad state caching, fixed that, and also eliminated 8% of the work done by git-annex in this case.

Fixed git annex unused in direct mode to not think that files that were deleted out of the work tree by the user still existed and were unused.

Posted Fri Mar 7 20:27:58 2014

Preparing for a release (probably tomorrow or Friday).

Part of that was updating the autobuilders. Had to deal with the gnutls security hole fix, and upgrading that on the OSX autobuilder turned out to be quite complicated due to library version skew. Also, I switched the linux autobuilders over to building from Debian unstable, rather than stable. That should be ok to do now that the standalone build bundles all the libraries it needs... And the arm build has always used unstable, and has been reported working on a lot of systems. So I think this will be safe, but have backed up the old autobuilder chroots just in case.

Also been catching up on bug reports and traffic and and dealt with quite a lot of things today. Smarter log file rotation for the assistant, better webapp behavior when git is not installed, and a fix for the webdav 5 second timeout problem.

Perhaps the most interesting change is a new annex.startupscan setting, which can be disabled to prevent the assistant from doing the expensive startup scan. This means it misses noticing any files that changed since it last run, but this should be useful for those really big repositories.

(Last night, did more work on the test suite, including even more checking of merge conflict resolution.)


Today's work was sponsored by Michael Alan Dorman.

Posted Wed Mar 5 22:45:22 2014

Yesterday I learned of a nasty bug in handling of merges in direct mode. It turns out that if the remote repository has added a file, and there is a conflicting file in the local work tree, which has not been added to git, the local file was overwritten when git-annex did a merge. That's really bad, I'm very unhappy this bug lurked undetected for so long.

Understanding the bug was easy. Fixing it turned out to be hard, because the automatic merge conflict resolution code was quite a mess. In particular, it wrote files to the work tree, which made it difficult for a later stage to detect and handle the abovementioned case. Also, the automatic merge resolution code had weird asymmetric structure that I never fully understood, and generally needed to be stared at for an hour to begin to understand it.

In the process of cleaning that up, I wrote several more tests, to ensure that every case was handled correctly. Coverage was about 50% of the cases, and should now be 100%.

To add to the fun, a while ago I had dealt with a bug on FAT/Windows where it sometimes lost the symlink bit during automatic merge resolution. Except it turned out my test case for it had a heisenbug, and I had not actually fixed it (I think). In any case, my old fix for it was a large part of the ugliness I was cleaning up, and had to be rewritten. Fully tracking down and dealing with that took a large part of today.

Finally this evening, I added support for automatically handling merge conflicts where one side is an annexed file, and the other side has the same filename committed to git in the normal way. This is not an important case, but it's worth it for completeness. There was an unexpected benefit to doing it; it turned out that the weird asymmetric part of the code went away.

The final core of the automatic merge conflict resolver has morphed from a mess I'd not want to paste here to a quite consise and easy to follow bit of code.

        case (kus, kthem) of
                -- Both sides of conflict are annexed files
                (Just keyUs, Just keyThem) -> resolveby $
                        if keyUs == keyThem
                                then makelink keyUs
                                else do
                                        makelink keyUs
                                        makelink keyThem
                -- Our side is annexed file, other side is not.
                (Just keyUs, Nothing) -> resolveby $ do
                        graftin them file
                        makelink keyUs
                -- Our side is not annexed file, other side is.
                (Nothing, Just keyThem) -> resolveby $ do
                        graftin us file
                        makelink keyThem
                -- Neither side is annexed file; cannot resolve.
                (Nothing, Nothing) -> return Nothing

Since the bug that started all this is so bad, I want to make a release pretty soon.. But I will probably let it soak and whale on the test suite a bit more first. (This bug is also probably worth backporting to old versions of git-annex in eg Debian stable.)

Posted Wed Mar 5 00:06:55 2014

Worked on metadata and views. Besides bugfixes, two features of note:

Made git-annex run a hook script, pre-commit-annex. And I wrote a sample script that extracts metadata from lots of kinds of files, including photos and sound files, using extract(1) to do the heavy lifting. See automatically adding metadata.

Views can be filtered to not include a tag or a field. For example, git annex view tag=* !old year!=2013

Today's work was sponsored by Stephan Schulz

Posted Mon Mar 3 00:22:04 2014

Did not plan to work on git-annex today..

Unexpectedly ended up making the webapp support HTTPS. Not by default, but if a key and certificate are provided, it'll use them. Great for using the webapp remotely! See the new tip: remote webapp setup.

Also removed support for --listen with a port, which was buggy and not necessary with HTTPS.

Also fixed several webapp/assistant bugs, including one that let it be run in a bare git repository.

And, made the quvi version be probed at runtime, rather than compile time.

Posted Sat Mar 1 03:07:54 2014

Pushed a release today. Rest of day spent beating head against Windows XMPP brick wall.

Actually made a lot of progress -- Finally found the right approach, and got a clean build of the XMPP haskell libraries. But.. ghc fails to load the libraries when running Template Haskell. "Misaligned section: 18206e5b". Filed a bug report, and I'm sure this alignment problem can be fixed, but I'm not hopeful about fixing it myself.

One workaround would be to use the EvilSplicer, building once without the XMPP library linked in, to get the TH splices expanded, and then a second time with the XMPP library and no TH. Made a winsplicehack branch with tons of ifdefs that allows doing this. However, several dozen haskell libraries would need to be patched to get it to work. I have the patches from Android, but would rather avoid doing all that again on Windows.

Another workaround would be to move XMPP into a separate process from the webapp. This is not very appealing either, the IPC between them would be fairly complicated since the webapp does stuff like show lists of XMPP buddies, etc. But, one thing this idea has to recommend it is I am already considering using a separate helper daemon like this for Telehash.

So there could be synergies between XMPP and Telehash support, possibly leading to some kind of plugin interface in git-annex for this sort of thing. But then, once Telehash or something like it is available and working well, I plan to deprecate XMPP entirely. It's been a flakey pain from the start, so that can't come too soon.

Posted Thu Feb 27 21:43:59 2014

Not a lot accomplished today. Some release prep, followed up to a few bug reports.

Split git-annex's .git/annex/tmp into two directories. .git/annex/tmp will now be used only for partially transferred objects, while .git/annex/misctmp will be used for everything else. In particular this allows symlinking .git/annex/tmp to a ram disk, if you want to do that. (It's not possible for .git/annex/misctmp to be on a different filesystem from the rest of the repository for various reasons.)

Beat on Windows XMPP for several more painful hours. Got all the haskell bindings installed, except for gnuidn. And patched network-client-xmpp to build without gnuidn. Have not managed to get it to link.

Posted Thu Feb 27 01:11:08 2014

More Windows porting. Made the build completely -Wall safe on Windows. Fixed some DOS path separator bugs that were preventing WebDav from working. Have now tested both box.com and Amazon S3 to be completely working in the webapp on Windows.

Posted Tue Feb 25 21:54:58 2014

Turns out that in the last release I broke making box.com, Amazon S3 and Glacier remotes from the webapp. Fixed that.

Also, dealt with changes in the haskell DAV library that broke support for box.com, and worked around an exception handling bug in the library.

I think I should try to enhance the test suite so it can run live tests on special remotes, which would at least have caught the some of these recent problems...


Since metadata is tied to a particular key, editing an annexed file, which causes the key to change, made the metadata seem to get lost.

I've now fixed this; it copies the metadata from the old version to the new one. (Taking care to copy the log file identically, so git can reuse its blob.)

That meant that git annex add has to check every file it adds to see if there's an old version. Happily, that check is fairly fast; I benchmarked my laptop running 2500 such checks a second. So it's not going to slow things down appreciably.

Posted Mon Feb 24 23:36:58 2014

When generating a view, there's now a way to reuse part of the directory hierarchy of the parent branch. For example, git annex view tag=* podcasts/=* makes a view where the first level is the tags, and the second level is whatever podcasts/* directories the files were in.

Also, year and month metadata can be automatically recorded when adding files to the annex. I made this only be done when annex.genmetadata is turned on, to avoid polluting repositories that don't want to use metadata.

It would be nice if there was a way to add a hook script that's run when files are added, to collect their metadata. I am not sure yet if I am going to add that to git-annex though. It's already possible to do via the regular git post-commit hook. Just make it look at the commit to see what files were added, and then run git annex metadata to set their metadata appropriately. It would be good to at least have an example of such a script to eg, extract EXIF or ID3 metadata. Perhaps someone can contribute one?

Posted Sun Feb 23 04:25:46 2014

Spent the day catching up on the last week or so's traffic. Ended up making numerous small big fixes and improvements. Message backlog stands at 44.

Here's the screencast demoing views!

Added to the design today the idea of automatically deriving metadata from the location of files in the master branch's directory tree. Eg, git annex view tag=* podcasts/=* in a repository that has a podcasts/ directory would make a tree like "$tag/$podcast". Seems promising.

So much still to do with views.. I have belatedly added them to the roadmap for this month; doing Windows and Android in the same month was too much to expect.

Posted Thu Feb 20 20:39:57 2014

Still working on views. The most important addition today is that git annex precommit notices when files have been moved/copied/deleted in a view, and updates the metadata to reflect the changes.

Also wrote some walkthrough documentation: metadata driven views.
And, recorded a screencast demoing views, which I will upload next time I have bandwidth.

Posted Thu Feb 20 01:56:42 2014

Today I built git annex view, and git annex vadd and a few related commands. A quick demo:

joey@darkstar:~/lib/talks>ls
Chaos_Communication_Congress/  FOSDEM/       Linux_Conference_Australia/
Debian/                        LibrePlanet/  README.md
joey@darkstar:~/lib/talks>git annex view tag=*
view  (searching...)
Switched to branch 'views/_'
ok
joey@darkstar:~/lib/talks#_>tree -d
.
|-- Debian
|-- android
|-- bigpicture
|-- debhelper
|-- git
|-- git-annex
`-- seen

7 directories
joey@darkstar:~/lib/talks#_>git annex vadd author=*
vadd  
Switched to branch 'views/author=_;_'
ok
joey@darkstar:~/lib/talks#author=_;_>tree -d
.
|-- Benjamin Mako Hill
|   `-- bigpicture
|-- Denis Carikli
|   `-- android
|-- Joey Hess
|   |-- Debian
|   |-- bigpicture
|   |-- debhelper
|   |-- git
|   `-- git-annex
|-- Richard Hartmann
|   |-- git
|   `-- git-annex
`-- Stefano Zacchiroli
    `-- Debian

15 directories
joey@darkstar:~/lib/talks#author=_;_>git annex vpop
vpop 1
Switched to branch 'views/_'
ok
joey@darkstar:~/lib/talks#_>git annex vadd tag=git-annex
vadd  
Switched to branch 'views/(git-annex)'
ok
joey@darkstar:~/lib/talks#(git-annex)>ls
1025_gitify_your_life_{Debian;2013;DebConf13;high}.ogv@
git_annex___manage_files_with_git__without_checking_their_contents_into_git_{FOSDEM;2012;lightningtalks}.webm@
mirror.linux.org.au_linux.conf.au_2013_mp4_gitannex_{Linux_Conference_Australia;2013}.mp4@
joey@darkstar:~/lib/talks#_>git annex vpop 2
vpop 2
Switched to branch 'master'
ok

Not 100% happy with the speed -- the generation of the view branch is close to optimal, and fast enough (unless the branch has very many matching files). And vadd can be quite fast if the view has already limited the total number of files to a smallish amount. But view has to look at every file's metadata, and this can take a while in a large repository. Needs indexes.

It also needs integration with git annex sync, so the view branches update when files are added to the master branch, and moving files around inside a view and committing them does not yet update their metadata.


Today's work was sponsored by Daniel Atlas.

Posted Wed Feb 19 01:58:19 2014

Working on building metadata filtered branches.

Spent most of the day on types and pure code. Finally at the end I wrote down two actions that I still need to implement to make it all work:

applyView' :: MkFileView -> View -> Annex Git.Branch
updateView :: View -> Git.Ref -> Git.Ref -> Annex Git.Branch

I know how to implement these, more or less. And in most cases they will be pretty fast.

The more interesting part is already done. That was the issue of how to generate filenames in the filter branches. That depends on the View being used to filter and organize the branch, but also on the original filename used in the reference branch. Each filter branch has a reference branch (such as "master"), and displays a filtered and metadata-driven reorganized tree of files from its reference branch.

fileViews :: View -> (FilePath -> FileView) -> FilePath -> MetaData -> Maybe [FileView]

So, a view that matches files tagged "haskell" or "git-annex" and with an author of "J*" will generate filenames like "haskell/Joachim/interesting_theoretical_talk.ogg" and "git-annex/Joey/mytalk.ogg".

It can also work backwards from these filenames to derive the MetaData that is encoded in them.

fromView :: View -> FileView -> MetaData

So, copying a file to "haskell/Joey/mytalk.ogg" lets it know that it's gained a "haskell" tag. I knew I was on the right track when fromView turned out to be only 6 lines of code!

The trickiest part of all this, which I spent most of yesterday thinking about, is what to do if the master branch has files in subdirectories. It probably does not makes sense to retain that hierarchical directory structure in the filtered branch, because we instead have a non-hierarchical metadata structure to express. (And there would probably be a lot of deep directory structures containing only one file.) But throwing away the subdirectory information entirely means that two files with the same basename and same metadata would have colliding names.

I eventually decided to embed the subdirectory information into the filenames used on the filter branch. Currently that is done by converting dir/subdir/file.foo to file(dir)(subdir).foo. We'll see how this works out in practice..

Posted Mon Feb 17 02:44:28 2014

More Windows porting.. Seem to be getting near an end of the easy stuff, and also the webapp is getting pretty usable on Windows now, the only really important thing lacking is XMPP support.

Made git-annex on Windows set HOME when it's not already set. Several of the bundled cygwin tools only look at HOME. This was made a lot harder and uglier due to there not being any way to modify the environment of the running process.. git-annex has to re-run itself with the fixed environment.

Got rsync.net working in the webapp. Although with an extra rsync.net password prompt on Windows, which I cannot find a way to avoid.

While testing that, I discovered that openssh 6.5p1 has broken support for ~/.ssh/config Host lines that contain upper case letters! I have filed a bug about this and put a quick fix in git-annex, which sometimes generated such lines.

Posted Fri Feb 14 21:02:52 2014

Windows porting all day. Fixed a lot of issues with the webapp, so quite productive. Except for the 2 hours wasted finding a way to kill a process by PID from Haskell on Windows.

Last night, made git annex metadata able to set metadata on a whole directory or list of files if desired. And added a --metadata field=value switch (and corresponding preferred content terminal) which limits git-annex to acting on files with the specified metadata.

Posted Thu Feb 13 21:37:49 2014

Built the core data types, and log for metadata storage. Making metadata union merge well is tricky, but I have a design I'm happy with, that will allow distributed changes to metadata.

Finished up the day with a git annex metadata command to get/set metadata for a file.

This is all the goundwork needed to begin experimenting with generating git branches that display different metadata-driven views of annexed files.

Posted Thu Feb 13 03:24:04 2014

There's a new design document for letting git-annex store arbitrary metadata. The really neat thing about this is the user can check out only files matching the tags or values they care about, and get an automatically structuted file tree layout that can be dynamically filtered. It's going to be awesome! metadata

In the meantime, spent most of today working on Windows. Very good progress, possibly motivated by wanting to get it over with so I can spend some time this month on the above. ;)

  • webapp can make box.com and S3 remotes. This just involved fixing a hack where the webapp set environment variables to communicate creds to initremote. Can't change environment on Windows (or I don't know how to).
  • webapp can make repos on removable drives.
  • git annex assistant --stop works, although this is not likely to really be useful
  • The source tree now has 0 func = error "Windows TODO" type stubbed out functions to trip over.
Posted Tue Feb 11 20:21:42 2014

Pushed out the new release. This is the first one where I consider the git-annex command line beta quality on Windows.

Did some testing of the webapp on Windows, trying out every part of the UI. I now have eleven todo items involving the webapp listed in windows support. Most of them don't look too bad to fix.

Posted Mon Feb 10 22:48:12 2014

Last night I tracked down and fixed a bug in the DAV library that has been affecting WebDAV remotes. I've been deploying the fix for that today, including to the android and arm autobuilders. While I finished a clean reinstall of the android autobuilder, I ran into problems getting a clean reinstall of the arm autobuilder (some type mismatch error building yesod-core), so manually fixed its DAV for now.

The WebDAV fix and other recent fixes makes me want to make a release soon, probably Monday.

ObWindows: Fixed git-annex to not crash when run on Windows in a git repository that has a remote with a unix-style path like "/foo/bar". Seems that not everything aggrees on whether such a path is absolute; even sometimes different parts of the same library disagree!

import System.FilePath.Windows

prop_windows_is_sane :: Bool
prop_windows_is_sane = isAbsolute upath || ("C:\\STUFF" </> upath /= upath)
  where upath = "/foo/bar"

Perhaps more interestingly, I've been helping dxtrish port git-annex to OpenBSD and it seems most of the way there.

Posted Sat Feb 8 21:27:35 2014

git-annex has been using MissingH's absNormPath forever, but that's not very maintained and doesn't work on Windows. I've been wanting to get rid of it for some time, and finally did today, writing a simplifyPath that does the things git-annex needs and will work with all the Windows filename craziness, and takes advantage of the more modern System.FilePath to be quite a simple peice of code. A QuickCheck test found no important divergences from absNormPath. A good first step to making git-annex not depend on MissingH at all.

That fixed one last Windows bug that was disabled in the test suite: git annex add ..\subdir\file will now work.

I am re-installing the Android autobuilder for 2 reasons: I noticed I had accidentally lost a patch to make a library use the Android SSL cert directory, and also a new version of GHC is very near to release and so it makes sense to update.

Down to 38 messages in the backlog.

Posted Fri Feb 7 22:08:20 2014

Added a new feature that started out with me wanting a way to undo a git-annex drop, but turned into something rather more powerful. The --in option can now be told to match files that were in a repository at some point in the past. For example, git annex get --in=here@{yesterday} will get any files that have been dropped over the past day.

While git-annex's location tracking info is stored in git and thus versioned, very little of it makes use of past versions of the location tracking info (only git annex log). I'm happy to have finally found a use for it!

OB Windows porting: Fixed a bug in the symlink calculation code. Sounds simple; took 2 hours!

Also various bug triage; updated git version on OSX; forwarded bug about DAV-0.6 being broken upstream; fixed a bug with initremote in encryption=pubkey mode. Backlog is 65 messages.


Today's work was sponsored by Brock Spratlen.

Posted Fri Feb 7 01:08:56 2014

A more test driven day than usual. Yesterday I noticed a test case was failing on Windows in a way not related to what it was intended to test, and fixed the test case to not fail.. But knew I'd need to get to the bottom of what broke it eventually.

Digging into that today, I eventually (after rather a long time stuck) determined the bug involved automatic conflict resolution, but only happened on systems without symlink support. This let me reproduce it on FAT outside Windows and do some fast TDD iterations in a much less unwieldly environment and fix the bug.

Posted Tue Feb 4 21:34:34 2014

While I've not been blogging over what amounted to a long weekend, looking over the changelog, there were quite a few things done. Mostly various improvements and fixes to git annex sync --content.

Today, got the test suite to pass on Windows 100% again.

Posted Tue Feb 4 01:26:04 2014

With yesterday's release, I'm pretty much done with the month's work. Since there was no particular goal this month, it's been a grab bag of features and bugfixes. Quite a lot of them in this last release.

I'll be away the next couple of days.. But got a start today on the next part of the roadmap, which is planned to be all about Windows and Android porting. Today, it was all about lock files, mostly on Windows.

Lock files on Windows are horrific. I especially like that programs that want to open a file, for any reason, are encouraged in the official documentation to retry repeatedly if it fails, because some other random program, like a virus checker, might have opened the file first.

Turns out Windows does support a shared file read mode. This was just barely enough for me to implement both shared and exclusive file locking a-la-flock.

Couldn't avoid a busy wait in a few places that block on a lock. Luckily, these are few, and the chances the lock will be taken for a long time is small. (I did think about trying to watch the file for close events and detect when the lock was released that way, but it seemed much too complicated and hard to avoid races.)

Also, Windows only seems to support mandatory locks, while all locking in git-annex needs to be advisory locks. Ie, git-annex's locking shouldn't prevent a program from opening an annexed file! To work around that, I am using dedicated lock files on Windows.

Also switched direct mode's annexed object locking to use dedicated lock files. AFAICS, this was pretty well broken in direct mode before.

Posted Tue Jan 28 20:52:01 2014

Built the UI to manage unused files.

Testing yesterday's work, I found several problems that prevented the assistant from moving unused files around, and fixed them. It seems to be working pretty well now.

Posted Thu Jan 23 20:57:49 2014

A big missing peice of the assistant is doing something about the content of old versions of files, and deleted files. In direct mode, editing or deleting a file necessarily loses its content from the local repository, but the content can still hang around in other repositories. So, the assistant needs to do something about that to avoid eating up disk space unnecessarily.

I built on recent work, that lets preferred content expressions be matched against keys with no associated file. This means that I can run unused keys through all the machinery in the assistant that handles file transfers, and they'll end being moved to whatever repository wants them. To control which repositories do want to retain unused files, and which not, I added a unused keyword to preferred content expressions. Client repositories and transfer repositories do not want to retain unused files, but backup etc repos do.

One nice thing about this unused preferred content implementation is that it doesn't slow down normal matching of preferred content expressions at all. Can you guess why not? See 4b55afe9e92c045d72b78747021e15e8dfc16416

So, the assistant will run git annex unused on a daily basis, and cause unused files to flow to repositories that want them. But what if no repositories do? To guard against filling up the local disk, there's a annex.expireunused configuration setting, that can cause old unused files to be deleted by the assistant after a number of days.

I made the assistant check if there seem to be a lot of unused files piling up. (1000+, or 10% of disk used by them, or more space taken by unused files than is free.) If so, it'll pop up an alert to nudge the user to configure annex.expireunused.

Still need to build the UI to configure that, and test all of this.

Today's work was sponsored by Samuel Tardieu.

Posted Thu Jan 23 03:11:20 2014

Worked on cleaning up and reorganizing all the code that handles numcopies settings. Much nicer now. Fixed some bugs.

As expected, making the preferred content numcopies check look at .gitattributes slows it down significantly. So, exposed both the slow and accurate check and a faster version that ignores .gitattributes.

Also worked on the test suite, removing dependencies between tests. This will let tasty-rerun be used later to run only previously failing tests.

Posted Tue Jan 21 23:21:48 2014

In order to remove some hackishness in git annex sync --content, I finally fixed a bad design decision I made back at the very beginning (before I really knew haskell) when I built the command seek code, which had led to a kind of inversion of control. This took most of a night, but it made a lot of code in git-annex clearer, and it makes the command seeking code much more flexible in what it can do. Some of the oldest, and worst code in git-annex was removed in the process.

Also, I've been reworking the numcopies configuration, to allow for a ?preferred content numcopies check. That will let the assistant, as well as git annex sync --content proactively make copies when needed in order to satisfy numcopies.

As part of this, git config annex.numcopies is deprecated, and there's a new git annex numcopies N command that sets the numcopies value that will be used by any clone of a repository.

I got the preferred content checking of numcopies working too. However, I am unsure if checking for per-file .gitattributes annex.numcopies settings will make preferred content expressions be, so I have left that out for now.

Today's work was sponsored by Josh Taylor.

Posted Mon Jan 20 21:47:17 2014

Spent the day building this new feature, which makes git annex sync --content do the same synchronization of file contents (to satisfy preferred content settings) that the assistant does. The result has not been tested a lot yet, but seems to work well.

Posted Sun Jan 19 22:12:24 2014

Activity has been a bit low again this week. It seems to make sense to do weekly releases currently (rather than bi-monthly), and Thursday's release had only one new feature (Tahoe LAFS) and a bunch of bug fixes.

Looks like git-annex will get back into Debian testing soon, after various fixes to make it build on all architectures again, and then the backport can be updated again too.

I have been struggling with a problem with the OSX builds, which fail with a SIGKILL on some machines. It seems that homebrew likes to agressively optimise things it builds, and while I have had some success with its --build-bottle option, something in the gnutls stack used for XMPP is still over-optimised. Waiting to hear back from Kevin on cleaning up some optimised system libraries on the OSX host I use. (Is there some way to make a clean chrooot on OSX that can be accessed by a non-root user?)

Today I did some minor work involving the --json switch, and also a small change (well, under 300 line diff) allowing --all to be mixed with options like --copies and --in.

Posted Sat Jan 18 21:26:34 2014

Fixed a bug that one or two people had mentioned years ago, but I was never able to reproduce myself or get anyone to reproduce in a useful way. It caused log files that were supposed to be committed to the git-annex branch to end up in master. Turned out to involve weird stuff when the environment contains two different settings for a single variable. So was easily fixed at last. (I'm pretty sure the code would have never had this bug if Data.AssocList was not buried inside an xml library, which rather discourages using it when dealing with the environment.)

Also worked on, and hopefully fixed, another OSX cpu optimisations problem. This one involving shared libraries that git-annex uses for XMPP.

Also made the assistant detect corrupt .git/annex/index files on startup and remove them. It was already able to recover from corrupt .git/index files.

Today's work was sponsored by David Wagner.

Posted Tue Jan 14 21:23:27 2014

If you've been keeping an eye on the roadmap, you'll have seen that xmpp security keeps being pushed back. This was because it's a hard and annoying problem requiring custom crypto and with an ugly key validation problem built into it too. I've now removed it from the roadmap entirely, replacing it with a telehash design.

I'm excited by the possibilities of using telehash with git-annex. It seems it would be quite easy to make it significantly more peer-to-peer and flexible. The only issue is that telehash is still under heavy development and the C implementation is not even usable yet.. (I'll probably end up writing Haskell bindings to that.) So I've pushed it down the roadmap to at least March.

Spent the rest of the day making some minor improvements to external special remote protocol and doing some other minor bug fixes and backlog catch up. My backlog has exploded to nearly 50 messages remaining.


Today's work was sponsored by Chad Horohoe.

Posted Mon Jan 13 22:10:45 2014

Been on reduced activity the past several days. I did spend a full day somewhere in there building the Tahoe LAFS special remote. Also, Tobias has finished updating his full suite of external special remotes to use the new interface!

Worked on closing up the fundraising campaign today (long overdue). This included adding a new wall-o-names to thanks.

Posted Fri Jan 10 19:40:11 2014

Taught the assistant to stop reusing an existing git annex transferkeys process after it detects a network connection change. I don't think this is a complete solution to what to do about long-duration network connections in remotes. For one thing a remote could take a long time to time out when the network is disconnected, and block other transfers (eg to local drives) in the meantime. But at least if a remote loses its network connection and does not try to reconnect on its own, and so is continually failing, this will get it back into a working state eventually.

Also, fixed a problem with the OSX Mavericks build, it seems that the versions of wget and coreutils stuff that I was including in it were built by homebrew with full optimisations turned on, so didn't work on some CPUs. Replaced those with portable builds.

Posted Mon Jan 6 21:16:08 2014

Spent ages tracking down a memory leak in the assistant that showed up when a lot of files were added. Turned out to be a standard haskell laziness induced problem, fixed by adding strictness annotations. Actually there were several of them, that leaked at different rates. Eventually, I seem to have gotten them all fixed:

Before: ?leakbefore.png After: ?leakafter.png

Also fixed a bug in git annex add when the disk was completely full. In that situation, it could sometimes move the file from the work tree to .git/annex/objects and fail to put the symlink in place.

Posted Mon Jan 6 01:38:44 2014

Yesterday, added per-remote, per-key state storage. This is exported via the external special remote protocol, and I expect to use it at least for Tahoe Lafs.

Also, made the assistant write ssh config files with better permissions, so ssh won't refuse to use them. (The only case I know of where that happened was on Windows.)

Today, made addurl and importfeed honor annex.diskreserve. Found out about this the hard way, when an importfeed cron job filled up my server with youtube videos. I should probably also make import honor annex.diskreserve.


I've been working, so far inconclusively, on making the assistant deal with remotes that might open a long duration network connection. Problem being that if the connection is lost, and the remote is not smart enough to reconnect, all further use of it could fail.

In a restarttransferrer branch, I have made the assistant start separate transferkeys processes for each remote. So if a remote starts to fail, the assistant can stop its transferkeys process, and restart it, solving the problem.

But, if a resource needed for a remote is not available, this degrades to every transfer attempt to that remote restarting it. So I don't know if this is the right approach.

Other approaches being considered include asking that implementors of external special remotes deal with reconnection themselves (Tobias, do you deal with this in your remotes?), or making the assistant only restart failing remotes after it detects there's been a network connection change.

Posted Sat Jan 4 22:15:06 2014

Implemented read-only remotes. This may not cover every use case around wanting to clone a repository and use git-annex without leaking the existence of your clone back to it, but I think it hits most of them in a quite easy way, and allows for some potentially interesting stuff like partitioned networks of git-annex repositories.

Zooko and I have been talking things over (for rather too long), and I think have now agreed on a how a more advanced git-annex Tahoe-LAFS special remote should work. This includes storing the tahoe file-caps in the git-annex branch. So, I really need to add that per-special-remote data storage feature I've been thinking about.

Posted Fri Jan 3 00:29:54 2014

Various work on Debian, OSX, and Windows stuff. Mostly uninteresting, but took most of the day.

Made git annex mirror --all work. I can see why I left it out; when the mirroring wants to drop an object, in --all mode it doesn't have an associated file in the tree, so it cannot look at the annex.numcopies in gitattributes. Same reason why git annex drop --all is not implemented. But decided to go ahead and only use other numcopies configuration for mirroring.

Added GETWANTED and SETWANTED to the external special remote protocol, and that is as far as I want to go on adding git-annex plumbing stuff to the protocol. I expect Tobias will release a boatload of special remotes updated to the new protocol soon, which seems to prove it has everything that could reasonably be needed.

This is a nice public git-annex repository containing a growing collection of tech conference videos. https://github.com/RichiH/conference_proceedings

Did some design work on ?untracked remotes, which I think will turn out to be read-only remotes. Being able to clone a repository and use git-annex in the clone without anything leaking back upstream is often desirable when using public repository, or a repository with many users.

Posted Thu Jan 2 00:37:06 2014

Worked on bug report and forum backlog (24 messages left), and made a few bug fixes. The main one was a fix for a Windows-specific direct mode merge bug.

This month didn't go entirely to plan. I had not expected to work on the Windows assistant and webapp and get it so close to fully working. Nor had I expected to spend time and make significant progress on porting git-annex to Linux -- particularly to embedded NAS devices! I had hoped to encourage some others to develop git-annex, but only had one bite from a student and it didn't work out. Meanwhile, automatically rewarding committers with bitcoin is an interesting alternative approach to possibly motivating contributors, and I would like to set that up, but the software is new and I haven't had time yet. The only thing that went exactly as planned was the external special remote implementation.

A special surprise this month is that I have started hearing privately from several institutions that are starting using git-annex in interesting ways. Hope I can share details of some of that 2014!

Posted Tue Dec 31 21:45:36 2013

Fixed a bug that could leave a direct mode repository stuck at annex.version 3. As part of that, v3 indirect mode repositories will be automatically updated to v5. There's no actual change in that upgrade, it just simplifies things to have only one supported annex.version.

Added youtube playlist support to git-annex. Seems I had almost all the pieces needed, and didn't know it. Only about a dozen lines of code!

Added PREPARE-FAILURE support to the external special remote interface.

After I found the cable my kitten stole (her apport level is high), fixed file transfers to/from Android. This broke because git-annex assistant tries to use ionice, if it's in PATH, and Android's ionice is not suitable. It could probably include ionice in the busybox build and use that one, but I wanted a quick fix for this before the upcoming release.

Posted Sun Dec 29 21:33:20 2013

The external special remote interface is now done, and tested working great! Now we just need all the old hook special remotes to be converted to use it..

I punted on per-special-remote, per-key state storage in the git-annex branch for now. If I find an example of a remote that needs it (Tahoe-LAFS may, but still TBD), I'll add it. Added suppport for using the same credential storage that git-annex uses for S3 and WebDAV credentials.

The main improvement I'd like to make is to add an interface for transferring files where the file is streamed to/from the external special remote, rather than using temp files as it does now. This would be more efficient (sometimes) and make the progress bars better. But it needs to either use a named pipe, which is complicated and non-portable, or serialize the file's contents over a currently line-based protocol, which would be a pain. Anyway, this can be added later, the protocol is extensible.

Posted Fri Dec 27 20:37:12 2013

Built most of the external special remote today. While I've written 600 lines of code for this, and think it's probably working, and complete (except for a couple of features), all I know is that it compiles.

I've also written an example external special remote program in shell script, so the next step is to put the two together and see how it works. I also hope that some people who have built hook special remotes in the past will update them to the new external special remote interface, which is quite a lot better.

Today's work was sponsored by Justine Lam.

Posted Thu Dec 26 22:42:26 2013

Only did a few hours today, getting started on implementing the external special remote protocol.

Mostly this involved writing down types for the various messages, and code to parse them. I'm very happy with how the parsing turned out; nearly all the work is handled by the data types and type classes, and so only one line of very simple code is needed to parse each message:

instance Receivable Response where
       parseCommand "PREPARE-SUCCESS" = parse0 PREPARE_SUCCESS
       parseCommand "TRANSFER-SUCCESS" = parse2 TRANSFER_SUCCESS
       parseCommand "TRANSFER-FAILURE" = parse3 TRANSFER_FAILURE

An especially nice part of this implementation is that it knows exactly how many parameters each message should have (and their types of course), and so can both reject invalid messages, and avoid ambiguity in tokenizing the parameters. For example, the 3rd parameter of TRANSFER-FAILURE is an error message, and as it's the last parameter, it can contain multiple words.

*Remote.External> parseMessage "TRANSFER-FAILURE STORE SHA1--foo doesn't work on Christmas" :: Maybe Response
Just (TRANSFER_FAILURE Upload (Key {keyName = "foo", keyBackendName = "SHA1", keySize = Nothing, keyMtime = Nothing}) "doesn't work on Christmas")

That's the easy groundwork for external special remotes, done.

Posted Wed Dec 25 22:27:12 2013

Resurfaced today to fix some problems with the Linux standalone builds in the Solstice release. The worst of these prevented the amd64 build from running on some systems, and that build has been updated. The other problems all involved the binary shimming, and were less serious.

As part of that work, replaced the hacky shell script that handled the linux library copying and binary shimming with a haskell program.

Also worked on some Windows bugs, and fixed a typo in the test suite. Got my own little present: haskell-tasty finally got out of Incoming, so the next Debian package build will once again include the test suite.

Posted Tue Dec 24 21:48:20 2013

Got the arm webapp to build! (I have not tried to run it.) The build process for this is quite elaborate; 2 chroots, one amd64 and one armel, with the same versions of everything installed in each, and git-annex is built in the first to get the info the EvilSplicer needs to build it in the second.

Fixed a nasty bug in the assistant on OSX, where at startup it would follow symlinks in the repository that pointed to directories outside the repository, and add the files found there. Didn't cause data loss itself (in direct mode the assistant doesn't touch the files), but certainly confusingly breaks things and makes it easy to shoot your foot off. I will be moving up the next scheduled release because of this bug, probably to Saturday.

Looped the git developers in on a problem with git failing on some kernels due to RLIMIT_NOFILE not working. Looks like git will get more robust and this should make the armel build work on even more embedded devices.

Today's work was sponsored by Johan Herland.

Posted Wed Dec 18 21:53:43 2013

Fixed a few problems in the armel build, and it's been confirmed to work on Raspberry Pi and Synology NAS. Since none of the fixes were specific to those platforms, it will probably work anywhere the kernel is new enough. That covers 9+% of the missing ports in the user survey!

Thought through the possible issues with the assistant on Windows not being able to use lsof. I've convinced myself it's probably safe. (In fact, it might be safe to stop checking with lsof when using the assistant in direct mode entirely.) Also did some testing of some specific interesting circumstances (including 2 concurrent writers to a single file).

I've been working on adding the webapp to the armel build. This can mostly reuse the patches and EvilSplicer developed for Android, but it's taking some babysitting of the build to get yesod etc installer for various reasons. Will be surprised if I don't get there tomorrow.

One other thing.. I notice that http://git-annex.org/ is up and running. This was set up by Subito, who offered me the domain, but I suggested he keep it and set up a pretty start page that points new users at the relevant parts of the wiki. I think he's done a good job with that!

Posted Tue Dec 17 23:40:18 2013

Made the Linux standalone builds more self-contained, now they include their own linker and glibc, and ugly hacks to make them be used when running the included programs. This should make them more portable to older systems.

Set up an arm autobuilder. This autobuilder runs in an Debian armel chroot, using qemu-user-static (with a patch to make it support some syscalls ghc uses). No webapp yet; waiting on feedback of how well it works. I hope this build will be usable on eg, Synology NAS and Raspberry PI.

Also worked on improving the assistant's batching of commits during the startup scan. And some other followups and bug triage.

Today's work was sponsored by Hamish Coleman.

Posted Mon Dec 16 20:33:17 2013

Made some improvements to git-annex's plumbing level commands today. Added new lookupkey and examinekey commands. Also expanded the things that git annex find can report about files. Among other things, the elusive hash directory locations can now be looked up, which IIRC a few people have asked for a way to do.

Also did some work on the linux standalone tarball and OSX app. Both now include man pages, and it's also now possible to just unpack it and symlink git-annex into ~/bin or similar to add it to PATH.

Posted Sun Dec 15 23:16:44 2013

Spent most of today catching up with a weeks's worth of traffic.

Fixed 2 bugs. Message backlog is 23 messages.

Posted Thu Dec 12 20:40:28 2013

I've switched over to mostly working on Windows porting in the evenings when bored, with days spent on other git-annex stuff. So, getting back to the planned roadmap for this month..

Set up a tip4commit for git-annex. Anyone who gets a commit merged in will receive a currently small amount of bitcoin. This would almost be a good way to encourage more committers other than me, by putting say, half the money I have earmarked for that into the tip jar. The problem is, I make too many commits myself, so most of the money would be quickly tipped back out to me! I have gotten in touch with the tip4commit people, and hope they will give me a way to blacklist myself from being tipped.

Designed a external special remote protocol that seems pretty good for first-class special remotes implemented outside git-annex. It's moderately complicated on the git-annex side to make it simple and flexible on the special remote side, but I estimate only a few days to build it once I have the design finalized.

windows

Tested the autobuilt windows webapp. It works! Sorted out some issues with the bundled libraries.

Reworked how git annex transferkeys communicates, to make it easier to port it to Windows. Narrowly managed to avoid needing to write Haskell bindings to Windows's equivilant of pipe(2). I think the Windows assistant can transfer keys now. and the webapp UI may even be able to be used to stop transfers. Needs testing.

Investigated what I'll need to get XMPP working on Windows. Most of the libs are available in cygwin, but gsasl would need to be built from source. Also some kind of space-in-path problem is preventing cabal installing some of the necessary dependencies.

Posted Wed Dec 11 22:04:40 2013

Got the Windows autobuilder building the webapp. Have not tried that build yet myself, but I have high hopes it will work.

Made other Windows improvements, including making the installer write a start menu entry file, and adding free disk space checking.

Spent rest of the day improving git repair code on a real-world corrupted repository.

Posted Tue Dec 10 22:30:18 2013

Fixed up a few problems with the Windows webapp, and it's now completely usable, using any browser other than MSIE. While there are missing features in the windows port, all the UI for the features it does have seems to just work in the webapp.

Fixed a ugly problem with Firefox, which turned out to have been introduced a while ago by a workaround for an ugly problem in Chrome. Web browsers are so wonderful, until they're crap.

Think I've fixed the bug in the EvilLinker that was causing it to hang on the autobuilder, but still don't have a Windows autobuild with the webapp just yet.

Also improved git annex import some more, and worked on a bug in git repository repair, which I will need to spend some more time on tomorrow.

Posted Mon Dec 9 22:08:08 2013

I have seen the glory of the webapp running on Windows.

One of the warp developers pointed me in the right direction and I developed a fix for the recv bug.

My Windows and MSIE are old and fall over on some of the javascript, so it's not glorious enough for a screenshot. But large chunks of it do seem to work.

Posted Sun Dec 8 20:27:06 2013

Windows webapp now starts, opens a web browser, and ... crashes.

This is a bug in warp or a deep level of the stack. I know that yesod apps have run on Windows before, so apparently something has changed and introduced this problem.

Also have a problem with the autobuilder; the EvilSplicer or something it runs is locking up on that system for reasons not yet determined.

Looks like I will need to wait a bit longer for the windows webapp, but I could keep working on porting the assistant in the meantime.

The most important thing that I need to port is how to check if a file is being written to at the same time the assistant adds it to the repository. No real lsof equivilant on Windows. I might be able to do something with exclusive locking to detect if there's a writer (but this would also block using the file while it was being added). Or I may be able to avoid the need for this check, at least in direct mode.

Posted Sat Dec 7 21:18:39 2013

Android has the EvilSplicer, now Windows gets the EvilLinker. Fully automated, and truly horrible solution to the too long command line problem.

Now when I run git annex webapp on windows, it almost manages to open the web browser.

At the same time, I worked with Yuri to upgrade the Windows autobuilder to a newer Haskell platform, which can install Yesod. I have not quite achieved a successful webapp build on the autobuilder, but it seems close.


Here's a nice Haskell exercise for someone. I wrote this quick and dirty function in the EvilSplicer, but it's crying out for a generalized solution.

{- Input contains something like 
 - c:/program files/haskell platform/foo -LC:/Program Files/Haskell Platform/ -L...
 - and the *right* spaces must be escaped with \
 -
 - Argh.
 -}
escapeDosPaths :: String -> String
escapeDosPaths = replace "Program Files" "Program\\ Files"
        . replace "program files" "program\\ files"
        . replace "Haskell Platform" "Haskell\\ Platform"
        . replace "haskell platform" "haskell\\ platform"
Posted Sat Dec 7 01:12:47 2013

Got the entire webapp to build on Windows.

Compiling was easy. One line of code had to be #ifdefed out, and the whole rest of the webapp UI just built!

Linking was epic. It seems that I really am runninginto a 32kb command line length limit, which causes the link command to fail on Windows. git-annex with all its bells and whistles enabled is just too big. Filed a ghc bug report, and got back a helpful response about using http://gcc.gnu.org/wiki/Response_Files to work around.

6 hours of slogging through compiling dependencies and fighting with toolchain later, I have managed to link git-annex with the webapp!

The process is not automated yet. While I was able to automate passing gcc a @file with its parameters, gcc then calls collect2, which calls ld, and both are passed too many parameters. I have not found a way to get gcc to generate a response file. So I did it manually. Urgh.

Also, it crashes on startup with getAddrInfo failure. But some more porting is to be expected, now that the windows webapp links.. ;)

Posted Fri Dec 6 03:50:03 2013

Had planned to spend all day not working on git-annex and instead getting caught up on conference videos. However, got a little bit multitasky while watching those, and started investigating why, last time I worked on Windows port, git-annex was failing to link.

A good thing to do while watching conference videos since it involved lots of test builds with different flags. Eventially solved it. Building w/o WebDAV avoids crashing the compiler anyhow.

Thought I'd try the resulting binary and see if perhaps I had forgotten to use the threaded RTS when I was running ghc by hand to link it last time, and perhaps that was why threads seemed to have hung back then.

It was. This became clear when I saw a "deadlocked indefinitely in MVar" error message, which tells me that it's at least using the threaded RTS. So, I fixed that, and a few other minor things, and ran this command in a DOS prompt box:

git annex watch --force --foreground --debug

And I've been making changes to files in that repository, and amazingly, the watcher is noticing them, and committing them!

So, I was almost entirely there to a windows port of the watcher a month ago, and didn't know. It has some rough edges, including not doing anything to check if a newly created file is open for write when adding it, and getting the full assistant ported will be more work, and the full webapp may be a whole other set of problems, but this is a quite nice milestone for the Windows port.

Posted Wed Dec 4 21:56:29 2013

The 2013 git-annex user survey has been running for several weeks and around 375 people have answered at least the first question. While I am going to leave it up through the end of the year, I went over the data today to see what interesting preliminary conclusions I can draw.

  • 11% build git-annex from source. More than I would have guessed.

  • 20% use the prebuilt versions from the git-annex website.

    This is a number to keep in mind later, when more people have upgraded to the last release, which checks for upgrades. I can run some stats on the number of upgrade checks I receive, and multiplying that by 5 would give a good approximation of the total number of computers running git-annex.

  • I'm surprised to see so many more Linux (79%) than OSX (15%) users. Also surprising is there are more Windows (2%) than Android (1%) users. (Android numbers may be artificially low since many users will use it in addition to one of the other OSes.)

  • Android and Windows unsurprisingly lead in ports requested, but the Synology NAS is a surprise runner up, with 5% (more than IOS).

    In theory it would not be too hard to make a standalone arm tarball, which could be used on such a device, although IIRC the Synology had problems with a too old linker and libc. It would help if I could make the standalone tarball not depend on the system linker at all.

    A susprising number (3%) want some kind of port the the Raspberry Pi, which is weird because I'd think they'd just be using Raspbian on it.. but a standalone arm tarball would also cover that use case.

  • A minimum of 1664 (probably closer to 2000) git annex repositories are being used by the 248 people who answered that question. Around 7 repositories per person average, which is either one repository checked out on 7 different machines or two repositories on 3 machines, etc.

  • At least 143 terabytes of data are being stored in git-annex. This does not count redundant data. (It also excludes several hundred terabytes from one instituion that I don't think are fully online yet.) Average user has more than half a terabyte of data.

  • 8% of users store scientific data in git-annex! :) A couple of users are using it for game development assets, and 5% of users are using it for some form of business data.

  • Only 10% of users are sharing a git-annex repository with at least one other person. 27% use it by themselves, but want to get others using their repositories. This probably points to it needing to be easier for nontechnical users.

  • 61% of git-annex users have good or very good knowledge of git. This question intentionally used the same wording as the general git user survey, so the results can be compared. The curves have somewhat different shapes, with git-annex's users being biased more toward the higher knowledge levels than git's users.

  • The question about how happy users are also used the same wording. While 74% of users are happy with git-annex, 94% are similarly happy with git, and a while the median git-annex user is happy, the median git user is very happy.

    The 10% who wrote in "very enthusiastic, but still often bitten by quirks (so not very happy yet, but with lots of confidence in the potential" might have thrown off this comparison some, but they certianly made their point!

  • 3% of respondants say that a bug is preventing them from using git-annex, but that they have not reported the bug yet. Frustrating! 1% say that a bug that's been reported already is blocking them.

  • 18% wrote in that they need the webapp to support using github (etc) as a central server. I've been moving in that direction with the encryption and some other changes, so it's probably time to make a UI for that.

  • 12% want more control over which files are stored locally when using the assistant.

  • A really surprising thing happened when someone wrote in that I should work on "not needing twice disk space of repo in direct mode", and 5% of people then picked this choice. This is some kind of documentation problem, because of course git-annex never needs 2x disk space, whether using direct mode or not. That's one of its advantages over git!

  • Somewhere between 59 and 161 of the survey respondants use Debian. I can compare this with Debian popularity contest data which has 400 active installations and 1000 total installations, and make guesses about what fraction of all git-annex users have answered the survey. By making different assumptions I got guesses that varied by 2 orders of magnitude, so not worth bothering with. Explicitly asking how many people use each Linux distribution would be a good idea in next year's survey.


Main work today was fixing Android DNS lookups, which was trying to use /etc/resolv.conf to look up SRV records for XMPP, and had to be changed to use a getprop command instead. Since I can't remember dealing with this before (not impossible I made some quick fix to the dns library before and lost it though), I'm wondering if XMPP was ever usable on Android before. Cannot remember. May work now, anyway...

Posted Tue Dec 3 22:04:12 2013

Still working through thanksgiving backlog. Around 55 messages to go.

Wrote hairy code to automatically fix up bad bare repositories created by recent versions of git-annex. Managed to do it with only 1 stat call overhead (per local repository). Will probably keep that code in git-annex for a year or so, despite the bug only being present for a few weeks, because the repositories that need to be fixed might be on removable drives that are rarely used.

Various other small bug fixes, including dealing with box.com having changed their WebDAV endpoint url.

Spent a while evaluating various key/value storage possibilities. ?incremental fsck should not use sticky bit has the details.

Posted Tue Dec 3 00:33:30 2013

Made a release yesterday to fix a bug that made git-annex init in a bare repository set core.bare=false. This bug only affected git-annex 5, it was introduced when building the direct mode guard. Currently recovering from it is a manual (pretty easy) process. Perhas I should automate that, but I mostly wanted to get a fix out before too many people encountered the bug.

Today, I made the assistant run batch jobs with ionice and nocache, when those commands are available. Also, when the assistant transfers files, that also runs as a batch job.

Changed how git-annex does commits, avoiding using git commit in direct mode, since in some situations git commit (not with -a!) wants to read the contents of files in the work tree, which can be very slow.

Posted Sun Dec 1 20:18:32 2013

My last day before thanksgiving, getting caught up with some recent bug reports and, quite a rush to get a lot of fixes in. Adding to the fun, wintery weather means very limited power today.

It was a very productive day, especially for Android, which hopefully has XMPP working again (at least it builds..), halved the size of the package, etc.

Fixed a stupid bug in the automatic v5 upgrade code; annex.version was not being set to 5, and so every git annex command was actually re-running the upgrade.

Fixed another bug I introduced last Friday, which the test suite luckily caught, that broke using some local remotes in direct mode.

Tracked down a behavior that makes git annex sync quite slow on filesystems that don't support symlinks. I need to switch direct mode to not using git commit at all, and use plumbing to make commits there. Will probably work on this over the holiday.

Posted Wed Nov 27 00:08:03 2013

Upgrades should be working on OSX Mavericks, Linux, and sort of on Android. This needs more testing, so I have temporarily made the daily builds think they are an older version than the last git-annex release. So when you install a daily build, and start the webapp, it should try to upgrade (really downgrade) to the last release. Tests appreciated.

Looking over the whole upgrade code base, it took 700 lines of code to build the whole thing, of which 75 are platform specific (and mostly come down to just 3 or 4 shell commands). Not bad..


Last night, added support for quvi 0.9, which has a completely changed command line interface from the 0.4 version.

Plan to spend tomorrow catching up on bug reports etc and then low activity for rest of the week.

Posted Mon Nov 25 19:42:52 2013

Upgrades are fully working on Linux. OSX code is written but intested and I thought of one bug it certainly has on my evening walk. Probably another hour's work left later this evening to finish it off.

Posted Sun Nov 24 22:34:28 2013

Completely finished up with making the assistant detect when git-annex's binary has changed and handling the restart.

It's a bit tricky because during an upgrade there can be two assistant daemons running at the same time, in the same repository. Although I disable the watcher of the old one first. Luckily, git-annex has long supported running multiple concurrent git-annex processes in the same repository.

The surprisingly annoying part turned out to be how to make the webapp redirect the browser to the new url when it's upgraded. Particularly needed when automatic upgrades are enabled, since the user will not then be taking any action in the webapp that could result in a redirect. My solution to this feels like overkill; the webapp does ajax long polling until it gets an url, and then redirects to it. Had to write javascript code and ugh.

But, that turned out to also be useful when manually restarting the webapp (removed some horrible old code that ran a shell script to do it before), and also when shutting the webapp down.

assistant downloading an upgrade to itself

Getting back to upgrades, I have the assistant downloading the upgrade, and running a hook action once the key is transferred. Now all I need is some platform-specific code to install it. Will probably be hairy, especially on OSX where I need to somehow unmount the old git-annex dmg and mount the new one, from within a program running on the old dmg.


Today's work was sponsored by Evan Deaubl.

Posted Sat Nov 23 21:28:42 2013

The difference picking the right type can make! Last night, I realized that the where I had a distributionSha256sum :: String, I should instead use distributionKey :: Key. This means that when git-annex is eventually downloading an upgrade, it can treat it as just another Key being downloaded from the web. So the webapp will show that transfer along with all the rest, and I can leverage tons of code for a new purpose. For example, it can simply fsck the key once it's downloaded to verify its checksum.

Also, built a DistriutionUpdate program, which I'll run to generate the info files for a new version. And since I keep git-annex releases in a git-annex repo, this too leverages a lot of git-annex modules, and ended up being just 60 easy lines of code. The upgrade notification code is tested and working now.

And, I made the assistant detect when the git-annex program binary is replaced or modified. Used my existing DirWatcher code for that. The plan is to restart the assistant on upgrade, although I need to add some sanity checks (eg, reuse the lsof code) first. And yes, this will work even for apt-get upgrade!


Today's work was sponsored by Paul Tötterman

Posted Fri Nov 22 23:03:22 2013

Still working on the git repair code. Improved the test suite, which found some more bugs, and so I've been running tests all day and occasionally going and fixing a bug in the repair code. The hardest part of repairing a git repo has turned out to be reliably determining which objects in it are broken. Bugs in git don't help (but the git devs are going to fix the one I reported).

But the interesting new thing today is that I added some upgrade alert code to the webapp. Ideally everyone would get git-annex and other software as part of an OS distribution, which would include its own upgrade system -- But the survey tells me that a quarter of installs are from the prebuilt binaries I distribute.

So, those builds are going to be built with knowledge of an upgrade url, and will periodically download a small info file (over https) to see if a newer version is available, and show an alert.

I think all that's working, though I have not yet put the info files in place and tested it. The actual upgrade process will be a manual download and reinstall, to start with, and then perhaps I'll automate it further, depending on how hard that is on the different platforms.

Posted Fri Nov 22 04:26:24 2013

Pushed out a minor release of git-annex today, mostly to fix build problems on Debian. No strong reason to upgrade to it otherwise.

Continued where I left off with the Git.Destroyer. Fixed quite a lot of edge cases where git repair failed due to things like a corrupted .git/HEAD file (this makes git think it's not in a git repository), corrupt git objects that have an unknown object type and so crash git hard, and an interesting failure mode where git fsck wants to allocate 116 GB of memory due to a corrupted object size header. Reported that last to the git list, as well as working around it.

At the end of the day, I ran a test creating 10000 corrupt git repositories, and all of them were recovered! Any improvements will probably involve finding new ways to corrupt git repositories that my code can't think of. ;)

Posted Wed Nov 20 23:34:30 2013

Wrote some evil code you don't want to run today. Git.Destroyer randomly generates Damage, and applies it to a git repository, in a way that is reproducible -- applying the same Damage to clones of the same git repo will always yeild the same result.

This let me build a test harness for git-repair, which repeatedly clones, damages, and repairs a repository. And when it fails, I can just ask it to retry after fixing the bug and it'll re-run every attempt it's logged.

This is already yeilding improvements to the git-repair code. The first randomly constructed Damage that it failed to recover turned out to be a truncated index file that hid some other corrupted object files from being repaired.

[Damage Empty (FileSelector 1),
 Damage Empty (FileSelector 2),
 Damage Empty (FileSelector 3),
 Damage Reverse (FileSelector 3),
 Damage (ScrambleFileMode 3) (FileSelector 5),
 Damage Delete (FileSelector 9),
 Damage (PrependGarbage "¥SOH¥STX¥ENQ¥f¥a¥ACK¥b¥DLE¥n") (FileSelector 9),
 Damage Empty (FileSelector 12),
 Damage (CorruptByte 11 25) (FileSelector 6),
 Damage Empty (FileSelector 5),
 Damage (ScrambleFileMode 4294967281) (FileSelector 14)
]

I need to improve the ranges of files that it damages -- currently QuickCheck seems to only be selecting one of the first 20 or so files. Also, it's quite common that it will damage .git/config so badly that git thinks it's not a git repository anymore. I am not sure if that is something git-repair should try to deal with.


Today's work was sponsored by the WikiMedia Foundation.

Posted Tue Nov 19 21:35:16 2013

Release today, right on bi-weekly schedule. Rather startled at the size of the changelog for this one; along with the direct mode guard, it adds support for OS X Mavericks, Android 4.3/4.4, and fixes numerous bugs.

Posted another question in the survey, http://git-annex-survey.branchable.com/polls/2013/roadmap/.

Spun off git-repair as an independant package from git-annex. Of course, most of the source code is shared with git-annex. I need to do something with libraries eventually..

Posted Mon Nov 18 22:27:47 2013

Fixed two difficult bugs with direct mode. One happened (sometimes) when a file was deleted and replaced with a directory by the same name and then those changes were merged into a direct mode repository.

The other problem was that direct mode did not prevent writes to .git/annex/objects the way that indirect mode does, so when a file in the repository was not currently present, writing to the dangling symlink would follow it and write into the object directory.

Hmm, I was going to say that it's a pity that direct mode still has so many bugs being found and fixed, but the last real bug fix to direct mode was made last May! Instead, I probably have to thank Tim for being a very thorough tester.

Finished switching the test suite to use the tasty framework, and prepared tasty packages for Debian.

Posted Fri Nov 15 20:31:36 2013

The user survey is producing some interesting and useful results!
Added two more polls: using with and blocking problems
(There were some load issues so if you were unable to vote yesterday, try again..)

Worked on getting the autobuilder for OS X Mavericks set up. Eventually succeeded, after patching a few packages to work around a cpp that thinks it should parse haskell files as if they're C code. Also, Jimmy has resuscitated the OS X Lion autobuilder.

A not too bad bug in automatic merge conflict resolution has been reported, so I will need to dig into that tomorrow. Didn't feel up to it today, so instead have been spending the remaining time finishing up a branch that switches the test suite to use the tasty test framework.

Posted Fri Nov 15 00:09:47 2013

One of my goals for this month is to get a better sense of how git-annex is being used, how it's working out for people, and what areas need to be concentrated on. To start on that, I am doing the 2013 git-annex user survey, similar to the git user surveys. I will be adding some less general polls later (suggestions for topics appreciated!), but you can go vote in any or all of 10 polls now.


Found a workaround for yesterday's Windows build problem. Seems that only cabal runs gcc in a way that fails, so ghc --make builds is successfully. However, the watcher doesn't quite work on Windows. It does get events when files are created, but it seems to then hang before it can add the file to git, or indeed finish printing out a debug log message about the event. This looks like it could be a problem with the threaded ghc runtime on Windows, or something like that.

Main work today was improving the git repository repair to handle corrupt index files. The assistant can now start up, detect that the index file is corrupt, and regenerate it all automatically.

Posted Wed Nov 13 20:56:41 2013

Annoyingly, the Android 4.3 fix breaks git-annex on Android 4.0 (probably through 4.2), so I now have two separate builds of the Android app.


Worked on Windows porting today. I've managed to get the assistant and watcher (but not yet webapp) to build on Windows. The git annex transferrer interface needs POSIX stuff, and seems to be the main thing that will need porting for Windows for the assistant to work, besides of course file change detection. For that, I've hooked up Win32-notify.

So the watcher might work on Windows. At least in theory. Problem is, while all the code builds ok, it fails to link:

ghc.exe: could not execute: C:\Program Files (x86)\Haskell Platform\2012.4.0.0\lib/../mingw/bin/gcc.exe

I wonder if this is case of too many parameters being passed?

This happens both on the autobuilder and on my laptop, so I'm stuck here. Oh well, I was not planning to work on this anyway until February...

Posted Wed Nov 13 01:05:04 2013

Finally found the root cause of the Android 4.3/4.4 trouble, and a fix is now in place!

As a bonus, it looks like I've fixed a problem accessing the environment on Android that had been worked around in an ugly way before.

Big thanks to my remote hands Michael Alan, Sören, and subito. All told they ran 19 separate tests to help me narrow down this tricky problem, often repeating long command lines on software keyboards.

Posted Tue Nov 12 06:54:19 2013

Been chipping away at my backlog of messages, and it's down to 23 items.

Finally managed to get ghc to build with a newer version of the NDK. This might mean a solution to git-annex on Android 4.2. I need help with testing.

Posted Sun Nov 10 20:14:20 2013

Finished the direct mode guard, including the new git annex status command.

Spent the rest of the day working on various bug fixes. One of them turned into rather a lot of work to make the webapp's UI better for git remotes that do not have an annex.uuid.

Posted Thu Nov 7 22:03:32 2013

Started by tracking down a strange bug that was apparently ubuntu-specific and caused git-annex branch changes to get committed to master. Root cause turned out to failing to recover from an exception. I'm kicking myself about that, because I remember looking at the code where the bug was at least twice before and thinking "hmm, should add exception handling here? nah..". Exceptions are horrible.

Made a release with a fix for that and a few minor other accumulated changes since last Friday's release. The pain point of this release is to fix building without the webapp (so it will propigate to Debian testing, etc). This release does not include the direct mode guard, so I'll have a few weeks until the next release to get that tested.

Fixed the test suite in directguard. This branch is now nearly ready to merge to master, but one command that is badly needed in guarded direct mode is "git status". So I am planning to rename "git annex status" to "git annex info", and make "git annex status" display something similar to "git status".

Also took half an hour and added optional EKG support to git-annex. This is a Haskell library that can add a terrific monitoring console web UI to any program in 2 lines of code. Here we can see the git-annex webapp using resources at startup, followed in a few seconds by the assistant's startup scan of the repository.

BTW, Kevin tells me that the machine used to build git-annex for OSX is going to be upgraded to 10.9 soon. So, hopefully I'll be making autobuilds of that. I may have to stop the 10.8.2 autobuilds though.


Today's work was sponsored by Protonet.

Posted Wed Nov 6 20:39:24 2013

Long, long day coding up the direct mode guard today. About 90% of the fun is dealing with receive.denyCurrentBranch not preventing pushes that change the current branch, now that core.bare is set in direct mode. My current solution to this involves using a special branch when using direct mode, which nothing will ever push to (hopefully). A much nicer solution would be to use a update hook to deny pushes of the current branch -- but there are filesystems where repos cannot have git hooks.

The test suite is falling over, but the directguard branch otherwise seems usable.


Today's work was sponsored by Carlo Matteo Capocasa.

Posted Wed Nov 6 01:26:06 2013

I've been investigating ways to implement a ?direct mode guard. Preventing a stray git commit -a or git add doing bad things in a direct mode repository seems increasingly important.

First, considered moving .git, so git won't know it's a git repository. This doesn't seem too hard to do, but there will certainly be unexpected places that assume .git is the directory name.

I dislike it more and more as I think about it though, because it moves direct mode git-annex toward being entirely separate from git, and I don't want to write my own version control system. Nor do I want to complicate the git ecosystem with tools needing to know about git-annex to work in such a repository.

So, I'm happy that one of the other ideas I tried today seems quite promising. Just set core.bare=true in a direct mode repository. This nicely blocks all git commands that operate on the working tree from doing anything, which is just what's needed in direct mode, since they don't know how to handle the direct mode files. But it lets all git commands and other tools that don't touch the working tree continue to be used. You can even run git log file in such a repository (surprisingly!)

It also gives an easy out for anyone who really wants to use git commands that operate on the work tree of their direct mode repository, by just passing -c core.bare=false. And it's really easy to implement in git-annex too -- it can just notice if a repo has core.bare and annex.direct both set, and pass that parameter to every git command it runs. I should be able to get by with only modifying 2 functions to implement this.

Posted Mon Nov 4 21:32:03 2013

Low activity the past couple of days. Released a new version of git-annex yesterday. Today fixed three bugs (including a local pairing one that was pretty compicated) and worked on getting caught up with traffic.

Posted Sat Nov 2 21:07:19 2013

Spent today reviewing my ?plans for the month and filling in a couple of missing peices.

Noticed that I had forgotten to make repository repair clean up any stale git locks, despite writing that code at the beginning of the month, and added that in.

Made the webapp notice when a repository that is being used does not have any consistency checks configured, and encourage the user to set up checks. This happens when the assistant is started (for the local repository), and when removable drives containing repositories are plugged in. If the reminders are annoying, they can be disabled with a couple clicks.

And I think that just about wraps up the month. (If I get a chance, I would still like to add recovery of git-remote-gcrypt encrypted git repositories.)

My roadmap has next month dedicated to user-driven features and polishing and bugfixing.

Posted Tue Oct 29 20:59:51 2013

All command line stuff today..

Added --want-get and --want-drop, which can be used to test preferred content settings of a repository. For example git annex find --in . --want-drop will list the same files that git annex drop --auto would try to drop. (Also renamed git annex content to git annex wanted.)

Finally laid to rest problems with git annex unannex when multiple files point to the same key. It's a lot slower, but I'll stop getting bug reports about that.

Posted Mon Oct 28 22:19:19 2013

Finally got the assistant to repair git repositories on removable drives, or other local repos. Mostly this happens entirely automatically, whatever data in the git repo on the drive has been corrupted can just be copied to it from ~/annex/.git.

And, the assistant will launch a git fsck of such a repo whenever it fails to sync with it, so the user does not even need to schedule periodic fscks. Although it's still a good idea, since some git repository problems don't prevent syncing from happening.

Watching git annex heal problems like this is quite cool!

One thing I had to defer till later is repairing corrupted gcrypt repositories. I don't see a way to do it without deleting all the objects in the gcrypt repository, and re-pushing everything. And even doing that is tricky, since the gcrypt-id needs to stay the same.

Posted Sun Oct 27 20:58:10 2013

Got well caught up on bug fixes and traffic. Backlog is down to 40.

Made the assistant wait for a few seconds before doing the startup scan when it's autostarted, since the desktop is often busy starting up at that same time.

Fixed an ugly bug with chunked webdav and directory special remotes that caused it to not write a "chunkcount" file when storing data, so it didn't think the data was present later. I was able to make it recover nicely from that mistake, by probing for what chunks are actually present.

Several people turn out to have had problems with git annex sync not working because receive.denyNonFastForwards is enabled. I made the webapp not enable it when setting up a ssh repository, and I made git annex sync print out a hint about this when it's failed to push. (I don't think this problem affects the assistant's own syncing.)

Made the assistant try to repair a damaged git repository without prompting. It will only prompt when it fails to fetch all the lost objects from remotes.

Glad to see that others have managed to get git-annex to build on Max OS X 10.9. Now I just need someone to offer up a ssh account on that OS, and I could set up an autobuilder for it.

Posted Sat Oct 26 21:17:47 2013

The webapp now fully handles repairing damage to the repository.

Along with all the git repository repair stuff already built, I added additional repairs of the git-annex branch and git-annex's index file. That was pretty easy actually, since git-annex already handles merging git-annex branches that can sometimes be quite out of date. So when git repo repair has to throw away recent changes to the git-annex branch, it just effectively becomes out of date. Added a git annex fsck --fast run to ensure that the git-annex branch reflects the current state of the repository.

When the webapp runs a repair, it first stops the assistant from committing new files. Once the repair is done, that's started back up, and it runs a startup scan, which is just what is needed in this sitation; it will add any new files, as well as any old files that the git repository damange caused to be removed from the index.

Also made git annex repair run the git repository repair code, for those with a more command-line bent. It can be used in non-git-annex repos too!


So, I'm nearly ready to wrap up working on disaster recovery. Lots has been accomplished this month. And I have put off making a release for entirely too long!

The big missing piece is repair of git remotes located on removable drive. I may make a release before adding that, but removable drives are probably where git repository corruption is most likely to occur, so I certainly need to add that.


Today's work was sponsored by Scott Robinson.

Posted Wed Oct 23 19:16:15 2013

I think that git-recover-repository is ready now. Made it deal with the index file referencing corrupt objects. The best approach I could think of for that is to just remove those objects from the index, so the user can re-add files from their work tree after recovery.

Now to integrate this git repository repair capability into the git-annex assistant. I decided to run git fsck as part of a scheduled repository consistency check. It may also make sense for the assistant to notice when things are going wrong, and suggest an immediate check. I've started on the webapp UI to run a repository repair when fsck detects problems.

Posted Tue Oct 22 20:30:39 2013

Solid day of working on repository recovery. Got git recover-repository --force working, which involves fixing up branches that refer to missing objects. Mostly straightforward traversal of git commits, trees, blobs, to find when a branch has a problem, and identify an old version of it that predates the missing object. (Can also find them in the reflog.)

The main complication turned out to be that git branch -D and git show-ref don't behave very well when the commit objects pointed to by refs are themselves missing. And git has no low-level plumbing that avoids falling over these problems, so I had to write it myself.

Testing has turned up one unexpected problem: Git's index can itself refer to missing objects, and that will break future commits, etc. So I need to find a way to validate the index, and when it's got problems, either throw it out, or possibly recover some of the staged data from it.

Posted Mon Oct 21 21:44:05 2013

Built a git-recover-repository command today. So far it only does the detection and deletion of corrupt objects, and retrieves them from remotes when possible. No handling yet of missing objects that cannot be recovered from remotes.

Here's a couple of sample runs where I do bad things to the git repository and it fixes them:

joey@darkstar:~/tmp/git-annex>chmod 644 .git/objects/pack/*
joey@darkstar:~/tmp/git-annex>echo > .git/objects/pack/pack-a1a770c1569ac6e2746f85573adc59477b96ebc5.pack 
joey@darkstar:~/tmp/git-annex>~/src/git-annex/git-recover-repository 
Running git fsck ...
git fsck found a problem but no specific broken objects. Perhaps a corrupt pack file? Unpacking all pack files.
fatal: early EOF
Unpacking objects: 100% (148/148), done.
Unpacking objects: 100% (354/354), done.
Re-running git fsck to see if it finds more problems.
Re-running git fsck to see if it finds more problems.
Initialized empty Git repository in /home/joey/tmp/tmprepo.0/.git/
Trying to recover missing objects from remote origin
Successfully recovered repository!
You should run "git fsck" to make sure, but it looks like
everything was recovered ok.

joey@darkstar:~/tmp/git-annex>chmod 644 .git/objects/00/0800742987b9f9c34caea512b413e627dd718e
joey@darkstar:~/tmp/git-annex>echo > .git/objects/00/0800742987b9f9c34caea512b413e627dd718e
joey@darkstar:~/tmp/git-annex>~/src/git-annex/git-recover-repository 
Running git fsck ...
error: unable to unpack 000800742987b9f9c34caea512b413e627dd718e header
error: inflateEnd: stream consistency error (no message)
error: unable to unpack 000800742987b9f9c34caea512b413e627dd718e header
error: inflateEnd: stream consistency error (no message)
git fsck found 1 broken objects. Unpacking all pack files.
removing 1 corrupt loose objects
Re-running git fsck to see if it finds more problems.
Re-running git fsck to see if it finds more problems.
Initialized empty Git repository in /home/joey/tmp/tmprepo.0/.git/
Trying to recover missing objects from remote origin
Successfully recovered repository!
You should run "git fsck" to make sure, but it looks like
everything was recovered ok.

Works great! I need to move this and git-union-merge out of the git-annex source tree sometime.


Today's work was sponsored by Francois Marier.

Posted Sun Oct 20 21:56:25 2013

Goal for the rest of the month is to build automatic recovery git repository corruption. Spent today investigating how to do it and came up with a fairly detailed design. It will have two parts, first to handle repository problems that can be fixed by fetching objects from remotes, and secondly to recover from problems where data never got sent to a remote, and has been lost.

In either case, the assistant should be able to detect the problem and automatically recover well enough to keep running. Since this also affects non-git-annex repositories, it will also be available in a standalone git-recover-repository command.

Posted Fri Oct 18 20:19:22 2013

A long day of bugfixing. Split into two major parts. First I got back to a bug I filed in August to do with the assistant misbehaving when run in a subdirectory of a git repository, and did a nice type-driven fix of the underlying problem (that also found and fixed some other related bugs that would not normally occur). Then, spent 4 hours in Windows purgatory working around crazy path separator issues.

Posted Fri Oct 18 02:39:56 2013

Productive day, but I'm wiped out. Backlog down to 51.

Posted Wed Oct 16 20:58:53 2013

While I said I was done with fsck scheduling yesterday, I ended up adding one more feature to it today: Full anacron style scheduling. So a fsck can be scheduled to run once per week, or month, or year, and it'll run the fsck the next time it's available after that much time has passed. The nice thing about this is I didn't have to change Cronner at all to add this, just improved the Recurrence data type and the code that calculates when to run events.

Rest of the day I've been catching up on some bug reports. The main bug I fixed caused git-annex on Android to hang when adding files. This turns out to be because it's using a new (unreleased) version of git, and git check-attr -z output format has changed in an incompatible way.

I am currently 70 messages behind, which includes some ugly looking bug reports, so I will probably continue with this over the next couple days.

Posted Tue Oct 15 20:16:31 2013

Fixed a lot of bugs in the assistant's fsck handling today, and merged it into master. There are some enhancments that could be added to it, including fscking ssh remotes via git-annex-shell and adding the ability to schedule events to run every 30 days instead of on a specific day of the month. But enough on this feature for now.

Today's work was sponsored by Daniel Brockman.

Posted Mon Oct 14 20:33:23 2013

Built everything needed to run a fsck when a remote gets connected. Have not tested it; only testing is blocking merging the incrementalfsck branch now.

Also updated the OSX and Android builds to use a new gpg release (denial of service security fix), and updated the Debian backport, and did a small amount of bug fixing. I need to do several more days of bug fixing once I get this incremental fsck feature wrapped up before moving on to recovery of corrupt git repositories.

Posted Sun Oct 13 21:22:31 2013

Last night, built this nice user interface for configuring periodic fscks:

Rather happy that that whole UI needed only 140 lines of code to build. Though rather more work behind it, as seen in this blog..

Today I added some support to git-annex for smart fscking of remotes. So far only git repos on local drives, but this should get extended to git-annex-shell for ssh remotes. The assistant can also run periodic fscks of these.

Still need to test that, and find a way to make a removable drive's fsck job run when the drive gets plugged in. That's where picking "any time" will be useful; it'll let you configure fscking of removable drives when they're available, as long as they have not been fscked too recently.


Today's work was sponsored by Georg Bauer.

Posted Fri Oct 11 21:35:54 2013

Some neat stuff is coming up, but today was a pretty blah day for me. I did get the Cronner tested and working (only had a few little bugs). But I got stuck for quite a while making the Cronner stop git-annex fsck processes it was running when their jobs get removed. I had some code to do this that worked when run standalone, but not when run from git-annex.

After considerable head-scratching, I found out this was due to forkProcess masking aync exceptions, which seems to be probably a bug. Luckily was able to work around it. Async exceptions continue to strike me as the worst part of the worst part of Haskell (the worst part being exceptions in general).

Was more productive after that.. Got the assistant to automatically queue re-downloads of any files that fsck throws out due to having bad contents, and made the webapp display an alert while fscking is running, which will go to the page to configure fsck schedules. Now all I need to do is build the UI of that page.

Posted Fri Oct 11 04:45:46 2013

Lots of progress from yesterday's modest start of building data types for scheduling. Last night I wrote the hairy calendar code to calculate when next to run a scheduled event. (This is actually quite superior to cron, which checks every second to see if it should run each event!) Today I built a "Cronner" thread that handles spawning threads to handle each scheduled event. It even notices when changes have been made to the its schedule and stops/starts event threads appropriately.

Everything is hooked up, building, and there's a good chance it works without too many bugs, but while I've tested all the pure code (mostly automatically with quickcheck properties), I have not run the Cronner thread at all. And there is some tricky stuff in there, like noticing that the machine was asleep past when it expected to wake up, and deciding if it should still run a scheduled event, or should wait until next time. So tomorrow we'll see..

Today's work was sponsored by Ethan Aubin.

Posted Tue Oct 8 22:15:36 2013

Spent most of the day building some generic types for scheduling recurring events. Not sure if rolling my own was a good idea, but that's what I did.

In the incrementalfsck branch, I have hooked this up in git-annex vicfg, which now accepts and parses scheduled events like "fsck self every day at any time for 60 minutes" and "fsck self on day 1 of weeks divisible by 2 at 3:45 for 120 minutes", and stores them in the git-annex branch. The exact syntax is of course subject to change, but also doesn't matter a whole lot since the webapp will have a better interface.

Posted Tue Oct 8 03:58:26 2013

Finished up the automatic recovery from stale lock files. Turns out git has quite a few lock files; the assistant handles them all.

Improved URL and WORM keys so the filenames used for them will always work on FAT (which has a crazy assortmeny of illegal characters). This is a tricky thing to deal with without breaking backwards compatability, so it's only dealt with when creating new URL or WORM keys.


I think my next step in this disaster recovery themed month will be adding periodic incremental fsck to the assistant. git annex fsck can already do an incremental fsck, so this should mostly involve adding a user interface to the webapp to configure when it should fsck. For example, you might choose to run it for up 1 hour every night, with a goal of checking all your files once per month. Also will need to make the assistant do something useful when fsck finds a bad file (ie, queue a re-download).

Posted Sat Oct 5 21:26:17 2013

Started the day by getting the builds updated for yesterday's release. This included making it possible to build git-annex with Debian stable's version of cryptohash. Also updated the Debian stable backport to the previous release.


The roadmap has this month devoted to improving git-annex's support for recovering from disasters, broken repos, and so on. Today I've been working on the first thing on the list, stale git index lock files.

It's unfortunate that git uses simple files for locking, and does not use fcntl or flock to prevent the stale lock file problem. Perhaps they want it to work on broken NFS systems? The problem with that line of thinking is is means all non-broken systems end up broken by stale lock files. Not a good tradeoff IMHO.

There are actually two lock files that can end up stale when using git-annex; both .git/index.lock and .git/annex/index.lock. Today I concentrated on the latter, because I saw a way to prevent it from ever being a problem. All updates to that index file are done by git-annex when committing to the git-annex branch. git-annex already uses fcntl locking when manipulating its journal. So, that can be extended to also cover committing to the git-annex branch, and then the git index.lock file is irrelevant, and can just be removed if it exists when a commit is started.

To ensure this makes sense, I used the type system to prove that the journal locking was in effect everywhere I need it to be. Very happy I was able to do that, although I am very much a novice at using the type system for interesting proofs. But doing so made it very easily to build up to a point where I could unlink the .git/annex/index.lock and be sure it was safe to do that.


What about stale .git/index.lock files? I don't think it's appropriate for git-annex to generally recover from those, because it would change regular git command line behavior, and risks breaking something. However, I do want the assistant to be able to recover if such a file exists when it is starting up, since that would prevent it from running. Implemented that also today, although I am less happy with the way the assistant detects when this lock file is stale, which is somewhat heuristic (but should work even on networked filesystems with multiple writing machines).


Today's work was sponsored by Torbjørn Thorsen.

Posted Thu Oct 3 20:58:40 2013

Did I say it would be easy to make the webapp detect when a gcrypt repository already existed and enable it? Well, it wasn't exactly hard, but it took over 300 lines of code and 3 hours..

So, gcrypt support is done for now. The glaring omission is gpg key management for sharing gcrypt repositories between machines and/or people. But despite that, I think it's solid, and easy to use, and covers some great use cases.

Pushed out a release.

Now I really need to start thinking about disaster recovery.


Today's work was sponsored by Dominik Wagenknecht.

Posted Wed Oct 2 20:13:45 2013

Long day, but I did finally finish up with gcrypt support. More or less.

Got both creating and enabling existing gcrypt repositories on ssh servers working in the webapp. (But I ran out of time to make it detect when the user is manually entering a gcrypt repo that already exists. Should be easy so maybe tomorrow.)

Fixed several bugs in git-annex's gcrypt support that turned up in testing. Made git-annex ensure that a gcrypt repository does not have receive.denyNonFastForwards set, because gcrypt relies on always forcing the push of the branch it stores its manifest on. Fixed a bug in git-annex-shell recvkey when it was receiving a file from an annex in direct mode.

Also had to add a new git annex shell gcryptsetup command, which is needed to make setting up a gcrypt repository work when the assistant has set up a locked-down ssh key that can only run git-annex-shell. Painted myself into a bit of a corner there.

And tested, tested, tested. So many possibilities and edge cases in this part of the code..


Today's work was sponsored by Hendrik Müller Hofstede.

Posted Tue Oct 1 23:21:47 2013

So close to being done with gcrypt support.. But still not quite there.

Today I made the UI changes to support gcrypt when setting up a repository on a ssh server, and improved the probing and data types so it can tell which options the server supports. Fairly happy with how that is turning out.

Have not yet hooked up the new buttons to make gcrypt repos. While I was testing that my changes didn't break other stuff, I found a bug in the webapp that caused it to sometimes fail to transfer one file to/from a remote that was just added, because the transferrer process didn't know about the new remote yet, and crashed (and was restarted knowing about it, so successfully sent any other files). So got sidetracked on fixing that.

Also did some work to make the gpg bundled with git-annex on OSX be compatable with the config files written by MacGPG. At first I was going to hack it to not crash on the options it didn't support, but it turned out that upgrading to version 1.4.14 actually fixed the problem that was making it build without support for DNS.


Today's work was sponsored by Thomas Hochstein.

Posted Sun Sep 29 20:35:34 2013

Worked on making the assistant able to merge in existing encrypted git repositories from rsync.net.

This had two parts. First, making the webapp UI where you click to enable a known special remote work with these encrypted repos. Secondly, handling the case where a user knows they have an encrypted repository on rsync.net, so enters in its hostname and path, but git-annex doesn't know about that special remote. The second case is important, for example, when the encrypted repository is a backup and you're restoring from it. It wouldn't do for the assistant, in that case, to make a new encrypted repo and push it over top of your backup!

Handling that was a neat trick. It has to do quite a lot of probing, including downloading the whole encrypted git repo so it can decrypt it and merge it, to find out about the special remote configuration used for it. This all works with just 2 ssh connections, and only 1 ssh password prompt max.

Next, on to generalizing this rsync.net specific code to work with arbitrary ssh servers!


Today's work was made possible by RMS's vision 30 years ago.

Posted Fri Sep 27 20:36:58 2013

Being still a little unsure of the UI and complexity for configuring gcrypt on ssh servers, I thought I'd start today with the special case of gcrypt on rsync.net. Since rsync.net allows running some git commands, gcrypt can be used to make encrypted git repositories on it.

Here's the UI I came up with. It's complicated a bit by needing to explain the tradeoffs between the rsync and gcrypt special remotes.

This works fine, but I did not get a chance to add support for enabling existing gcrypt repos on rsync.net. Anyway, most of the changes to make this work will also make it easier to add general support for gcrypt on ssh servers.

Also spent a while fixing a bug in git-remote-gcrypt. Oddly gpg --list-keys --fast-list --fingerprint does not show the fingerprints of some keys.

Today's work was sponsored by Cloudier - Thomas Djärv.

Posted Fri Sep 27 05:03:50 2013

Did various bug fixes and followup today. Amazing how a day can vanish that way. Made 4 actual improvements.

I still have 46 messages in unanswered backlog. Although only 8 of the are from this month.

Posted Wed Sep 25 20:13:08 2013

Added support for gcrypt remotes to git-annex-shell. Now gcrypt special remotes probe when they are set up to see if the remote system has a suitable git-annex-shell, and if so all commands are sent to it. Kept the direct rsync mode working as a fallback.

It turns out I made a bad decision when first adding gcrypt support to git-annex. To make implementation marginally easier, I decided to not put objects inside the usual annex/objects directory in a gcrypt remote. But that lack of consistency would have made adding support to git-annex-shell a lot harder. So, I decided to change this. Which means that anyone already using gcrypt with git-annex will need to manually move files around.

Today's work was sponsored by Tobias Nix.

Posted Tue Sep 24 21:51:12 2013

Finished moving the Android autobuilder over to the new clean build environment. Tested the Android app, and it still works. Whew!

There's a small chance that the issue with the Android app not working on Android 4.3 has been fixed by this rebuild. I doubt it, but perhaps someone can download the daily build and give it another try..


I have 7 days left in which I'd like to get remote gcrypt repositories working in the assistant. I think that should be fairly easy, but a prerequisite for it is making git-annex-shell support being run on a gcrypt repository. That's needed so that the assistant's normal locked down ssh key setup can also be used for gcrypt repositories.

At the same time, not all gcrypt endpoints will have git-annex-shell installed, and it seems to make sense to leave in the existing support for running raw rsync and git push commands against such a repository. So that's going to add some complication.

It will also complicate git-annex-shell to support gcrypt repos. Basically, everything it does in git-annex repos will need to be reimplemented in gcrypt repositories. Generally in a more simple form; for example it doesn't need to (and can't) update location logs in a gcrypt repo.


I also need to find a good UI to present the three available choices (unencrypted git, encrypted git, encrypted rsync) when setting up a repo on a ssh server. I don't want to just remove the encrypted rsync option, because it's useful when using xmpp to sync the git repo, and is simpler to set up since it uses shared encryption rather than gpg public keys.

My current thought is to offer just 2 choices, encrypted and non-encrypted. If they choose encrypted, offer a choice of shared encryption or encrypting to a specific key. I think I can word this so it's pretty clear what the tradeoffs are.

Posted Mon Sep 23 20:32:11 2013

Made a release on Friday. But I had to rebuild the OSX and Linux standalone builds today to fix a bug in them.

Spent the past three days redoing the whole Android build environment. I've been progressively moving from my first hacked up Android build env to something more reproducible and sane. Finally I am at the point where I can run a shell script (well, actually, 3 shell scripts) and get an Android build chroot. It's still not immune to breaking when new versions of haskell libs are uploaded, but this is much better, and should be maintainable going forward.

This is a good starting point for getting git-annex into the F-Droid app store, or for trying to build with a newer version of the Android SDK and NDK, to perhaps get it working on Android 4.3. (Eventually. I am so sick of building Android stuff right now..)

Friday was all spent struggling to get ghc-android to build. I had not built it successfully since February. I finally did, on Saturday, and I have made my own fork of it which builds using a known-good snapshot of the current development version of ghc. Building this in a Debian stable chroot means that there should be no possibility that upstream changes will break the build again.

With ghc built, I moved on to building all the haskell libs git-annex needs. Unfortunately my build script for these also has stopped working since I made it in April. I failed to pin every package at a defined version, and things broke.

So, I redid the build script, and updated all the haskell libs to the newest versions while I was at it. I have decided not to pin the library versions (at least until I find a foolproof way to do it), so this new script will break in the future, but it should break in a way I can fix up easily by just refreshing a patch.

The new ghc-android build has a nice feature of at least being able to compile Template Haskell code (though still not run it at compile time. This made the patching needed in the Haskell libs quite a lot less. Offset somewhat by me needing to make general fixes to lots of libs to build with ghc head. Including some fun with ==# changing its type from Bool to Int#. In all, I think I removed around 2.5 thousand lines of patches! (Only 6 thousand lines to go...)

Today I improved ghc-android some more so it cross builds several C libraries that are needed to build several haskell libraries needed for XMPP. I had only ever built those once, and done it by hand, and very hackishly. Now they all build automatically too.

And, I put together a script that builds the debian stable chroot and installs ghc-android.

And, I hacked on the EvilSplicer (which is sadly still needed) to work with the new ghc/yesod/etc.

At this point, I have git-annex successfully building, including the APK!


In a bored hour waiting for a compile, I also sped up git annex add on OSX by I think a factor of 10. Using cryptohash for hash calculation now, when external hash programs are not available. It's still a few percentage points slower than external hash programs, or I'd use it by default.


This period of important drudgery was sponsored by an unknown bitcoin user, and by Bradley Unterrheiner and Andreas Olsson.

Posted Mon Sep 23 02:55:40 2013

Spent a few hours improving gcrypt in some minor ways, including adding a --check option that the assistant can use to find out if a given repo is encrypted with dgit, and also tell if the necessary gpg key is available to decrypt it. Also merged in a fix to support subkeys, developed by a git-annex user who is the first person I've heard from who is using gcrypt. I don't want to maintain gcrypt, so I am glad its author has shown up again today.

Got mostly caught up on backlog. The main bug I was able to track down today is git-annex using a lot of memory in certian repositories. This turns out to have happened when a really large file was committed right intoo to the git repository (by mistake or on purpose). Some parts of git-annex buffer file contents in memory while trying to work out if they're git-annex keys. Fixed by making it first check if a file in git is marked as a symlink. Which was really hard to do!

At least 4 people ran into this bug, which makes me suspect that lots of people are messing up when using direct mode (probably due to not reading the documentation, or having git commit -a hardwired into their fingers, and forcing git to commit large files into their repos, rather than having git-annex manage them. Implementing ?direct mode guard seems more urgent now.


Today's work was sponsored by Amitai Schlair.

Posted Thu Sep 19 21:10:49 2013

Spent basically all of today getting the assistant to be able to handle gcrypt special remotes that already exist when it's told to add a USB drive. This was quite tricky! And I did have to skip handling gcrypt repos that are not git-annex special remotes.

Anyway, it's now almost easy to set up an encrypted sneakernet using a USB drive and some computers running the webapp. The only part that the assistant doesn't help with is gpg key management.

Plan is to make a release on Friday, and then try to also add support for encrypted git repositories on remote servers. Tomorrow I will try to get through some of the communications backlog that has been piling up while I was head down working on gcrypt.

Posted Wed Sep 18 20:11:59 2013

I decided to keep gpg key generation very simple for now. So it generates a special-purpose key that is only intended to be used by git-annex. It hardcodes some key parameters, like RSA and 4096 bits (maximum recommended by gpg at this time). And there is no password on the key, although you can of course edit it and set one. This is because anyone who can access the computer to get the key can also look at the files in your git-annex repository. Also because I can't rely on gpg-agent being installed everywhere. All these simplifying assumptions may be revisited later, but are enough for now for someone who doesn't know about gpg (so doesn't have a key already) and just wants an encrypted repo on a removable drive.

Put together a simple UI to deal with gpg taking quite a while to generate a key ...

genkey.png

repoinfo.png

Then I had to patch git-remote-gcrypt again, to have a per-remote signingkey setting, so that these special-purpose keys get used for signing their repo.

Next, need to add support for adding an existing gcrypt repo as a remote (assuming it's encrypted to an available key). Then, gcrypt repos on ssh servers..


Also dealt with build breakage caused by a new version of the Haskell DNS library.


Today's work was sponsored by Joseph Liu.

Posted Wed Sep 18 00:08:57 2013

Now the webapp can set up encrypted repositories on removable drives.

encryptdrive.png

This UI needs some work, and the button to create a new key is not wired up. Also if you have no gpg agent installed, there will be lots of password prompts at the console.

Forked git-remote-gcrypt to fix a bug. Hopefully my patch will be merged; for now I recommend installing my worked version.

Today's work was sponsored by Romain Lenglet.

Posted Mon Sep 16 21:00:18 2013

Fixed a typo that broke automatic youtube video support in addurl.


Now there's an easy way to get an overview of how close your repository is to meeting the configured numcopies settings (or when it exceeds them).

# time git annex status . 
[...]
numcopies stats: 
    numcopies +0: 6686
    numcopies +1: 3793
    numcopies +3: 3156
    numcopies +2: 2743
    numcopies -1: 1242
    numcopies -4: 1098
    numcopies -3: 1009
    numcopies +4: 372

This does make git annex status slow when run on a large directory tree, so --fast disables that.

Posted Sun Sep 15 23:36:52 2013

Worked to get git-remote-gcrypt included in every git-annex autobuild bundle. (Except Windows; running a shell script there may need some work later..)

Next I want to work on making the assistant easily able to create encrypted git repositories on removable drives. Which will involve a UI to select which gpg key to use, or creating (and backing up!) a gpg key.

But, I got distracted chasing down some bugs on Windows. These were quite ugly; more direct mode mapping breakage which resulted in files not being accessible. Also fsck on Windows failed to detect and fix the problem. All fixed now. (If you use git-annex on Windows, you should certainly upgrade and run git annex fsck.)

As with most bugs in the Windows port, the underlying cause turned out to be stupid: isSymlink always returned False on Windows. Which makes sense from the perspective of Windows not quite having anything entirely like symlinks. But failed when that was being used to detect when files in the git tree being merged into the repository had the symlink bit set..

Did bug triage. Backlog down to 32 (mostly messages from August).

Posted Fri Sep 13 20:09:20 2013

I've been out sick. However, some things kept happening. Mesar contributed a build host, and the linux and android builds are now happening, hourly, there. (Thanks as well to the two other people who also offered hostng.) And I made a minor release to fix a bug in the test suite that I was pleased three different people reported.

Today, my main work was getting git-annex to notice when a gcrypt remote located on some removable drive mount point is not the same gcrypt remote that was mounted there before. I was able to finesse this so it re-configures things to use the new gcrypt remote, as long as it's a special remote it knows about. (Otherwise it has to ignore the remote.) So, encrypted repos on removable drives will work just as well as non-encrypted repos!

Also spent a while with rsync.net tech support trying to work out why someone's git-annex apparently opened a lot of concurrent ssh connections to rsync.net. Have not been able to reproduce the problem though.

Also, a lot of catch-up to traffic. Still 63 messages backlogged however, and still not entirely well..

Posted Thu Sep 12 21:58:43 2013

Got git annex sync working with gcrypt. So went ahead and made a release today. Lots of nice new features!

Unfortunately the linux 64 bit daily build is failing, because my build host only has 2 gb of memory and it is no longer enough. I am looking for a new build host, ideally one that doesn't cost me $40/month for 3 gb of ram and 15 gb of disk. (Extra special ideally one that I can run multiple builds per day on, rather than the current situation of only building overnight to avoid loading the machine during the day.) Until this is sorted out, no new 64 bit linux builds..

Posted Mon Sep 9 19:37:10 2013

gcrpyt is fully working now. Most of the examples in fully encrypted git repositories with gcrypt should work.

A few known problems:

  • git annex sync refuses to sync with gcrypt remotes. some url parsing issue.
  • Swapping two drives with gcrypt repositories on the same mount point doesn't work yet.
  • http urls are not supported
Posted Sun Sep 8 19:57:57 2013

About half way done with a gcrypt special remote. I can initremote it (the hard part to get working), and can send files to it. Can't yet get files back, or remove files, and only local repositories work so far, but this is enough to know it's going to be pretty nice!

Did find one issue in gcrypt that I may need to develop a patch for: https://github.com/blake2-ppc/git-remote-gcrypt/issues/3

Posted Sat Sep 7 23:10:26 2013

Woke up with a pretty solid plan for gcrypt. It will be structured as a separate special remote, so initremote will be needed, with a gitrepo= parameter (unless the remote already exists). git-annex will then set up the git remote, including pushing to it (needed to get a gcrypt-id).

Didn't feel up to implementing that today. Instead I expectedly spent the day doing mostly Windows work, including setting up a VM on my new laptop for development. Including a ssh server in Windows, so I can script local builds and tests on Windows without ever having to touch the desktop. Much better!

Posted Fri Sep 6 22:54:24 2013

Started work on gcrypt support.

The first question is, should git-annex leave it up to gcrypt to transport the data to the encrypted repository on a push/pull? gcrypt hooks into git nicely to make that just work. However, if I go this route, it limits the places the encrypted git repositores can be stored to regular git remotes (and rsync). The alternative is to somehow use gcrypt to generate/consume the data, but use the git-annex special remotes to store individual files. Which would allow for a git repo stored on S3, etc. For now, I am going with the simple option, but I have not ruled out trying to make the latter work. It seems it would need changes to gcrypt though.

Next question: Given a remote that uses gcrypt, how do I determine the annex.uuid of that repository. I found a nice solutuon to this. gcrypt has its own gcrypt-id, and I convert it to a UUID in a reproducible, and even standards-compliant way. So the same encrypted remote will automatically get the same annex.uuid wherever it's used. Nice. Does mean that git-annex cannot find a uuid until git pull or git push has been used, to let gcrypt get the gcrypt-id. Implemented that.

The next step is actually making git-annex store data on gcrypt remotes. And it needs to store it encrypted of course. It seems best to avoid needing a git annex initremote for these gcrypt remotes, and just have git-annex automatically encrypt data stored on them. But I don't know. Without initializing them like a special remote is, I'm limited to using the gpg keys that gcrypt is configured to encrypt to, and cannot use the regular git-annex hybrid encryption scheme. Also, I need to generate and store a nonce anyway to HMAC ecrypt keys. (Or modify gcrypt to put enough entropy in gcrypt-id that I can use it?)

Another concern I have is that gcrypt's own encryption scheme is simply to use a list of public keys to encrypt to. It would be nicer if the full set of git-annex encryption schemes could be used. Then the webapp could use shared encryption to avoid needing to make the user set up a gpg key, or hybrid encryption could be used to add keys later, etc.

But I see why gcrypt works the way it does. Otherwise, you can't make an encrypted repo with a friend set as one of the particpants and have them be able to git clone it. Both hybrid and shared encryption store a secret inside the repo, which is not accessible if it's encrypted using that secret. There are use cases where not being able to blindly clone a gcrypt repo would be ok. For example, you use the assistant to pair with a friend and then set up an encrypted repo in the cloud for both of you to use.

Anyway, for now, I will need to deal with setting up gpg keys etc in the assistant. I don't want to tackle full gpgkeys yet. Instead, I think I will start by adding some simple stuff to the assistant:

  • When adding a USB drive, offer to encrypt the repository on the drive so that only you can see it.
  • When adding a ssh remote make a similar offer.
  • Add a UI to add an arbitrary git remote with encryption. Let the user paste in the url to an empty remote they have, which could be to eg github. (In most cases this won't be used for annexed content..)
  • When the user has no gpg key, prompt to set one up. (Securely!)
  • Maybe have an interface to add another gpg key that can access the gcrypt repo. Note that this will need to re-encrypt and re-push the whole git history.
Posted Thu Sep 5 21:19:04 2013

Now I can build git-annex twice as fast! And a typical incremental build is down to 10 seconds, from 51 seconds.

Spent a productive evening working with Guilhem to get his encryption patches reviewed and merged. Now there is a way to remove revoked gpg keys, and there is a new encryption scheme available that uses public key encryption by default rather than git-annex's usual approach. That's not for everyone, but it is a good option to have available.

Posted Thu Sep 5 04:10:43 2013

I try hard to keep this devblog about git-annex development and not me. However, it is a shame that what I wanted to be the beginning of my first real month of work funded by the new campaign has been marred by my home's internet connection being taken out by a lightning strike, and by illness. Nearly back on my feet after that, and waiting for my new laptop to finally get here.

Today's work: Finished up the git annex forget feature and merged it in. Fixed the bug that was causing the commit race detection code to incorrectly fire on the commit made by the transition code. Few other bits and pieces.

Posted Tue Sep 3 20:59:33 2013

Implemented git annex forget --drop-dead, which is finally a way to remove all references to old repositories that you've marked as dead.

I've still not merged in the forget branch, because I developed this while slightly ill, and have not tested it very well yet.

Posted Sat Aug 31 22:25:23 2013

John Millikin came through and fixed that haskell-gnutls segfault on OSX that I developed a reproducible test case for the other day. It's a bit hard to test, since the bug doesn't always happen, but the fix is already deployed for Mountain Lion autobuilder.

However, I then found another way to make haskell-gnutls segfault, more reliably on OSX, and even sometimes on Linux. Just entering the wrong XMPP password in the assistant can trigger this crash. Hopefully John will work his magic again.


Meanwhile, I fixed the sync-after-forget problem. Now sync always forces its push of the git-annex branch (as does the assistant). I considered but rejected having sync do the kind of uuid-tagged branch push that the assistant sometimes falls back to if it's failing to do a normal sync. It's ugly, but worse, it wouldn't work in the workflow where multiple clients are syncing to a central bare repository, because they'd not pull down the hidden uuid-tagged branches, and without the assistant running on the repository, nothing would ever merge their data into the git-annex branch. Forcing the push of synced/git-annex was easy, once I satisfied myself that it was always ok to do so.

Also factored out a module that knows about all the different log files stored on the git-annex branch, which is all the support infrastructure that will be needed to make git annex forget --drop-dead work. Since this is basically a routing module, perhaps I'll get around to making it use a nice bidirectional routing library like Zwaluw one day.

Posted Fri Aug 30 00:28:45 2013

Yesterday I spent making a release, and shopping for a new laptop, since this one is dying. (Soon I'll be able to compile git-annex fast-ish! Yay!) And thinking about ?wishlist: dropping git-annex history.

Today, I added the git annex forget command. It's currently been lightly tested, seems to work, and is living in the forget branch until I gain confidence with it. It should be perfectly safe to use, even if it's buggy, because you can use git reflog git-annex to pull out and revert to an old version of your git-annex branch. So if you're been wanting this feature, please beta test!


I actually implemented something more generic than just forgetting git history. There's now a whole mechanism for git-annex doing distributed transitions of whatever sort is needed.

There were several subtleties involved in distributed transitions:

First is how to tell when a given transition has already been done on a branch. At first I was thinking that the transition log should include the sha of the first commit on the old branch that got rewritten. However, that would mean that after a single transition had been done, every git-annex branch merge would need to look up the first commit of the current branch, to see if it's done the transition yet. That's slow! Instead, transitions are logged with a timestamp, and as long as a branch contains a transition with the same timestamp, it's been done.

A really tricky problem is what to do if the local repository has transitioned, but a remote has not, and changes keep being made to the remote. What it does so far is incorporate the changes from the remote into the index, and re-run the transition code over the whole thing to yeild a single new commit. This might not be very efficient (once I write the more full-featured transition code), but it lets the local repo keep up with what's going on in the remote, without directly merging with it (which would revert the transition). And once the remote repository has its git-annex upgraded to one that knows about transitions, it will finish up the transition on its side automatically, and the two branches will once again merge.

Related to the previous problem, we don't want to keep trying to merge from a remote branch when it's not yet transitioned. So a blacklist is used, of untransitioned commits that have already been integrated.

One really subtle thing is that when the user does a transition more complicated than git annex forget, like the git annex forget --dead that I need to implement to forget dead remotes, they're not just telling git-annex to forget whatever dead remotes it knows right now. They're actually telling git-annex to perform the transition one time on every existing clone of the repository, at some point in the future. Repositories with unfinished transitions could hang around for years, and at some future point when git-annex runs in the repository again, it would merge in the current state of the world, and re-do the transition. So you might tell it to forget dead remotes today, and then the very repository you ran that in later becomes dead, and a long-slumbering repo wakes up and forgets about the repo that started the whole process! I hope users don't find this massively confusing, but that's how the implementation works right now.


I think I have at least two more days of work to do to finish up this feature.

  • I still need to add some extra features like forgetting about dead remotes, and forgetting about keys that are no longer present on any remote.

  • After git annex forget, git annex sync will fail to push the synced/annex branch to remotes, since the branch is no longer a fast-forward of the old one. I will probably fix this by making git annex sync do a fallback push of a unique branch in this case, like the assistant already does. Although I may need to adjust that code to handle this case, too..

  • For some reason the automatic transitioning code triggers a "(recovery from race)" commit. This is certainly a bug somewhere, because you can't have a race with only 1 participant.


Today's work was sponsored by Richard Hartmann.

Posted Wed Aug 28 21:41:55 2013

I've started a new page for my devblog, since I'm not focusing extensively on the assistant and so keeping the blog here increasingly felt wrong. Also, my new year of crowdfunded development formally starts in September, so a new blog seemed good.

Posted Wed Aug 28 21:40:09 2013