Joey blogs about his work here on a semi-daily basis.

Today I'm releasing git-annex 10.20230626. This release got delayed for 2 months due to making some breaking changes in how filenames with unusual characters are quoted. So it has an unsual amount of changes in it, and is as major as a git-annex release gets without being a repository version bump.

But I mostly wanted to announce that we are planning a joint git-annex/Datalad meeting! In the first half of 2024, probably in Germany. If you would be interested in attending, please fill out this brief survey. While Datalad is aimed at scientists, and I look forward to spending time with them, you do not have to be a scientist to attend, any interested users are welcome.

Also, if you'd like to learn about git-annex in Germany this weekend, Yann Büchau is hosting a workshop.

Posted Mon Jun 26 15:53:43 2023

Importing trees from special remotes still feels a bit like a new feature, although it was added to git-annex in 2019. I don't know if many people are using it. I've had some complaints about it being slow when the remote contains a large number of files (eg 100 thousand).

I've just finished speeding up repeated imports from a special remote a lot, when the special remote contains a large number of files, and few or no files have changed.

git-annex was spending a lot of time converting content identifiers to keys. Each conversion took a database lookup, which was slow enough to become painful in bulk.

I thought of a neat trick. Take the sha1 of a content identifier, and create a git tree of the files in the special remote, using those sha1s as the content of the files. Of course, that is not the actual content of any file that git knows about. But it doesn't matter, because once git-annex has those trees, it can diff the current tree to the tree from the previous import. And that tells it which files have changed. Then it only has to do database lookups for the changed files.

This turned out to be one of the best results I've ever gotten from a git-annex optimisation. It runs 60x faster or more with more files!

The moral is that git is really good at diffing trees fast, and so it's worth using git diff whenever possible, even if the thing being diffed is not a regular tree of files.

This work was sponsored by Mark Reidenbach and Lawrence Brogan on Patreon

Posted Thu Jun 1 22:43:33 2023

Last weekend I watched a talk "Houdini of the Terminal: The need for escaping" which shows several recent exploits of terminal emulators using escape sequences. It was eye opening that security holes like that are still being found, and also how severe some of the results can be. I was already familiar with escape sequences as a potential security hole, but it never seemed to make sense to have a program that was not a terminal emulator guard against them. This talk made me think it can make sense for some programs, as a defence in depth.

Now git does escape unusual characters when displaying filenames (most of the time). But git-annex never has. So it seems it would be a good idea to make git-annex follow git's lead on this. And git has a core.quotePath which can be used to make it not escape unicode characters, so git-annex should also support that.

Implementing that was not very easy, because there are a vast number of places where git-annex can display a filename. I had to check every error message and warning message and other output in the whole code base to find ones that displayed a filename. That took a while.

While doing that, I realized that there are some other ways a control character could be stored in the git repository that would cause git-annex to display it. It's possible for a git-annex key to have a control character in its name. And a few other things stored in the git-annex branch, like metadata, could also contain control characters.

I decided the best way to deal with those is not with some complex escaping, but just by filtering out the control characters on output. In fact, git-annex now filters out control characters in basically all its output. The exceptions are some cases where filtering is not done when it's outputting to a pipe, and that commands like git-annex find that support --format only do escaping when requested by the format.

By the way, it turns out that git will display control characters in the names of remotes or branches. Possibly in other situations too. (I do wonder if a git remote that uses control characters in a branch could be used to exploit a terminal emulator?) So git-annex has now gone further than git in this area.

The resulting diff is 6500 lines, and I don't consider this an actual security fix in git-annex, but only a hardening measure. So I won't be hurrying out the next release for this.

This work was sponsored by Jake Vosloo, unqueued, Graham Spencer, and Erik Bjäreholt on Patreon

Posted Wed Apr 12 19:03:10 2023

(Tap tap. Oh, this devblog is still on?)

View branches are a neat corner of git-annex that have remained kind of obscure since I implemented them back in 2014. Not many improvements have been made from back then until recently.

Today I implemented a longstanding todo, unifying view branches with adjusted branches. The result is that you can enter an adjusted branch from a view branch, or a view branch from an adjusted branch, and get what you would probably expect.

For example, to sort your annexed files into directories by author and year, and have all annexed files in the view be unlocked:

git-annex adjust --unlock
git-annex view author=* year=*

Earlier this month, I addressed probably the main missing feature of view branches, by making git-annex sync work in a view branch, updating it with metadata and files pulled in from remotes. Although it there is room to make it ?faster still.

Also, view branches can be made that include files that lack metadata. Such files are put in a directory named "_". And can be moved out of there to other directories to set their metadata. For example:

git-annex view author?=*

Views combine nicely with graphical file managers, and Yann Büchau has recently built an integration with Thunar that supports most of these new features and can be seen in action in this screencast.

This work was sponsored by Lawrence Brogan, Erik Bjäreholt, and unqueued on Patreon

Posted Mon Feb 27 20:28:28 2023

Last Thursday I implemented git-annex filter-process, which you can try enabling to make commands like git add and git checkout faster when they operate on a lot of files.

git config filter.annex.process 'git-annex filter-process'

On Friday, I benchmarked it and was not surprised to find that it's slower in some cases than the old smudge/clean filter interface, and faster in other cases. Still, good to see actual numbers (see 054c803f8d7cc43eb01fdf6141ab6572373c7d60). The surprising good news is that it only seems to make git add around 10% slower when adding a large file (to the annex presumably). Although I know I can speed that up, eventually.

Today, I used the benchmark results to build a cost model into git-annex, so it knows when it would be faster to have filter.annex.process set or unset, and temporarily unsets it when that seems best. It can only do that when it's restaging pointer files, but that was the main problem with setting filter.annex.process really.

So I'm fairly close to wanting to enable it by default. But will probably just wait until whenever v9 happens and do it then. Hopefully some people will try it out in the meantime and perhaps I can refine the cost model.

This work was sponsored by Jake Vosloo, Graham Spencer, and Dr. Land Raider on Patreon

Posted Mon Nov 8 20:21:08 2021

Would you rather that git checkout got a lot faster at checking out a lot of files, and git add got a lot faster at adding a lot of small files, if the tradeoff was that git add and git commit -a got slower at adding large files to the annex than they are now?

Being able to make that choice is what I'm working on now. Of course, we'd rather it were all fast, but due to git smudge clean interface suboptiomal, that is not possible without improvements to git. But I seem to have a plan that will work around enough of the problems to let that choice be made.

Today I've been laying the groundwork, by implementing git's pkt-line interface, and the long-running filter process protocol. Next step will be to add support for that in git-annex smudge, so that users who want to can enable it with:

git config filter.annex.process 'git-annex filter-process'

I can imagine that becoming enabled by default at some point in v9, if most users prefer it over the current method. Which would still be available by unsetting the config.

Today's work was sponsored by Mark Reidenbach on Patreon

Posted Wed Nov 3 20:07:03 2021

I've been unsatisfied with git-annex's handling of clock skew since day 1. Since it relies on timestamps, it needs clocks to be synchronised across users, at least to a reasonable extent. A clock in the far future or distant past could potentially confuse git-annex a lot. Vector clocks felt like the right kind of solution, but also wrong somehow.

I've finally cracked it! See ?git-annex branch clocks for the details, but in summary, git-annex will be able to detect clock skew and fall back to vector clocks, but will otherwise continue to use timestamps for their benefits over vector clocks (ie, having some idea about what order disconnected events actually occurred, to the extent physics makes that possible).

That is mostly implemented, only needs some more testing and cleanup before merging.

Today's work was sponsored by Graham Spencer on Patreon

Posted Tue Aug 3 21:16:06 2021

I've fallen completely out of practice on this dev blog, but I felt I had to mention a major milestone accomplished over the past week. The database that git-annex maintains about keys and worktree files used to only be guaranteed to be maintained for unlocked files, but it did not have information about locked files. Now it does, and it's automatically, and efficiently (I hope) kept up-to-date.

That let a long-standing bug get fixed, where when 2 files used the same key, the preferred content expression could match one file and not the other and cause get/drop to happen over and over.

But there are probably a lot of other ways this database could be used, now that's it's fully available. For example, it would be easy to write a git-annex command that queries for which worktree files use a key, without needing to scan the whole worktree to find them.

Posted Mon May 31 14:44:45 2021

Finally gotten started on the borg special remote idea. A prerequisite of that is remotes that can be imported from, but not exported to. So I actually started by allowing setting importtree=yes without exporttree=yes. A lot of code had assumptions about that not being allowed, so it took a while to chase down everything. Finished most of that yesterday.

What I've done today is added a thirdPartyPopulated type of remote, which git-annex sync can "pull" from by using the existing import interface to list files on it, and determine which of them are annex object files. I have not started on the actual borg remote at all, but this should be all the groundwork for it done.

(I also finished up annex.stalldetection earlier this week.)

This work was sponsored by Jake Vosloo on Patreon.

Posted Tue Dec 22 20:55:25 2020

Just barely managed to get the borg special remote fully working by end of day today. I'm still a bit shocked it was possible to do this at all, let alone as neatly as it turned out, with so few changes to git-annex to support such an entirely different thing.

I put in a fair amount of effort to making it fast to keep up-to-date with changes to the borg repo; git-annex sync does need to run borg list to check what archives are in it, but it avoids rescanning archives it already knows the contents of.

There are a slew of new todo items related to this special remote: ?allow overriding untrust of import remotes, ?borg special remote add subdir config, ?sync --content with borg does not get content, ?borg sync tree not grafted, ?use same vector clock for content identifier updates in import

And, if this backup-as-a-remote idea does turn out to be useful, there are lots of other backup programs. It would be good to be able to write external special remotes for them, but that would need a protocol extension.

This work was sponsored by Mark Reidenbaach on Patreon.

Posted Tue Dec 22 20:55:25 2020

I got annex.stalldetection implemented yesterday without much drama. It was a big patch, and it all worked on the first try. But ended up also spending all of today working on it. After sleeping on it, I realized there were several things I needed to improve. Including making a real protocol that git-annex uses to talk to the helper git-annex transferrer processes, to make it fairly future proof.

This work was sponsored by Jake Vosloo and Mark Reidenbach on Patreon.

Posted Wed Dec 9 20:21:08 2020

Finally have all the groundwork done for canceling stalled transfers. This involved taking some code that was in the assistant, and had not been touched for probably 7 years beyond basic maintenance, dusting it off, and making it suitable to be used in git-annex more generally. Now I have git-annex using transferkeys child processes, and all that seems to work well.

I'm finishing up today by designing the new git config that will enable stall detection and canceling. annex.stalldetection will be configurable to a value like "1MB/30s", which means it's stalled unless every 30 seconds a megabyte of data has been tranferred. Or "0KiB/2m" will let things stall for up to 2 minutes with no data transfer. There will also be a per-remote config, so minimum transfer rates can be set for each. This can be combined with annex.retry to make it retry after detecting a stall.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Dec 7 20:56:53 2020

Update on 3 new features. Appropriate to the season, there's a past, a present, and a future one.

Past: The last release added git annex adjust --unlock-present which might be just what you were looking for, if you used to use direct mode. It unlocks files whose content is present, but files whose content is missing are dangling symlinks. Currently, the branch is only refresh after git-annex finishes all requested transfers. There is a annex.adjustedbranchrefresh config that can make it refresh more frequently, but doing it after every file may be too slow in a large repo. I hope to speed it up enough eventually to perhaps make this the default later in places where --unlock is currently used.

(That work was sponsored by Gioele Barabucci ENK)

Present: This week, I've been working on an internal protocol to comminicate about all console IO that git-annex does, so it can start some child processes to perform long-running tasks, like downloads. The goal is to ?detect stalled transfers and cancel or retry them. This is after previous attempts, at doing it using threads failed. I finished the IO serialization part today, but may put off the rest until a bit later.

(This work was sponsored by Jake Vosloo, Mark Reidenbach, and Graham Spencer on Patreon)

Future: We've been thinking about a ?borg special remote for a while, and last night I realized that something I implemented this summer for ?importing from special remote without downloading might be just what's needed for this new kind of remote. That was surprising! At the time, I had been doubtful about the new feature, since it seemed only the directory special remote would benefit from it at all.

The idea is the user runs a backup program, like borg, to store a copy of your git-annex repo, and then points git-annex at it, to learn what annexed content is stored in it. This is particularly exciting to me, because it's a whole new kind of special remote, and could be used for lots of backup programs beyond borg, and probably other stuff.

Imagine something like this:

borg init user@host:/annex.borg
borg create user@host:/annex.borg::{now} .git
git annex initremote borg type=borg repolocation=user@host:/annex.borg
git annex import --known --from borg
git annex drop --unused

And now all your old unused annex objects have been moved into the borg repo, where they're efficiently stored with its data deduplication. And of course, you can use git-annex get to get them from there.

I have a feeling I'll be haunted by this idea until I implement it..

Posted Fri Dec 4 19:43:15 2020

The first prerelease of git-annex was made ten years ago. Taking a look back at 0.01, which I think only I ever used, it had the commands add, get, drop, unannex, init, fix, and fromkey. There were 2600 lines of code in all, which has increased 30-fold.

Later that week, 0.02 added the move command. I've been working recently on fixing a ?tricky problem with resuming interrupted moves. The move command has been surprisingly subtle to get just right, since it turns out to not be as simple as a get followed by a drop -- numcopies checks that would prevent a drop sometimes need to be relaxed to allow a move. Maybe I've finally gotten it perfect now. Probably not.

Posted Wed Oct 21 14:32:33 2020

I've spent two days trying to track down a recently introduced memory leak, or leaks. This was unusually hard because all the profiler could tell me is the memory is "PINNED", but not what allocated it or anything else about it.

I probably should have bisected it, rather than staring at the code and randomly reimplementing things I thought could be pinning memory. Oops.

And there is more memory that the profiler doesn't even show being allocated, which got much bigger with a new toolchain, and I have not gotten to the bottom of that yet.


This work was sponsored by Jake Vosloo and Mark Reidenbach on Patreon.

Posted Tue Oct 13 22:58:08 2020

Three recent todos have needed a way to introspect the matchers built from preferred content expressions and some command-line options, to determine what information they use. So implemented that today.

With that, it was possible to double the seeking speed of git-annex sync --all when include=/exclude= are not used in preferred content. And the seeking speed of commands like git-annex find --copies=2 and --in remote improved by around 20%.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Fri Sep 25 17:51:34 2020

A stressful thing about maintaining git-annex is that sometimes changes to git break it in some way. Since git has a high development velocity, it can be hard to keep up with all changes and catch such problems. The git devs are good about backwards compatibility, but can still make mistakes. Worse is when there's an assumption about how people will use git, when git-annex lets people use it in a rather different way. I've been dealing with one of those today.

CONFLICT (file location): x/foo added in refs/remotes/origin/master
inside a directory that was renamed in HEAD, suggesting it should
perhaps be moved to y/foo.

This was a interesting new git feature when it was added back in 2.18, especially since git doesn't really track directories, so is here somewhat guessing if a directory was renamed.

An example of a way git-annex is used that this does not play well with is managing media files for consumption, where you might have an incoming directory, and then rename files to somewhere else once they're processed. If you renamed the last file in your incoming directory, and then a new file was later added to it in some other clone of the repository, this git feature could result in that new file being moved to an unexpected location when you git-annex sync.

Normally it wouldn't matter much if git guessed wrong like that about a rename, since the merge conflict forces the user to look at it. But, git-annex sync and the assistant automatically resolve merge conflicts, so the user can easily not notice this happening.

If you're worried that might have happened to you, look for files in your repository with ".variant" in the name. If there are two with the same base name, that's a normal merge conflict, but if there's only a single variant file by itself, it could have been created by this rename conflict scenario.

git-annex will now avoid this problem, by setting merge.directoryRenames=false when running a merge (unless you've manually configured it yourself).

Today's work was sponsored by Martin D on Patreon

Posted Mon Sep 7 20:57:40 2020

After a release on Monday, I've spent the week working on async extension to external special remote protocol. This is lets a single external special remote process handle multiple requests at the same time, when it's more efficient to use one process than for git-annex to run several processes.

It's a good thing I added support for extensions a couple of years back. I never imagined at the time using it for something like this, that radically changes the whole protocol! It could have just been protocol version 2, but then special remotes would be pushed towards using this by default, which I don't want. It's probably overkill for most of them.

      J 6 REMOVE Key5

The protocol extension went through a bunch of iterations, ending up with probably the simplest possible way to do it, a simple framing layer around the main protocol. I started with rather a lot of rather hairy code and it kind of all melted away as I refined the protocol down to that, which was nice, although I also kind of wish I had been able to jump right to the clean and simple end result.

Posted Fri Aug 14 20:01:50 2020

Today I implemented external backends for keys. So unusual new hashes can be used by writing a small program.

Probably lots of other uses for this too; I don't know if I'll like them all. It has the potential to warp git-annex in some directions I don't want to deal with. Still, it's good to have this feature.

I was able to reuse a lot of the external special remote code for this, and only had to write around 400 lines of new code. Dunno how that all happened in 8 hours, but it did!

Posted Wed Jul 29 21:28:49 2020

One more day working on performance, as I had a few known improvements I had not had time to get to. Managed to double the speed of move --to,copy --to, anddrop` when seeking files to act on and a few percent more in general.

My laptop's keyboard is failing, with more and more keys not working -- luckily so far only ones in the number row -- so I'm stopping early and hoping the fix arrives quickly on Monday. At some point I know that ?this todo will be able to speed up using things like --in and --copies by a similar amount as the recent performance improvements.


Today's work was sponsored by Jake Vosloo on Patreon.

Posted Fri Jul 24 18:16:00 2020

I've spent all week working on performance. It started when Lukey found a way to use git cat-file --buffer to make --all faster. Once implemented, that turned out to be a 2x to 16x speedup in seek time.

I felt that same approach could probably also speed up other parts of git-annex that use git cat-file, so spent another 4 days finding ways to do that. Some of the ideas are not implemented yet, but I landed a 2x speedup today, to all git-annex commands that seek annexed files to work on.

Oh and also there used to be a git-annex branch read cache, but it got removed many years ago, and I forgot it had been removed. Which does not lead to writing the fastest code. Bringing the cache back makes some things another 20% faster.

This work was sponsored by Mark Reidenbach, Jake Vosloo, and Graham Spencer on Patreon.

Posted Fri Jul 10 20:09:36 2020

New feature today: Implemented ?import tree should honor annex.largefiles.

This only took an hour to implement, but I had to think for several hours first to get a solid understanding of it. Particularly, what happens if a file on a remote has a name that makes it be treated as non-large, but then in a later import, it's renamed to a name that would be treated as large? (Or vice-versa.) My conclusion is that is equivilant to git annex add of a file with the first name followed by git mv, so it's ok for annex.largefiles to not take effect in such a case.

Today's work was sponsored by Martin D on Patreon.

Posted Tue Jun 23 20:09:14 2020

Started out the day productively working through more async exception safety for timeouts. But then I realized there's a whole can of worms involving bracket not working like I expected it to.

Got a bit side tracked checking if other people expect bracket to work the way it actually does, and seem to have found a bug in the process library. Which is especially concerning since it's just the first place I looked, so what other libraries might have similar problems?

So the timeouts feature is seeming a lot less plasible than it did last week. I'll probably defer it until later. The work done on it so far is at least generally an improvement to the code.

Posted Tue Jun 9 20:21:02 2020

Working this week on a long missing capability of git-annex: The ability to time out, and perhaps retry, a transfer that has gotten stuck.

It's a lot harder than it sounds, because to get it right with no resource leaks, every process and child thread that git-annex runs has to be stopped by a timeout too, which the current code base was not designed for at all. ?gory details here

So far I have most processes being stopped, and that took 2 solid days. This may take a while to finish. I do think though, that once the basic operation of stopping a transfer is available, there will be other uses besides timeouts.

One I can think of already is, if a remote is being very slow, it might make sense to stop a transfer from it and switch to using a different remote. Another is that there could be a hotkey to skip the current transfer, moving on to the next file.

Posted Thu Jun 4 19:53:26 2020

Landed two behavior changes in the past two days, which I wanted to mention here.

First, special remotes configured with autoenable=yes will be auto-enabed by the automatic git-annex initialization that it does in a clone of a git-annex repo. Before, git annex init had to be run to "auto-enable" them. Probably few people will notice this, unless a special remote somehow takes too long to enable. May later have to add a timeout.

Second, a command like git annex get foo will complain if the specified file or directory is not known to git. Lots of users have gotten confused by why such a command would silently return without doing anything.

Commands like git annex get dir will not complain about files in the directory that are not under git's control, unless none of them are. (Same behavior as eg git commit dir.) Mostly this change affects using wildcards, or just being confused about a file not being checked into git.

Due to the potential to break some workflows, new behavior will only be enabled for now annex.skipunknown is set to false. I plan to make that the default in early 2022. (About a year delay seemed right generally but I added some time due to the pandemic.) So if you prefer the original behavior, you can just set annex.skipunknown true.

At least for now, git-annex will still skip over files that are checked into git but are not annexed files. May make sense to change that too, we'll see if users get confused by that like by the other skipping behavior.

Posted Fri May 29 17:09:20 2020

git-annex development has been more or less back to normal for the past several weeks, including getting on top of most of the recent backlog.

Today I'm finishing up a project that has taken half the week. The internal remote interface uses Bool extensively, and avoided throwing exceptions, and so it was not uncommon for access to a remote to fail and no reason be given. There have been a number of bugs about one thing or anther over the years, which have been fixed on an ad hoc basis without addressing the underlying problem. Now it's all been changed to throw exceptions, so the failure reason will always be displayed. Some tens of thousands of lines of diffs later, it's almost done.

Today's work was sponsored by Graham Spencer on Patreon.

Posted Fri May 15 19:13:01 2020

I'm only working on git-annex a day or two a week at present. Like everyone, dealing with the covid-19 crisis taking up a lot of my time. Some days I can't concentrate, some days I am dealing with basic needs, and other days I am rushing to develop other software targeted at this crisis. (See my personal blog.)

I remotely attended the MONII conference a week ago, with lots of researchers doing things with software related to git-annex, some in the health field, and something that struck me was a mention that it's important that scientists continue their work, even if it's not directly related to the crisis. All kinds of fields are going to be important in the time ahead beyond saving lives.

So I am prioritizing anything scientists need to use git-annex, and anything those working on the crisis might need. If that's you and you need something, you can use the new "priority" tag on bugs and todos, and it will go right to the top of the roadmap. Do bear in mind that I have limited time/resources/attention right now, so only use it when you really need something urgently.

Posted Wed Mar 25 17:26:38 2020

A ?nasty bug that made git-annex store content on gcrypt and git-lfs without encrypting it led to a bugfix-only release, 7.20200226.

Since v8 was already close to release -- I was thinking probably Friday -- and the autobuilders are already building that version, it made sense to move up the v8 release as well, so that's also been released today.

That bug was happened because of an oversight when I was doing the big remote config parsing change. I tested that a lot, but I didn't think to test that gcrypt actually stored content encrypted! I need to do something about network test suite, so this kind of breakage in special remotes can be caught.

Posted Wed Feb 26 22:51:11 2020

v8 is now merged into master and so the next release will use v8.

The presumably last v7 release happened earlier today, with some accumulated changes, including a data loss bug in git-annex fsck --from remote -J.

Posted Wed Feb 19 19:16:36 2020

This has been a big change, I'm now 3 days and a 3000 line diff in and I finally got all the remote configuration settings converted to the new up-front parsing.

Seems like quite a lot of work, since the only user-visible improvement is these error messages:

# git annex initremote demo type=directory directory=../foo encryption=none foo=bar
initremote demo
git-annex: Unexpected fields: foo

# git annex initremote demo type=directory directory=../foo encryption=none exporttree=true
initremote demo
git-annex: Bad value for exporttree (expected yes or no)

But this involved paying down technical debt in a big code base, so of course it was expensive.

Anyway, it should now be relatively easy to implement git annex initremote --list-params-for=S3

Posted Wed Jan 15 18:16:42 2020

I'm in the middle of a big change to internals. Remotes have buried inside them a string-based configuration, and those settings are only parsed when they're used, so bad configuration is often ignored rather than being detected when the user inputs it. The parsing is moving to happen upfront.

This is something I could not have done when I first wrote git-annex, because the values that get parsed have many different types, so how can a single Remote data type contain those, whatever they are? Now I know how to use the Typeable class to do such things.

Posted Mon Jan 13 17:17:29 2020

The release of git-annex with all the ByteString optimisations went out earlier this week. The Windows autobuilder was down and I didn't try to get it building on Windows, so fixed that today, luckily all those changes only broke a few bits of Windows-specific code.

Also today, I added git-annex add --force-annex/--force-git options. These do the same thing as -c annex.largefiles=anthing etc, but are easier to type and may avoid some tricky git behavior in some edge cases.

I'd kind of like to get back to v8 this month and perhaps release it. There's a v8 branch now, which as well as the sqlite changes adds a new annex.dotfiles config setting, and cleans up the special cases around adding dotfiles. Anyone not using git-annex to manage large dotfiles (or files in dotdirs) won't be impacted, but those who do will need to enable annex.dotfiles and configure annex.largefiles to match the dotfiles they want annexed. There is a risk that someone who's in the habit of running git annex add .dotfile to add them to the annex will be surprised when the new version adds them to git because they've not done the necessary configuration. I'm still mulling over whether this is an acceptable risk to mostly de-uglify and de-special-case dotfiles.

Posted Wed Jan 1 19:16:35 2020

Cut the last release before the switch over to end-to-end ByteString. (Including fixing the rpm repo's index which had not been getting updated.)

I had left the bs branch last week with a known bug, so got that fixed. Also there were some encoding problems on windows with the ByteString filepaths, which needed a new release of filepath-bytestring to clean up. Now I think the bs branch is really in a mergeable state. (It's still not tested on Windows at all though.)

Took the last little while to do some more profiling. Mostly the remaining ByteString conversions barely seem worth doing (1% improvement at most), but ?optimise journal access seems like it could pay off well.

Also found time in there somewhere to implement git annex inprogress --key

Posted Wed Dec 18 21:11:59 2019

The bs branch has reached a milestone: git-annex find and git-annex get (when all files are present) process ByteStrings end-to-end with no String conversion. That sped it up by around 30% on top of the previous optimisations.

To get here, I spent a couple of days creating the filepath-bytestring library, which git-annex will depend on. Lots more git-annex internals were switched to ByteString, especially everything having to do with statting files.

Other commands, like git-annex whereis, still do some String conversions. Optimisation never ends.

But the bs branch is ready to merge as-is, and the diff is 10 thousand lines, so not a branch I want to maintain for long. Planning to merge it after the next release.

Posted Wed Dec 11 19:28:24 2019

I've gotten the bs branch to build everything again. Was not trivial, the diff is over 7000 lines.

Had hoped this was a mechanical enough conversion it would not introduce many bugs, but the test suite quickly found a lot of problems. So that branch is not ready for merging yet.

I'm considering making a library that's like filepath but for RawFilePath. That would probably speed git-annex up by another 5% or so, in places where it currently has to convert back to FilePath.

Posted Thu Dec 5 19:23:04 2019

Two entire days spent making a branch where git-annex uses ByteString instead of String, especially for filepaths. I commented out all the commands except for find, but it still took thousands of lines of patches to get it to compile.

The result: git-annex find is between 28% and 66% faster when using ByteString. The files just fly by!

It's going to be a long, long road to finish this, but it's good to have a start, and know it will be worth it. ?optimize by converting String to ByteString is the tracking page for this going forward.

Posted Tue Nov 26 20:13:23 2019

Today, sped up many git-annex commands by around 5%. Often git-annex traverses the work tree and deserializes keys to its Key data type, only to turn around and do something with a Key that needs it to be serialized again. So caching the original serialization of a key avoids that work. I had started on this in January but had to throw my first attempt away.

The big bytestring conversion in January only yielded a 5-15% speedup, so an extra 5% is a nice bonus for so relativly little work today. It also feels like this optimisation approach is nearly paid out though; only converting all filepath operations to bytestrings seems likely to yield a similar widespread improvement.

Posted Sat Nov 23 16:50:30 2019

The git-lfs support I added to git-annex had one small problem: People expect to be able to clone a git repo and get right to using it, but after cloning a git-annex repo that's on a server that uses git-lfs, there was an extra git annex enableremote step to be able to use it as a git-lfs special remote. And, you ended up with a "origin" git remote and a git-lfs special remote with some other name.

Now, it's this simple to set up a git-lfs repo on eg, github:

git annex initremote github type=git-lfs encryption=none url=
git annex sync github
git annex copy --to github ...

And then for others to clone and use it is even simpler:

git clone
cd lfstest
git annex get

The only gotcha is that git-annex has to know the url that's used for the remote. Cloning any other url any other way (eg http instead of https) will result in git-annex not using it. This is a consequence of git-lfs not having any equivilant of a git-annex repository UUID, so git-annex can't probe for the UUID and has to compare urls. This can be worked around using initremote --sameas to tell git-annex about other urls.

Posted Mon Nov 18 21:25:55 2019

Spent the past two weeks on the ?sqlite database improvements which will be git-annex v8.

That cleaned up a significant amount of technical debt. I had made some bad choices about encoding sqlite data early on, and the persistent library turns out to make a dubious choice about how String is stored, that prevents some unicode surrigate code points from roundtripping sometimes. On top of those problems, there were some missing indexes. And then to resolve the git add mess, I had to write a raw SQL query that used LIKE, which was super ugly, slow, and not indexed.

Really good to get all that resolved. And I have microbenchmarks that are good too; 10-25% speedup across the board for database operations.

The tricky thing was that, due to the encoding problem, both filenames and keys stored in the old sqlite databases can't be trusted to be valid. This ruled out a database migration because it could leave a repo with bad old data in it. Instead, the old databases have to be thrown away, and the upgrade has to somehow build new databases that contain all the necessary data. Seems a tall order, but luckily git-annex is a distributed system and so the databases are used as a local fast cache for information that can be looked up more slowly from git. Well, mostly. Sometimes the databases are used for data that has not yet been committed to git, or that is local to a single repo.

So I had to find solutions to a lot of hairly problems. In a couple cases, the solutions involve git-annex doing more work after the upgrade for a while, until it is able to fully regenerate the data that was stored in the old databases.

One nice thing about this approach is that, if I ever need to change the sqlite databases again, I can reuse the same code to delete the old and regnerate the new, rather than writing migration code specific to a given database change.

Anyway, v8 is all ready to merge, but I'm inclined to sit on it for a month or two, to avoid upgrade fatigue. Also I find more ways to improve the database schema. Perhaps it would be worth it to do some normalization, and/or move everything into a single large database rather than the current smattering of unnormalized databases?

Posted Thu Nov 7 17:56:18 2019

I posted in my main blog.

Planning to start on ?sqlite database improvements tomorrow, since the recent changes have raised the urgency of that significantly. I'll be head down working on that, and not reading anything here.

Posted Mon Oct 28 16:24:38 2019

Not working on git-annex for a while, at least not under the direction of users on this website. Sorry.

Posted Wed Oct 23 19:33:39 2019

Dropped a release today, earlier than planned. There was a bug that broke the OSX git-annex.dmg, and another problem broke the Homebrew build, and it was worth moving up the release for those.

Now I'm going on a mini vacation to recover from some stuff..

Posted Thu Oct 17 22:20:18 2019

Gotten the sameas remote feature fully working today, except for the small problem that sameas remotes should use different per-remote state from one-another. I know how to approach that, but it's another big change, for another day.

Posted Fri Oct 11 20:09:15 2019

Plenty of stuff going in that has not made the blog. Now I'm working on a sort of major feature. ?support multiple special remotes with same uuid will fill in a odd little hole in git-annex's capabilities. I think it will turn out to be more useful than it appears. But it's major not so much in what it will allow, but in how many assumptions in the git-annex code base have to be worked around to implement it. After pondering lots of approaches, I have finally gotten stuck in to implementing it, and I've made some good progress today (on the sameas branch). I might finish it tomorrow.

Posted Thu Oct 10 20:18:58 2019

With git-annex 7.20190912 released this fine Friday the 13th, I've finally made v7 the default!

See upgrades for details about this major transitition if you have not been keeping up with v7 stuff.

Based on some feedback that it would be good to have a way to avoid accidental upgrades of a repository in some circumstances, there's a new config option git config annex.autoupgraderepository false to prevent upgrades. Since the new git-annex doesn't support working in v5 repos, setting that will make every command except git annex upgrade fail.

Users of rpm based linux distros can now install a git-annex-standalone.rpm package that will work on a broad range of systems. It's based on the standalone tarball, just packaged as a rpm similar to the git-annex-standalone.deb provided by NeuroDebian.

Posted Fri Sep 13 17:05:03 2019

Finished yesterday with elminating direct mode, removing another 200 lines of code. I'm close to a decision on ?updating repos to v7 by default. Been reviewing open bugs that could impact the decision, taking a final look at the code to make, and reaching out to people who might be affected.

Posted Wed Aug 28 19:18:53 2019

Wow, I did not plan to remove direct mode today! The original plan was to work on ?sqlite database improvements. But that seems to need a v8 repository format, to avoid confusing old git-annex with the new db schemas. And to get to v8, we must first get to v7..

Removing direct mode eliminated over 1000 lines of code. I may be able to remove a few hundred more yet.

Did find a bug with the upgrade process just as I was wrapping up for the day, it's minor (involving a deleted file in the work tree), so I'll deal with it tomorrow.

Posted Mon Aug 26 20:46:46 2019

Spent several days fixing test suite failures on Windows. This started out really annoying; I had to chase back a "NUL" -- the string not the pointer! -- to a indirect dependency that needed an update to work with recent ghc on Windows.

Then yesterday I fixed most of the other test suite failures on Windows. But, it became clear that the test suite was only testing adjusted unlocked branches on Windows, and was finding non-Windows-specific problems involving them. So, today I added a fifth pass to the test suite, so it will always test adjusted unlocked branches. And fixed all the problems with them that test suite turned up.

It turned out there was no good way to use git-annex import with an adjusted branch. Merging the imported branch into an adjusted branch is likely to result in spurious merge conflicts, and the merged files don't get adjusted. The solution was adding a new way to merge a single branch in the same way that git-annex sync handles merges: git-annex merge remote/master

Sadly, I think there are still a couple of test failures on Windows. (Can't win em all..)

Posted Fri Aug 9 19:41:27 2019

I've spent a week making git-annex be able to store files in a remote on a server using git-lfs.

That included writing a haskell implementation of the git-lfs protocol. That could be split out of git-annex into a library if someone wants to use it for something else.

Now git-lfs is now just another special remote as far as git-annex is concerned. Albeit one that it can't drop data from, because the git-lfs protocol does not have a way to delete an object.

Once nice thing about git-annex's support for git-lfs is it can be used along with git-remote-gcrypt, and the result is a remote where both the annexed files and the git repo contents are both encrypted.

See storing data in git-lfs for details.

Posted Mon Aug 5 17:52:42 2019

I've been back from summer vacation for a couple of days. My contract to work on git-annex has expired, at least for now, but I have a lot of Patreon rewards to catch up on anyway. I've been pushing hard for months on that contract and made a lot of progress on long-term goals. Plan for the next little while is to cut back a little bit, and work on easier stuff.

Today I improved how git-annex uses Copy-On-Write when copying between two repositories on the same drive. It had relied on matching up device numbers, but it turns out that with eg BTRFS subvolumes, CoW is supported even when the device numbers don't match. Also, it was using cp even on filesystems that don't support CoW, which prevented resuming after an interruption. The new approach is to try to make a CoW copy once per remote, and if it fails, fall back to rsync.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Wed Jul 17 18:40:23 2019

I mentioned the other day about "a bit of a hack" that I couldn't find a way to avoid. After sleeping on it, I did find a much cleaner way.

The problem involved classifying threads in a worker pool, so eg only a certian number of transfer threads and a certian number of checksum threads run at the same time.

I had been relying on the stages used internally by git-annex commands to classify the threads. And that is a reasonable default for random git-annex commands that might do anything, but for a specific command like git annex get that is all about transferring and checksumming, it would be better to mark the segments of code that do transfers and checksums, and have a way to specify which what classifications matter for scheduling the actions of the command.

As well as cleaning up the design, that also fixed one bug in the thread classification. And, it would now be easy to classify threads in other ways specific to particular commands.

Then I spent too long fixing a STM deadlock. Same one I spent too long "fixing" the other day, but I really understaood it and fixed it this time.

Posted Wed Jun 19 22:45:27 2019

Finally got checksum verification running in a separate job pool from downloads, to better keep bandwidth saturated.

I had to resort to what felt like a bit of a hack, but I can't see a better way to do it. Also, I got stuck for far too long on a STM deadlock bug.

Interestingly, this means that -J1 now has a purpose, it's not the same as no -J option. Instead, it lets one download and also one concurrent checksum of the previous download run at the same time.

It would be nice if -J1 could be the default.. One problem with that is that it needs a unicode locale to work due to a limitation of concurrent-output. Changing the concurrency method based on the locale does not seem like a good idea.

Posted Mon Jun 17 19:27:10 2019

Finished the refactoring that I had started on Thursday. This was only a partial success, because it didn't result in the speedup to -J that I had hoped for. The slow start with -J turns out to not be caused by concurrency overhead at all, but a bug, ?rsync and gcrypt special remotes make -J slow.

What was successful is that I got rid of the oldest implementation wart in git-annex, the implicitMessages state. And, I made progress toward separately parallizing checksum verification.

This refactoring is still cooking in the starting branch and will be merged after the next release.

Posted Wed Jun 12 19:11:04 2019

I have a 2500 line patch on the starting branch that refactors how start messages get displayed. Prerequisite for faster parallel starts. This touched every single command, and quite a few needed non-trivial changes, so it took all day to get it to even compile.

Posted Thu Jun 6 21:16:49 2019

A long day spent making CommandCleanup actions run in a separate job pool than CommandPerform actions. I don't think this will speed anything up much yet, but it's useful groundwork. Now expensive things that are not the main action of a command can be moved into CommandCleanup and won't delay git-annex moving on to the next file. The main thing I want to move is checksum verification after a transfer. But there are probably other things I have not thought of.

CommandCleanup was always not well distinguised from CommandPerform, and so there was little incentive to put things in it. Now that's changed.

I also noticed that with -J, git-annex takes significantly longer than without to get started, when the first file it needs to process is quite a way down the ls-tree. This must be concurrency overhead. But, when CommandStart is skipping over a file that it doesn't need to process, there is no need to do that bookkeeping. Planning to take some time tomorrow to see if I can refactor CommandStart to avoid that overhead.

Posted Thu Jun 6 00:19:39 2019

I've added an export and import appendix to the external special remote protocol which documents how the protocol might be extended to allow for importing from external special remotes.

Feel this needs more thought. It's complicated by there already being an interface that only supports export, and import needing all the same operations, but with more checks that the content has not been modified behind git-annex's back. Unifying them at the protocol level would be possible, but perhaps more confusing.

Posted Tue May 28 20:25:26 2019

Kind of surprised it all came together so well today, especially because I noticed another big problem with the design, but I was able to work around that and import/export with preferred content works great.

I did end up limiting import to supported a subset of preferred content expressions. Downloading content that it doesn't yet know if it wants to import seemed too surprising and potentially very expensive.

Posted Tue May 21 19:06:24 2019

Made git annex export --to remote honor the preferred content of the remote. In a nice bit of code reuse, adjustTree was just what I needed to filter unwanted content out of the exported tree.

Then a hard problem: When a tree is exported with some non-preferred content filtered out, importing from the remote generates a tree that is lacking those files, but merging that tree would delete the files from the working tree. Solving that took the rest of the day.

Posted Mon May 20 20:46:38 2019

I've developed a plan for how to handle ?export preferred content. And today I'm working on making git annex import --from remote honor the preferred content of the remote. It doesn't make sense to support it for one and not the other, so this is on the preferred git branch for now.

One use case for this is to configure an import to exclude certain file extensions or directories. Such unwanted content will be left as-is in the remote's data store, but won't be imported, so from git-annex's POV, it won't be present on the remote.

The tricky thing is, when importing, the key is not known until the file is downloaded, but you don't want git-annex downloading content that is not preferred. I'm finessing that problem by checking the subset of preferred content expressions that are not dependent on the file's content, which will avoid downloads of unwanted content in probably most cases.

What should it do when the preferred content expression is dependent on the file's content? I'm undecided if it's better to warn and not import, or to download the content once in order to check the preferred content expression, and then throw unwanted content away.

Posted Tue May 14 19:25:31 2019

Finally got the remote tracking branch for import from S3 into a good shape.

Rather than coaxing git into generating the same commits each time for imports (which would have needed commit dates to be stable somehow), I made git-annex always preserve a reference to the last import commit.


Here's how it looks when a rename was exported to S3, which resulted in a history in S3 that diverged and reconverged with the git history.


And here's how the history develops as more changes get exported and imported.

With adb and S3 import done, the first phase of ?import tree is complete.

Posted Wed May 1 18:46:58 2019

I could not find a good solution to the S3 history matching problem, so I think that was the wrong approach. Now I have what seems to be a better approach implemented: When an import of history from S3 contains some trees that differ from the trees that were exported to S3, all git-annex needs to do is make git aware of that, and it can do so by making the remote tracking branch contain a merge between what was exported to S3 and what was imported from it.

That does mean that there can be some extra commits generated form an import, with the same trees as commits that the user made, but a different message. That seems acceptable. Less so is that repeated imports generate different commits each time; I need to make it generate stable commits. I should also add back detection of the simple fast-forward case which was working but got broken today.

So still not done with this, but the end is in sight!

Posted Tue Apr 30 20:34:49 2019

I've been working on matching up the git history with the history from a versioned S3 export. Got sidetracked for quite a while building an efficient way to get the git history up to a certian depth (including all sides of merge commits) without reading the entire git log output.

The history matching is mostly working now, but there's a problem when a rename is exported to S3, because it's non-atomic on S3 and atomic in git, and so the histories stop matching up. This is not fatal, just results in an ugly git history with the right tree at the top of it. It's not entirely wrong; the git repo and the S3 bucket did legitimately diverge for a while, so shouldn't the merged history reflect that? The problem is just that the divergence is not represented in the opimal way.

I hate giving up at the final hurdle, but I feel I need to think about this some more, so merging import-from-s3 is postponed for another day, or likely until Monday.

Posted Wed Apr 24 22:37:33 2019

Got S3 import and export working fully for both versioned and unversioned buckets. This included developing a patch to the aws library; only versioned buckets are fully supported until that gets merged.

I'm left with one blocking problem before merging import-from-s3: The commit history when importing from a versioned bucket is too long. It needs to find the point in the versioned import that has already been committed and avoid committing it again. Have started on that, but didn't get all the way today.

Also, this S3 import feature should be able to be used with anonymous S3 access to a bucket, and indeed that might be more common than wanting to import from a bucket you own or have credentials to allow access to. But the S3 remote does not currently try to use anonymous S3 access, so supporting that will need some more changes.

(Keyboard is fixed, yay!)

Posted Tue Apr 23 20:37:38 2019

Despite struggling with a keyboard controller that's increasingly prone to flaking out and not registering some key presses while doubling others, I managed to finis implementing import from versioned S3 buckets. It's quite nice to see it download past versions of files and construct a git history.

Still enough unimplemented stuff and bugs to need to work on this for probably one more day.

(Imagine here me stuggling for a full minute to :wq)

Posted Fri Apr 19 19:21:44 2019

Started today on git annex import from S3, in the "import-from-s3" branch.

It looks like I'm going to support both versioned and unversioned buckets; the latter will need --force to initialize since it can lose data.

One thought I had about that is: It's probably better for git-annex to be able to import data from an unversioned S3 bucket with caveats about avoiding unsafe operations (export) that could lose data, than it is for git-annex to not be able to import from the bucket at all, guaranteeing that past versions of modified files will be lost. (Rationalization is a powerful drug.)

To support unversioned buckets, some kind of stable content identifier is needed other than the S3 version id. Luckily, S3 has etags, which are md5sum of the content, so will work great. But, the aws haskell library needs one small change to return an etag, so this will be blocked on that change.

I've gotten listing importable contents from S3 working for unversioned buckets, including dealing with S3's 1000 item limit by paging. Listing importable contents from versioned buckets is harder, because it needs to synthesize a git version history from the information that S3 provides. I think I have a method for doing this that will generate the trees that users will expect to see, and also will generate the same past trees every time, avoiding a proliferation of git trees. Next step: Converting my prose description of how to do that into haskell.

Posted Thu Apr 18 20:38:56 2019

It was not very hard to get git annex import working with adb special remotes. This is a nice alternative to installing git-annex on an Android device for syncing with it. See android sync with adb.

I'm still thinking about supporting import from special remotes that can't avoid most race conditions. But for adb, the only race conditions that I couldn't avoid are reasonably narrow, nearly as narrow as git checkout's own race conditions, with only the added overhead of adb. So I let them slide.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Tue Apr 9 22:01:16 2019

Had not looked at bug reports in over a month, so did some triage. No particularly bad new bugs have shown up, but there were lots of interesting lesser bugs, and I fixed 6 of them today.

Posted Mon Mar 18 20:44:08 2019

Whew, I got the finishing touches on the import tree feature, and it's merged into master! Still work to do on that, particularly supporting more interesting special remotes than directory. It may be two or three weeks until I get back to working on it.

Even with only directory special remotes, some nice things can be done with this. Like bi-directional syncing between a removable drive with no git-annex repository on it, and a subdirectory of the git-annex repository:

$ git config master:subdir
$ date > /mnt/new-from-drive
$ date > subdir/new-from-repo
$ git annex add
add subdir/new-from-repo ok
$ git annex sync --content
 1 file changed, 1 insertion(+)
 create mode 120000 subdir/new-from-repo
list drive ok
import drive new-from-drive ok
update refs/remotes/drive/master ok

Merge made by the 'recursive' strategy.
 subdir/new-from-drive | 1 +
 1 file changed, 1 insertion(+)
 create mode 120000 subdir/new-from-drive
export drive new-from-repo ok
Posted Mon Mar 11 18:47:09 2019

Two more days into ?import tree and I'm running out of steam.

However.. Imports from the same special remote into several clones of a repository are now fully working!

The hard part should be behind me now. Hoping to merge it on Monday after improving the command's output a bit.

Posted Thu Mar 7 20:16:02 2019

Past two days have been spent making ?import tree interoperate safely with git annex export. This was more complicated and needed more methods to be added to the remote API than I had expected.

At this point, the directory special remote's implementation is no longer an unsafe prototype, but detects conflicting file modifications and avoids overwriting them when exporting to the directory.

Here it is in action:

joey@darkstar:/tmp/testrepo> git annex unlock foo
joey@darkstar:/tmp/testrepo> echo version from git > foo
joey@darkstar:/tmp/testrepo> echo version from special remote > ../dir/foo
joey@darkstar:/tmp/testrepo> git annex add foo
joey@darkstar:/tmp/testrepo> git commit -m add
joey@darkstar:/tmp/testrepo> git annex export master --to dir
unexport dir foo failed
export dir foo failed
(recording state in git...)
git-annex: export: 2 failed
joey@darkstar:/tmp/testrepo> git annex import master --from dir
import dir ok
update refs/remotes/dir/master ok
(recording state in git...)
joey@darkstar:/tmp/testrepo> git merge dir/master
Auto-merging foo
CONFLICT (content): Merge conflict in foo
Automatic merge failed; fix conflicts and then commit the result.
joey@darkstar:/tmp/testrepo> echo merged version > foo
joey@darkstar:/tmp/testrepo> git annex add foo
joey@darkstar:/tmp/testrepo> git commit -m resolved
joey@darkstar:/tmp/testrepo> git annex export master --to dir
unexport dir foo ok
export dir foo ok
(recording state in git...)

The feature is close to being mergeable to master now, but still needs some work on the progress display of git annex import, and on supporting imports from the same special remote to different git repos.

Posted Tue Mar 5 21:23:28 2019

Still working on ?import tree today. I decided to make the imported commit be on an unrelated history from the main branch, which avoids some very susprising merge behavior, but means users will need to pass --allow-unrelated-histories to git merge.

Also got export and sync commands updating the remote tracking branch. It was surprisingly complicated to do.

With that done, I've tested exporting to a directory remote, then making changes to the directory manually, and importing, and it all works together.

Posted Fri Mar 1 20:59:33 2019

Yesterday I implemented the command-line interface for git annex import branch --from remote and today I got a prototype of it it working with the directory special remote. Still a whole lot to do before this feature is ready for release, but it's good to have the command line interface working to play with it. The workflow feels pretty good:

joey@darkstar:~/tmp/t> echo hello > ../dir/new_file
joey@darkstar:~/tmp/t> git annex import master --from dir
import dir ok
update refs/remotes/dir/master ok
(recording state in git...)
joey@darkstar:~/tmp/t> git merge dir/master
Updating d3277e2..410aa8e
 new_file | 1 +
 1 file changed, 1 insertion(+)
 create mode 120000 new_file
joey@darkstar:~/tmp/t> cat new_file

My laptop's keyboard is dying; the S and X keys often don't register. Is making programming feel very clumsy, and there are far too many S's in the code I've been working on. ;-)

Posted Wed Feb 27 20:50:44 2019

Yesterday, got the git tree generation done and working. The main import tree code is also implemented, though it may need some fine tuning.

Today I've been working on firming up user interface design and documentation. Turns out that import tree is going to lead to some changes to export tree. A remote tracking branch will be updated by both export tree and import tree, since those operations are similar to git push and git fetch. And git annex export --tracking will be deprecated in favor of a git config setting that configures both import and export.

Posted Sat Feb 23 20:04:26 2019

Not a lot of progress on ?import tree today I feel..

Started off by adding a QuickCheck test of the content identifier log, which did find one bug in that code.

Then started roughing out the core of the importing operation, which involves building up git trees for the files that are imported. But that needs a way to graft an imported tree into a subdirectory of another tree, and the only way I had available to do it needed to read in the entire recursive tree of the current branch, which would be slower and use more memory than I like.

So, got sidetracked building a git tree grafter. It turns out that the export tree code also needs to graft a tree (into the git-annex branch), and did so using the same innefficient method that I want to avoid, so it will also be able to be improved using the grafter.

Unfortunately, I had to stop for the day with the grafter not quite working properly.

Posted Thu Feb 21 21:46:19 2019

Started building ?import tree (in the importtree branch). So far the content identifier storage in the git-annex branch is done. Since the API tells me it will need to both map from a key to content identifiers, and from content identifier to the key, I also added a sqlite database to handle the latter.

While implementing that, I happened to notice a bug in storage of metadata that contains newlines; internals said that would be base64'd, but it was not. That bug turns out to have been introduced by the ByteString conversion in January, and it's the second bug caused by that conversion. The other one broke git-annex on Windows, which was fixed by a release yesterday.

Posted Wed Feb 20 21:36:22 2019

Not a lot of coding the past few days, but a lot of skull sweat!

I've been working through the design for the ?import tree feature, and I think I finally have a design that I'm happy with. There were some very challenging race conditions, and so import tree may only be safely able to be implemented for a few remotes; S3 (with versioning enabled), directory, maybe webdav and I hope adb. Work on this included finding equivilant race conditions in git's update of the worktree, which do turn out to exist if you go looking, but have much narrower time windows there.

And I'll be running a tutorial for people who want to learn about git-annex internals at the code level, to start development or be better able to design their own features. That's in Montreal, March 26th-27th (8 hours total), hosted at McGill university. There may be one or two seats left, so if you are interested in attending, please get in touch with me by email. Haskell is not a prerequisite.

Posted Wed Feb 13 20:39:07 2019

The 2018 user's survey is closed, time for a look at the results. Several of the questions were also on the two past surveys, so we can start to look at historical trends as well.

Very similar numbers of people responded in 2018 as in 2015. The 2013 survey remains a high water mark in participation. My thoughts on the 2015 survey participation level mostly still stand, although there has been a consistent downwards trend in Debian popcon since 2015.

Also interesting that several people skipped the first question on the survey, perhaps because it was a fairly challenging question? And later questions saw much higher response rates this time than in either of the previous surveys, thanks to improvements in the survey interface.


v7 unlocked files are being used by 7% of users, pretty impressive uptake for a feature that has only been really finished for a couple of months. Direct mode is still used by 7% of users, while its v7 replacement of adjusted unlocked branches is only used by 1% so far. That's still some decent progress toward eliminating the need for direct mode.

command line vs assistant

Well that's plain enough isn't it? Although note that I myself have the assistant running in some repos all the time, but would of course vote "command line" since I interact with that much more.

Also notice that people who apparently don't use git-annex but wanted to fill out the survey anyway was the same for 2013-2015, but has now declined.

operating system

Android users have more or less gone away since I deprecated the app. I hope the termux integration brings some back.

how git-annex is installed

Good to see the increase in using git-annex packages from the OS or a third-party package manager.

missing/incomplete ports

Good improvement here since 2015 with 60% now satisfied with available ports.

Worth noting that in 2013, 6% wanted a way to use git-annex on Synology NAS. That is possible now via the standalone linux tarball. This year, 2% wanted "Synology NAS (app store package)".

Also honorable mention to the anonymous person who rewrote git-annex in another language. You should release the code!

number of repositories

Increasingly users seem to have just a couple repositories or a large number, with the middle ground shrinking. A few percent have 200+ repositories now. The sense is of a split between causual users who perhaps clone one repository to a few places, and power users who are adding new repositories over time.

data stored in git-annex

Increasing growth in the high end with many users storing dozens of terabytes of data in git-annex and a couple storing more than 64 terabytes. And a bit of growth in the low end storing under 100 gb.

The total data stored in git-annex looks to be around 650-1300 terabytes now. It was around 150-300 terabytes in 2013. That doesn't count redundant data. And it could be off slightly if shared repositories were reported by multiple users.

(Compare with the Internet Archive, which was 15000 terabytes in 2016 but I think they keep two copies of everything, so call it 7000 terabytes of unique data.)

git level

The same question was asked in the git surveys so I have included those in the graph for comparison.

git-annex users trend more experienced than git users, which is not surprising. You have to know some stuff about git to understand why you'd want to use git-annex.

Notice that git knowledge level is generally going up over time in both surveys.

happyness with the software

A similar question on the git survey included for comparison.

There's a bimodal distribution to git-annex user's happyness, with more unhappy with it than with git, but also more so happy they gravitate toward extreme praise.

There seem to be more unhappy users in 2018 than in 2015 though. The 2018 results are very close to the 2013 results.

blocking problems

Notably 15% of users now find git-annex too hard to use, up from 5% in 2015. Which seems to correlate with some users being more unhappy with it. I don't think git-annex has gotten any harder to use, so this must reflect a change in expectations and/or demographics. (2013 had similar numbers to 2018.)

Very few complain about the documentation now, down to 3% from 13% in 2015, but 12% want to see more tutorials showing how to tie the features together.

And a staggering 21% picked a write-in, "no issues personally, but people don't see (or realize they need) the immense benefits it provides". Need to find better ways to market git-annex, essentially.

size of group using git-annex together

A similar distribution to 2015. One person said they're using git-annex in a group of 50+, and 5 reported groups larger than 10 people.

scientific data

A new high of 11% of respondants are using git-annex to store scientific data. (Other kinds of data it's used for seem more or less the same.)

Part of that growth is because of the companion 2018 git-annex scientific data survey which was promoted in some scientific communities, and so brought more scientists to the main survey.

The use for neuroscience is no surprise, but so much use for astronomy and physics is. And "other" in that pie chart includes statistics, social sciences, mathematics, education, linguistics, biomedical engineering, EE, and physiology -- wow!

survey reach

All participants in the science survey did go on to answer at least part of the main survey. So 37% of respondants to the main survey are scientists.

A full 27% of survey respondants have their name on the thanks page, many for financial support. Which is really great, but also speaks to the fraction of the git-annex user base who saw the survey, because I really doubt that a quarter of the users of any free software are financially supporting it.

As with any online survey, the results are skewed by who bothers to answer it. Still, a lot of useful information to mull over.

Posted Fri Feb 1 18:49:13 2019

Started off the day with some more improvements and bug fixes for export remotes.

Then I noticed that there is no progress displayed for transfers to export remotes; it seems I forgot to wire that up. That really ought to be handled by the special remote setup code, the same way it is for non-export remotes. But it was not possible to do it there the way that export actions are structured.

I got sidetracked with how S3 prepares a handle to the server. That didn't work as well as it might have; most of the time each request to the remote actually prepared a new handle, rather than reusing a single handle. Though the http connection to the server did get reused, that still caused a lot of unnecessary work. I fixed that, and the fix also allowed me to restructure export actions in the way I need for progress bars.

I've ran out of time to finish adding the missing progress bars today, so I'll do it tomorrow.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Jan 30 20:37:01 2019

Today's release is to fix a data loss bug, that affects S3 remotes configured with exporttree=yes that got versioning=yes turned on after some unversioned data is stored in them. If you use the new versioning=yes feature with S3, please upgrade.

Also, there are only two days left to fill out the git-annex user survey if you have not already.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Tue Jan 29 19:27:22 2019

After a long struggle with the test suite, the new git-annex release is finally out today.

A few last-minute changes in the release include removing from the webapp since their webdav gateway is EOL at the end of the month, supporting armv71 in the android installation script, allowing installation with 64 bit git on windows, and shortening the estimated time to completion display.

Today's work was supported by Trenton Cronholm on Patreon.

Posted Tue Jan 22 19:49:17 2019

Offline today due to weather, but there's lots of nice backlog to work on...

I've written down a external remote querying transition plan. If you maintain an external special remote that implements WHEREIS or GETINFO, please take a look as your code would need to be updated if this is done.

Ilya suggested making git annex testremote be able to test readonly remotes, and I implemented that.

There was a discussion in the forum about .git/annex/misctmp/ containing cruft left by an interrupted git-annex process. I was surprised to find half a gigabyte of old files on my own laptop due to this problem. I've put in a fix, so git-annex will clean up such temp files that were left behind by a previous interrupted git-annex process.

Posted Fri Jan 18 15:55:37 2019

I said I was going to stop with the ByteString conversion, but then I looked at profiling, and I knew I couldn't stop there -- conversion between String and ByteString had became a major cost center.

So today, converted all the code that reads and parses symlinks and pointer files to ByteString, now ByteString is used all the way from disk to Key. Also put in some caching, so git-annex does not need to re-serialize a Key that it's just deserialized from a ByteString.

There's still some ByteString to String conversion when generating FilePaths; to avoid that will need an equivilant of System.FilePath that operates on RawFilePath, and I don't think there is one yet? But the profiling does show improvement, it's more and more dominated by IO operations that can't be sped up, and less by slow code.

This really does feel like a stopping place now.

Updated benchmarks (compared to last git-annex release):

find on 10000 files, none present... 8% speedup
whereis on 1000 files............... 12% speedup
info on dir with 1000 files......... 7% speedup
local get ; drop of 1000 files...... 4% speedup
setting metadata in 1000 files...... 8% speedup
getting metadata from 1000 files.... 7% speedup
finding a single file out of 1000 that has a given metadata value... 8% speedup

Posted Mon Jan 14 23:01:07 2019

Today worked on converting the Key data type to use ByteString.

Microbenchmarks of Keys improved, especially parsing them got 700% faster. But key parsing is not enough of an overhead in any commands I benchmarked to be a real improvement.

The new key parser is much stricter than the old one, which helps the speed. Hopefully the oddly formatted edge cases that the old parser allowed are not really in use; they include keys with fields out of the usual order, and keys with multiple values for the same field.

The next step would probably be to convert the git interface to use ByteStrings, and that plus the current groundwork is likely to lead to some real performance improvements. But I'm going to stop here with the ByteString conversion for now.

Posted Fri Jan 11 21:26:10 2019

Spent two days converting all code that deal with git-annex branch log files to use attoparsec and bytestring builders.

For most of them, I'm not expecting much if any speed improvements, since often git-annex only ever parses a given log file once, and writes to many log files are only done rarely. The main candidates for speedup are chunk logs and remote state logs. Also Group was converted to a ByteString, which may speed up queries that involve groups. I have not benchmarked. It was still worth doing the conversion, for consistency and better code if not speed.

I found a few bugs in the old parsers for log files along the way. The uuid.log parser was not preserving whitespace in repositiory descriptions; the new one will. And the activity.log parser filtered out unknown values, not leaving room for expansion.

Posted Thu Jan 10 21:45:35 2019

Continuing with the ByteString conversion marathon, today worked on converting the metadata types. Actually, metadata field names made more sense to change to Text, since they're limited to a subset of utf-8.

I lost 2 hours to a puzzling quickcheck failure of metadata serialization. It turned out to involve unicode non-breaking spaces. Aaargh. Otherwise, fairly straightforward changes, but metadata is used all over git-annex, so the final patch was nearly 1000 lines.

Benchmark time:

setting metadata in 1000 files...... 1% speedup
getting metadata from 1000 files.... 0.5% speedup
finding a single file out of 1000 that has a given metadata value... 5% speedup

Posted Mon Jan 7 20:37:05 2019

I've been benchmarking whole calls to eg git annex whereis, and that's not ideal because git-annex has some startup overhead that I'm not interested in benchmarking (right now), and often that overhead swamped the things I wanted to benchmark, making it difficult to trust my results.

So, I've built a git annex benchmark command, that can benchmark any other git-annex commands, without starting a new git-annex process. It uses criterion to get statistically meaningful benchmark results. And operations as fast as 10 ms can be benchmarked now, without needing to write any special purpose benchmark code.

New results for this weeks's optimisations:

whereis on 1000 files........... 5% speedup
whereis on 1 file............... 14% speedup
info on dir with 1000 files..... 4% speedup
local get ; drop of 1000 files.. 3% speedup

Posted Fri Jan 4 19:12:12 2019

Converted git-annex branch access to use ByteStrings, with support also for writing to it using bytestring-builder, which is supposed to be faster. Finished both an attoparsec parser and a builder for the location logs. All the other logs just convert to and from String for now, so there is still a lot of work to do.

The git annex whereis benchmark looks to be around 6% total speedup now, so this only improved it by a few percent, but these little speedups are adding up.

Writing to the git-annex branch may also have sped up significantly; the builder is probably able to stream out to git without doing any internal copies. But there are not many cases where git-annex does a lot of writes to the branch without some other operation that is much more expensive, so I don't anticipate much speed improvement on that side.

Posted Thu Jan 3 20:12:43 2019

Built an attoparsec parser for timestamps, and unsurprisingly it's 15 times faster parsing ByteStrings with it than the old String parser. The surprising thing to me was that converting a String to a ByteString and using the new parser is 10 times as fast as the old parser despite the conversion overhead. A nice immediate speedup for many parts of git-annex!

Of course timestamp parsing is not a major cost center in git-annex, but benchmarking git annex whereis run on 1000 files, there is a real speedup already, approximately 4%.

Posted Wed Jan 2 20:18:45 2019

Starting the new year with new git-annex development funding for most of the things on the roadmap!

Today was spent converting the UUID data type to use a ByteString, rather than a String, and also converting repo descriptions to ByteString. That's groundwork for reading and writing log files on the git-annex branch using attoparsec and ByteString builders, which will hopefully improve performance.

Until that's complete, it will often convert a String to a ByteString and then back to a String, which could actually make performance slightly worse. Benchmarking git annex whereis doesn't find much of a change. It may have gotten a slightly faster overall, due to the faster Eq and Ord instances making the map of repositories faster.

Posted Tue Jan 1 20:42:42 2019

Fixed several bugs involving upgrade to v7 when the git repository already v7 contained unlocked files. The worst of those involved direct mode and caused the whole file content to get checked into git. While that's a fairly unusual case, it's an ugly enough bug that I rushed out a release to fix it.

Also, LWN has posted a comparison of git-annex and git LFS.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Tue Dec 11 20:36:38 2018

Snowed in and without internet until now, I've been working on the backlog. This included adding git annex find --branch and adding support for combining options like --include, --largerthan etc with --branch.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Sun Dec 9 18:35:51 2018

The survey is now live,

I had said I was going to clear votes to the draft survey, but for technical reasons I decided not to. So if you already voted you don't need to vote again, but if you made any joke votes please go change your vote. (I removed the proposed port of git-annex to the TOPS-20.)

Posted Sat Dec 1 16:24:21 2018

It's been several years since the last git-annex user survey, and I've put together a new one. This is only a draft, any votes made at this point won't count. I wanted to get more eyes on it before it goes live to get whatever feedback and ideas you have on the content of the survey.

Posted Wed Nov 28 15:33:17 2018

I fixed two reversions yesterday (neither related to v7 repos) during a day of triage in preparation for the release of git-annex 7.

One of the reversions broke adding remotes in the webapp, and was filed all the way back in January with lots of confirmations. I feel bad I didn't get around to even looking at that bug report until now.

My backlog is kind of large, it hovers around 400 messages most of the time now, there needs to be a better way to make sure I notice such bad bugs. Would someone like to help with git-annex bug triage, picking out bugs that multiple users have confirmed, or that have good intructions to reproduce them, and helping me prioritize them? No coding required, massive contribution to git-annex. Please get in touch.

Anyway, after that full day's work, I took a look at the autobuilders, and it was bad; the test suite was failing everywhere testing v7. For quite a while I've been seeing intermittent test suite failures involving the new repo version, that mostly only happened on the autobuilders. But now they were more reproducible; a recent change made them happen much more frequently. That was good; it made it easier to track down the problem.

Which was that git-annex was getting mtime information with 1 second granularity. So when the test suite modified a file several times in the same second, git-annex could fail to notice some of the modifications. I think when I origianlly developed the inode cache module in 2013, for direct mode, there was no easy way to access high-precision mtimes from haskell, but there is now, and git-annex will use them.

That left one other failure in the test suite, an intermittent crash of sqlite with ErrorIO on Linux. May be related to the known sqlite crashes in WSL. I've been trying various things today to try to fix it, but have to run the test suite in a loop for several hours to reproduce it reliably.

Posted Tue Oct 30 22:11:22 2018

In the delaysmudge branch, I've implemented the delayed worktree update in the post-merge/post-checkout hooks for v6. It works very well!

In particular, with annex.thin set, checking out a branch containing a huge unlocked file does a fast hard link to the file.

Remaining problem before merging that is, how to get the new hooks installed? Of course git annex init and git annex upgrade install them, but I know plenty of people have v6 repositories already, without those hooks.

So, would it be better to bump up to v7 and install the hooks on that upgrade, or stay on v6 and say that it was, after all, experimental up until now, and so the minor bother of needing to run git annex init in existing v6 repositories is acceptable? If the version is bumped to v7, that will cause some pain for users of older versions of git-annex that won't support it, but those old versions also have pretty big gaps in their support for v6. I'm undecided, but leaning toward v7, even though it will also mean a lot of work to update all the documentation, as well as needing changes to projects like datalad that use git-annex. Feedback on this decision is welcomed below...

Posted Thu Oct 25 20:52:26 2018

Dreadfully early this morning I developed a plan for a way to finish the last v6 blocker, that works around most of the problems with git's smudge interface. The only problem with the plan is that it would make both git stash and git reset --hard leave unlocked annexed files in an unpopulated state when their content is available. The user would have to run git-annex afterwards to fix up after them. All other git checkout, merge, etc commands would work though.

Not sure how I feel about this plan, but it seems to be the best one so far, other than going off and trying to improve git's smudge interface again. I also wrote up ?git smudge clean interface suboptiomal which explains the problems with git's interface in detail.

Posted Mon Oct 22 20:30:29 2018

Goal for today was to make git annex sync --content operate on files hidden by git annex adjust --hide-missing. However, this got into the weeds pretty quickly due to the problem of how to handle --content-of=path when either the whole path or some files within it may be hidden.

Eventually I discovered that git ls-files --with-tree can be used to get a combined list of files in the index plus files in another tree, which in git-annex's case is the original branch that got adjusted. It's not documented to work the way I'm using it (worrying), but it's perfect, because git-annex already uses git ls-files extensively and this could let lots of commands get support for operating on hidden files.

That said, I'm going to limit it to git annex sync for now, because it would be a lot of work to make lots of commands support them, and there could easily be commands where supporting them adds lots of complexity or room for confusion.

Demo time:

joey@darkstar:/tmp> git clone ~/lib/sound/
Cloning into 'sound'...
Checking out files: 100% (45727/45727), done.
joey@darkstar:/tmp> cd sound/
joey@darkstar:/tmp/sound> git annex init --version=6
init  (merging origin/git-annex origin/synced/git-annex into git-annex...)
(scanning for unlocked files...)
joey@darkstar:/tmp/sound> git annex adjust --hide-missing
Switched to branch 'adjusted/master(hidemissing)'
joey@darkstar:/tmp/sound#master(hidemissing)> ls
joey@darkstar:/tmp/sound#master(hidemissing)> ls podcasts
joey@darkstar:/tmp/sound#master(hidemissing)> git annex sync origin --no-push -C podcasts
joey@darkstar:/tmp/sound> time git annex adjust --hide-missing
15.03user 3.11system 0:14.95elapsed 121%CPU (0avgtext+0avgdata 93280maxresident)k
0inputs+88outputs (0major+12206minor)pagefaults 0swaps
joey@darkstar:/tmp/sound#master(hidemissing)> ls podcasts
Astronomy_Cast/                                     Hacking_Culture/
Benjamen_Walker_s_Theory_of_Everything/             In_Our_Time/
Clarkesworld_Magazine___Science_Fiction___Fantasy/  Lightspeed_MagazineLightspeed_Magazine___Science_Fiction___Fantasy/
DatCast/                                            Long_Now__Seminars_About_Long_term_Thinking/
Escape_Pod/                                         Love___Radio/
Gravy/                                              feeds

Close to being able to use this on my phone. ;-)

Posted Fri Oct 19 21:57:41 2018

At long last there's a way to hide annexed files whose content is missing from the working tree: git-annex adjust --hide-missing

And once you've run that command, git annex sync will update the tree to hide/unhide files whose content availability has changed. (So will running git annex adjust again with the same options.)

You can also combine --hide-missing with --unlock, which should prove useful in a lot of situations.

My implementation today is as simple as possible, which means that every time it updates the adjusted branch it does a full traversal of the original branch, checks content availability, and generates a new branch. So it may not be super fast in a large repo, but I was able to implement it in one day's work. It should be possible later to speed it up a lot, by maintaining more state.

Today's work was sponsored by Ethan Aubin.

Posted Thu Oct 18 19:37:01 2018

No time to blog yesterday, but I somehow found the time to fix the second to last known major issue with v6 mode, a database inconsistency problem involving touching annexed files.

The only remaining blocker for v6 not being experimental is that git checkout of large unlocked files can use a lot of memory (and doesn't honor annex.thin).

Also I finally have a rought plan for how to ?hide missing files: Have git annex sync update the working tree to only show visible files. Still details to work out, but it would be great to finally get this often-requested feature.

Posted Wed Oct 17 20:32:31 2018

Pulled the trigger on the old Android builds, and made a massive commit removing all the cruft that had built up to enable them. Running in Termux is just better. It's important to note this does not mean I've given up on more native git-annex Android stuff, indeed there are promising developments in ghc Android support that I'm keeping an eye on.

I'll kind of miss the EvilSplicer, that was 750 lines of crazy code to be proud of. But really, it's going to be great to not have hanging over me the prospect that any change could break the Android build and end up needing tons of work to resolve.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Sat Oct 13 19:30:47 2018

I've improved the termux installation, adding an installer script to make it easier, and fixing some issues that have been reported. And it supports arm64 and also should work on Intel android devices. This feels very close to being able to remove the old deprecated Android apps.

I'm temporarily running the arm64 builds on my phone, in a Debian chroot. But it overheats, so this is a stopgap and it won't autobuild daily, only manually at release time.

Released git-annex 6.20181011.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Thu Oct 11 17:57:05 2018

Been making some improvements to git-annex export over the past couple days, but took time off this afternoon to set up a new phone, and try git-annex in termux on it. Luckily, I was able to reproduce the signal 11 on arm64 problem that several users have reported earlier, and also found a fix, which is simply to build git-annex for arm64.

So I want to set up a ?arm64 autobuilder, and if someone has an arm64 server that could host it, that would be great. Otherwise, I could use Scaleway, but I'd rather avoid that ongoing expense.

Also fixed a recent reversion in the linux standalone runshell that broke git-annex in termux, and probably on some other systems.

Today's work was sponsored by Trenton Cronholm on Patreon

Posted Wed Oct 10 22:24:12 2018

So I've been catching up on backlog for a couple of days. Including reading all the old todos, and closing a bunch of them that turned out to have been implemented already.

Today I added an setting, fixed annex.web-options which was broken in the semi-recent security update, and fixed a very tricky bug in rmurl.

(What happened to the I was working on earlier this week? When I looked at the details, it was much more complicated than I had thought. Back burnered.)

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Thu Oct 4 21:54:11 2018

Started work on It's going slow, I had to start with a large refactoring. So far, option parsing is working, and a few commands are almost working, but concurrency is not working right, and concurrency is the main reason to want to support this (along with remote groups).

Today's work was supported by Jake Vosloo on Patreon.

Posted Mon Oct 1 20:18:06 2018

Unix would be better if filenames could not contain newlines. But they can, and so today was spent dealing with some technical debt.

The main problem with using git-annex with filenames with newlines is that git cat-file --batch uses a line-based protocol. It would be nice if that were extended to support -z like most of the rest of git does, but I realized I could work around this by not using batch mode for the rare filename with a newline. Handling such files will be slower than other files, but at least it will work.

Then I realized that git-annex has its own problems with its --batch option and files with newlines. So I added support for -z to every batchable command in git-annex, including a couple of commands that did batch input without a --batch option.

Now git-annex should fully support filenames containing newlines, as well as anything else. The best thing to do if you have such a file is to commit it and then git mv it to a better name.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Thu Sep 20 20:23:30 2018

Well it took the whole day to finish the release. Including fixing a deadlock when the new v6 code runs with an older git, and some build errors. Ant there's still an intermittent test suite failure involving v6 on one autobuilder, which will need to be dealt with later.

This is a big release, lots of bug fixes, lots of v6 improvements, and significant S3 improvements.

Today's work was sponsored by Paul Walmsley on Patreon.

Posted Thu Sep 13 19:51:01 2018

I'm in release prep mode now, fixing build problems and a few bugs, but the v6 sprint is well over, though v6 still has its issues. Might as well release all the last month's work.

Yesterday was taken up with dealing with some very ugly git interface stuff that changed between versions. An July workaround for a bug in git turns out to have caused reversions with older versions, and was not a complete fix either. Tuned it to hopefully work better.

This was sponsored by Jake Vosloo on Patreon.

Posted Wed Sep 12 18:29:13 2018

Got git-annex downloading versioned files from S3, without needing S3 credentials. This makes a S3 special remote be equally capable as a git-annex repository exported over http, other than of course not including the git objects.

An example of this new feature:

git annex initremote s3 type=S3 public=yes exporttree=yes versioning=yes
git annex export --tracking master --to s3
git tag 1.0
# modify some files here
git annex sync --content s3

And then in a clone without the credentials:

git annex enableremote s3
git checkout 1.0
git annex get somefile

This is nice; I only wish it were supported by other special remotes. It seems that any special remote could be made to support it, but ones not supporting some kind of versioning would need to store each file twice, and many would also need each file to be uploaded to them twice. But perhaps there are others that do have a form of versioning. WebDAV for one has a versioning extension in RFC 3253.

Also did a final review of a patch Antoine Beaupré is working on to backport the recent git-annex security fixes to debian oldstable, git-annex 5.20141125. He described the backport in his blog:

This time again, Haskell was nice to work with: by changing type configurations and APIs, the compiler makes sure that everything works out and there are no inconsistencies. This logic is somewhat backwards to what we are used to: normally, in security updates, we avoid breaking APIs at all costs. But in Haskell, it's a fundamental way to make sure the system is still coherent.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Thu Sep 6 21:31:59 2018

Back to being only crowdfunded now.

Several little things today, including a git-annex.cabal patch from fftehnik that fixed building without the assistant, and supporting AWS_SESSION_TOKEN. The main work was on making git annex drop --dead prune obsolete per-remote metadata, and on fixing a bug in v6 mode that left git-annex object files writable.

Today's work was sponsored by Paul Walmsley in honor of Mark Phillips.

Posted Wed Sep 5 21:27:51 2018

Finished this feature, and I'm liking it quite a lot! Though I had to put off ?support public versioned S3 access until some other time as I'm all out of time and energy.

The storage of S3 version IDs got rethought -- I was not comfortable with using per-remote state in the git-annex branch which would have caused problems if dropping from these remotes later gets supported.

So, I added per-remote metadata in about 1 hour! It's like git-annex's regular metadata, but scoped so only the remote that owns it can see it. This is perfect for storing things like S3 version IDs. It probably ought to be added to the external special remote interface since it could be used for lots of stuff.

Here's how that looks when S3 version IDs are stored in it:

1535737778.867692782s 31ea6c94-fba3-4952-99b5-285ae192d92a:V +woYHK59DD2VUkJfg527mEBBqtCaPlSXn#myfile

This work was supported by the NSF-funded DataLad project.

Posted Fri Aug 31 18:18:48 2018

Most of the way done with implementing support for export to S3 buckets with versioning enabled. This will make the files from the most recent git annex export be visible to users browsing the bucket, while letting git-annex download any of the content from previous exports too.

Still need to test it. And, deletion of old content from such a bucket is not supported, and my initial thoughts are that it might not be possible in a multi-writer situation. I need to think about it more.

This work is supported by the NSF-funded DataLad project.

Posted Thu Aug 30 19:53:33 2018

Looked over bugs filed about v6 mode and did some triage and analysis. ?smudge has the details.

This led to changing what's done by git add and git commit file when annex.largefiles is not configured. Rather than behaving like git annex add and always storing the file in the annex, it will store it in the annex if the old version was annexed, and in git if the old version was stored in git. This avoids accidental conversions.

It might make sense to have git annex add also do this, even in v5 repositories, but I want to concentrate on v6 for now, and also don't think that git add and git annex add necessarily need to behave identically in v6 mode. While using git commit -a doesn't imply anything about whether you want the file in git or the annex, using git-annex add seems to imply they you want it in the annex, unless you've gone out of your way to configure otherwise.

Also did some design work on supporting versioned S3 buckets with git-annex export.

This work is supported by the NSF-funded DataLad project.

Posted Mon Aug 27 19:10:27 2018

That's a lot of races! Well, 4 of them were all related, but the fixes to them had to be made in two different places.

Hopefully that's all the v6 races fixed. I've been finding these races by inspection, who knows if I missed some. Anyway, I'm now down to one todo item left on the v6 sprint. Gonna take a break for a couple days before tackling it.

This work is supported by the NSF-funded DataLad project.

Posted Wed Aug 22 20:13:23 2018

More v6 work. Got most of the way to a solution to the problem of updating the associated files database for staged changes to unlocked files, eg a git mv.

While writing the test case, I was surprised to find that the problem is timing dependent. If a git mv is run less than a second after git add, git runs the smudge filter for whatever reason, which avoids the problem. With a longer delay, it doesn't run the smudge filter. Seems this could be the cause of intermittent glitches with v6 mode, and I've seen a few such glitches before.

Anyway, I developed an inexpensive way to find the relevant staged changes, using git diff with a full page of options to tweak its behavior just right. Still need to make that only run when the index has changed, not every time git-annex runs.

There's still a race between a command like git mv and git annex drop/get, that can result in the unlocked file's content not being updated. Don't have a solution to that yet.

This work is supported by the NSF-funded DataLad project.

Posted Tue Aug 21 21:08:20 2018

Sleeping on that race from yesterday, I realized there is a way to fix it, and have implemented the fix. It doubled the overhead of updating the index, but that's worth it to not have a race condition to worry about.

This work is supported by the NSF-funded DataLad project.

Posted Fri Aug 17 20:05:06 2018

Found a better way to update the index after get/drop in v6 repositories. I was able to close all the todos around that.

Only problem is there is a race where a modification that happens to a file soon after get/drop gets unexpectedly staged by the index update. I made this race's window as small as I reasonably can. Fully fixing it would involve improvements to the git update-index interface, or another way to update the index.

Only two todos remain in ?smudge that I want to fix in the remainder of this v6 sprint.

This work is supported by the NSF-funded DataLad project.

Posted Thu Aug 16 20:40:31 2018

I've now fixed the worst problem with v6 mode, which was that get/drop of unlocked files would cause git to think that the files were modified.

Since the clean filter now runs quite fast, I was able to fix that by, after git-annex updates the worktree, restaging the not-really-modified file in the index.

This approach is not optimal; index file updates have overhead; and only one process can update the index file at one time. ?smudge has a bunch of new todo items for cases where this change causes problems. Still, it seems a lot better than the old behavior, which made v6 mode nearly unusable IMHO.

This work is supported by the NSF-funded DataLad project.

Posted Tue Aug 14 20:24:23 2018

Working on a "filterdriver" branch, I've implemented support for the long-running smudge/clean process interface.

It works, but not really any better than the old smudge/clean interface. Unfortunately git leaks memory just as badly in the new interface as it did in the old interface when sending large data to the smudge filter. Also, the new interface requires that the clean filter read all the content of the file from git, even when it's just going to look at the file on disk, so that's worse performance.

So, I don't think I'll be merging that branch yet, but git's interface does support adding capabilities, and perhaps a capability could be added that avoids it schlepping the file content over the pipe. Same as my old git patches tried to do with the old smudge/clean interface.

This work is supported by the NSF-funded DataLad project.

Posted Mon Aug 13 20:19:36 2018

Spent today implementing the git pkt-line protocol. Git uses it for a bunch of internal stuff, but also to talk to long-running filter processes.

This was my first time using attoparsec, which I quite enjoyed aside from some difficulty in parsing a 4 byte hex number. Even though parsing to a Word16 should naturally only consume 4 bytes, attoparsec will actually consume subsequent bytes that look like hex. And it may parse fewer than 4 bytes too. So my parser had to take 4 bytes and feed them back into a call to attoparsec. Which seemed weird, but works. I also used bytestring-builder, and between the two libraries, this should be quite a fast implementation of the protocol.

With that 300 lines of code written, it should be easy to implement support for the rest of the long-running filter process protocol. Which will surely speed up v6 a bit, since at least git won't be running git-annex over and over again for each file in the worktree. I hope it will also avoid a memory leak in git. That'll be the rest of the low-hanging fruit, before v6 improvements get really interesting.

This work is supported by the NSF-funded DataLad project.

Posted Fri Aug 10 20:21:30 2018

Plan is to take some time this August and revisit v6, hoping to move it toward being production ready.

Today I studied the "Long Running Filter Process" documentation in gitattributes(5), as well as the supplimental documentation in git about the protocol they use. This interface was added to git after v6 mode was implemented, and hopefully some of v6's issues can be fixed by using it in some way. But I don't know how yet, it's not as simple as using this interface as-is (it was designed for something different), but finding a creative trick using it.

So far I have this idea to explore. It's promising, might fix the worst of the problems.

Also, reading over all the notes in ?smudge, I finally checked and yes, git doesn't require filters to consume all stdin anymore, and when they don't consume stdin, git doesn't leak memory anymore either. Which let me massively speed up git add in v6 repos. While before git add of a gigabyte file made git grow to a gigabyte in memory and copied a gigabyte through a pipe, it's now just as fast as git annex add in v5 mode is.

This work is supported by the NSF-funded DataLad project.

Posted Thu Aug 9 22:35:37 2018

After the big security fix push, I've had a bit of a vacation. Several new features have also landed in git-annex though.

git-worktree support is a feature I'm fairly excited by. It turned out to be possible to make git-annex just work in working trees set up by git worktree, and they share the same object files. So, if you need several checkouts of a repository for whatever reason, this makes it really efficient to do. It's much better than the old method of using git clone --shared.

A new --accessedwithin option matches files whose content was accessed within a given amount of time. (Using the atime.) Of course it can be combined with other options, for example git annex move --to archive --not --accessedwithin=30d
There are a few open requests for other new file matching options that I hope to get to soon.

A small configuration addition of to make git-annex try to get content from a remote even if its records don't indicate the remote contains the content allows setting up an interesting kind of local cache of annexed files which can even be shared between unrelated git-annex repositories, with inter-repository deduplication.

I suspect that may also have other uses. It warps git-annex's behavior in a small but fundamental way which could let it fit into new places. Will be interesting to see.

There's also a annex.commitmessage config, which I am much less excited by, but enough people have asked for it over the years.

Also fixed a howler of a bug today: In -J mode, remotes were sorted not by cost, but by UUID! How did that not get noticed for 2 years?

Much of this work was sponsored by NSF-funded DataLad project at Dartmouth Colledge, as has been the case for the past 4 years. All told they've funded over 1000 hours of work on git-annex. This is the last month of that funding.

Posted Fri Aug 3 19:04:24 2018

Just released git-annex 6.20180626 with important security fixes!

Please go upgrade now, read the release notes for details about some necessary behavior changes, and if you're curious about the details of the security holes, see the advisory.

I've been dealing with these security holes for the past week and a half, and decided to use a security embargo while fixes were being developed due to the complexity of addressing security holes that impact both git-annex and external special remote programs. For the full story see past 5 posts in this devblog, which are being published all together now that the embargo is lifted.

Posted Tue Jun 26 12:00:00 2018

Was getting dangerously close to burnt out, or exhaustion leading to mistakes, so yesterday I took the day off, aside from spending the morning babysitting the android build every half hour. (It did finally succeed.)

Today, got back into it, and implemented a fix for CVE-2018-10859 and also the one case of CVE-2018-10857 that had not been dealt with before. This fix was really a lot easier than the previous fixes for CVE-2018-10857. Unfortunately this did mean not letting URL and WORM keys be downloaded from many special remotes by default, which is going to be painful for some.

Posted Wed Jun 20 17:00:00 2018

Started testing that the security fix will build everywhere on release day. This is being particularly painful for the android build, which has very old libraries and needed http-client updated, with many follow-on changes, and is not successfully building yet after 5 hours. I really need to finish deprecating the android build.

Pretty exhausted from all this, and thinking what to do about external special remotes, I elaborated on an idea that Daniel Dent had raised in discussions about vulnerability, and realized that git-annex has a second, worse vulnerability. This new one could be used to trick a git-annex user into decrypting gpg encrypted data that they had never stored in git-annex. The attacker needs to have control of both an encrypted special remote and a git remote, so it's not an easy exploit to pull off, but it's still super bad.

This week is going to be a lot longer than I thought, and it's already feeling kind of endless..

Posted Tue Jun 19 20:00:00 2018

Spent several hours dealing with the problem of http proxies, which bypassed the IP address checks added to prevent the security hole. Eventually got it filtering out http proxies located on private IP addresses.

Other than the question of what to do about external special remotes that may be vulerable to related problems, it looks like the security hole is all closed off in git-annex now.

Added a new page security with details of this and past security holes in git-annex.

Several people I reached out to for help with special remotes have gotten back to me, and we're discussing how the security hole may affect them and what to do. Thanks especially to Robie Basak and Daniel Dent for their work on security analysis.

Also prepared a minimal backport of the security fixes for the git-annex in Debian stable, which will probably be more palatable to their security team than the full 2000+ lines of patches I've developed so far. The minimal fix is secure, but suboptimal; it prevents even safe urls from being downloaded from the web special remote by default.

Posted Mon Jun 18 16:00:00 2018

Got the IP address restrictions for http implemented. (Except for http proxies.)

Unforunately as part of this, had to make youtube-dl and curl not be used by default. The config has to be opened up by the user in order to use those external commands, since they can follow arbitrary redirects.

Also thought some more about how external special remotes might be affected, and sent their authors' a heads-up.

Posted Sun Jun 17 16:00:00 2018

Most of the day was spent staring at the http-client source code and trying to find a way to add the IP address checks to it that I need to fully close the security hole.

In the end, I did find a way, with the duplication of a couple dozen lines of code from http-client. It will let the security fix be used with libraries like aws and DAV that build on top of http-client, too.

While the code is in git-annex for now, it's fully disconnected and would also be useful if a web browser were implemented in Haskell, to implement same-origin restrictions while avoiding DNS rebinding attacks.

Looks like http proxies and curl will need to be disabled by default, since this fix can't support either of them securely. I wonder how web browsers deal with http proxies, DNS rebinding attacks and same-origin? I can't think of a secure way.

Next I need a function that checks if an IP address is a link-local address or a private network address. For both ipv4 and ipv6. Could not find anything handy on hackage, so I'm gonna have to stare at some RFCs. Perhaps this evening, for now, it's time to swim in the river.

Today's work was sponsored by Jake Vosloo on Patreon

Posted Sat Jun 16 16:00:00 2018

I'm writing this on a private branch, it won't be posted until a week from now when the security hole is disclosed.

Security is not compositional. You can have one good feature, and add another good feature, and the result is not two good features, but a new security hole. In this case ?security/CVE-2018-10857 and CVE-2018-10859. And it can be hard to spot this kind of security hole, but then once it's known it seems blindly obvious.

It came to me last night and by this morning I had decided the potential impact was large enough to do a coordinated disclosure. Spent the first half of the day thinking through ways to fix it that don't involve writing my own http library. Then started getting in touch with all the distributions' security teams. And then coded up a fairly complete fix for the worst part of the security hole, although a secondary part is going to need considerably more work.

It looks like the external special remotes are going to need at least some security review too, and I'm still thinking that part of the problem over.


Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Fri Jun 15 19:00:00 2018

I'm unexpectedly preparing for a release soon, because the last release turned out to have a crasher bug when using a bare repository or --all, and a bug that prevented the webapp starting on OSX.

As well as fixing those, the new release will have several smaller improvements and fixes all done today. It's been a rather productive day.

And, using git-annex in Termux is now working even on newer versions of Android, that use seccomp filtering to filter out system calls that the ghc runtime uses. The proot program on Termux worked around that nasty problem.

The old Android app is now deprecated, and I'll probably remove it entirely within a few months unless I find a reason not to. So, I also closed almost all the old Android-specific bug reports today. I don't normally do mass bug closures without followup, but it was warranted here; almost all of those bugs are specific to the old Android app.

Today's work was sponsored by Trenton Cronholm on Patreon

Posted Tue May 8 20:37:54 2018

I've long been unsatisfied with the amount of effort needed to maintain the Android port in its current state, the hacky cross-compiler toolchain needs days of wasted work to update, and is constantly out of date and breaking in one way or other. This sucks up any time I might spare to actually improve the Android port.

So, it was quite a surprise yesterday when I downloaded the git-annex standalone Linux tarball into the Termux Android shell and unpacked it, and it more or less worked!

The result, after a ?few minor fixes, works just as well as the git-annex Android app, and probably better. Even the webapp works well, and with the Termux:Boot app, it can even autostart the assistant on boot as a daemon. If you want to give it a try, see install on Android in Termux.

So, I am leaning toward deprecating the android port for this, removing 14 thousand lines of patches and android-specific code. Not going to do it just yet, but I feel a weight lifting...

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Apr 25 21:58:34 2018

After talking it over in ?move violates numcopies, we found a nicer compromise for git annex move. Rather than strictly enforcing numcopies, it avoids making any bad situations worse. For example, when there's only one copy of a file, it can be moved even if numcopies is higher. But, when numcopies is 2 and the source and destination repos have a copy, move will not drop from the source repo, since that would make it worse.

Implemented that today. While doing so I got bit by the inverted Ord instance for TrustLevel, so spent a while cleaning that up.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Fri Apr 13 19:24:11 2018

New version released today with adb special remote, http connection caching, improved progress displays, annex.retry, and other changes.

I've been rethinking git annex move in the context of numcopies checking. Thanks to a user posting git-annex move does not appear to respect numcopies. Of course, move is known not to do that, but it's useful to get a perspective that this is susprising behavior and not wanted by that user, and poorly documented besides.

So, I added git annex move --safe which does honor numcopies, so it only does a copy when there are not enough copies to move.

I'm leaning toward making that the default behavior, and needing git annex move --unsafe to get the current behavior of moving without a net. Of course, lots of us probably use move and like the current behavior, and such a change can break workflows and scripts. There might be a transition period where move warns when run without --safe or --unsafe. Feedback welcomed on the bug report ?move violates numcopies.

Posted Tue Apr 10 00:16:08 2018

To make git-annex faster when it's dealing with a lot of urls, I decided to make it use the http-conduit library for all url access by default. That way, http pipelining will speed up repeated requests to the same web servers. This is kind of a follow-up to the recent elimination of rsync.

Some users rely on some annex.web-options or a .netrc file to configure how git-annex downloads urls. To keep that supported, when annex.web-options is set, git-annex will use curl. To use a .netrc file, curl needs an option, so you would configure:

git config annex.web-options --netrc

I get the feeling that nobody has implemented resuming interrupted downloads of files using http-conduit before, because it was unexpectedly kind of hard and http-types lacks support for some of the necessary range-related HTTP stuff.

Today's work was supported by the NSF-funded DataLad project.

Stewart V. Wright announced recastex, a program that publishes podcasts and other files from by git-annex to your phone.

Posted Fri Apr 6 21:39:03 2018

I've been traveling and at conferences.

In the meantime, Lykos has released git-annex-remote-googledrive, a replacement for an older, unmaintained Google Drive special remote.

Today I added a special remote that stores files on an Android device using adb. It supports git annex export, so the files stored on the Android device can have the same filenames as in the git-annex repository. I have plans for making git annex import support special remotes, and other features to make bi-directional sync with Android work well.

Of course, there is some overlap between that and the Android port, but they probably serve different use cases.

Today's work was sponsored by Trenton Cronholm on Patreon

Posted Tue Mar 27 20:31:33 2018

In the past 24 hours, I've fixed two extremely hairy problems with git annex get -J. One was a locking problem. And the other involved thundering herds and ssh connection multiplexing and inherited file descriptors and races, and ... That took 4 hours of investigation to understand well enough to fix it.

Neither of those involved the ssh P2P changes, other than perhaps they exposed one of the issues more than it was exposed before, but on the plus side I've been testing that new code quite a lot as I worked on them..

Today's work was supported by the NSF-funded DataLad project.

Posted Thu Mar 15 20:17:12 2018

With fresh eyes I stopped being confused by P2P protocol free monad stuff, and got annex.verify=false supported when it's safe to skip verification.

And, I found some cases where resuming a download with annex.verify=false could let corrupt data into the repository. This is not a new problem; as well as with the P2P protocol, it could happen when downloading from the web, and possibly with some external special remotes that support resuming. So, it seemed best to override annex.verify configuration when resuming a download.

Also fixed up some progress bar stuff related to the P2P protocol. Including dealing with the case where the size of a key being downloaded is not known until the peer starts sending its data. The progress bar will now be updated with the size from the P2P protocol, so it can display a percentage even in this case.

I hope that's the end of the P2P protocol stuff for now.

Today's work was supported by the NSF-funded DataLad project.

Posted Tue Mar 13 20:23:55 2018

Working on getting the git-annex-shell P2P protocol into a releasable state. This was kind of annoying.

I started out wanting to make annex.verify=false disable verification when using the P2P protocol. But, that needed protocol changes, and unfortunately the protocol was not extensible. I thought it was supposed to reject unknown commands and keep the connection open, which would make extensions easy, but unfortunately it actually closed the connection after an unknown command.

So, I added a version negotiation to the P2P protocol, but it's not done for tor remotes yet, and will be turned on for them in some ?future flag day, once all of them get upgraded.

After all that, I got completely stuck on the annex.verify change. Multiple problems are preventing me from seeing a way to do it at all. ?support disabling verification of transfer over p2p protocol This must be why I didn't support it in the first place when building the P2P protocol two years ago.

Also fixed performance when a ssh remote is unavailable, where it was trying to connect twice to the remote for each action. And confirmed that the assistant will behave ok when moving between networks while it has P2P connections open. So, other than annex.verify not being supported, I feel fairly ready to release this new feature.

Today's work was supported by an anonymous bitcoin donor.

Posted Mon Mar 12 21:33:48 2018

Andrew Wringler has released git-annex-turtle which provides Apple Finder integration for git-annex on macOS, including custom badge icons, contextual menus and a Menubar icon. This looks really nice!

I've completed the P2P protocol with git-annex-shell. It turned out just as fast and good as I'd hoped. ?accellerate ssh remotes with git-annex-shell mass protocol has the benchmark details.

Even transferring of large files speeds up somewhat; git-annex is actually faster than rsync at shoving bytes down a pipe. (Though rsync still wins in lots of other benchmarks I'm sure.)

Surprisingly, in one benchmark, I found accessing a repository on localhost via ssh is now slightly faster than accessing that same repository by path. I think that this is because when git-annex is talking to git-annex-ssh, the programs run on different CPU cores, so there's some extra concurrency.

There are still some implementation todos, some of which will make it faster yet, and others involving potential edge cases. This is a big change and will need some time to be considered stable.

Today's work was sponsored by Jake Vosloo on Patreon

Posted Fri Mar 9 18:23:45 2018

Spent most of the day laying groundwork for using git-annex-shell p2pstdio. Implemented pools of ssh connections to it, and added uuid verification. Then generalized code from the p2p remote so it can be reused in the git remote. The types got super hairy in there, but the code reuse level is excellent.

Finally it was time to convert the first ssh remote method to use the P2P protocol. I chose key removal, since benchmarking it doesn't involve the size of annexed objects.

Here's the P2P protocol in action over ssh:

[2018-03-08 17:02:47.688627136] chat: ssh ["localhost","-S",".git/annex/ssh/localhost","-o","ControlMaster=auto","-o","ControlPersist=yes","-T","git-annex-shell 'p2pstdio' '/~/tmp/bench/a' '--debug' 'da72c285-2615-4a67-828f-eaae4f42fc3d' --uuid db017fac-eb8f-42d9-9d09-2780b193cef1"]
[2018-03-08 17:02:47.901897195] P2P < AUTH-SUCCESS db017fac-eb8f-42d9-9d09-2780b193cef1
[2018-03-08 17:02:47.902025504] P2P > REMOVE SHA256E-s4--97b912eb4a61df5f806ca6239dde3e1a4f51ad20aced1642cbb83dc510a5fa6b
[2018-03-08 17:02:47.910074003] P2P < SUCCESS
[2018-03-08 17:02:47.914181701] P2P > REMOVE SHA256E-s4--6af2f5b785a8930f0bd3edc833e18fa191167ab0535ef359b19a1982a6984e96
[2018-03-08 17:02:47.918699806] P2P < SUCCESS

For a benchmark, I set up a repository with 1000 annexed files, and cloned it from localhost, then ran git annex drop --from origin.

before: 41 seconds
after: 10 seconds

400% speedup for dropping is pretty great.. And when there's more latency than loopback has, the improvement should be more pronounced. Will test it this evening over my satellite internet. :)

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Thu Mar 8 21:05:20 2018

It was rather easy to implement git-annex-shell p2pstdio, the P2P protocol implementation done for tor is quite generic and easy to adapt for this.

The only complication was that git-annex-shell has a readonly mode, so the protocol server needed modifications to support that. Well, there's also some innefficiency around unnecessary verification of transferred content in some cases, which will probably need extensions to the P2P protocol later.

Also wrote up some documentation of what the P2P protocol looks like, for anyone who might want to communiate with git-annex-shell using it, for some reason, and doesn't understand Haskell and free monads. P2P protocol

While comparing the code of the P2P server and git-annex-shell commands, I noticed that the P2P server didn't check inAnnex when locking content, while git-annex-shell lockcontent did check inAnnex. This turned out to be a ugly data loss bug involving direct mode repositories, where a modified direct mode file was allowed when locking a key in the repository. Turned out that the bug happened when locking a file over tor to drop it locally, but also when locking a file locally in a direct mode repository to allow dropping it from any remote.

Very glad I noticed that, and I've changed the API to prevent that class of bug. I feel this is not a severe data loss bug, because when a direct mode repository is involved, dropping from somewhere and then modifying the file in the direct mode repository can have the same effect of losing the old copy. The bug just made data loss happen when running the same operations in the other order.

Next will be making git-annex use this new git-annex-shell feature when available.

Today's work was sponsored by Trenton Cronholm on Patreon

Posted Wed Mar 7 19:41:44 2018

I'm excited by this new design ?accellerate ssh remotes with git-annex-shell mass protocol.

git-annex's use of rsync got transfers over ssh working quickly early on, but other than resuming interrupted transfers, using rsync doesn't really gain git-annex much, since annexed objects don't change over time. And rsync has always involved a certian amount of overhead that a custom protocol would avoid.

It's especially handy that such a protocol was already developed for git-annex p2p when using tor. I've not heard of a lot of people using that feature (but maybe people who do have reason not to talk about it), but it's a good solid thing, implemented very generically with a free monad, and reusing it for git-annex-shell would be great.

Posted Tue Mar 6 19:34:17 2018

I've been recovering from some stuff over the past month, so progress lately has been slow, but still steadily progressing. Yesterday's release of git-annex had a month and a half of improvements, including JSON enhancements, adding extension support to the external special remote protocol, and making fsck warn about required content that's missing.

Today I've been working on git annex export to rsync special remotes.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Feb 28 18:39:19 2018

The external special remote protocol had extensibility built into it for messages git-annex sends, but not for messages that the remote sends back to git-annex. To fix this asymmetry, I've added a new EXTENSIONS to the protocol, which can be used to find out about what new protocol extensions are supported.

There was the possibility that adding that might break some external special remote that hardcoded the intial protocol messages. So, I checked all of them that I know of, and all were ok, except for older versions of datalad, which we were able to deal with. If you have your own external special remote implementation, now would be a good time to check it.

Posted Wed Feb 7 20:24:29 2018

git-annex does a little bit of work at startup to learn about the git repository it's running in. That's been optimised some before, but not entirely eliminated; it's just too useful to have that information always available inside git-annex. But it turned out that it was doing more work than needed for many commands, by checking the git config of local remotes. Thas caused unnecessary spin up of removable drives, or automount timeouts, or generally more work than needed when running commands like git annex find and even tab completing git-annex. That's fixed now, so it avoids checking the git config of remotes except when running commands that access some remote.

There's also a new config setting, remote.<name>.annex-checkuuid that can be set to false to defer checking the uuid of local repositories until git-annex actually uses them. That can avoid even more spinup/automounts, but that config prevents git-annex from transparently handling the case where different removable drives get mounted to the same place at different times.

Speaking of speed, I benchmarked linux kernel mitigation for the meltdown attack making git status 5% slower from a warm cache. It did not slow down git annex find or git annex find --in remote enough to be measured by my benchmark. I expect that git-annex commands that transfer data are bottlenecked on IO and won't be slowed down appreciably by the meltdown mitigation either.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Jan 10 19:00:22 2018

I noticed a large drop in bug reports and comments on the git-annex website over the holiday period of December. At first I thought this was just due to the holidays, even though holidays are often busy times for free software projects since lots of people have more time. But, traffic is still down this week, and several people emailed me about problems logging into the website.

So, lacking much detail at all about what people were doing that didn't work, I've the past day and a some trying to guess at and reproduce the problem. And I think I have,, and once reproduced it was of course easily fixed.

If you tried to post something and got a login prompt instead of seeing it on the website, now would be a good time to post it again.

If you still have login problems with the website (other than openid which has lot of broken providers and badly specified protocol and stuff), please get in touch and try to provide enough detail to reproduce the problem, cuz my guessing muscles are feeling sprained after this experience.

In the meantime, there has still been git-annex development happening. I added a new git annex inprogress command over the holidays that allows doing things like streaming videos while git annex get is still downloading them. Several fixes to problems with the switch to youtube-dl are fixed, core.sharedRepository is handled better, and the cabal file's custom-setup stanza was added back after quite a lot of refactoring of library code.

Today's work was sponsored by an anonymous bitcoin donor.

Posted Fri Jan 5 17:20:49 2018

Finished up youtube-dl integration today, including all the edge cases in addurl and honoring annex.diskreserve.

I changed my mind about git annex addurl --relaxed; it seems better for it to be slower than before, but not have surprising behavior, than to be fast but potentially surprising. If it's too slow, add --raw to avoid using youtube-dl.

Posted Thu Nov 30 21:08:52 2017

It's mostly working now. Still need to fix --fast and --relaxed, and avoid youtube-dl running out of the annex.diskreserve.

The first hour or two was spent adding support for per-key temp directories. youtube-dl is run inside such a directory, to let it write whatever files it needs. Like the per-key temp files, these temp directories are not cleaned up when a download fails or is interrupted, so resuming can pick up where it left off. Taught git annex dropunused and everything else that cleans up per-key temp files to also clean up the temp directories.

Today's work was sponsored by Trenton Cronholm on Patreon

Posted Wed Nov 29 21:37:23 2017

Working on ?switch from quvi to youtube-dl, because quvi is not being maintained and youtube-dl can download a lot more stuff.

Unfortunately, youtube-dl's interface is not a good fit for git-annex, compared with quvi's interface which was a near-perfect fit. Two things git-annex relied on quvi for are a way to check if a url has embedded media without downloading the url, and a way to get the url from which the embedded media can be downloaded. Youtube-dl supports neither. Also it has some other warts that make it unncessarily hard to interface with, like not always storing the download in the location specified by --output, and sometimes crashing when downloading non-media urls (eg over my satellite internet).

I've found ways to avoid all these problems. For example, to make git annex addurl avoid unncessarily overhead of running youtube-dl in the common case of downloading some non-web-page file, I'll have it download the url content, and check if it looks like a html page. Only then will it use youtube-dl. So addurl of html pages without embedded media will get slower, but addurl of everything else will be as fast as before.

But there's an unavoidable change to addurl --relaxed. It will not check for embedded media and more, because that would make it a lot slower, since it would have to hit the network. addurl --fast will have to be used for such urls instead. I hope this behavior change won't affect workflows badly.

Today was all coding groundwork, and I just got to the point that I'm ready to have it run youtube-dl. Hope to finish it tomorrow.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Tue Nov 28 21:44:29 2017

Good grief, I've spent another whole day on Windows porting issues.

Tested the new windows builds. Got win32 patched with terminateProcessId, and got the windows build using that. Unfortunately this made stack not incrementally rebuild very well, so Windows builds got very slow. (Bug report)

Found and fixed an ugly bug with annex link generation on Windows, which probably dates back to this spring. Code was comparing "C:\" with "C:/" and thinking the drives were different, argh. Still waiting on sloooow builds to test that.

There will probably be an almost Windows-dedicated release tomorrow. Only "almost" because Sean T Parsons sent in some patches.

Posted Wed Oct 25 23:47:22 2017

Got the Windows build fixed, with help from Yury. The toolchain had been broken for months. We switched to using stack, which should make the Windows build more reproducible and easy to manage.

Unfortunately, there was a link problem, and I had to disable some FFI code that was needed to terminate processes on Windows. Until that gets fixed, restarting and stopping the assistant won't work right on Windows.

Aaand: The EvilLinker is not needed any longer, so I was very happy to be able to delete that hack. \o/

Posted Wed Oct 25 04:23:53 2017

There's been a lot of little bug fixes and improvements going on in the ... oops ... almost a month since I last updated the devblog. Including a release of git-annex on the 3rd, and another release that's almost ready to go now. Just have not had the energy to blog about it all.

Anyway, today I spent way too long fixing a minor wart. When multiple annexed files have the same content, transferring them with concurrency enabled could make it complain that "transfer already in progress". Which is better than transferring the same content twice, but it did make there seem to be a failure.

I implemented two and a half different fixes for that. The first half a fix was too intrusive and I couldn't get it to work. Then came a fix that avoided the problem pretty cleanly, except it actually led to worse behavior, because it would sometimes transfer the same content twice, and needed non-obvious tweaks here and there to prevent that. Finally, around an hour ago, having actually given up unhappily for the day, I realized a much better way to fix it, that was minimally intrusive and works perfectly.

So it goes.. I'd say "concurrency is hard", but it's more that big complex code bases can make things that seem simple not really that simple. Yesterday I had a much easier time fixing a related problem with git annex add -J, which was really a lot hairier (involving a race condition and a lack of atomicity), but didn't cut across the code base in the same broad way.

Today's work was supported by the NSF-funded DataLad project.

Posted Tue Oct 17 23:07:54 2017

Got the git-annex assistant updating exports. The assistant is pretty complicated, so that took most of the day.

Exports are done!

Posted Wed Sep 20 19:36:48 2017

Built a way to make an export track changes to a branch.

git annex export --tracking master --to myexport

That ties in nicely with git annex sync:

joey@darkstar:~/tmp/bench/a> echo hello > foo
joey@darkstar:~/tmp/bench/a> git annex add
add foo ok
joey@darkstar:~/tmp/bench/a> git annex sync --content
[master 8edbc6f] git-annex in joey@darkstar:~/tmp/bench/a
 1 file changed, 1 insertion(+)
 create mode 120000 foo
export myexport foo 
joey@darkstar:~/tmp/bench/a> git mv foo bar
joey@darkstar:~/tmp/bench/a> git annex sync --content
[master 3ab6e73] git-annex in joey@darkstar:~/tmp/bench/a
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename foo => bar (100%)
rename myexport foo -> .git-annex-tmp-content-SHA256E-s6--5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03 ok
rename myexport .git-annex-tmp-content-SHA256E-s6--5891b5b522d5df086d0ff0b110fbd9d21bb4fc7163af34d08286a2e846f6be03 -> bar ok
Posted Tue Sep 19 20:24:19 2017

The tricky part of the git annex export feature has definitely been making it work in a distributed situation. The last details of that seem to have been worked out now.

I had to remove support for dropping individual files from export remotes. The design has a scenario where that makes distributed use of exports inconsistent.

But, what is working now is git annex export being run in one repository, and then another repository, after syncing, can get files from the export.

Most of export is done now. The only thing I'm thinking about adding is a way to make an export track a branch. so git annex sync can update the export.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Sep 18 23:23:06 2017

After doing some fine-tuning of webdav export on Wednesday, I noticed a problem: There seems to be no way in the webdav spec to delete a collection (directory) only when it's empty like rmdir(2) does. It would be possible to check the contents of the collection before deleting it, but that's complex (involving XML parsing) and race-prone.

So, I decided to add a remote method to delete a directory, and make git-annex keep track of when a directory in an export is empty, and delete it. While it does complicate the design some to need to do this, that seems better than complicating the implementation of remotes like webdav. And some remotes may not have a rmdir(2) equivalent or a way to check if a directory is empty.

Spent most of today implementing that, including some rather hairy eSQL to maintain a table of exported directories.

Still not quite done with export..

Today's work was sponsored by Trenton Cronholm on Patreon

Posted Fri Sep 15 19:54:30 2017

Got git annex export working to webdav and rsync special remotes. Tested exporting to the Internet Archive via S3, and to via webdav. Both had little weirdnesses in their handling of the protocols, which were worked around, and it's quite nice to be able to export trees to those services, as well as Amazon S3.

Also added connection caching for exports, so S3 and webdav exports only make one http connection, instead of one per file.

Had to change the format of git-annex:export.log; the old format didn't take into account that a repository can export to several different remotes.

Today's work was supported by the NSF-funded DataLad project.

Posted Tue Sep 12 22:35:14 2017

Got git annex export working to external special remotes. Each external special remote will need some modifications to allow exporting. Exporting to some things doesn't make sense, but often there's a way to browse a tree of files stored on the special remote and so export is worth supporting. Now would be a good time to contact the author of your favorite special remote about supporting export..

Also had time to get git annex export working to S3. The tip publishing your files to the public had a clumsy method for publishing files via S3 before, and is now quite simple!

Today's work was supported by the NSF-funded DataLad project.

Posted Fri Sep 8 20:29:27 2017

I've merged the export branch, after fixing most of the remaining known warts, and testing clean-up from interrupted exports and export conflicts.

The main thing remaining to be done is adding the new commands to the external special remote interface, and adding export support to S3, webdav, and rsync special remotes.

Today's work was supported by the NSF-funded DataLad project.

Posted Thu Sep 7 20:42:33 2017

I knew that making git annex export handle renames efficiently would take a whole day somehow.

Indeed, thinking it over, it is a seriously super hairy thing. Renames can swap contents between two or more files, and so temp files are needed. It has to handle cleaning up temp files after interrupted exports, which may be resumed with the same or a different tree. It also has to recover from export conflicts, which could cause the wrong content to be renamed to a file.

I think I've thought through everything and found a way to deal with it all. Here's how it looks in operation swapping two files:

git annex export master --to dir
rename bar -> .git-annex-tmp-content-SHA256E-s30--472b01bf6234c98ce03d1386483ae578f6e58033974a1363da2606f9fa0e222a ok
rename foo -> .git-annex-tmp-content-SHA256E-s4--b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c ok
rename .git-annex-tmp-content-SHA256E-s4--b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c -> bar ok
rename .git-annex-tmp-content-SHA256E-s30--472b01bf6234c98ce03d1386483ae578f6e58033974a1363da2606f9fa0e222a -> foo ok
(recording state in git...)

The export todo list is only getting longer.. But the branch may be close to being merged.

Today's work was supported by the NSF-funded DataLad project.

Posted Wed Sep 6 21:22:34 2017

More work on git annex export. Made initremote exporttree=yes be required to enable exporting to a special remote. Added a sqlite database to keep track of what files have been exported. That let me fix the known problems with exporting multiple files that have the same content.

The same database lets git annex get (etc) download content from exports. Since an export is not a key/value store, git-annex has to do more verification of content downloaded from an export. Some types of keys, that are not based on checksums (eg WORM and URL), cannot be downloaded from an export. And, git-annex will never trust an export to retain the content of a key, since some other tree could be exported over it at any time.

With git annex get working from exports, it might be nice to also support git annex copy --to export for exporting specific files to them. However, that needs information that is not currently stored in the sqlite database until the export has already completed. One way it could work is for git annex export --fast treeish --to export to put all the filenames in the database but not export anything, and then git annex copy --to export (or even git annex sync --content to send the contents). I don't know if this complication is worth it.

Otherwise, the export feature is fairly close to being complete now. Still need to make renames be handled efficiently, and add support for exporting to more special remotes.

Today's work was supported by the NSF-funded DataLad project.

Posted Mon Sep 4 21:03:35 2017

Good progress on git annex export today. Changing the exported tree now works and is done efficiently. Resuming an export is working. Even detecting and resolving export conflicts should work (have not tested it). The necessary information about the export is recorded in the git-annex branch, including grafting in the exported tree there.

There are some known problems when the tree that is exported contains multiple files with the same content. And git-annex is not yet able to download exported files from a special remote. Handling both of those needs way to get from keys to exported filenames. So, I plan to populate a sqlite database with that information next.

Posted Thu Aug 31 22:14:16 2017

Put together a prototype of git annex export in the "export" branch. Exporting to a directory special remote is basically working, but this is only the beginning.

Today's work was sponsored by Jake Vosloo on Patreon

Posted Tue Aug 29 21:27:50 2017

I'm back working on git-annex after a month away! It's good to be back, and I've made some decent improvements this week.

New features include a GIT_ANNEX_VECTOR_CLOCK environment variable that may be useful for those using git-annex in a HIPPA compliance setting, where timestamps are verboten (but verifying full compliance is up to you!), and and configurations that let shell commands be run to vary what remotes are used depending on eg, what network it's on.

Also, I took a look at the external special remote protocol, and noticed two problems with it. First, keys with spaces in their names can't be used with it. This only affects the WORM backend, and it seems no one has ever run into the problem. Rather than complicate the implementation of external special remotes, I decided to deprecate having spaces in key names. Which is just asking for trouble anyway. So now there's a nice error message, and a migration path.

The other problem was that the external special remote documentation incorrectly said that a filename parameter never contained spaces. But in fact, there are situations where it can. This was not a problem with the protocol, but only with its documentation, and potentially with the implementation of some special remotes. So, I've spent some time today auditing every git-annex special remote that I know about. A few scripts that are bundled with git-annex were buggy and got fixed, and I filed bugs on 9 other external special remotes. A few did already get it right!

Today's work was sponsored by Trenton Cronholm on Patreon

Posted Thu Aug 17 21:13:34 2017

Have been working on a design for exporting trees to special remotes. As well as being handy for publishing scientific data sets out of git-annex repositories, that covers long-requested features like ?dumb, unsafe, human-readable backend.

I had not been optimistic about such requests, which seemed half-baked, but Yoh came up with idea of exporting a git treeish, and remembering the last exported treeish so a subsequent export can be done incrementally, and can fully sync the exported tree.

Please take a look at the design if you've wanted to use git-annex for some sort of tree export before, and see if it meets your needs.

Posted Wed Jul 12 18:19:39 2017

A new version of optparse-applicative supports zsh and fish shell completions. Got that integrated into git-annex, although it will be a while until most builds are updated to that version of the library. Also, re-submitted my old patch from 2015 to make "git annex" always tab complete in bash.

Enough other small fixes and improvements have accumulated that a release is due soon..

Posted Fri Jun 9 21:01:07 2017

From the why-did-I-never-think-of-that-before department: git annex move --to=here implemented today. Useful if you want to move a file from whatever remotes might contain it to the local repository.

Posted Wed May 31 21:08:43 2017

Seems that the recent release of git 2.13.0 contained a reversion that broke git-annex sync in an adjusted branch. After bisecting git, producing a minimal test case, and reporting that to the git developers, I was able to work around it in git-annex. The workaround is normally not expensive, but could be when a repository has thousands of unpacked refs. So I hope this will get fixed in git and I can remove the workaround.

I think I will hurry up the next git-annex release somewhat to get the workaround out. It's not a super bad bug, but it does make the test suite fail and I've already had 3 people report the problem.

Seems it would be good to have an integration test that runs git-annex's test suite against new commits to git. Ævar Arnfjörð Bjarmason has stepped up to add that to git's test suite.

Also, I dealt with some fallout from removing MissingH; a exponential speed blowup in a directory traversal function.

Getting back to the ssh password prompting with -J I was working on last week, dealt with the ssh prompt interfering with the regional display manager. The fix is not perfect, but good enough; before ssh prompts (and only if it prompts), git-annex temporarily clears the regional display. Then the display gets redrawn under the ssh output. That needed some changes to concurrent-output (which I did over the weekend), so will only be done when it's built with a new enough version. A better approach would be to save and restore the cursor position, but the ansi-terminal library does not yet support that.

Posted Tue May 16 19:52:10 2017

Wasn't planning to, but spent the day making git-annex not depend on the MissingH library. This has been a long-term goal, as MissingH pulls in several other libraries and is not modern or principled.

The first part was to using cryptonite for MD5 calculation. While converting to the form git-annex uses to make hash directories involved some math, this did make git-annex garbage-collect less, and probably made it faster.

Then I had to write my own progress meter display, since git-annex was using MissingH's display. That was fairly simple (73 LoC), and let me make it more efficient and tuned for the git-annex use case. As a bonus, it got progress displays when transferring files of unknown sizes, which wasn't done before.

MissingH was handy training wheels when I was coming over from perl, but it's been training wheels on some old cars in the middle of a 500 car train for a while, so glad that's over.

Posted Tue May 16 05:07:48 2017

After a month away building debug-me I'm back working on git-annex. I hope debug-me will be useful for debugging git-annex in some situations BTW.

Pushed a release yesterday that was mostly changes from back in March. It also updated the git bundled with git-annex to fix the recent git-shell security hole.

After work on Monday and today, I am caught up with all the recent month's backlog, but still have 230 old backlogged messages to get to.

The first consequental thing I got back to was improving ssh password prompting when git-annex is running concurrently with -J. It used to start up several ssh's at the same time, so connection caching didn't kick in, and there could be a bunch of ssh password prompts at the same time. Now there will never be more than one ssh password prompt at once, and only one prompt per host. (As long as connection caching is enabled.)

Posted Thu May 11 22:35:41 2017

Digging in to some of the meatier backlog today. Backlog down to 225.

A lot of fixes around using git annex enableremote to add new gpg keys to a gcrypt special remote.

Had to make git-annex's use of GIT_SSH/GIT_SSH_COMMAND contingent on GIT_ANNEX_USE_GIT_SSH=1 being set. Unfortunate, but difference from git made at least one existing use of that environment variable break, and so it will need to be whitelisted in places where git-annex should use it.

Added support for git annex add --update

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Fri Apr 7 21:06:34 2017

Catching up on some recent backlog after a trip and post-trip flu.

Anarcat wrote up an anlysis of semi-synchronized remotes, and based on that I implemented remote.<name>.annex-push and remote.<name>.annex-pull

Also fixed the Windows build.

Today's work was sponsored by Thomas Hochstein on Patreon.

Posted Fri Apr 7 15:37:52 2017

Earlier this week I had the opportunity to sit in on a workshop at MIT where students were taught how to use git-annex as part of a stack of tools for reproducible scientific data research. That was great!

One thing we noticed there is, it can be hard to distribute files to such a class; downloading them individually wastes network bandwidth. Today, I added git annex multicast which uses uftp to multicast files to other clones of a repository on a LAN. An "easy" 500 lines of code and 7 hour job.

There is encryption and authentication, but the key management for this turned out to be simple, since the public key fingerprints can be stored on the git-annex branch, and easily synced around that way. So, I expect this should be not hard to use in a classroom setting such as the one I was in earlier this week.

Posted Thu Mar 30 23:39:07 2017

Preparing for a release tomorrow. Yury fixed the Windows autobuilder over the weekend. The OSX autobuilder was broken by my changes Friday, which turned out to have a simple bug that took quite a long time to chase down.

Also added git annex sync --content-of=path to sync the contents of files in a path, rather than in the whole work tree as --content does. I would have rather made this be --content=path but optparse-applicative does not support options that can be either boolean or have a string value. Really, I'd rather git annex sync path do it, but that would be ambiguous with the remote name parameter.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Mar 20 21:17:38 2017

Found a bug in git-annex-shell where verbose messages would sometimes make it output things git-annex didn't expect.

While fixing that, I wanted to add a test case, but the test suite actually does not test git-annex-shell at all. It would need to ssh, which test suites should not do. So, I took a detour..

Support for GIT_SSH and GIT_SSH_COMMAND has been requested before for various reasons. So I implemented that, which took 4 hours. (With one little possible compatibility caveat, since git-annex needs to pass the -n parameter to ssh sometimes, and git's interface doesn't allow for such a parameter.)

Now the test suite can use those environment variables to make mock ssh remotes be accessed using local sh instead of ssh.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Sat Mar 18 00:02:35 2017

The new annex.securehashesonly config setting prevents annexed content that does not use a cryptographically secure hash from being downloaded or otherwise added to a repository.

Using that and signed commits prevents SHA1 collisions from causing problems with annexed files. See using signed git commits for details about how to use it, and why I believe it makes git-annex safe despite git's vulnerability to SHA1 collisions in general.

If you are using git-annex to publish binary files in a repository, you should follow the instructions in using signed git commits.

If you're using git to publish binary files, you can improve the security of your repository by switchingto git-annex and signed commits.

Today's work was sponsored by Riku Voipio.

Posted Mon Feb 27 20:12:00 2017

Yesterday I said that a git-annex repository using signed commits and SHA2 backend would be secure from SHA1 collision attacks. Then I noticed that there were two ways to embed the necessary collision generation data inside git-annex key names. I've fixed both of them today, and cannot find any other ways to embed collision generation data in between a signed commit and the annexed files.

I also have a design for a way to configure git-annex to expect to see only keys using secure hash backends, which will make it easier to work with repositories that want to use signed commits and SHA2. Planning to implement that tomorrow.

?sha1 collision embedding in git-annex keys has the details.

Posted Sat Feb 25 00:06:43 2017

The first SHA1 collision was announced today, produced by an identical-prefix collision attack.

After looking into it all day, it does not appear to impact git's security immediately, except for targeted attacks against specific projects by very wealthy attackers. But we're well past the time when it seemed ok that git uses SHA1. If this gets improved into a chosen-prefix collision attack, git will start to be rather insecure.

Projects that store binary files in git, that might be worth $100k for an attacker to backdoor should be concerned by the SHA1 collisions. A good example of such a project is <git://>.

Using git-annex (with a suitable backend like SHA256) and signed commits together is a good way to secure such repositories.

Update 12:25 am: However, there are some ways to embed SHA1-colliding data in the names of git-annex keys. That makes git-annex with signed commits be no more secure than git with signed commits. I am working to fix git-annex to not use keys that have such problems.

Posted Thu Feb 23 20:44:24 2017

Today was all about writing making a remote repo update when changes are pushed to it.

That's a fairly simple page, because I added workarounds for all the complexity of making it work in direct mode repos, adjusted branches, and repos on filesystems not supporting executable git hooks. Basically, the user should be able to set the standard receive.denyCurrentBranch=updateInstead configuration on a remote, and then git push or git annex sync should update that remote's working tree.

There are a couple of unhandled cases; git push to a remote on a filesystem like FAT won't update it, and git annex sync will only update it if it's local, not accessed over ssh. Also, the emulation of git's updateInstead behavior is not perfect for direct mode repos and adjusted branches.

Still, it's good enough that most users should find it meets their needs, I hope. How to set this kind of thing up is a fairly common FAQ, and this makes it much simpler.

(Oh yeah, the first ancient kernel arm build is still running. May finish before tomorrow.)

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Fri Feb 17 19:56:56 2017

When you see a command like "ssh somehost rm -f file", you probably don't think that consumes stdin. After all, the rm -f doesn't. But, ssh can pass stdin over the network even if it's not being consumed, and it turns out git-annex was bitten by this.

That bug made git-annex-checkpresentkey --batch with remote accessed over ssh not see all the batch-mode input that was passed into it, because ssh sometimes consumed some of it.

Shell scripts using git-annex could also be impacted by the bug, for example:

find . -type l -atime 100 | \
    while read file; do
        echo "gonna drop $file that has not been used in a while"
        git annex drop "$file"

Depending on what remotes git annex drop talks to, it might consume parts of the output of find.

I've fixed this in git-annex now (using ssh -n when running commands that are not fed some stdin of their own), but this seems like a class of bug that could impact lots of programs that run ssh.

I've been thinking about ?simpler setup for remote worktree update on push.

One nice way to make a remote update its worktree on push is available in recent-ish gits, receive.denyCurrentBranch=updateInstead. That could already be used with git annex sync, but it hid any error messages when pushing the master branch to the remote (since that push fails with a large error message in default configurations). Found a way to make the error message be displayed when the remote's receive.denyCurrentBranch does not have the default configuration.

The remaining problem is that direct mode and adjusted branch remotes won't get their works trees updated even when configured that way. I am thinking about adding a post-update hook to support those.

Also continuing to bring up the ancient kernel arm autobuilder. It's running its first build now.

Today's work was sponsored by Riku Voipio.

Posted Wed Feb 15 20:44:28 2017

Last week I only had energy to work most of each day on git-annex, or to blog about it. I chose quiet work. The changelog did grow a good amount.

Today, fixed some autobuilder problems, and I am gearing up to add another autobuild, targeting arm boxes with older linux kernels, since I got a chance to upgrade the arm autobuilder's disk this weekend.

Also, some work on the S3 special remote, and worked around a bug in sqlite's handling of umask.

Backlog is down to 243 messages.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Mon Feb 13 21:42:21 2017

Finished the repository-clone-global configuration settings I started adding on Monday. Came up with a nice type-driven way to make sure that configuration is loaded when needed, and only loaded once. Then it was easy to make annex.autocommit be configurable by git-annex config. Also added a new annex.synccontent configuration, which can also be set by git-annex config.

Also resolved a tricky situation with providing an appid to magic wormhole. It will happen on a flag day in 2021. I've marked my calendar..

Today's work was sponsored by Thomas Hochstein on Patreon.

Posted Fri Feb 3 19:46:04 2017

Spent rather too long today tracking down a memory leak in git annex unused. Actually, it was three memory leaks; one of them was a reversion introduced while otherwise improving a function to not be partial. Another only happened in very rare circumstances. The third, which took several more hours staring at the code, turned out to simply be an unnecessary use of an accumulating list. Feel like I should have seen that one sooner, but then I am under the weather and was running profiles in a daze for several hours.. In the end, git-annex unused went from needing 1 gb of memory to 150 mb in my big repo.

One advantage to all the profiling though, was I noticed that the split function was allocating a lot of memory, and seemed generally ineficient. This has to do with it splitting on a string; splitting on a single character can run twice as fast and churn the GC quite a bit less, so I wrote up a specialized version of that, and it's used extensively in git-annex now, so it may run up to 50% faster in some cases. Seems like haskell libraries with a split function should perhaps use the more optimal version when splitting on a single character, and I'm going to file bugs to that effect.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Feb 1 00:02:26 2017

First day working on git-annex in over a month. I've been away preparing for and giving two talks at Linux Conf Australia and then recovering from conference flu, but am now raring to dive back into git-annex development!

The backlog stood at over 300 messages this morning, and is down to 274 now. So still lots of catching up to do. But nothing seems to have blown up badly in my absence. The antipatterns page was a nice development while I was away, listing some ways people sometimes find to shoot their feet. Read and responded to lots of questions, including one user who mentioned a scientific use case: "We are exploring use of git-annex to manage the large boundary conditions used within our weather model."

The main bit of coding today was adding a new git annex config command. This is fairly similar to git config, but it stores the settings in the git-annex branch, so they're visible in all clones of the repo (aka "global"). Not every setting will be configurable this way (that would be too expensive, and too foot-shooty), but starting with annex.autocommit I plan to enable selected settings that make sense to be able to set globally. If you've wanted to be able to configure some part of git-annex in all clones of a repository, suggestions are welcome in the todo item about this

git annex vicfg can also be used to edit the global settings, and I also made it able to edit the global git annex numcopies setting which was omitted before. There's no real reason to have a separate git annex numcopies command now, since git annex config could configure global annex.numcopies.. but it's probably not worth changing that.

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Mon Jan 30 21:35:05 2017

The webapp's wormhole pairing almost worked perfectly on the first test. Turned out the remotedaemon was not noticing that the tor hidden service got enabled. After fixing that, it worked perfectly!

So, I've merged that feature, and removed XMPP support from the assistant at the same time. If all goes well, the autobuilds will be updated soon, and it'll be released in time for new year's.

Anyone who's been using XMPP to keep repositories in sync will need to either switch to Tor, or could add a remote on a ssh server to sync by instead. See for the pointy-clicky way to do it, and for the command-line way.

Posted Wed Dec 28 16:42:45 2016

Added the Magic Wormhole UI to the webapp for pairing Tor remotes. This replaces the XMPP pairing UI when using "Share with a friend" and "Share with your other devices" in the webapp.

I have not been able to fully test it yet, and it's part of the no-xmpp branch until I can.

It's been a while since I worked on the webapp. It was not as hard as I remembered to deal with Yesod. The inversion of control involved in coding for the web is as annoying as I remembered.

Today's work was sponsored by Riku Voipio.

Posted Tue Dec 27 21:18:44 2016

Have been working on some improvements to git annex enable-tor. Made it su to root, using any su-like program that's available. And made it test the hidden service it sets up, and wait until it's propigated the the Tor directory authorities. The webapp will need these features, so I thought I might as well add them at the command-line level.

Also some messing about with locale and encoding issues. About most of which the less said the better. One significant thing is that I've made the filesystem encoding be used for all IO by git-annex, rather than needing to explicitly enable it for each file and process. So, there should be much less bother with encoding problems going forward.

Posted Sat Dec 24 21:16:25 2016

git annex p2p --pair implemented, using Magic Wormhole codes that have to be exchanged between the repositories being paired.

It looks like this, with the same thing being done at the same time in the other repository.

joey@elephant:~/tmp/bench3/a>git annex p2p --pair
p2p pair peer1 (using Magic Wormhole) 

This repository's pairing code is: 1-select-bluebird

Enter the other repository's pairing code: (here I entered 8-fascinate-sawdust) 
Exchanging pairing data...
Successfully exchanged pairing data. Connecting to peer1...

And just that simply, the two repositories find one another, Tor onion addresses and authentication data is exchanged, and a git remote is set up connecting via Tor.

joey@elephant:~/tmp/bench3/a>git annex sync peer1
pull peer1 
warning: no common commits
remote: Counting objects: 5, done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 5 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (5/5), done.
From tor-annex::5vkpoyz723otbmzo.onion:61900
 * [new branch]      git-annex  -> peer1/git-annex

Very pleased with this, and also the whole thing worked on the very first try!

It might be slightly annoying to have to exchange two codes during pairing. It would be possible to make this work with only one code. I decided to go with two codes, even though it's only marginally more secure than one, mostly for UI reasons. The pairing interface and instructions for using it is simplfied by being symmetric.

(I also decided to revert the work I did on Friday to make p2p --link set up a bidirectional link. Better to keep --link the simplest possible primitive, and pairing makes bidirectional links more easily.)

Next: Some more testing of this and the Tor hidden services, a webapp UI for P2P peering, and then finally removing XMPP support. I hope to finish that by New Years.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Sun Dec 18 21:44:53 2016

Improved git annex p2p --link to create a bi-directional link automatically. Bi-directional links are desirable more often than not, so it's the default behavior.

Also continued thinking about using magic wormhole for communicating p2p addresses for pairing. And filed some more bugs on magic wormhole.

Posted Fri Dec 16 21:54:45 2016

Quite a backlog developed in the couple of weeks I was concentrating on tor support. I've taken a first pass through it and fixed the most pressing issues now.

Most important was an ugly memory corruption problem in the GHC runtime system that may have led to data corruption when using git-annex with Linux kernels older than 4.5. All the Linux standalone builds of git-annex have been updated to fix that issue.

Today dealt with several more things, including fixing a buggy timestamp issue with metadata --batch, reverting the ssh ServerAliveInterval setting (broke on too many systems with old ssh or complicated ssh configurations), making batch input not be rejected when it can't be decoded as UTF-8, and more.

Also, spent some time learning a little bit about Magic Wormhole and SPAKE, as a way to exchange tor remote addresses. Using Magic Wormhole for that seems like a reasonable plan. I did file a couple bugs on it which will need to get fixed, and then using it is mostly a question of whether it's easy enough to install that git-annex can rely on it.

Posted Tue Dec 13 19:49:18 2016

More improvements to tor support. Yesterday, debugged a reversion that broke push/pull over tor, and made actual useful error messages be displayed when there were problems. Also fixed a memory leak, although I fixed it by reorganizing code and could not figure out quite why it happened, other than that the ghc runtime was not managing to be as lazy as I would expect.

Today, added git ref change notification to the P2P protocol, and made the remotedaemon automatically fetch changes from tor remotes. So, it should work to use the assistant to keep repositories in sync over tor. I have not tried it yet, and linking over tor still needs to be done at the command line, so it's not really ready for webapp users yet.

Also fixed a denial of service attack in git-annex-shell and git-annex when talking to a remote git-annex-shell. It was possible to feed either a large amount of data when they tried to read a line of data, and summon the OOM killer. Next release will be expedited some because of that.

Today's work was sponsored by Thomas Hochstein on Patreon.

Posted Fri Dec 9 20:46:53 2016

Git annex transfers over Tor worked correctly the first time I tried them today. I had been expecting protocol implementation bugs, so this was a nice surprise!

Of course there were some bugs to fix. I had forgotten to add UUID discovery to git annex p2p --link. And, resuming interrupted transfers was buggy.

Spent some time adding progress updates to the Tor remote. I was curious to see what speed transfers would run. Speed will of course vary depending on the Tor relays being used, but this example with a 100 mb file is not bad:

copy big4 (to peer1...) 
62%          1.5MB/s 24s

There are still a couple of known bugs, but I've merged the tor branch into master already.

Alpernebbi has built a GUI for editing git-annex metadata. Something I always wanted!
Read about it here

Today's work was sponsored by Ethan Aubin.

Posted Wed Dec 7 19:51:20 2016

Friday and today were spent implementing both sides of the P2P protocol for git-annex content transfers.

There were some tricky cases to deal with. For example, when a file is being sent from a direct mode repository, or v6 annex.thin repository, the content of the file can change as it's being transferred. Including being appended to or truncated. Had to find a way to deal with that, to avoid breaking the protocol by not sending the indicated number of bytes of data.

It all seems to be done now, but it's not been tested at all, and there are probably some bugs to find. (And progress info is not wired up yet.)

Today's work was sponsored by Trenton Cronholm on Patreon.

Posted Tue Dec 6 21:09:21 2016

Today I finished the second-to-last big missing peice for tor hidden service remotes. Networks of these remotes are P2P networks, and there needs to be a way for peers to find one-another, and to authenticate with one-another. The git annex p2p command sets up links between peers in such a network.

So far it has only a basic interface that sets up a one way link between two peers. In the first repository, run git annex p2p --gen-address. That outputs a long address. In the second repository, run git annex p2p --link peer1, and paste the address into it. That sets up a git remote named "peer1" that connects back to the first repository over tor.

That is a one-directional link, while a bi-directional link would be much more convenient to have between peers. Worse, the address can be reused by anyone who sees it, to link into the repository. And, the address is far too long to communicate in any way except for pasting it.

So I want to improve that later. What I'd really like to have is an interface that displays a one-time-use phrase of five to ten words, that can be read over the phone or across the room. Exchange phrases with a friend, and get your repositories securely linked together with tor.

But, git annex p2p is good enough for now. I can move on to the final keystone of the tor support, which is file transfer over tor. That should, fingers crossed, be relatively easy, and the tor branch is close to mergeable now.

Today's work was sponsored by Riku Voipio.

Posted Wed Nov 30 21:06:46 2016

Debian's tor daemon is very locked down in the directories it can read from, and so I've had a hard time finding a place to put the unix socket file for git-annex's tor hidden service. Painful details in At least for now, I'm putting it under /etc/tor/, which is probably a FHS violation, but seems to be the only option that doesn't involve a lot of added complexity.

The Windows autobuilder is moving, since NEST is shutting down the server it has been using. Yury Zaytsev has set up a new Windows autobuilder, hosted at Dartmouth College this time.

Posted Tue Nov 29 21:39:03 2016

The tor branch is coming along nicely.

This weekend, I continued working on the P2P protocol, implementing it for network sockets, and extending it to support connecting up git-send-pack/git-receive-pack.

There was a bit of a detour when I split the Free monad into two separate ones, one for Net operations and the other for Local filesystem operations.

This weekend's work was sponsored by Thomas Hochstein on Patreon.

Today, implemented a git-remote-tor-annex command that git will use for tor-annex:: urls, and made git annex remotedaemon serve the tor hidden service.

Now I have git push/pull working to the hidden service, for example:

git pull tor-annex::eeaytkuhaupbarfi.onion:47651

That works very well, but does not yet check that the user is authorized to use the repo, beyond knowing the onion address. And currently it only works in git-annex repos; with some tweaks it should also work in plain git repos.

Next, I need to teach git-annex how to access tor-annex remotes. And after that, an interface in the webapp for setting them up and connecting them together.

Today's work was sponsored by Josh Taylor on Patreon.

Posted Tue Nov 22 02:38:08 2016

For a Haskell programmer, and day where a big thing is implemented without the least scrap of code that touches the IO monad is a good day. And this was a good day for me!

Implemented the p2p protocol for tor hidden services. Its needs are somewhat similar to the external special remote protocol, but the two protocols are not fully overlapping with one-another. Rather than try to unify them, and so complicate both cases, I prefer to reuse as much code as possible between separate protocol implementations. The generating and parsing of messages is largely shared between them. I let the new p2p protocol otherwise develop in its own direction.

But, I do want to make this p2p protocol reusable for other types of p2p networks than tor hidden services. This was an opportunity to use the Free monad, which I'd never used before. It worked out great, letting me write monadic code to handle requests and responses in the protocol, that reads the content of files and resumes transfers and so on, all independent of any concrete implementation.

The whole implementation of the protocol only needed 74 lines of monadic code. It helped that I was able to factor out functions like this one, that is used both for handling a download, and by the remote when an upload is sent to it:

receiveContent :: Key -> Offset -> Len -> Proto Bool
receiveContent key offset len = do
        content <- receiveBytes len
        ok <- writeKeyFile key offset content
        sendMessage $ if ok then SUCCESS else FAILURE
        return ok

To get transcripts of the protocol in action, the Free monad can be evaluated purely, providing the other side of the conversation:

ghci> putStrLn $ protoDump $ runPure (put (fromJust $ file2key "WORM--foo")) [PUT_FROM (Offset 10), SUCCESS]
> PUT WORM--foo
> DATA 90
> bytes
result: True

ghci> putStrLn $ protoDump $ runPure (serve (toUUID "myuuid")) [GET (Offset 0) (fromJust $ file2key "WORM--foo")]
< GET 0 WORM--foo
> PROTO-ERROR must AUTH first
result: ()

Am very happy with all this pure code and that I'm finally using Free monads. Next I need to get down the the dirty business of wiring this up to actual IO actions, and an actual network connection.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Thu Nov 17 21:20:39 2016

Fixed one howler of a bug today. Turns out that git annex fsck --all --from remote didn't actually check the content of the remote, but checked the local repository. Only --all was buggy; git annex fsck --from remote was ok. Don't think this is crash priority enough to make a release for, since only --all is affected.

Somewhat uncomfortably made git annex sync pass --allow-unrelated-histories to git merge. While I do think that git's recent refusal to merge unrelated histories is good in general, the problem is that initializing a direct mode repository involves making an empty commit. So merging from a remote into such a direct mode repository means merging unrelated histories, while an indirect mode repository doesn't. Seems best to avoid such inconsistencies, and the only way I could see to do it is to always use --allow-unrelated-histories. May revisit this once direct mode is finally removed.

Using the git-annex arm standalone bundle on some WD NAS boxes used to work, and then it seems they changed their kernel to use a nonstandard page size, and broke it. This actually seems to be a bug in the gold linker, which defaults to an unncessarily small page size on arm. The git-annex arm bundle is being adjusted to try to deal with this.

ghc 8 made error include some backtrace information. While it's really nice to have backtraces for unexpected exceptions in Haskell, it turns out that git-annex used error a lot with the intent of showing an error message to the user, and a backtrace clutters up such messages. So, bit the bullet and checked through every error in git-annex and made such ones not include a backtrace.

Also, I've been considering what protocol to use between git-annex nodes when communicating over tor. One way would be to make it very similar to git-annex-shell, using rsync etc, and possibly reusing code from git-annex-shell. However, it can take a while to make a connection across the tor network, and that method seems to need a new connection for each file transfered etc. Also thought about using a http based protocol. The servant library is great for that, you get both http client and server implementations almost for free. Resuming interrupted transfers might complicate it, and the hidden service side would need to listen on a unix socket, instead of the regular http port. It might be worth it to use http for tor, if it could be reused for git-annex http servers not on the tor network. But, then I'd have to make the http server support git pull and push over http in a way that's compatable with how git uses http, including authentication. Which is a whole nother ball of complexity. So, I'm leaning instead to using a simple custom protocol something like:

    > AUTH $localuuid $token
    < AUTH-SUCCESS $remoteuuid
    > SENDPACK $length
    > $gitdata
    < RECVPACK $length
    < $gitdata
    > GET $pos $key
    < DATA $length
    < $bytes
    > PUT $key
    < PUT-FROM $pos
    > DATA $length
    > $bytes

Today's work was sponsored by Riku Voipio.

Posted Wed Nov 16 20:18:30 2016

Have waited too long for some next-generation encrypted P2P network, like telehash to emerge. Time to stop waiting; tor hidden services are not as cutting edge, but should work. Updated the design and started implementation in the tor branch.

Unfortunately, Tor's default configuration does not enable the ControlPort. And, changing that in the configuration could be problimatic. This makes it harder than it ought to be to register a tor hidden service. So, I implemented a git annex enable-tor command, which can be run as root to set it up. The webapp will probably use su-to-root or gksu to run it. There's some Linux-specific parts in there, and it uses a socket for communication between tor and the hidden service, which may cause problems for Windows porting later.

Next step will be to get git annex remotedaemon to run as a tor hidden service.

Also made a no-xmpp branch which removes xmpp support from the assistant. That will remove 3000 lines of code when it's merged. Will probably wait until after tor hidden services are working.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Nov 14 20:50:26 2016

Worked on several bug reports today, fixing some easy ones, and following up on others. And then there are the hard bugs.. Very pleased that I was able to eventually reproduce a bug based entirely on the information that git-annex's output did not include a filename. Didn't quite get that bug fixed though.

At the end of the day, got a bug report that git annex add of filenames containing spaces has broken. This is a recent reversion and I'm pushing out a release with a fix ASAP.

Posted Mon Oct 31 22:43:37 2016

Made a significant change today: Enabled automatic retrying of transfers that fail. It's only done if the previous try managed to advance the progress by some amount. The assistant has already had that retrying for many years, but now it will also be done when using git-annex at the command line.

One good reason for a transfer to fail and need a retry is when the network connection stalls. You'd think that TCP keepalives would detect this kind of thing and kill the connection but I've had enough complaints, that I suppose that doesn't always work or gets disabled. Ssh has a ServerAliveInterval that detects such stalls nicely for the kind of batch transfers git-annex uses ssh for, but it's not enabled by default. So I found a way to make git-annex enable it, while still letting ~/.ssh/config settings override that.

Also got back to analizing an old bug report about proliferating ".nfs*.lock" files when using git-annex on nfs; this was caused by the wacky NFS behavior of renaming deleted files, and I found a change to the ssh connection caching cleanup code that should avoid the problem.

Posted Wed Oct 26 20:48:52 2016

Several bug fixes involving v6 unlocked files today. Several related bugs were caused by relying on the inode cache information, without a fallback to handle the case where the inode cache had not gotten updated. While the inode cache is generally kept up-to-date well by the smudge/clean filtering, it is just a cache and can be out of date. Did some auditing for such problems and hopefully I've managed to find them all.

Also, there was a tricky upgrade case where a v5 repository contained a v6 unlocked file, and the annexed content got copied into it. This triggered the above-described bugs, and in this case the worktree needs to be updated on upgrade, to replace the pointer file with the content.

As I caught up with recent activity, it was nice to see some contributions from others. James MacMahon sent in a patch to improve the filenames generated by importfeed. And, xloem is writing workflow documentation for git-annex in Workflow guide.

Posted Mon Oct 17 20:50:11 2016

Finished up where I left off yesterday, writing test cases and fixing bugs with syncing in adjusted branches. While adjusted branches need v6 mode, and v6 mode is still considered experimental, this is still a rather nasty bug, since it can make files go missing (though still available in git history of course). So, planning to release a new version with these fixes as soon as the autobuilders build it.

Posted Tue Oct 11 20:03:25 2016

Over a month ago, I had some reports that syncing into adjusted branches was losing some files that had been committed. I couldn't reproduce it, but IIRC both felix and tbm reported problems in this area. And, felix kindly sent me enough of his git repo to hopefully reproduce it the problem.

Finally got back to that today. Luckily, I was able to reproduce the bug using felix's repo. The bug only occurs when there's a change deep in a tree of an adjusted branch, and not always then. After staring at it for a couple of hours, I finally found the problem; a modification flag was not getting propagated in this case, and some changes made deep in the tree were not getting included into parent trees.

So, I think I've fixed it, but need to look at it some more to be sure, and develop a test case. And fixing that exposed another bug in the same code. Gotta run unfortunately, so will finish this tomorrow..

Today's work was sponsored by Riku Voipio.

Posted Mon Oct 10 19:05:19 2016

Several bug fixes today and got caught up on most recent messages. Backlog is 157.

The most significant one prevents git-annex from reading in the whole content of a large git object when it wants to check if it's an annex symlink. In several situations where large files were committed to git, or staged, git-annex could do a lot of work, and use a lot of memory and maybe crash. Fixed by checking the size of an object before asking git cat-file for its content.

Also a couple of improvements around versions and upgrading. IIRC git-annex used to only support one repository version at a time, but this was changed to support V6 as an optional upgrade from V5, and so the supported versions became a list. Since V3 repositories are identical to V5 other than the version, I added it to the supported version list, and any V3 repos out there can be used without upgading. Particularly useful if they're on read-only media.

And, there was a bug in the automatic upgrading of a remote that caused it to be upgraded all the way to V6. Now it will only be upgraded to V5.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Wed Oct 5 21:11:07 2016

Realized recently that despite all the nice concurrency support in git-annex, external special remotes were limited to handling one request at a time.

While the external special remote prococol could almost support concurrent requests, that would complicate implementing them, and probably need a version flag to enable to avoid breaking existing ones.

Instead, made git-annex start up multiple external special remote processes as needed to handle concurrency.

Today's work was sponsored by Josh Taylor on Patreon.

Posted Fri Sep 30 23:52:30 2016

Did most of the optimisations that recent profiling suggested. This sped up a git annex find from 3.53 seconds to 1.73 seconds. And, git annex find --not --in remote from 12.41 seconds to 5.24 seconds. One of the optimisations sped up git-annex branch querying by up to 50%, which should also speed up use of some preferred content expressions. All in all, a very nice little optimisation pass.

Posted Thu Sep 29 21:17:29 2016

Only had a couple hours today, which were spent doing some profiling of git-annex in situations where it has to look through a large working tree in order to find files to act on. The top five hot spots this found are responsible for between 50% and 80% of git-annex's total CPU use in these situations.

The first optimisation sped up git annex find by around 18%. More tomorrow..

Posted Mon Sep 26 20:54:34 2016

Catching up on backlog today. I hope to be back to a regular work schedule now. Unanswered messages down to 156. A lot of time today spent answering questions.

There were several problems involving git branches with slashes in their name, such as "foo/bar" (but not "origin/master" or "refs/heads/foo"). Some branch names based on such a branch would take only the "bar" part. In git annex sync, this led to perhaps merging "foo/bar" into "other/bar" or "bar". And the adjusted branch code was entirely broken for such branches. I've fixed it now.

Also made git annex addurl behave better when the file it wants to add is gitignored.

Thinking about implementing git annex copy --from A --to B. It does not seem too hard to do that, at least with a temp file used inbetween. See transitive transfers.

Today's work was sponsored by Thomas Hochstein on Patreon.

Posted Wed Sep 21 22:03:16 2016

Turned out to not be very hard at all to make git annex get -JN assign different threads to different remotes that have the same cost. Something like that was requested back in 2011, but it didn't really make sense until parallel get was implemented last year.

(Also spent too much time fixing up broken builds.)

Posted Tue Sep 6 19:20:07 2016

Back after taking most of August off and working on other projects.

Got the unanswered messages backlog down from 222 to 170. Still scary high.

Numerous little improvements today. Notable ones:

  • Windows: Handle shebang in external special remote program. This is needed for git-annex-remote-rclone to work on Windows. Nice to see that external special remote is getting ported and apparently lots of use.
  • Make --json and --quiet suppress automatic init messages, and any other messages that might be output before a command starts. This was a reversion introduced in the optparse-applicative changes over a year ago.

Also I'm developing a plan to improve parallel downloading when multiple remotes have the same cost. See ?get round robin.

Today's work was sponsored by Jake Vosloo on Patreon.

Posted Mon Sep 5 20:39:45 2016

A user suggested adding --failed to retry failed transfers. That was a great idea and I landed a patch for it 3 hours later. Love it when a user suggests something so clearly right and I am able to quickly make it happen!

Unfortunately, my funding from the DataLad project to work on git-annex is running out. It's been a very good two years funded that way, with an enormous amount of improvements and support and bug fixes, but all good things must end. I'll continue to get some funding from them for the next year, but only for half as much time as the past two years.

I need to decide it it makes sense to keep working on git-annex to the extent I have been. There are definitely a few (hundred) things I still want to do on git-annex, starting with getting the git patches landed to make v6 mode really shine. Past that, it's mostly up to the users. If they keep suggesting great ideas and finding git-annex useful, I'll want to work on it more.

What to do about funding? Maybe some git-annex users can contribute a small amount each month to fund development. I've set up a Patreon page for this,

Anyhoo... Back to today's (unfunded) work.

--failed can be used with get, move, copy, and mirror. Of course those commands can all be simply re-ran if some of the transfers fail and will pick up where they left off. But using --failed is faster because it does not need to scan all files to find out which still need to be transferred. And accumulated failures from multiple commands can be retried with a single use of --failed.

It's even possible to do things like git annex get --from foo; git annex get --failed --from bar, which first downloads everything it can from the foo remote and falls back to using the bar remote for the rest. Although setting remote costs is probably a better approach most of the time.

Turns out that I had earlier disabled writing failure log files, except by the assistant, because only the assistant was using them. So, that had to be undone. There's some potential for failure log files to accumulate annoyingly, so perhaps some expiry mechanism will be needed. This is why --failed is documented as retrying "recent" transfers. Anyway, the failure log files are cleaned up after successful transfers.

Posted Wed Aug 3 18:55:31 2016

With yesterday's JSON groundwork in place, I quickly implemented git annex metadata --batch today in only 45 LoC. The interface is nicely elegant; the same JSON format that git-annex metadata outputs can be fed into it to get, set, delete, and modify metadata.

Posted Wed Jul 27 20:02:41 2016

I've had to change the output of git annex metadata --json. The old output looked like this:


That was not good, because it didn't separate the metadata fields from the rest of the JSON object. What if a metadata field is named "note" or "success"? It would collide with the other "note" and "success" in the JSON.

So, changed this to a new format, which moves the metadata fields into a "fields" object:


I don't like breaking backwards compatability of JSON output, but in this case I could see no real alternative. I don't know if anyone is using metadata --batch anyway. If you are and this will cause a problem, get in touch.

While making that change, I also improved the JSON output layer, so it can use Aeson. Update: And switched everything over to using Aeson, so git-annex no longer depends on two different JSON libraries.

This let me use Aeson to generate the "fields" object for metadata --json. And it was also easy enough to use Aeson to parse the output of that command (and some simplified forms of it).

So, I've laid the groundwork for git annex metadata --batch today.

Posted Tue Jul 26 19:55:20 2016

A common complaint is that git annex fsck in a bare repository complains about missing content of deleted files. That's because in a bare repository, git-annex operates on all versions of all files. Today I added a --branch option, so if you only want to check say, the master branch, you can: git annex fsck --branch master

The new option has other uses too. Want to get all the files in the v1.0 tag? git annex get --branch v1.0

It might be worth revisiting the implicit --all behavior for bare repositories. It could instead default to --branch HEAD or something like that. But I'd only want to change that if there was a strong consensus in favor.

Over 3/4th of the time spent implementing --branch was spent in adjusting the output of commands, to show "branch:file" is being operated on. How annoying.

Posted Wed Jul 20 19:59:37 2016

First release in over a month. Before making this release, a few last minute fixes, including a partial workaround for the problem that Sqlite databases don't work on Lustre filesystems.

Backlog is now down to 140 messages, and only 3 of those are from this month. Still higher than I like.

Posted Tue Jul 19 20:19:43 2016

Noticed that in one of my git-annex repositories, git-annex was spending a full second at startup checking all the git-annex branches from remotes to see if they contained changes that needed to be merged in. So, I added a cache of recently merged branches to avoid that. I remember considering this optimisation years ago; don't know why I didn't do it then. Not every day that I can speed up git-annex so much!

Also, made git annex log --all show location log changes for all keys. This was tricky to get right and fast.

Posted Sun Jul 17 19:20:51 2016

Worked on recent bug reports. Two bugs fixed today were both reversions introduced when the v6 repository support was added. Backlog is down to 153.

Posted Tue Jul 12 20:47:20 2016

Revisited my enhanced smudge/clean patch set for git, updating it for code review and to deal with changes in git since I've been away. This took several hours unfortunately.

Posted Mon Jul 11 22:48:56 2016

Back from vacation, with a message backlog of 181. I'm concentrating first on low-hanging fruit of easily implemented todos, and well reproducible bugs, to get started again.

Implemented --batch mode for git annex get and git annex drop, and also enabled --json for those.

Investigated git-annex startup time; see Turns out that cabal has a bug that causes many thousands of unnecessary syscalls when linking in the shared libraries. Working around it halved git-annex's startup time.

Fixed a bug that caused git annex testremote to crash when testing a freshly made external special remote.

Posted Wed Jul 6 19:14:24 2016

Continued working on the enhanced smudge/clean interface in git today. Sent in a third version of the patch set, which is now quite complete.

I'll be away for the next week and a half, on vacation.

Posted Wed Jun 22 20:25:17 2016

Continued working on the enhancaed smudge/clean interface in git, incorporating feedback from the git developers.

In a spare half an hour, I made an improved-smudge-filters branch that teaches git-annex smudge to use the new interface.

Doing a quick benchmark, git checkout of a deleted 1 gb file took:

  • 19 seconds before
  • 11 seconds with the new interface
  • 0.1 seconds with the new interface and annex.thin set
    (while also saving 1 gb of disk space!)

So, this new interface is very much worthwhile.

Posted Fri Jun 17 20:56:27 2016

Working on git, not git-annex the past two days, I have implemented the smudge-to-file/clean-from-file extension to the smudge/clean filter interface. Patches have been sent to the git developers, and hopefully they'll like it and include it. This will make git-annex v6 work a lot faster and better.

Amazing how much harder it is to code on git than on git-annex! While I'm certianly not as familiar with the git code base, this is mostly because C requires so much more care about innumerable details and so much verbosity to do anything. I probably could have implemented this interface in git-annex in 2 hours, not 2 days.

Posted Thu Jun 16 20:36:35 2016

There was one more test suite failure when run on FAT, which I've investigated today. It turns out that a bug report was filed about the same problem, and at root it seems to be a bug in git merge. Luckily, it was not hard to work around the strange merge behavior.

It's been very worthwhile running the test suite on FAT; it's pointed me at several problems with adjusted branches over the past weeks. It would be good to add another test suite pass to test adjusted branches explicitly, but when I tried adding that, there were a lot of failures where the test suite is confused by adjusted branch behavior and would need to be taught about it.

I've released git-annex 6.20160613. If you're using v6 repositories and especially adjusted branches, you should upgrade since it has many fixes.

Posted Mon Jun 13 20:18:22 2016

Today I was indeed able to get to the bottom of and fix the bug that had stumped me the other day.

Rest of the day was taken up by catching up to some bug requests and suggestions for v6 mode. Like making unlock and lock work for files that are not locally present. And, improving the behavior of the clean filter so it remembers what backend was used for a file before and continues using that same backend.

About ready to make a release, but IIRC there's one remaining test suite failure on FAT.

Posted Thu Jun 9 20:39:47 2016

Been having a difficult time fixing the two remaining test suite failures when run on a FAT filesystem.

On Friday, I got quite lost trying to understand the first failure. At first I thought it had something to do with queued git staging commands not being run in the right git environment when git-annex is using a different index file or work tree. I did find and fix a potential bug in that area. It might be that some reports long ago of git-annex branch files getting written to the master branch was caused by that. But, fixing it did not help with the test suite failure at hand.

Today, I quickly found the actual cause of the first failure. Of course, it had nothing to do with queued git commands at all, and was a simple fix in the end.

But, I've been staring at the second failure for hours and am not much wiser. All I know is, an invalid tree object gets generated by the adjusted branch code that contains some files more than once. (git gets very confused when a repository contains such tree objects; if you wanted to break a git repository, getting such trees into it might be a good way. cough) This invalid tree object seems to be caused by the basis ref for the adjusted branch diverging somehow from the adjusted branch itself. I have not been able to determine why or how the basis ref can diverge like that.

Also, this failure is somewhat indeterminite, doesn't always occur and reordering the tests in the test suite can hide it. Weird.

Well, hopefully looking at it again later with fresh eyes will help.

Posted Tue Jun 7 20:01:27 2016

A productive day of small fixes. Including a change to deal with an incompatibility in git 2.9's commit.gpgsign, and couple of fixes involving gcrypt repositories.

Also several improvements to cloning from repositories where an adjusted branch is checked out. The clone automatically ends up with the adjusted branch checked out too.

The test suite has 3 failures when run on a FAT repository, all involving adjusted branches. Managed to fix one of them today, hope to get to the others soon.

Posted Thu Jun 2 21:03:50 2016

Release today includes a last-minute fix to parsing lines from the git-annex branch that might have one or more carriage returns at the end. This comes from Windows of course, where since some things transparently add/remove \r before the end of lines, while other things don't, it could result in quite a mess. Luckily it was not hard or expensive to handle. If you are lucky enough not to use Windows, the release also has several more interesting improvements.

Posted Fri May 27 20:51:03 2016

git-annex has always balanced implicit and explicit behavior. Enabling a git repository to be used with git-annex needs an explicit init, to avoid foot-shooting; but a clone of a repository that is already using git-annex will be implicitly initialized. Git remotes implicitly are checked to see if they use git-annex, so the user can immediately follow git remote add with git annex get to get files from it.

There's a fine line here, and implicit git remote enabling sometimes crosses it; sometimes the remote doesn't have git-annex-shell, and so there's an ugly error message and annex-ignore has to be set to avoid trying to enable that git remote again. Sometimes the probe of a remote can occur when the user doesn't really expect it to (and it can involve a ssh password prompt).

Part of the problem is, there's not an explicit way to enable a git remote to be used by git-annex. So, today, I made git annex enableremote do that, when the remote name passed to it is a git remote rather than a special remote. This way, you can avoid the implicit behavior if you want to.

I also made git annex enableremote un-set annex-ignore, so if a remote got that set due to a transient configuration problem, it can be explicitly enabled.

Posted Tue May 24 21:11:15 2016

Over the weekend, I noticed that a relative path to GIT_INDEX_FILE is interpreted in several different, inconsistent ways by git. git-annex mostly used absolute paths, but did use a relative path in git annex view. Now it will only use absolute paths to avoid git's wacky behavior.

Integrated some patches to support building with ghc 8.0.1, which was recently released.

The gnupg-options git configs were not always passed to gpg. Fixing this involved quite a lot of plumbing to get the options to the right functions, and consumed half of today.

Also did some design work on external special remote protocol to avoid backwards compatability problems when adding new protocol features.

Posted Mon May 23 22:26:56 2016

Fixed several problems with v6 mode today. The assistant was doing some pretty wrong things when changes were synced into v6 repos, and that behavior is fixed. Also dealt with a race that caused updates made to the keys database by one process to not be seen by another process. And, made git annex add of a unlocked pointer file not annex the pointer file's content, but just add it to git as-is.

Also, Thowz pointed out that adjusted branches could be used to locally adjust where annex symlinks point to, when a repository's git directory is not in the usual location. I've added that, as git annex adjust --fix. It was quite easy to implement this, which makes me very happy with the adjusted branches code!

Posted Mon May 16 21:35:43 2016

Posted a proposal for extending git smudge/clean filters with raw file access. If git gets an interface like that, it will make it easy to deal with most of the remaining ?v6 todo list.

Posted Thu May 12 21:20:41 2016

It's not every day I add a new special remote encryption mode to git-annex! The new encryption=sharedpubkey mode lets anyone with a clone of the git repository (and access to the remote) store files in the remote, but then only the private key owner can access those files. Which opens up some interesting new use cases...

Posted Tue May 10 21:18:39 2016

Lots of little fixes and improvements here and there over the past couple days.

The main thing was fixing several bugs with adjusted branches and Windows. They seem to work now, and commits made on the adjusted branch are propigated back to master correctly.

It would be good to finish up the last todos for v6 mode this month. The sticking point is I need a way to update the file stat in the git index when git-annex gets/drops/etc an unlocked file. I have not decided yet if it makes the most sense to add a dependency on libgit2 for that, or extend git update-index, or even write a pure haskell library to manipulate index files. Each has its pluses and its minuses.

Posted Wed May 4 18:40:43 2016

git-annex 6.20160419 has a rare security fix. A bug made encrypted special remotes that are configured to use chunks accidentally expose the checksums of content that is uploaded to the remote. Such information is supposed to be hidden from the remote's view by the encryption. The same bug also made resuming interrupted uploads to such remotes start over from the beginning.

After releasing that, I've been occupied today with fixing the Android autobuilder, which somehow got its build environment broken (unsure how), and fixing some other dependency issues.

Posted Thu Apr 28 20:19:34 2016

I'm on a long weekend. This did not prevent git-annex from getting an impressive lot of features though, as Daniel Dent contributed which uses rclone to add support for a ton of additional cloud storage things, including:

Google Drive, Openstack Swift, Rackspace cloud files, Memset Memstore, Dropbox, Google Cloud Storage, Amazon Cloud Drive, Microsoft One Drive, Hubic, Backblaze B2, Yandex Disk

Wow! I hope that rclone will end up packaged in more distributions (eg Debian) so this will be easier to set up.

Posted Mon Apr 25 17:47:48 2016

Something that has come up repeatedly is that git annex reinject is too hard to use since you have to tell it which annexed file you're providing the content for. Now git-annex reinject --known can be passed a list of files and it will reinject any that hash to known annexed contents and ignore the rest. That works best when only one backend is used in a repository; otherwise it would need to be run repeatedly with different --backend values.

Turns out that the GIT_COMMON_DIR feature used by adjusted branches is only a couple years old, so don't let adjusted branches be used with a too old git.

And, git merge is getting a new sanity check that prevents merging in a branch with a disconnected history. git annex sync will inherit that sanity check, but the assistant needs to let such merges happen when eg, pairing repositories, so more git version checking there.

Posted Fri Apr 22 20:10:51 2016

The past three days have felt kind of low activity days, but somehow a lot of stuff still got done, both bug fixes and small features, and I am feeling pretty well caught up with backlog for the first time in over a month. Although as always there is some left, 110 messages.

On Monday I fixed a bug that could cause a hang when dropping content, if git-annex had to verify the content was present on a ssh remote. That bug was bad enough to make an immediate release for, even though it was only a week since the last release.

Posted Wed Apr 20 20:16:16 2016

Seems I forgot about executable files entirely when implementing v6 unlocked files. Fixed that oversight today.

Posted Thu Apr 14 19:56:05 2016

Yesterday I released version 6.20160412, which is the first to support adjusted branches.

Today, some planning for ways to better support annex.thin, but that seems to be stuck on needing a way to update git's index file. Which is the main thing needed to fix various problems with v6 unlocked files.

Dove back into the backlog, got it down to 144 messages. Several bug fixes.

Posted Wed Apr 13 23:18:24 2016

Think I'm really finished with adjusted branches now. Fixed a bug in annex symlink calculation when merging into an adjusted branch. And, fixed a race condition involving a push of master from another repository.

While git annex adjust --unlock is reason enough to have adjusted branches, I do want to at some point look into implementing git annex adjust --hide-missing, and perhaps rewrite the view branches to use adjusted branches, which would allow for updating view branches when pulling from a remote.

Also, turns out Windows supports hard links, so I got annex.thin working on Windows, as well as a few other things that work better with hard links.

Posted Sat Apr 9 18:20:43 2016

Well, I had to rethink how merges into adjusted branches should be handled. The old method often led to unnecessary merge conflicts. My new approach should always avoid unnecessary merge conflicts, but it's quite a trick.

To merge origin/master into adjusted/master, it first merges origin/master into master. But, since adjusted/master is checked out, it has to do the merge in a temporary work tree. Luckily this can be done fairly inexpensively. To handle merge conflicts at this stage, git-annex's automatic merge conflict resolver is used. This approach wouldn't be feasible without a way to automatically resolve merge conflicts, because the user can't help with conflict resolution when the merge is not happening in their working tree.

Once that out-of-tree merge is done, the result is adjusted, and merged into the adjusted branch. Since we know the adjusted branch is a child of the old master branch, this merge can be forced to always be a fast-forward. This second merge will only ever have conflicts if the work tree has something uncommitted in it that causes a merge conflict.

Wow! That's super tricky, but it seems to work well. While I ended up throwing away everything I did last Thursday due to this new approach, the code is in some ways simpler than that old, busted approach.

Posted Wed Apr 6 23:32:23 2016

Feels like I've been working on adjusted branches too long.

Did make some excellent progress today. Upgrading a direct mode repo to v6 will now enter an adjusted branch where all files are unlocked. Using an adjusted branch like this avoids unlocking all files in the master branch of the repo, which means that different clones of a repo can be upgraded to v6 mode at different times. This should let me advance the timetable for enabling v6 by default, and getting rid of direct mode.

Also, cloning a repository that has an adjusted branch checked out will now work; the clone starts out in the same adjusted branch.

But, I realized today that the way merges from origin/master into adjusted/master are done will often lead to merge conflicts. I have came up with a better way to handle these merges that won't unncessarily conflict, but didn't feel ready to implement that today.

Instead, I spent the latter half of the day getting caught up on some of the backlog. Got it down from some 200 messages to 150.

Posted Mon Apr 4 21:02:06 2016

Spent all day fixing sync in adjusted branches. I was lost in the weeds for a long time. Eventually, drawing this diagram helped me find my way to a solution:

origin/master    adjusted/master     master
A                                    A
|--------------->A'                  |
|                |                   |
|                C'- - - - - - - - > C
B                                    |
|                                    |

After implementing that, syncing in adjusted branches seems to work much better now. And I've finally merged support for them into master.

There's still several bugs and race conditions and upgrade things to sort out around adjusted branches. Proably another week's work all told.

Posted Thu Mar 31 23:07:49 2016

Back from Libreplanet and a week of spring break. Backlog is not too bad for two weeks mostly away; 143 messages.

Finally got the OSX app updated for the git security fix yesterday. Had to drop builds for old OSX releases.

Getting back into working on adjusted branches now. Polishing up the UI and docs today. Nearly ready to merge the feature; the only blocker is there seems to be something a little bit wrong with how pulled changes are merged into the adjusted branch that I noticed in testing.

Posted Tue Mar 29 20:47:33 2016

Pushed out a git-annex release this morning mostly because of the recent ?git security fix. Several git-annex builds bundle a copy of git and needed to be updated. Note that the OSX autobuilder is temporarily down and so it's not been updated yet -- hopefully soon.

Posted Fri Mar 18 15:46:07 2016

Caught up with a few last things today, before I leave for a week in Boston.

Converted several places that ran git hash-object repeatedly to feed data to a running process. This sped up git-annex add in direct mode and with v6 unlocked files, by up to 2x.

Posted Mon Mar 14 20:53:27 2016

After a real brain-bender of a day, I have commit propagation from the adjusted branch back to the original branch working, without needing to reverse adjust the whole tree. This is faster, but the really nice thing is that it makes individual adjustments simpler to write.

In fact, it's so simple that I took 10 minutes just now to implement a second adjustment!

adjustTreeItem HideMissingAdjustment h ti@(TreeItem _ _ s) = do
         mk <- catKey s
         case mk of
                 Just k -> ifM (inAnnex k)
                         ( return (Just ti)
                         , return Nothing
                 Nothing -> return (Just ti)
Posted Fri Mar 11 23:55:45 2016

Over the weekend, I converted the linux "ancient" autobuilder to use stack. This makes it easier to get all the recent versions of all the haskell dependencies installed there.

Also, merged my no-ffi branch, removing some library code from git-annex and adding new dependencies. It's good to remove code.

Today, fixed the OSX dmg file -- its bundled gpg was broken. I pushed out a new version of the OSX dmg file with the fix.

With the recent incident in mind of malware inserted into the Transmission dmg, I've added a virus scan step to the release process for all the git-annex images. This way, we'll notice if an autobuilder gets a virus.

Also caught up on some backlog, although the remaining backlog is a little larger than I'd like at 135 messages.

Hope to work some more on adjusted branches this week. A few mornings ago, I had what may be a key insight about how to reverse adjustments when propigating changes back from the adjusted branch.

Posted Mon Mar 7 20:26:38 2016

Tuesday was spent dealing with lock files. Turned out there were some bugs in the annex.pidlock configuration that prevented it from working, and could even lead to data loss.

And then more lock files today, since I needed to lock git's index file the same way git does. This involved finding out how to emulate O_EXCL under Windows. Urgh.

Finally got back to working on adjusted branches today. And, I've just gotten syncing of commits from adjusted branches back to the orginal branch working! Time for short demo of what I've been building for the past couple weeks:

joey@darkstar:~/tmp/demo>ls -l
total 4
lrwxrwxrwx 1 joey joey 190 Mar  3 17:09 bigfile -> .git/annex/objects/zx/X8/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f
joey@darkstar:~/tmp/demo>git annex adjust
Switched to branch 'adjusted/master(unlocked)'
joey@darkstar:~/tmp/demo#master(unlocked)>ls -l
total 4
-rw-r--r-- 1 joey joey 1048576 Mar  3 17:09 bigfile

Entering the adjusted branch unlocked all the files.

joey@darkstar:~/tmp/demo#master(unlocked)>git mv bigfile newname
joey@darkstar:~/tmp/demo#master(unlocked)>git commit -m rename
[adjusted/master(unlocked) 29e1bc8] rename
 1 file changed, 0 insertions(+), 0 deletions(-)
  rename bigfile => newname (100%)
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline
29e1bc835080298bbeeaa4a9faf42858c050cad5 rename
a195537dc5beeee73fc026246bd102bae9770389 git-annex adjusted branch
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline master
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add

The commit was made on top of the commit that generated the adjusted branch. It's not yet reached the master branch.

joey@darkstar:~/tmp/demo#master(unlocked)>git annex sync
commit  ok
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline
b60c5d6dfe55107431b80382596f14f4dcd259c9 git-annex adjusted branch
9c36848f078a2bb7a304010e962a2b7318c0877c rename
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add
joey@darkstar:~/tmp/demo#master(unlocked)>git log --pretty=oneline master
9c36848f078a2bb7a304010e962a2b7318c0877c rename
5dc1d94d40af4bf4a88b52805e2a3ae855122958 add

Now the commit has reached master. Notice how the history of the adjusted branch was rebased on top of the updated master branch as well.

joey@darkstar:~/tmp/demo#master(unlocked)>ls -l
total 1024
-rw-r--r-- 1 joey joey 1048576 Mar  3 17:09 newname
joey@darkstar:~/tmp/demo#master(unlocked)>git checkout master
Switched to branch 'master'
joey@darkstar:~/tmp/demo>ls -l
total 4
lrwxrwxrwx 1 joey joey 190 Mar  3 17:12 newname -> .git/annex/objects/zx/X8/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f/SHA256E-s1048576--44ee9fdd91d4bc567355f8b2becd5fe137b9e3aafdfe804341ce2bcc73b8013f

Just as we'd want, the file is locked in master, and unlocked in the adjusted branch.

(Not shown: git annex sync will also merge in and adjust changes from remotes.)

So, that all looks great! But, it's cheating a bit, because it locks all files when updating the master branch. I need to make it remember, somehow, when files were originally unlocked, and keep them unlocked. Also want to implement other adjustments, like hiding files whose content is not present.

Posted Thu Mar 3 21:20:53 2016

Pushed out a release today, could not resist the leap day in the version number, and also there were enough bug fixes accumulated to make it worth doing.

I now have git-annex sync working inside adjusted branches, so pulls get adjusted appropriately before being merged into the adjusted branch. Seems to mostly work well, I did just find one bug in it though. Only propigating adjusted commits remains to be done to finish my adjusted branches prototype.

Posted Mon Feb 29 21:40:37 2016

Now I have a proof of concept adjusted branches implementation, that creates a branch where all locked files are adjusted to be unlocked. It works!

Building the adjusted branch is pretty fast; around 2 thousand files per second. And, I have a trick in my back pocket that could double that speed. It's important this be quite fast, because it'll be done often.

Checking out the adjusted branch can be bit slow though, since git runs git annex smudge once per unlocked file. So that might need to be optimised somehow. On the other hand, this should be done only rarely.

I like that it generates reproducible git commits so the same adjustments of the same branch will always have the same sha, no matter when and where it's done. Implementing that involved parsing git commit objects.

Next step will be merging pulled changes into the adjusted branch, while maintaining the desired adjustments.

Posted Thu Feb 25 21:13:30 2016

Getting started on adjusted branches, taking a top-down and bottom-up approach. Yesterday I worked on improving the design. Today, built a git mktree interface that supports recursive tree generation and filtering, which is the low-level core of what's needed to implement the adjusted branches.

To test that, wrote a fun program that generates a git tree with all the filenames reversed.

import Git.Tree
import Git.CurrentRepo
import Git.FilePath
import Git.Types
import System.FilePath

main = do
        r <- Git.CurrentRepo.get
        (Tree t, cleanup) <- getTree (Ref "HEAD") r
        print =<< recordTree r (Tree (map reverseTree t))

reverseTree :: TreeContent -> TreeContent
reverseTree (TreeBlob f m s) = TreeBlob (reverseFile f) m s
reverseTree (RecordedSubTree f s l) = NewSubTree (reverseFile f) (map reverseTree l)

reverseFile :: TopFilePath -> TopFilePath
reverseFile = asTopFilePath . joinPath . map reverse . splitPath . getTopFilePath

Also, fixed problems with the Android, Windows, and OSX builds today. Made a point release of the OSX dmg, because the last several releases of it will SIGILL on some hardware.

Posted Tue Feb 23 20:57:15 2016

Should mention that there was a release two days ago. The main reason for the timing of that release is because the Linux wstandalone builds include glibc, which recently had a nasty security hole and had to be updated.

Today, fixed a memory leak, and worked on getting caught up with backlog, which now stands at 112 messages.

Posted Fri Feb 19 21:09:44 2016

In a v6 repository on a filesystem not supporting symlinks, it makes sense for commands like git annex add and git annex import to add the files unlocked, since locked files are not usable there. After implementing that, I also added an annex.addunlocked config setting, so that the same behavior can be configured in other repositories.

Rest of the day was spent fixing up the test suite's v6 repository tests to work on FAT and Windows.

Posted Tue Feb 16 21:07:39 2016

Made a no-cbits branch that removes several things that use C code and the FFI. I moved one of them out to a new haskell library, Others were replaced with other existing libraries. This will simplify git-annex's build process, and more library use is good. Planning to merge this branch in a week or two.

v6 unlocked files don't work on Windows. I had assumed that since the build was succeeding, the test suite was passing there. But, it turns out the test suite was failing and somehow not failing the build. Have now fixed several problems with v6 on Windows. Still a couple test suite problems to address.

Posted Mon Feb 15 20:56:10 2016

This was one of those days where I somehow end up dealing with tricky filename encoding problems all day.

First, worked around inability for concurrent-output to display unicode characters when in a non-unicode locale. The normal trick that git-annex uses doesn't work in this case. Since it only affected -J, I decided to make git-annex detect the problem and make -J behave as if it was not built with the concurrent-output feature. So, it just doesn't display concurrent output, which is better than crashing with an encoding error.

The other problem affects v6 repos only. Seems that not all Strings will round trip through a persistent sqlite database. In particular, unicode surrogate characters are replaced with garbage. This is really a bug in persistent. But, for git-annex's purposes, it was possible to work around it, by detecting such Strings and serializing them differently.

Then I had to enhance git annex fsck to fix up repositories that were affected by that problem.

Posted Sun Feb 14 22:03:49 2016

Working on a design for adjusted branches. I've been kicking this idea around for a while to replace direct mode on crippled filesystems with v6 unlocked files. And the same thing would allow for hiding not present files. It's somewhat complicated, but the design I have seems like it would work.

Posted Tue Feb 9 19:37:34 2016

The 2015 git-annex user servey is over with, and I'm reading through it and comparing with the 2013 survey.

37% fewer users responded to the 2015 survey than in 2013. It's hard to tell if this has anything to do with the total number of git-annex users; Debian's popcon suggests the number of users has doubled since 2013, although its graph also suggests the number of users has flattened off since 2014. The difference may just be that I promoted the 2013 survey better than the 2015 survey, perhaps reaching kickstarter backers who I was in touch with back then.

25% use the assistant. Of those, 20% use XMPP, which is good to know as I'd like to get rid of it.

Android use has quardrupled, and Windows use has doubled; both are now at 4%. It's not surprising that Android and Windows users still think more porting work is needed for those OSes. iOS is the only unsupported OS that more than 1% of users want. Embedded and NAS systems were mentioned much less than in 2013; probably the arm tarball build met many such needs.

About the same percentage of users prefer direct mode in 2015 as did in 2013, and ditto for indirect mode. But, more users in 2015 only use direct mode on platforms that force its use. Correlating with the OS percentages suggests that many of these users are using removable media with the FAT filesystem, rather than an OS like Windows or Android. Hopefully v6 unlocked files will eventually better meet those user's needs.

The percent of users installing git-annex from source has halved since 2013, and it seems that builds from this website have taken up most of that slack; I would have expected more installs from Debian, Homebrew etc, but that seems not to have increased.

The number of repositories per user has gone up quite a lot since 2013, when only 7% of users had more than 10 repos. Now, 23% of users do. And, 2% of users have more than 100 repos! This probably involves both more repositories for different purposes, and cloning of repositories to more devices.

Similarly, the amount of data stored has gone up. 34% have more than 1 terabyte stored, up from 18% in 2013. 2% have more than 16 terabytes.

There's some indications of more users sharing repositories or otherwise using it in teams of larger groups, although most users still use it by themselves.

Users seem happier with git-annex now than in 2013. 16% call it "one of my favorite applications of all time". And, significantly fewer find it too hard to use than in 2013.

The main blocking problems are documentation, performance with many files (a general git problem), and various issues with the assistant. Respondants suggest more focus on making it easier for nontechnical users, and for use in larger groups/organizations.

Posted Fri Feb 5 22:30:14 2016

The same parser was used for both preferred content expressions and annex.largefiles. Reworked that today, splitting it into two distinct parsers. It doesn't make any sense to use terms like "standard" or "lackingcopies" in annex.largefiles, and such are now rejected.

That groundwork also let me add a feature that only makes sense for annex.largefiles, and not for preferred content expressions: Matching by mime type, such as mimetype=text/*

Posted Wed Feb 3 20:59:21 2016

For use cases that mix annexed files with files stored in git, the annex.largefiles config is more important in v6 repositories than before, since it configures the behavior of git add and even git commit -a. To make it possible to set annex.largefiles so it'll stick across clones of a repository, I have now made it be supported in .gitattributes files as well as git config.

Setting it in .gitattributes looks a little bit different, since the regular .gitattributes syntax can be used to match on the filename.

* annex.largefiles=(largerthan=100kb)
*.c annex.largefiles=nothing

It seems there's no way to make a git attribute value contain whitespace. So, more complicated annex.largefiles expressions need to use parens to break up the words.

* annex.largefiles=(largerthan=100kb)and(not(include=*.c))
Posted Tue Feb 2 19:35:35 2016

Bugfix release of git-annex today. The release earlier this month had a bug that caused git annex sync --content to drop files that should be preferred content. So I had to rush out a fix after that bug was reported. (Some of the builds for the new release are still updating as I post this.)

In the past week I've been dealing with a blizzard. Snowed in for 6 days and counting. That has slightly back-burnered working on git-annex, and I've mostly been making enhancements that the DataLad project needs, along the lines of more commands supporting --batch and better --json output.

Posted Tue Jan 26 19:45:23 2016

After finally releasing git-annex 6 yesterday, I did some catching up today, and got the message backlog back down from 120 to 100.

By the way, the first OSX release of git-annex 6 was broken; I had to fix an issue on the builder and update the build. If you upgraded at the wrong time, you might find that git-annex doesn't run; if so reinstall it. I now have an account on a separate OSX machine from the build machine, that automatically tests the daily build, to detect such problems.

Posted Fri Jan 15 20:56:19 2016

Added git annex benchmark which uses the excellent Criterion to benchmark parts of git-annex. What I'm interested in benchmarking right now is the sqlite database that is used to manage v6 unlocked files, but having a built-in benchmark will probably have other uses later.

The benchmark results were pretty good; queries from the database are quite fast (60 microseconds warm cache) and scale well as the size increases. I did find one scalability issue, which was fixed by adding another index to the database. The kind of schema change that it's easy to make now, but that would be a painful transition if it had to be done once this was in wide use.

Posted Tue Jan 12 18:12:11 2016

Test suite is 100% green! Fixed one remaining bug it found, and solved the strange sqlite crash, which turned out to be caused by the test suite deleting its temporary repository before sqlite was done with the database inside it.

The only remaining blocker for using v6 unlocked files is a bad interaction with shared clones. That should be easy to fix, so release of git-annex version 6 is now not far away!

While I've only talked about v6/?smudge stuff here lately, I have been fixing various other bugs along the way, and have accumulated a dozen bug fixes since the last release. Earlier this week I fixed a bug in git annex unused. Yesterday I noticed that git annex migrate didn't copy over metadata. Today, fixed a crash of git annex view in a non-unicode locale. Etc. So it'll be good not to have the release blocked any longer by v6 stuff.

Posted Fri Jan 8 20:35:32 2016

Been working hard on the last several test suite failures for v6 unlocked files. Now I've solved almost all of them, which is a big improvement to my confidence in its (almost) correctness.

Frustratingly, the test suite is still not green after all this work. There's some kind of intermittent failure related to the sqlite database. Only seems to happen when the test suite is running, and the error message is simply "Error" which is making it hard to track down..

Posted Thu Jan 7 22:03:50 2016

Got the test suite passing 100%, but then added a pass that uses v6 unlocked files and 30-some more failures appeared. Fixed a couple of the bugs today. After sprinting unexpectedly hard all December on v6, I need a change of pace, so I started digging into the website message backlog and fixed some bugs and posted some comments there.

Posted Fri Jan 1 21:51:00 2016

Automatic merge conflict resolver updated to work with unlocked files in v6 repos. Fairly tricky and painful; thank goodness the test suite tests a lot of edge cases in that code.

Posted Tue Dec 29 21:49:08 2015

If you've got some free holiday time, the v6 repository mode is now available in many of the daily builds, and there's documentation at unlocked files. It would be very useful now if you can give it a try. Use a clone or new repository for safety.

Yesterday I checked all parts of the code that special case direct mode, and found a few things that needed adjusting for v6 unlocked files. Today, I added the annex.thin config. Around 4 other major todo items need to be dealt with before this is ready for more than early adopters.

Posted Sun Dec 27 21:18:51 2015

Got unexpectedly far today on optimising the database that v6 repositories use to keep track of unlocked files. The database schema may still need optimization, but everything else to do with the database is optimised. Writes to the database are queued together. And reads to the database avoid creating the database if it doesn't exist yet. Which means v5 repos, and v6 repos with no unlocked files will avoid any database overhead.

Posted Wed Dec 23 23:41:22 2015

Today was mostly spent making the assistant support v6 repositories. That was harder than expected, because I have not touched this part of the assistant's code much in a long time, and there are lots of tricky races and edge cases to deal with.

The smudge branch has a 4500 diff from master now. Not counting documentation changes (Another 500 lines.) The todo list for it is shrinking slowly now. May not get it done before the new year.

Posted Wed Dec 23 00:38:21 2015

Two more days working on v6 and the smudge branch is almost ready to be merged. The test suite is passing again for v5 repos, and is almost passing for v6 repos. Also I decided to make git annex init create v5 repos for now, so git annex init --version=6 or a git annex upgrade is needed to get a v6 repo. So while I still have plenty of todo items for v6 repos, they are working reasonably well and almost ready for early adopters.

The only real blocker to merging it is that the database stuff used by v6 is not optimised yet and probably slow, and even in v5 repos it will query the database. I hope to find an optimisation that avoids all database overhead unless unlocked files are used in a v6 repo.

I'll probably make one more release before that is merged though. Yesterday I fixed a small security hole in git annex repair, which could expose the contents of an otherwise not world-writable repository to local users.

BTW, the 2015 git-annex user survey closes in two weeks, please go fill it out if you haven't yet done so!

Posted Wed Dec 16 21:05:32 2015

New special remote alert! Chris Kastorff has made a special remote supporting Backblaze's B2 storage servie.

And I'm still working on v6 unlocked files. After beating on it for 2 more days, all git-annex commands should support them. There is still plenty of work to do on testing, upgrading, optimisation, merge conflict resolution, and reconciling staged changes.

Posted Fri Dec 11 20:25:58 2015

Well, another day working on smudge filters, or unlocked files as the feature will be known when it's ready. Got both git annex get and git annex drop working for these files today.

Get was the easy part; it just has to hard link or copy the object to the work tree file(s) that point to it.

Handling dropping was hard. If the user drops a file, but it's unlocked and modified, it shouldn't reset it to the pointer file. For this, I reused the InodeCache stuff that was built for direct mode. So the sqlite database tracks the InodeCaches of unlocked files, and when a key is dropped it can check if the file is modified.

But that's not a complete solution, because when git uses a clean filter, it will write the file itself, and git-annex won't have an InodeCache for it. To handle this case, git-annex will fall back to verifying the content of the file when dropping it if its InodeCache isn't known. Bit of a shame to need an expensive checksum to drop an unlocked file; maybe the git clean filter interface will eventually be improved to let git-annex use it more efficiently.

Anyway, smudged aka unlocked files are working now well enough to be a proof of concept. I have several missing safety checks that need to be added to get the implementation to be really correct, and quite a lot of polishing still to do, including making unlock, lock, fsck, and merge handle them, and finishing repository upgrade code.

Posted Wed Dec 9 22:14:31 2015

Made a lot of progress today. Implemented the database mapping a key to its associated files. As expected this database, when updated by the smudge/clean filters, is not always consistent with the current git work tree. In particular, commands like git mv don't update the database with the new filename. So queries of the database will need to do some additional work first to get it updated with any staged changes. But the database is good enough for a proof of concept, I hope.

Then I got git-annex commands treating smudged files as annexed files. So this works:

joey@darkstar:~/tmp/new>git annex init
init  ok
(recording state in git...)
joey@darkstar:~/tmp/new>cp ~/some.mp3 .
joey@darkstar:~/tmp/new>git add some.mp3
joey@darkstar:~/tmp/new>git diff --cached
diff --git a/some.mp3 b/some.mp3
new file mode 100644
index 0000000..2df8868
--- /dev/null
+++ b/some.mp3
@@ -0,0 +1 @@
joey@darkstar:~/tmp/new>git annex whereis some.mp3
whereis some.mp3 (1 copy) 
    7de17427-329a-46ec-afd0-0a088f0d0b1b -- joey@darkstar:~/tmp/new [here]

get/drop don't yet update the smudged files, and that's the next step.

Posted Mon Dec 7 21:25:32 2015

I've gotten git-annex working as a smudge/clean filter today in the smudge branch. It works ok in a local git repository. git add lets git-annex decide if it wants to annex a file's content, and checking out branches and other git commands involving those files works pretty well.

It can sometimes be slow; git's smudge interface necessarily needs to copy the content of files around, particularly when checking out files, and so it's never going to be as fast as the good old git-annex symlink approach. Most of the slow parts are things that can't be done in direct mode repos though, like switching branches, so that isn't a regression.

No git-annex commands to manage the annexed content work yet. That will need a key to worktree file mapping to be maintained, and implementing that mapping and ensuring its always consistent is probably going to be the harder part of this.

Also there's the question of how to handle upgrades from direct mode repositories. This will be an upgrade from annex.version 5 to 6, and you won't want to do it until all computers that have clones of a repository have upgraded to git-annex 6.x, since older versions won't be able to work with the upgraded repository. So, the repository upgrade will need to be run manually initially, and it seems I'll need to keep supporting direct mode for v5 repos in a transition period, which will probably be measured in years.

Posted Fri Dec 4 22:05:32 2015

Spent a couple of days catching up on backlog, and my backlog is down to 80 messages now. Lowest in recent memory.

Made the annex.largefiles config be honored by git annex import, git annex addurl, and even git annex importfeed.

Planning to dive into smudge filters soon. The design seems ready to go, although there is some complication in needing to keep track of mappings between worktree files and annex keys.

Posted Wed Dec 2 20:02:25 2015

I'm considering ways to get rid of direct mode, replacing it with something better implemented using ?smudge filters.


I started by trying out git-lfs, to see what I can learn from it. My feeling is that git-lfs brings an admirable simplicity to using git with large files. For example, it uses a push-hook to automatically upload file contents before pushing a branch.

But its simplicity comes at the cost of being centralized. You can't make a git-lfs repository locally and clone it onto other drive and have the local repositories interoperate to pass file contents around. Everything has to go back through a centralized server. I'm willing to pay complexity costs for decentralization.

Its simplicity also means that the user doesn't have much control over what files are present in their checkout of a repository. git-lfs downloads all the files in the work tree. It doesn't have facilities for dropping files to free up space, or for configuring a repository to only want to get a subset of files in the first place. Some of this could be added to it I suppose.

I also noticed that git-lfs uses twice the disk space, at least when initially adding files. It keep a copy of the file in .git/lfs/objects/, in addition to the copy in the working tree. That copy seems to be necessary due to the way git smudge filters work, to avoid data loss. Of course, git-annex manages to avoid that duplication when using symlinks, and its direct mode also avoids that duplication (at the cost of some robustness). I'd like to keep git-annex's single local copy feature if possible.

replacing direct mode

Anyway, as smudge/clean filters stand now, they can't be used to set up git-annex symlinks; their interface doesn't allow it. But, I was able to think up a design that uses smudge/clean filters to cover the same use cases that direct mode covers now.

Thanks to the clean filter, adding a file with git add would check in a small file that points to the git-annex object.

In the same repository, you could also use git annex add to check in a git-annex symlink, which would protect the object from modification, in the good old indirect mode way. git annex lock and git annex unlock could switch a file between those two modes.

So this allows mixing directly writable annexed files and locked down annexed files in the same repository. All regular git commands and all git-annex commands can be used on both sorts of files. Workflows could develop where a file starts out unlocked, but once it's done, is locked to prevent accidental edits and archived away or published.

That's much more flexible than the current direct mode, and I think it will be able to be implemented in a simpler, more scalable, and robust way too. I can lose the direct mode merge code, and remove hundreds of lines of other special cases for direct mode.

The downside, perhaps, is that for a repository to be usable on a crippled filesystem, all the files in it will need to be unlocked. A file can't easily be unlocked in one checkout and locked in another checkout.

Posted Mon Nov 23 20:56:38 2015

Monday: Some finishing touches on the pid locking support, and released 5.20151116. After the release I noticed that concurrent downloads didn't always include a progress meter, and made the necessary changes to fix that.

Wednesday: This was a day of minor bug fixing and responding to questions etc. Message backlog got down below 90, not bad.

Thursday: I've been distracted from coding today with an idea of making some new stickers. Hexagonal this time, and even better, composable... So they can show git-annex getting as big as you want. ;)


The design is done, see stickers, and seems to work well, and even better is easy to modify. May find time to get these printed at some point.

Posted Thu Nov 19 23:09:41 2015

Got the pid locks working pretty easily, as expected.

But then... Detoured into some truely insane behavior of the Lustre filesystem. It seems that Lustre is perfectly happy to let link() succeed even when there's a file there that it would overwrite. Rather than overwriting the file, Lustre picks an even more crazy way to violate POSIX.. It lets there be 2 files in a directory with the same name, but different contents. Has to be seen to be believed:

hess$ ls pidlock
-r--r--r--  1 hess root    70 Nov 13 15:07 pidlock
-r--r--r--  1 hess root    70 Nov 13 15:07 pidlock
hess$ rm pidlock; ls pidlock
-r--r--r--  1 hess root    74 Nov 13 14:35 pidlock  

git-annex's pid locking code now detects this and seems to work even on Lustre. Eep.

I'm clutching my "NO WARRANTY" disclaimer pretty hard though, if anyone wants to use git-annex on Lustre. When POSIX is being violated this badly, it's hard to anticipate what other strangeness might result.

Posted Fri Nov 13 20:35:05 2015

Been working today on getting git-annex to fall back from nice posix fcntl locks to pid locks when the former are not supported. There will be an annex.pidlock to control this. Mostly useful, I think for networked file systems like NFS and Lustre. While these do support posix locks, I guess it can be hard sometimes to get some big server configured appropriately, especially when you don't admin it and just want to use git-annex there.

Of course, the fun part about pid locks is that it can be pretty hard to tell if one is stale or not. Especialy when using a networked filesystem, because then the pid in question can be running on a different computer.

Even if you do figure out that a pid lock is stale, how do you then take over a stale pid lock, without racing with anther process that also wants to take it over? This was the truely tricky question of the day.

I have a possibly slightly novel approach to solve that: Put a more modern lock file someplace else (eg, /dev/shm) and use that lock file to lock the pid lock file. Then you can tell if a local pid lock file is stale quickly locally, and take it over safely. Of course, if the pid is not locked by a local process, this still has to fall back to the inevitable retry-and-timeout-and-fail.

I hope the result will work pretty well, although git-annex will not support as fine-grained concurrency when using pid locks. Will find out tomorrow when I run today's code! ;)

Posted Thu Nov 12 22:25:14 2015

Some work today on improving the standalone linux builds and the git-annex-standalone.deb. Also, improved fscks's behavior when asked to fsck a dead repo, and fixed some places in the assistant where configured ssh-options were not used. Backlog is back down to 95.

Posted Tue Nov 10 21:11:55 2015

Finally concurrent progress bars are working! After all the groundwork, it was really easy to add, under a dozen lines of code.

I've found several bugs while testing commands in -Jn mode, and the rest of today was spent fixing them. Two of them affected concurrent git annex add; the worst narrowly avoided being a data loss bug.

Posted Fri Nov 6 20:03:15 2015

Spent my time today porting concurrent-output to Windows, fixing a tricky problem with error handling/thread joining with git-annex -J, and improving the concurrent state handling to support the git command queue. Got add/addurl working in concurrent mode. No concurrent progress bars yet.. maybe tomorrow?

Posted Thu Nov 5 22:57:52 2015

Got git-annex using concurrent-output today. It works beautifully. Since the library is new, git-annex has to be explicitly configured to use it, so it'll be a while until this is available in regular builds.

There are no progress bars yet in concurrent output mode, but that will change soon.. Probably tomorrow.

Posted Wed Nov 4 22:38:04 2015

Today started with getting a release of git-annex out, to deal with a new version of the aws library, which broke the build. That also added support to the S3 remotes for creating Google Nearline buckets, although only when git-annex is built with the newest version of the aws library.

Rest of the day (and most of the past weekend) I've been working on the concurrent-output library. Today I finished making it support multi-line regions, and color, and even fully optimised its console updates to use minimal bandwidth. So, it's got everything git-annex can possibly need to display those troublesome concurrent actions. Will be starting to make git-annex use it soon!

Posted Tue Nov 3 15:15:24 2015

Things have been relatively quiet on git-annex this week. I've been distracted with other projects. But, a library that I developed for propellor to help with concurrent console output has been rapidly developing into a kind of tiling region manager for the console, which may be just the thing git-annex needs on the concurrent download progress display front.

After seeing it could go that way, and working on it around the clock to add features git-annex will need, here's a teaser of its abilities.

Probably coming soonish to a git-annex -J near you!

Posted Sat Oct 31 01:47:24 2015

The first release of git-annex was 5 years ago.

There have been a total of 187 releases, growing to 50k lines of haskell code developed by 28 contributors (and another 10 or so external special remote contributors). Approximately 2000 people have posted questions, answers, bugs, todos, etc to this website, with 18900 posts in total.

I've been funded for 3 of the 5 years to work on git-annex, with support from 1451 individuals and 6 organizations.

Released a new version today with rather more significant changes than usual (see recent devblog entries).

The 2015 git-annex user survey is now live.

Posted Mon Oct 19 19:59:14 2015

Feeling kind of ready to cut the next release of git-annex, but am giving the recent large changes just a little time to soak in and make sure they're ok.

Yesterday, changed the order that git annex sync --content and the assistant do drops. When dropping from the local repo and also some remotes, it now makes more sense to drop from the remotes first, and only then the local repo. There are scenaries where that order lets content be dropped from all the places that it should be, while the reverse order doesn't.

Today, caught up on recent bug reports, including fixing a bad merge commit that was made when git merge failed due to filenames not supported by a crippled filesystem, and cleaning up a network transport warning that was displayed incorrectly. Also developed a patch to the aws library to support google nearline when creating buckets.

Posted Thu Oct 15 20:50:12 2015

Well, I've spent all week making git annex drop --from safe.

On Tuesday I got a sinking feeling in my stomach, as I realized that there was hole in git-annex's armor to prevent concurrent drops from violating numcopies or even losing the last copy of a file. ?The bug involved an unlikely race condition, and for all I know it's never happened in real life, but still this is not good.

Since this is a potential data loss bug, expect a release pretty soon with the fix. And, there are 2 things to keep in mind about the fix:

  1. If a ssh remote is using an old version of git-annex, a drop may fail. Solution will be to just upgrade the git-annex on the remote to the fixed version.
  2. When a file is present in several special remotes, but not in any accessible git repositories, dropping it from one of the special remotes will now fail, where before it was allowed.

    Instead, the file has to be moved from one of the special remotes to the git repository, and can then safely be dropped from the git repository.

    This is a worrysome behavior change, but unavoidable.

Solving this clearly called for more locking, to prevent concurrency problems. But, at first I couldn't find a solution that would allow dropping content that was only located on special remotes. I didn't want to make special remotes need to involve locking; that would be a nightmare to implement, and probably some existing special remotes don't have any way to do locking anyway.

Happily, after thinking about it all through Wednesday, I found a solution, that while imperfect (see above) is probably the best one feasible. If my analysis is correct (and it seems so, although I'd like to write a more formal proof than the ad-hoc one I have so far), no locking is needed on special remotes, as long as the locking is done just right on the git repos and remotes. While this is not able to guarantee that numcopies is always preserved, it is able to guarantee that the last copy of a file is never removed. And, numcopies will always be preserved except for when this rare race condition occurs.

So, I've been implementing that all of yesterday and today. Getting it right involves building up 4 different kinds of evidence, which can be used to make sure that the last copy of a file can't possibly end up being dropped, no matter what other concurrent drops could be happening. I ended up with a very clean and robust implementation of this, and a 2,000 line diff.


Posted Fri Oct 9 22:06:16 2015

Lots of porting work ongoing recently:

  • I've been working with Goeke on building git-annex on Solaris/SmartOS. Who knows, this may lead to a binary distribution in some way, but to start with I got the disk free space code ported to Solaris, and have seen git-annex work there.
  • Jirib has also been working on that same disk free code, porting it to OpenBSD. Hope to land an updated patch for that.
  • Yury kindly updated the Windows autobuilder to a new Haskell Platform release, and I was able to land the winprocfix branch that fixes ssh password prompting in the webapp on Windows.
  • The arm autobuilder is fixed and back in its colo, and should be making daily builds again.
Posted Sun Oct 4 20:12:58 2015

While at the DerbyCon security conference, I got to thinking about verifying objects that git-annex downloads from remotes. This can be expensive for big files, so git-annex has never done it at download time, instead deferring it to fsck time. But, that is a divergence from git, which always verifies checksums of objects it receives. So, it violates least surprise for git-annex to not verify checksums too. And this could weaken security in some use cases.

So, today I changed that. Now whenever git-annex accepts an object into .git/annex/objects, it first verifies its checksum and size. I did add a setting to disable that and get back the old behavior: git config annex.verify false, and there's also a per-remote setting if you want to verify content from some remotes but not others.

Posted Thu Oct 1 20:18:10 2015

I've mostly been chewing through old and new bug reports and support requests that past several days. The backlog is waaay low now -- only 82 messages! Just in time for me to go on another trip, to Louisville on Thursday.

Amazon S3 added an "Infrequent Access" storage class last week, and I got a patch into the haskell-aws library to support that, as well as partially supporting Google Nearline. That patch was accepted today, and git-annex is ready to use the new version of the library as soon as it's released.

At the end of today, I found myself rewriting git annex status to parse and adjust the output of git status --short. This new method makes it much more capable than before, including displaying Added files.

Posted Tue Sep 22 21:47:44 2015

Made the release this morning, first one in 3 weeks. A fair lot of good stuff in there.

Just in time for the release, git-annex has support for Ceph. Thanks to Mesar Hameed for building the external special remote!

Posted Thu Sep 17 03:36:26 2015

Seems that Git for Windows was released a few weeks ago, replacing msysgit. There were a couple problems using git-annex with that package of git, which I fixed on Thursday. The next release of git-annex won't work with msysgit any longer though; only with Git for Windows.

On Friday, I improved the Windows package further, making it work even when git is not added to the system PATH. In such an installation, git-annex will now work inside the "git bash" window, and I even got the webapp starting from the menu working without git in PATH.

In other dependency fun, the daily builds for Linux got broken due to a glibc bug in Debian unstable/testing, which makes the bundled curl and ssh segfault. With some difficulty I tracked that down, and it turns out the bug has been fixed upstream for quite a while. The daily builds are now using the fixed glibc 2.21.

Today, got back to making useful improvements, rather than chasing dependencies. Improved the bash completion for remotes and backends, made annex.hardlink be used more, and made special remotes that are configured with autoenable=true get automatically enabled by git annex init.

Posted Mon Sep 14 19:10:05 2015

Today was a scramble to get caught up after weeks away. Got the message backlog down from over 160 to 123. Fixed two reversions, worked around a strange bug, and implemented support for the gpg.program configuration, and made several smaller improvements.

Posted Wed Sep 9 22:12:16 2015

Did some work on Friday and Monday to let external special remotes be used in a readonly mode. This lets files that are stored in the remote be downloaded by git-annex without the user needing to install the external special remote program. For this to work, the external special remote just has to tell git-annex the urls to use. This was developed in collaboration with Benjamin Gilbert, who is developing gcsannex, a Google Cloud Storage special remote.

Today, got caught up with recent traffic, including fixing a couple of bugs. The backlog remains in the low 90's, which is a good place to be as I prepare for my August vacation week in the SF Bay Area, followed by a week for ICFP and the Haskell Symposium in Vancouver.

Posted Wed Aug 19 19:10:56 2015

Been doing a little bit of optimisation work. Which meant, first improving the --debug output to show fractions of a second, and show when commands exit.

That let me measure what takes up time when downloading files from ssh remotes. Found one place I could spawn a thread to run a cleanup action, and this simple change reduced the non-data-transfer overhead to 1/6th of what it had been!

Posted Thu Aug 13 20:23:51 2015

Catching up on weekend's traffic, and preparing for a release tomorrow.

Found another place where the optparse-applicative conversion broke some command-line parsing; using git-annex metadata to dump metadata recursively got broken. This is the second known bug caused by that transition, which is not too surpising given how large it was.

Tracked down and fixed a very tricky encoding problem with metadata values.

The arm autobuilder broke so it won't boot; got a serial console hooked up to it and looks like a botched upgrade resulting in a udev/systemd/linux version mismatch.

Posted Tue Aug 11 23:24:16 2015

The SHA-3 specification was released yesterday; git-annex got support for using SHA-3 hashes today. I had to add support for building with the new cryptonite library, as cryptohash doesn't (correctly) implement SHA-3 yet. Of course, nobody is likely to find a use for this for years, since SHA-2 is still prefectly fine, but it's nice to get support for new hashes in early. :)

Posted Thu Aug 6 22:34:16 2015

Took a half day and worked on making it simpler to set up ssh remotes. The complexity I've gotten rid of is there's no need to take any action to get a ssh remote initialized as a git-annex repository. Where before, either git-annex init needed to be ran on the remote, or a git-annex branch manually pushed to it, now the remote can simply be added and git annex sync will do the rest. This needed git-annex-shell changes, so will only work once servers are upgraded to use a newer version of git-annex.

Posted Wed Aug 5 18:41:44 2015

Ended up sending most of today working on git annex proxy. It had a lot of buggy edge cases, which are all cleaned up now.

Spent another couple hours catching up on recent traffic and fixing a couple other misc bugs.

Posted Tue Aug 4 20:28:05 2015

Work today has started in the git-annex bug tracker, but the real bugs were elsewhere. Got a patch into hinotify to fix its handling of filenames received from inotify events when used in a non-unicode locale. Tracked down why gitlab's git-annex-shell fails to initialize gcrypt repositories, and filed a bug on gitlab-shell.

Yesterday, I got the Android autobuilder fixed. I had started upgrading it to new versions of yesod etc, 2 months ago, and something in those new versions led to character encoding problems that broke the template haskell splicing. Had to throw away the work done for that upgrade, but at least it's building again, at last.

Posted Mon Aug 3 19:43:14 2015

Made a release this morning, mostly because the release earlier this week turns out to have accidentally removed several options from git annex copy.

Spent some time this afternoon improving how git-annex shuts down when --time-limit is used. This used to be a quick and dirty shutdown, similar to if git-annex were ctrl-c'd, but I reworked things so it does a clean shutdown, including running any buffered git commands. This made incremental fsck with --time-limit resume much better, since it saves the incremental fsck database on shutdown. Also tuned when the database gets checkpointed during an incremental fsck, to resume better after it's interrupted.

Posted Fri Jul 31 20:57:10 2015

Made a release today, with recent work, including the optparse-applicative transition and initial support in the webapp.

I had time before the release to work out most of the wrinkles in the support, but was not able to get gcrypt encrypted repos to work with gitlab, for reasons that remain murky. Their git-annex-shell seems to be misbehaving somehow. Will need to get some debugging assistance from the developers to figure that out.

Posted Mon Jul 27 20:29:55 2015

I've been working on adding GitLab support to the webapp for the past 3 days.

That's not the only thing I've been working on; I've continued to work on the older parts of the backlog, which is now shrunk to 91 messages, and made some minor improvements and bugfixes.

But, GitLab support in the webapp has certianly taken longer than I'd have expected. Only had to write 82 lines of GitLab specific code so far, but it went slowly. The user will need to cut and paste repository url and ssh public key back and forth between the webapp and GitLab for now. And the way GitLab repositories use git-annex makes it a bit tricky to set up; in one case the webapp has to do a forced push dry run to check if the repository on GitLab can be accessed by ssh.

I found a way to adapt the existing code for setting up a ssh server to also support GitLab, so beyond the repo url prompt and ssh key setup, everything will be reused. I have something that works now, but there are lots of cases to test (encrypted repositories, enabling existing repositories, etc), so will need to work on it a bit more before merging this feature.

Also took some time to split the centralized git repository tutorial into three parts, one for each of GitHub, GitLab, and self-administered servers.

The git-annex package in Debian unstable hasn't been updated for 8 months. This is about to change; Richard Hartmann has stepped up and is preparing an upload of a recent version. Yay!

Posted Wed Jul 22 22:04:21 2015

Worked on bash tab completion some more. Got "git annex" to also tab complete. However, for that to work perfectly when using bash-completion to demand-load completion scripts, a small improvement is needed in git's own completion script, to have it load git-annex's completion script. I sent a patch for that to the git developers, and hopefully it'll get accepted soon.

Then fixed a relatively long-standing bug that prevented uploads to chunked remotes from resuming after the last successfully uploaded chunk.

Posted Thu Jul 16 19:19:20 2015

Worked through the rest of the changes this weekend and morning, and the optparse-applicative branch has landed in master, including bash completion support.

Posted Mon Jul 13 20:57:53 2015

Day 3 of the optparse-applicative conversion.
116 files changed, 1607 insertions(+), 1135 deletions(-)
At this point, everything is done except for around 20 sub-commands. Probably takes 15 minutes work for each. Will finish plowing through it in the evenings.

Meanwhile, made the release of version 5.20150710. The Android build for this version is not available yet, since I broke the autobuilder last week and haven't fixed it yet.

Posted Fri Jul 10 21:58:14 2015

Now working on converting git-annex to use optparse-applicative for its command line parsing. I've wanted to do this for a long time, because the current code for options is generally horrible, runs in IO, and is not at all type safe, while optparse-applicative has wonderful composable parsers and lets each subcommand have its own data type repesenting all its options.

What pushed me over the edge is that optparse-applicative has automatic bash completion!

# source <(git-annex --bash-completion-script `which git-annex`)
# git-annex fsck -
--all                   --key                   -S
--from                  --more                  -U

Since nobody has managed to write a full bash completion for git-annex before, let alone keep it up-to-date with changes to the code, automating the problem away is a really nice win. :)

The conversion is a rather huge undertaking; the diff is already over 3000 lines large after 8 hours of work, and I'm maybe 1/3rd done, with the groundwork laid (except for global options still todo) and a few subcommands converted. This won't land for this week's release; it'll need a lot of testing before it'll be ready for any release.

Posted Thu Jul 9 00:25:19 2015

Mostly spent today getting to older messages in the backlog. This did result in a few fixes, but with 97 old messages left, I can feel the diminishing returns setting in, to try to understand old bug reports that are often unclear or lacking necessary info to reproduce them.

By the way, if you feel your bug report or question has gotten lost in my backlog, the best thing to do is post an update to it, and help me reproduce it, or clarify it.

Moved on to looking through todo, which was a more productive way to find useful things to work on.

Best change made today is that git annex unused can now be configured to look at the reflog. So, old versions of files are considered still used until the reflog expires. If you've wanted a way to only delete (or move away) unused files after they get to a certian age, this is a way to do that ...

Posted Tue Jul 7 21:38:16 2015

Now caught up on nearly all of my backlog of messages, and indeed am getting to some messages that have been waiting for months. Backlog is down to 113! Couple of bugfixes resulted, and many questions answered.

Think I'll spend a couple more days dealing with the older part of the backlog. Then, when that reaches diminishing returns, I'll move on to some big change. I have been thinking about caching database on and off..

Posted Mon Jul 6 20:46:38 2015

Back, and have spent all day focusing on new bug reports. All told, I fixed 4 bugs, followed up on all other bugs reported while I was away, and fixed the android autobuilder.

The message backlog started the day at 250 or something, and is down to 178 now. Looks like others have been following up to forum posts while I was away (thanks!) so those should clear quickly.

Posted Fri Jul 3 03:09:38 2015

Well, not the literal last push, but I've caught up on as much backlog as I can (142 messages remain) and spent today developing a few final features before tomorrow's release.

Some of the newer things displayed by git annex info were not included in the --json mode output. The json includes everything now.

git annex sync --all --content will make it consider all known annexed objects, not only those in the current work tree. By default that syncs all versions of all files, but of course preferred content can tune what repositories want.

To make that work well with preferred content settings like "include=*.mp3", it makes two passes. The first pass is over the work tree, so preferred content expressions that match files by name will work. The second pass is over all known keys, and preferred content expressions that don't care about the filename can match those keys.

Two passes feels a bit like a hack, but it's a lot better than --all making nothing be synced when the a preferred content expression matches against filenames... I actually had to resort to bloom filters to make the two passes work.

This new feature led to some slightly tricky follow-on changes to the standard groups preferred content expressions.

Posted Tue Jun 16 22:56:34 2015

Ever since git annex fsck --all was added, people have ?complained that there's no way to stop it complaining about keys whose content is gone for good. Well, there is now: git annex dead --key can be used when you know that a key is no longer available and want fsck to stop complaining about it.

Running fsck on a directory will intentionally still complain about files in the directory with missing contents, even if the keys have been marked dead.

The crucial part was finding a good way to store the information; luckily location log files are parsed in a way that lets it be added there without breaking backwards compatability. A bonus is that adding a key's content back to the annex will automatically bring it back from the dead.

I'm pondering making git annex drop --force automatically mark a key as dead when the last copy is dropped, but I don't know if it's too DWIM or worth the complication. Another approach would be to let fsck mark keys as dead, but that would certianly need an extra flag.

Posted Tue Jun 9 19:59:26 2015

Now git-annex can be used to set up a public S3 remote. If you've cloned a repository that knows about such a remote, you can use the S3 remote without needing any S3 credentials. Read-only of course.

This tip shows how to do it: public Amazon S3 remote

One rather neat way to use this is to configure the remote with encryption=shared. Then, the files stored in S3 will be encrypted, and anyone with access to the git repository can get and decrypt the files.

This feature will work for at least AWS S3, and for the Internet Archive's S3. It may work for other S3 services, that can be configured to publish their files over unauthenticated http. There's a publicurl configuration setting to allow specifying the url when using a service that git-annex doesn't know the url for.

Actually, there was a hack for the IA before, that added the public url to an item when it was uploaded to the IA. While that hack is now not necessary, I've left it in place for now, to avoid breaking anything that depended on it.

Posted Fri Jun 5 20:39:24 2015

Worked thru some backlog. Currently stands at 152 messages.

Merged work from Sebastian Reuße to teach the assistant to listen for systemd-networkd dbus events when the network connection changes.

Added git annex get --incomplete, which can be used to resume whatever it was you were downloading earlier and interrupted, that you've forgotten about. ;)

The Isuma Media Players project is using git-annex to "create a two-way, distributed content distribution network for communities with poor connexions to the internet". My understanding is this involves places waaay up North.

Reading over their design docs is quite interesting, both to see how they've leveraged things like git-annex metadata and preferred content expressions and the assistant, and areas where git-annex falls short.

Between DataLad, Isuma, Baobáxia, IA.BAK, and more, there are a lot of projects being built on top of git-annex now!

Posted Tue Jun 2 19:54:36 2015

On Friday I installed the CubieTruck that is the new autobuilder for arm. This autobuilder is hosted at WetKnee Books, so its physical security includes a swamp.

The hardware is not fast, but it's faster and far more stable than qemu arm emulation. By Saturday I got the build environment all installed nicely, including building libraries that use template haskell!

But, ghc crashed with an internal error building git-annex. I upgraded to ghc 7.10.1 (which took another day), but it also crashed. Was almost giving up, but I looked at the ghc parameters, and -j2 stuck out in them. Removed the -j2, and the build works w/o crashing! \o/ (Filed a bug report on ghc.)

Anarcat has been working on improving the man pages, including lots of linking to related commands.

The 2015 Haskell Communities and Activities Report is out, and includes an entry for git-annex for the first time!

Posted Mon Jun 1 22:53:42 2015

After a less active than usual week (dentist), I made a release last Friday. Unfortunately, it turns out that the Linux standalone builds in that release don't include the webapp. So, another release is planned tomorrow.

Yesterday and part of today I dug into the ?windows ssh webapp password entry broken reversion. Eventually cracked the problem; it seems that different versions of ssh for Windows do different things in a isatty check, and there's a flag that can be passed when starting ssh to make it not see a controlling tty. However, this neeeds changes to the process library, which db48x and I have now coded up. So a fix for this bug is waiting on a new release of that library. Oh well.

Rest of today was catching up on recent traffic, and improving the behavior of git annex fsck when there's a disk IO error while checksumming a file. Now it'll detect a hardware fault exception, and take that to mean the file is bad, and move it to the bad files directory, instead of just crashing.

I need better tooling to create disk IO errors on demand. Yanking disks out works, but is a blunt instrument. Anyone know of good tools for that?

Posted Wed May 27 21:06:58 2015

There's something rotten in POSIX fctnl locking. It's not composable, or thread-safe.

The most obvious problem with it is that if you have 2 threads, and they both try to take an exclusive lock of the same file (each opening it separately) ... They'll both succeed. Unlike 2 separate processes, where only one can take the lock.

Then the really crazy bit: If a process has a lock file open and fcntl locked, and then the same process opens the lock file again, for any reason, closing the new FD will release the lock that was set using the other FD.

So, that's a massive gotcha if you're writing complex multithreaded code. Or generally for composition of code. Of course, C programmers deal with this kind of thing all the time, but in the clean world of Haskell, this is a glaring problem. We don't expect to need to worry about this kind of unrelated side effect that breaks composition and thread safety.

After noticing this problem affected git-anenx in at least one place, I have to assume there could be more. And I don't want to need to worry about this problem forever. So, I have been working today on a clean fix that I can cleanly switch all my lock-related code to use.

One reasonable approach would be to avoid fcntl locking, and use flock. But, flock works even less well on NFS than fcntl, and git-annex relies on some fcntl locking features. On Linux, there's an "open file description locks" feature that fixes POSIX fnctl locking to not have this horrible wart, but that's not portable.

Instead, my approach is to keep track of which files the process has locked. If it tries to do something with a lockfile that it already has locked, it avoids opening the same file again, instead implements its own in-process locking behavior. I use STM to do that in a thread-safe manner.

I should probably break out git-annex's lock file handling code as a library. Eventually.. This was about as much fun as a root canal, and I'm having a real one tomorrow. :-/

git-annex is now included in Stackage!

Daniel Kahn Gillmor is doing some work on reproducible builds of git-annex.

Posted Tue May 19 20:01:18 2015

Today I added a feature to git annex unused that lets the user tune which refs they are interested in using. Annexed objects that are used by other refs then are considered unused.

Did a fairly complicated refspec format for this, with globs and include/exclude of refs. Example:


I think that, since Google dropped openid support, there seems to have been less activity on this website. Although possibly also a higher signal to noise ratio. :) I have been working on some ikiwiki changes to make it easier for users who don't have an openid to contiribute. So git-annex's website should soon let you log in and make posts with just an email address.

People sometimes ask for a git-annex mailing list. I wouldn't mind having one, and would certianly subscribe, but don't see any reason that I should be involved in running it.

Posted Thu May 14 19:57:19 2015

Implemented git annex drop --all. This also added for free drop with --unused and --key, which overlap with git annexdropunused and git annex dropkey.

The concurrentprogress branch had gone too long without being merged, and had a lot of merge conflicts. I resolved those, and went ahead and merged it into master. However, since the ascii-progress library is not ready yet, I made it a build flag, and it will build without it by default. So, git annex get -J5 can be used now, but no progress bars will display yet.

When doing concurrent downloads, either with the new -J or by hand by running multiple processes, there was a bug in the diskreserve checking code. It didn't consider the disk space that was in the process of being used by other concurrent downloads, so would let more downloads start up than there was space for.

I was able to fix this pretty easily, thanks to the transfer log files. Those were originally added just to let the webapp display transfers, but proved very helpful here!

Finally, made .git/annex/transfer/failed/ files stop accumulating when the assistant is not being used. Looked into also cleaning up stale .git/annex/transfer/{upload,download}/ files (from interrupted transfers). But, since those are used as lock files, it's difficult to remove them in a concurrency safe way.

Update: Unfortunately, I turned out to have stumbled over an apparent bug in haskell's implementation of file locking. Had to work around that.

Happily, the workaround also let me implement cleanup of stale transfer info files, left behind when a git-annex process was interrupted. So, .git/annex/transfer/ will entirely stop accumulating cruft!

Posted Tue May 12 20:36:53 2015

Lazy afternoon spent porting git-anenx to build under ghc 7.10. Required rather a lot of changes to build, and even more to build cleanly after the AMP transition.

Unfortunately, ghc 7.10 has started warning about every line that uses tab for indentation. I had to add additional cruft to turn those warnings off everywhere, and cannot say I'm happy about this at all.

Posted Sun May 10 20:33:21 2015

Got the release out after more struggling with ssh on windows and a last minute fix to the quvi support.

The git annex repository had accumulated 6 gb of past builds that were not publically available. I am publishing those on the Internet Archive now, so past builds can be downloaded using git-annex in that repository in the usual way. This worked great! :)

I have ordered a CubieTruck with 2 gb of ram to use for the new Arm builder. Hosting still TBD.

Looks like git-annex is almost ready to be included in stackage, which will make building it from source much less likely to fail due to broken libraries etc.

Posted Fri May 8 23:01:03 2015

I've not been blogging, but have been busy this week. Backlog is down to 113 messages.

Tuesday: I got a weird bug report where git annex get was deleting a file. This turned out to be a bug in wget ftp://... where it would delete a symlink that was not where it had been told to download the fie to. I put a workaround in git-annex; wget is now run in a temp directory. But this was a legitimate wget bug, and it's now been reported to the wget developers and will hopefully get fixed there.

Wednesday: Added a --batch mode for several plumbing commands (contentlocation, examinekey, and lookupkey). This avoids startup overhead, and so lets a lot of queries be done much faster. The implementation should make it easy to add --batch to more plumbing commands as needed, and could probably extend to non-plumbing commands too.

Today: The first 5 hours involved an incompatible mess of ssh and rsync versions on Windows. A Gordian knot of brokenness and dependency hell. I finally found a solution which involves downgrading the cygwin rsync to an older version, and using msysgit's ssh rather than cygwin's.

Finished up today with more post-Debian-release changes. Landed a patch to switch from dataenc to sandi that had been waiting since 2013, and got sandi installed on all the git-annex autobuilders. Finished up with some prep for a release tomorrow.

Finally, Debian has a new enough ghc that it can build template haskell on arm! So, whenever a new version of git-annex finally gets into Debian (I hope soon), the webapp will be available on arm for those arm laptops. Yay!

This also means I have the opportunity to make the standalone arm build be done much more simply. Currently it involves qemu and a separate companion native mode container that it has to ssh to and build stuff, that has to have the same versions of all libraries. It's just enormously complicated and touchy. With template haskell building support, all that complexity can fall away.

What I'd really like to do is get a fast-ish arm box with 2gb of ram hosted somewhere, and use that to do the builds, in native mode. Anyone want to help provide such a box for git-annex arm autobuilds?

Posted Thu May 7 23:39:39 2015

Reduced activity this week (didn't work on the assistant after all), but several things got done:

Monday: Fixed fsck --fast --from remote to not fail when the remote didn't support fast copy mode. And dealt with an incompatibility in S3 bucket names; the old hS3 library supported upper-case bucket names but the new one needs them all in lower case.

Wednesday: Caught up on most recent backlog, made some improvements to error handling in import, and improved integration with KDE's file manager to work with newer versions.

Today: Made import --deduplicate/--clean-duplicates actively verify that enough copies of a file exist before deleting it. And, thinking about some options for batch mode access to git-annex plumbing, to speed up things that use it a lot.

Posted Thu Apr 30 19:53:43 2015

Posted a design for balanced preferred content. This would let preferred content expressions assign each file to N repositories out of a group, selected using Math. Adding a repository could optionally be configured to automatically rebalance the files (not very bandwidth efficiently though). I think some have asked for a feature like this before, so read the design and see if it would be useful for you.

Spent a while debugging a problem with a S3 remote, which seems to have been a misconfiguration in the end. But several improvements came out of it to make it easier to debug S3 in the future etc.

Posted Thu Apr 23 20:34:05 2015

I hope that today's git-annex release will be landing in Debian unstable toward the end of the month. And I'm looking forward to some changes that have been blocked by wanting to keep git-annex buildable on Debian 7.

Yesterday I got rid of the SHA dependency, switching git-annex to use a newer version of cryptohash for HMAC generation (which its author Vincent Hanquez kindly added to it when I requested it, waay back in 2013). I'm considering using the LambdaCase extension to clean up a lot of the code next, and there are 500+ lines of old yesod compatability code I can eventually remove.

These changes and others will prevent backporting to the soon to be Debian oldstable, but the standalone tarball will still work there. And, the git-annex-standalone.deb that can be installed on any version of Debian is now available from the NeuroDebian repository, and its build support has been merged into the source tree.

In the run up to the release today, I also dealt with getting the Windows build tested and working, now that it's been updated to newer versions of rsync, ssh, etc from Cygwin. Had to add several more dlls to the installer. That testing also turned up a case where git-annex init could fail, which got a last-minute fix.

PS, scroll down this 10 year of git timeline and see what you find!

Posted Mon Apr 20 22:56:12 2015

Recent work has included improving fsck --from remote (and fixing a reversion caused by the relative path changes in January), and making annex.diskreserve be checked in more cases. And added a git annex required command for setting required content.

Also, I want to thank several people for their work:

  • Roy sent a patch to enable http proxy support.. despite having only learned some haskell by "30 mins with YAHT". I investigated that more, and no patch is actually necessary, but just a newer version of the http-client library.
  • CandyAngel has been posting lots of helpful comments on the website, including this tip that significantly speeds up a large git repository.
  • Øyvind fixed a lot of typos throughout the git-annex documentation.
  • Yaroslav has created a git-annex-standalone.deb package that will work on any system where debian packages can be installed, no matter how out of date it is (within reason), using the same methods as the standalone tarball.
Posted Sat Apr 18 20:15:19 2015

Mostly working on Windows recently. Fixed handling of git repos on different drive letters. Fixed crazy start menu loop. Worked around stange msysgit version problem.

Also some more work on the concurrentprogress branch, making the progress display prettier.

Added one nice new feature yesterday: git annex info $dir now includes a table of repositories that are storing files in the directory, with their sizes.

repositories containing these files: 
    288.98 MB: ca9c5d52-f03a-11df-ac14-6b772ffe59f9 -- archive-5
    288.98 MB: f1c0ce8d-d848-4d21-988c-dd78eed172e8 -- archive-8
     10.48 MB: 587b9ccf-4548-4d6f-9765-27faecc4105f -- darkstar
     15.18 kB: 42d47daa-45fd-11e0-9827-9f142c1630b3 -- origin

Nice thing about this feature is it's done for free, with no extra work other than a little bit of addition. All the heavy location lookup work was already being done to get the numcopies stats.

Posted Tue Apr 14 20:44:28 2015

Back working on git annex get --jobs=N today. It was going very well, until I realized I had a hard problem on my hands.

The hard problem is that the AnnexState structure at the core of git-annex is not able to be shared amoung multiple threads at all. There's too much complicated mutable state going on in there for that to be feasible at all.

In the git-annex assistant, which uses many threads, I long ago worked around this problem, by having a single shared AnnexState and when a thread needs to run an Annex action, it blocks until no other thread is using it. This worked ok for the assistant, with a little bit of thought to avoid long-duration Annex actions that could stall the rest of it.

That won't work for concurrent get etc. I spent a while investigating maybe making AnnexState thread safe, but it's just not built for it. Too many ways that can go wrong. For example, there's a CatFileHandle in the AnnexState. If two threads are running, they can both try to talk to the same git cat-file --batch command at once, with bad results. Worse, yet, some parts of the code do things like modifying the AnnexState's Git repo to add environment variables to use when running git commands.

It's not all gloom and doom though. Only very isolated parts of the code change the working directory or set environment variables. And the assistant has surely smoked out other thread concurrency problems already. And, separate git-annex programs can be run concurrently with no problems at all; it uses file locking to avoid different processes getting in each-others' way. So AnnexState is the only remaining obstacle to concurrency.

So, here's how I've worked around it: When git annex get -J10 is run, it will start by allocating 10 job slots. A fresh AnnexState will be created, and copied into each slot. Each time a job runs, it uses its slot's own AnnexState. This means 10 git cat-file processes, and maybe some contention over lock files, but generally, a nice, easy, and hopefully trouble-free multithreaded mode.

And indeed, I've gotten git annex get -J10 working robustly! And from there it was trivial to enable -J for move and copy and mirror too!

The only real blocker to merging the concurrentprogress branch is some bugs in the ascii-progress library that make it draw very scrambled progress bars the way git-annex uses it.

Posted Fri Apr 10 21:17:29 2015

I've had to release git-annex twice this week to fix reversions. On Monday, just after I made a planned release, I discovered a bug in it, and had to update it with a .1 release. Today's release fixes 2 other reversions introduced by recent changes, both only affecting the assistant.

Before making today's release, I did a bunch of other minor bugfixes and improvements, including adding a new contentlocationn plumbing command. This release also changes git annex add when annex.largefiles is configured, so it will git add the non-large files. That is particularly useful in direct mode.

I feel that the assistant needs some TLC, so I might devote a week to it in the latter part of this month. My current funding doesn't cover work on the assistant, but I should have some spare time toward the end of the month.

Posted Thu Apr 9 20:40:55 2015

Rethought distributed fsck. It's not really a fsck, but an expiration of inactive repositories, where fscking is one kind of activity. That insight let me reimplement it much more efficiently. Rather than updating all the location logs to prove it was active, git annex fsck can simply and inexpensively update an activity log. It's so cheap it'll do it by default. The git annex expire command then reads the activity log and expires (or unexpires) repositories that have not been active in the desired time period. Expiring a repository simply marks it as dead.

Yesterday, finished making --quiet really be quiet. That sounds easy, but it took several hours. On the concurrentprogress branch, I have ascii-progress hooked up and working, but it's not quite ready for prime time.

Posted Sun Apr 5 16:59:20 2015

I've started work on parallel get. Today, laid the groundwork in two areas:

  1. Evalulated the ascii-progress haskell library. It can display multiple progress bars in the terminal, portably, and its author Pedro Tacla Yamada has kindly offered to improve it to meet git-annex's needs.

    I ended up filing 10 issues on it today, around 3 of the are blockers for git-annex using it.

  2. Worked on making --quiet more quiet. Commands like rsync and wget need to have thier progress output disabled when run in parallel.

    Didn't quite finish this yet.

Yesterday I made some improvements to how git-annex behaves when it's passed a massive number of directories or files on the command line. Eg, when driven by xargs. There turned out to be some bugs in that scenario.

One problem one I kind of had to paper over. While git-annex get normally is careful to get the files in the same order they were listed on the command line, it becomes very expensive to expand directories using git-ls-files, and reorder its output to preserve order, when a large number offiles are passed on the command line. There was a O(N*M) time blowup.

I worked around it by making it only preserve the order of the first 100 files. Assumption being that if you're specifying so many files on the command line, you probably have less of an attachment to their ordering. :)

Posted Fri Apr 3 21:02:06 2015

Added two options to git annex fsck that allow for a form of distributed fsck. This is useful in situations where repositiories cannot be trusted to continue to exist, and cannot be checked directly, but you'd still like to keep track of their status. iabackup is one use case for this.

By running a periodic fsck with the --distributed option, the repositories can verify that they still exist and that the information about their contents is still accurate. This is done by doing an extra update of the location log each time a file is verified by fsck to still be in the repository.

The other option looks like --expire="30d somerepo:60d". It checks that each specified repository has recorded a distributed fsck within the specified time period. If not, the repository is dropped from the location tracking log. Of course it can always update that later if it's really still around.

Distributed fsck is not the default because those extra location log updates increase the size of the git-annex branch. I did one thing to keep the size increase small: An identical line is logged to for each key, including the timestamp, so git's delta compression will work as well as is possible. But, there's still commit and tree update overhead.

Probably doesn't make sense to run distributed fscks too often for that and other reasons. If the git-annex branch does get too large, there's always git annex forget ...

(Update: This was later rethought and works much more efficiently now..)

Posted Wed Apr 1 21:54:00 2015

Turns out that git has a feature I didn't know about; it will expand wildcards and other stuff in filenames passed to many git commands. This is on top of the shell's expansion.

That led to some broken behavior by git annex add 'foo.*' and, it could lead to other probably unwanted behavior, like git annex drop 'foo[barred]' dropping a file named food in addition to foo[barred]

For now, I've disabled this git feature throughout git-annex. If you relied on it for something, let me know, I might think about adding it back in specific places where it makes sense.

Improved git annex importfeed to check the itemid of the feed and avoid re-downloading a file with the same itemid. Before, it would add duplicate files if a feed kept the itemid the same, but changed the url. This was easier than expected because annex.genmetadata already caused the itemid to be stored in the git-annex metadata. I just had to make it check the itemid metadata, and set itemid even when annex.genmetadata isn't set.

Also got 4 other bug reports fixed, even though I feel I'm taking it easy today. It's good to be relaxed again!

Posted Tue Mar 31 20:02:08 2015

While I plowed through a lot of backlog the past several days, I still have some 120 messages piled deep.

That work did result in a number of improvements, culminating in a rather rushed release of version 5.20150327 today, to fix a regression affecting git annex sync when using the standalone linux tarballs. Unfortunately, I then had to update those tarballs a second time after the release as the first fix was incomplete.

And, I'm feeling super stressed out. At this point, I think I should step away until the end of the month. Unfortunately, this will mean more backlog later. Including lots of noise and hand-holding that I just don't seem to have time for if I want to continue making forward progress.

Maybe I'll think of a way to deal with it while I'm away. Currently, all I have is that I may have to start ignoring irc and the forum, and de-prioritizing bug reports that don't have either a working reproduction recipe or multiple independent confirmations that it's a real bug.

Posted Fri Mar 27 23:12:59 2015

While traveling for several days, I filled dead time with a rather massive reorganization of the git-annex man page, and I finished that up this morning.

That man page had gotten rather massive, at around 3 thousand lines. I split out 87 man pages, one for each git-annex command. Many of these were expanded with additional details, and have become a lot better thanks to the added focus and space. See for example, git-annex-find, or any of the links on the new git-annex man page. (Which is still over 1 thousand lines long..)

Also, git annex help <command> can be used to pull up a command's man page now!

I'm taking the rest of the day off to R&R from the big trip north, and expect to get back into the backlog of 143 messages starting tomorrow.

Posted Wed Mar 25 16:18:56 2015

Spent a couple of days at Dartmouth hanging out in the neuroscience department with the Datalad developers. Added several new plumbing commands and a new post-update-annex hook, based on their feedback of how they're using git-annex.

Posted Sat Mar 21 13:33:30 2015

Caught up with most of the recent backlog today. Was not very bad.

Fixed remotedaemon to support gcrypt remotes, which was never quite working before.

Seem to be on track to making a release tomorrow with a whole month's changes.

Posted Mon Mar 16 20:13:43 2015

After an intense week away, I didn't mean to work on git-annex today, but I got sucked back in..

Worked on some plumbing commands for mass repository creation. Made fromkey be able to read a stream of files to create from stdin. Added a new registerurl plumbing command, that reads a stream of keys and urls from stdin.

Posted Sun Mar 15 20:50:25 2015

Did a deep dive into ipfs last night. It has great promise.

As a first step toward using it with git-annex, I built an experimental ipfs special remote. It has some nice abilities; any ipfs address can be downloaded to a file in the repository:

git annex addurl ipfs:QmYgXEfjsLbPvVKrrD4Hf6QvXYRPRjH5XFGajDqtxBnD4W --file somefile

And, any file in the git-annex repository can be published to the world via ipfs, by simply using git annex copy --to ipfs. The ipfs address for the file is then visible in git annex whereis.

Had to extend the external special remote protocol slightly for that, so that ipfs addresses can be recorded as uris in git-annex, and will show up in git annex whereis.

Posted Thu Mar 5 21:07:40 2015

Fixed a mojibake bug that affected metadata values that included both whitespace and unicode characters. This was very fiddly to get right.

Finished up Monday's work to support submodules, getting them working on filesystems that don't support symlinks.

Posted Wed Mar 4 20:16:10 2015

This month is going to be a bit more random than usual where git-annex development is concerned.

  • On Saturday, the Seven Day Roguelike competition begins, and I will be spending a week building a game in haskell, to the exclusion of almost all other work.
  • On March 18th, I'll be at the Boston Haskell User's group. (Attending, not presenting.)
  • March 19-20, I'll be at Dartmouth visiting with the DataLad developers and learning more about what it needs from git-annex.
  • March 21-22, I'll be at the FSF's LibrePlanet conference at MIT.

Got started on the randomness today with this design proposal for using git-annex to back up the entire Internet Archive. This is something the Archive Team is considering taking on, and I had several hours driving and hiking to think about it and came up with a workable design. (Assuming large enough crowd of volunteers.)

Don't know if it will happen, but it was a useful thought problem to see how git-annex works, and doesn't work in this unusual use case.

One interesting thing to come out of that is that git-annex fsck does not currently make any record of successful fscks. In a very large distributed system, it can be useful to have successful fscks of an object's content recorded, by updating the timestamp in the location log to say "this repository still had the content at this time".

Posted Tue Mar 3 23:00:01 2015

I had thought that git-annex and git submodules couldn't mix. However, looking at it again, it turned out to be possible to use git-annex quite sanely in a submodule, with just a little tweaking of how git normally configures the repository. Details of this still experimental feature are in submodules.

There is still some work to be done to make git-annex work with submodules in repositories on filesystems that don't support symlinks.

Posted Mon Mar 2 20:45:19 2015

I'm snowed in, but keeping busy..

Developed a complete workaround for the sqlite SELECT ErrorBusy bug. So after a week, I finally have sqlite working robustly. And, I merged in the branch that uses sqlite for incremental fsck.

Benchmarking an incremental fsck --fast run, checking 40 thousand files, it used to take 4m30s using sticky bits, and using sqlite slowed it down by 10s. So one added second per 4 thousand or so files. I think that's ok. Incremental fsck is intended to be used in big repos, which are probably not checked in --fast most, so the checksumming of files will by far swamp that overhead.

Also got sqlite and persistent installed on all the autobuilders. This was easier than expected, because persistent bundles its own copy of sqlite.

That would have been a good stopping place for the day's work.. But then I got to spent 5 more hours getting the EvilSplicer to support Persistent. Urgh. :-/

Now I can look forward to using sqlite for something more interesting than incremental fsck, like metadata caching for views, or the direct mode mappings. But, given all the trouble I had with sqlite, I'm going to put that off for a little while, to make sure that I've really gotten sqlite to work robustly.

Posted Sun Feb 22 23:54:47 2015

Today's release doesn't have the database branch merged of course, but it still has a significant amount of changes.

Developed a test case for the sqlite problem, that reliably reproduces it, and sent it to the sqlite mailing list. It seems that under heavy write load, when a new connection is made to the database, SELECT can fail for a little while. Once one SELECT succeeds, that database connection becomes solid, and won't fail any more (apparently). This makes me think there might be some connection initialization steps that don't end up finishing before the SELECT goes through in this situation. I should be able to work around this problem by probing new connections for stability, and probably will have to, since it'll be years before any bug fixed sqlite is available everywhere.

I also noticed that current git-annex incremental parallel fsck doesn't really parallelize well; eg the processes do duplicate work. So, the database branch is not really a regression in this area.

Posted Thu Feb 19 22:45:55 2015

Breaking news: repositories now support git-annex!

A very nice surprise! More git hosters should do this..

Back to sqlite concurrency, I thought I had it dealt with, but more testing today has turned up a lot more problems with sqlite and concurrent writers (and readers).

First, I noticed that a process can be happily writing changes to the database, but if a second process starts reading from the database, this will make the writier start failing with BUSY, and keep failing until the second process goes idle. It turns out the solution to this is to use WAL mode, which prevents readers from blocking writers.

After several hours (persistent doesn't make it easy to enable WAL mode), it seemed pretty robust with concurrent fsck.

But then I saw SELECT fail with BUSY. I don't understand why a reader would fail in WAL mode; that's counter to the documentation. My best guess is that this happens when a checkpoint is being made.

This seems to be a real bug in sqlite. It may only affect the older versions bundled with persistent.

Posted Wed Feb 18 21:57:07 2015

Worked today on making incremental fsck's use of sqlite be safe with multiple concurrent fsck processes.

The first problem was that having fsck --incremental running and starting a new fsck --incremental caused it to crash. And with good reason, since starting a new incremental fsck deletes the old database, the old process was left writing to a database that had been deleted and recreated out from underneath it. Fixed with some locking.

Next problem is harder. Sqlite doesn't support multiple concurrent writers at all. One of them will fail to write. It's not even possible to have two processes building up separate transactions at the same time. Before using sqlite, incremental fsck could work perfectly well with multiple fsck processes running concurrently. I'd like to keep that working.

My partial solution, so far, is to make git-annex buffer writes, and every so often send them all to sqlite at once, in a transaction. So most of the time, nothing is writing to the database. (And if it gets unlucky and a write fails due to a collision with another writer, it can just wait and retry the write later.) This lets multiple processes write to the database successfully.

But, for the purposes of concurrent, incremental fsck, it's not ideal. Each process doesn't immediately learn of files that another process has checked. So they'll tend to do redundant work. Only way I can see to improve this is to use some other mechanism for short-term IPC between the fsck processes.

Also, I made git annex fsck --from remote --incremental use a different database per remote. This is a real improvement over the sticky bits; multiple incremental fscks can be in progress at once, checking different remotes.

Posted Tue Feb 17 21:13:13 2015

Yesterday I did a little more investigation of key/value stores. I'd love a pure haskell key/value store that didn't buffer everything in memory, and that allowed concurrent readers, and was ACID, and production quality. But so far, I have not found anything that meets all those criteria. It seems that sqlite is the best choice for now.

Started working on the database branch today. The plan is to use sqlite for incremental fsck first, and if that works well, do the rest of what's planned in caching database.

At least for now, I'm going to use a dedicated database file for each different thing. (This may not be as space-efficient due to lacking normalization, but it keeps things simple.)

So, .git/annex/fsck.db will be used by incremental fsck, and it has a super simple Persistent database schema:

  key SKey
  UniqueKey key

It was pretty easy to implement this and make incremental fsck use it. The hard part is making it both fast and robust.

At first, I was doing everything inside a single runSqlite action. Including creating the table. But, it turns out that runs as a single transaction, and if it was interrupted, this left the database in a state where it exists, but has no tables. Hard to recover from.

So, I separated out creating the database, made that be done in a separate transation and fully atomically. Now fsck --incremental could be crtl-c'd and resumed with fsck --more, but it would lose the transaction and so not remember anything had been checked.

To fix that, I tried making a separate transation per file fscked. That worked, and it resumes nicely where it left off, but all those transactions made it much slower.

To fix the speed, I made it commit just one transaction per minute. This seems like an ok balance. Having fsck re-do one minute's work when restarting an interrupted incremental fsck is perfectly reasonable, and now the speed, using the sqlite database, is nearly as fast as the old sticky bit hack was. (Specifically, 6m7s old vs 6m27s new, fscking 37000 files from cold cache in --fast mode.)

There is still a problem with multiple concurrent fsck --more failing. Probably a concurrent writer problem? And, some porting will be required to get sqlite and persistent working on Windows and Android. So the branch isn't ready to merge yet, but it seems promising.

In retrospect, while incremental fsck has the simplest database schema, it might be one of the harder things listed in caching database, just because it involves so many writes to the database. The other use cases are more read heavy.

Posted Mon Feb 16 21:16:38 2015

Spent a couple hours to make the ssh-options git config setting be used in more places. Now it's used everywhere that git-annex supports ssh caching, including the git pull and git push done by sync and by the assistant. Also the remotedaemon and the gcrypt, rsync, and ddar special remotes.

Posted Thu Feb 12 20:23:07 2015

Many more little improvements made yesterday and part of today. While it's only been a week since the last release, it feels almost time to make another one, after so many recent bug fixes and small improvements.

I've updated the roadmap. I have been operating without a roadmap for half a year, and it would be nice to have some plans. Keeping up with bug reports and requests as they come in is a fine mode of work, but it can feel a little aimless. It's good to have a planned out course, or at least some longer term goals.

After the next release, I've penciled in the second half of this month to work on the caching database.

Posted Wed Feb 11 20:56:55 2015

Plowing through the backlog today, and fixing quite a few bugs! Got the backlog down to 87 messages from ~140. And some of the things I got to were old and/or hard.

About a third of the day was spent revisiting git-annex branch shows commit with looong commitlog. I still don't understand how that behavior can happen, but I have a donated repository where it did happen. Made several changes to try to make the problem less likely to occur, and not as annoying when it does occur, and maybe get me more info if it does happen to someone again.

Posted Mon Feb 9 22:47:51 2015

Made a release yesterday, and caught up on most recent messages earlier this week. Backlog stands at 128 messages.

Had to deal with an ugly problem with /usr/bin/glacier today. Seems that there are multiple programs all using that name, some of them shipping in some linux distributions, and the one from boto fails to fail when passed parameters it doesn't understand. Yugh! I had to make git-annex probe to make sure the right glacier program is installed.

I'm planning to deprecate the glacier special remote at some point. Instead, I'd like to make the S3 special remote support the S3-glacier lifecycle, so objects can be uploaded to S3, set to transition to glacier, and then if necessary pulled back from glacier to S3. That should be much simpler and less prone to break.

But not yet; haskell-aws needs glacier support added. Or I could use the new amazonka library, but I'd rather stick with haskell-aws.

Some other minor improvements today included adding git annex groupwanted, which makes for easier examples than using vicfg, and making git annex import support options like --include and --exclude.

Also I moved a many file matching options to only be accepted by the commands that actually use them. Of the remaining common options, most of them make sense for every command to accept (eg, --force and --debug). It would make sense to move --backend, --notify-start/finish, and perhaps --user-agent. Eventually.

Posted Fri Feb 6 21:31:17 2015

Today I put together a lot of things I've been thinking about:

  • There's some evidence that git-annex needs tuning to handle some unusual repositories. In particular very big repositories might benefit from different object hashing.
  • It's really hard to handle upgrades that change the fundamentals of how git-annex repositories work. Such an upgrade would need every git-annex user to upgrade their repository, and would be very painful. It's hard to imagine a change that is worth that amount of pain.
  • There are other changes some would like to see (like lower-case object hash directory names) that are certainly not enough to warrant a flag day repo format upgrade.
  • It would be nice to let people who want to have some flexibility to play around with changes, in their own repos, as long as they don't a) make git-annex a lot more complicated, or b) negatively impact others. (Without having to fork git-annex.)

This is discussed in more depth in new repo versions.

The solution, which I've built today, is support for tuning settings, when a new repository is first created. The resulting repository will be different in some significant way from a default git-annex repository, but git-annex will support it just fine.

The main limitations are:

  • You can't change the tuning of an existing repository (unless a tool gets written to transition it).
  • You absolutely don't want to merge repo B, which has been tuned in nonstandard ways, into repo A which has not. Or A into B. (Unless you like watching slow motion car crashes.)

I built all the infrastructure for this today. Basically, the git-annex branch gets a record of all tunings that have been applied, and they're automatically propagated to new clones of a repository.

And I implemented the first tunable setting:

git -c annex.tune.objecthashlower=true annex init

This is definitely an experimental feature for now. git-annex merge and similar commands will detect attempts to merge between incompatibly tuned repositories, and error out. But, there are a lot of ways to shoot yourself in the foot if you use this feature:

  • Nothing stops git merge from merging two incompatible repositories.
  • Nothing stops any version of git-annex older from today from merging either.

Now that the groundwork is laid, I can pretty easily, and inexpensively, add more tunable settings. The next two I plan to add are already documented, annex.tune.objecthashdirectories and annex.tune.branchhashdirectories. Most new tunables should take about 4 lines of code to add to git-annex.

Posted Tue Jan 27 21:39:18 2015

Today I got The pre-commit-annex hook working on Windows. It turns out that msysgit runs hook scripts even when they're not executable, and it parses the #! line itself. Now git-annex does too, on Windows.

Also, added a new chapter to the walkthrough, using special remotes. They clearly needed to be mentioned, especially to show the workflow of running initremote in one repository, then syncing another repository and running enableremote to enable the same special remote there.

Then more fun Windows porting! Turns out git-annex on Windows didn't handle files > 2 gb correctly; the way it was getting file size uses a too small data type on Windows. Luckily git-annex itself treats all file sizes as unbounded Integers, so I was easily able to swap in a getFileSize that returns correct values for large files.

While I haven't blogged since the 13th and have not been too active until today, there are still a number of little improvements that have been done here and there.

Including a fix for an interesting bug where the assistant would tell the remotedaemon that the network connection has been lost, twice in a row, and this would make the remotedeamon fail to reconnect to the remote when the network came up. I'm not sure what situation triggers this bug (Maybe machines with 2 interfaces? Or maybe a double disconnection event for 1 interface?), but I was able to reproduce it by sending messages to the remotedaemon, and so fixed it.

Backlog is down to 118 messages.

Posted Tue Jan 20 21:36:12 2015

Got a release out today.

I'm feeling a little under the weather, so wanted something easy to do in the rest of the day that would be nice and constructive. Ended up going over the todo list. Old todos come in three groups; hard problems, already solved, and easy changes that never got done. I left the first group alone, closed many todos in the second group, and implemented a few easy changes. Including git annex sync -m and adding some more info to git annex info remote.

Posted Tue Jan 13 22:36:29 2015

Worked more on the relativepaths branch last night, and I am actually fairly happy with it now, and plan to merge it after I've run it for a bit longer myself.

It seems that I did manage to get a git-annex executable that is built PIE so it will work on Android 5.0. But all the C programs like busybox included in the Android app also have to be built that way. Arranging for everything to get built twice and with the right options took up most of today.

Posted Wed Jan 7 21:27:44 2015

git-annex internally uses all absolute paths all the time. For a couple of reasons, I'd like it to use relative paths. The best reason is, it would let a repository be moved while git-annex was running, without breaking. A lesser reason is that Windows has some crazy small limit on the length of a path (260 bytes?!), and using relative paths would avoid hitting it so often.

I tried to do this today, in a relativepaths branch. I eventually got the test suite to pass, but I am very unsure about this change. A lot of random assumptions broke, and the test suite won't catch them all. In a few places, git-annex commands do change the current directory, and that will break with relative paths.

A frustrating day.

Posted Tue Jan 6 22:00:56 2015

I've finally been clued into why git-annex isn't working on Android 5, and it seems fixing it is as easy as pie.. That is, passing -pie -FPIE to the linker. I've added a 5.0 build to the Android autobuilder. It is currently untested, so I hope to get feedback from someone with an Android 5 device; a test build is now available.

I've been working through the backlog of messages today, and gotten down from 170 to 128. Mostly answered a lot of interesting questions, such as "Where to start reading the source code?"

Also did some work to make git-annex check git versions at runtime more often, instead of assuming the git version it was built against. It turns out this could be done pretty inexpensively in 2 of 4 cases, and one of the 2 fixed was the git check-attr behavior change, which could lead to git-annex add hanging if used with an old version of git.

Posted Mon Jan 5 21:13:03 2015

Took a holiday week off from git-annex development, and started a new side project building shell-monad, which might eventually be used in some parts of git-annex that generate shell scripts.

Message backlog is 165 and I have not dove back into it, but I have started spinning back up the development engines in preparation for new year takeoff.

Yesterday, added some minor new features -- git annex sync now supports git remote groups, and I added a new plumbing command setpresentkey for those times when you really need to mess with git-annex's internal bookkeeping. Also cleaned up a lot of build warning messages on OSX and Windows.

Today, first some improvements to make addurl more robust. Then the rest of the day was spent on Windows. Fixed (again) the Windows port's problem with rsync hating DOS style filenames. Got the rsync special remote fully working on Windows for the first time.

Best of all, got the Windows autobuilder to run the test suite successfully, and fixed a couple test suite failures on Windows.

Posted Tue Dec 30 21:52:44 2014

Spent a couple days adding a bittorrent special remote to git-annex. This is better than the demo external torrent remote I made on Friday: It's built into git-annex; it supports magnet links; it even parses aria2c's output so the webapp can display progress bars.

Besides needing aria2 to download torrents, it also currently depends on the btshowmetainfo command from the original bittorrent client (or bittornado). I looked into using instead, but that package is out of date and doesn't currently build. I've got a patch fixing that, but am waiting to hear back from the library's author.

There is a bit of a behavior change here; while before git annex addurl of a torrent file would add the torrent file itself to the repository, it now will download and add the contents of the torrent. I think/hope this behavior change is ok..

Posted Wed Dec 17 19:45:48 2014

Some more work on the interface that lets remotes claim urls for git annex addurl. Added support for remotes suggesting a filename to use when adding an url. Also, added support for urls that result in multiple files when downloaded. The obvious use case for that is an url to a torrent that contains multiple files.

Then, got git annex importfeed to also check if a remote claims an url.

Finally, I put together a quick demo external remote using this new interface. git-annex-remote-torrent adds support for torrent files to git-annex, using aria2c to download them. It supports multi-file torrents, but not magnet links. (I'll probably rewrite this more robustly and efficiently in haskell sometime soon.)

Here's a demo:

# git annex initremote torrent type=external encryption=none externaltype=torrent
initremote torrent ok
(Recording state in git...)
# ls
# git annex addurl  --fast file:///home/joey/my.torrent
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   198  100   198    0     0  3946k      0 --:--:-- --:--:-- --:--:-- 3946k
addurl _home_joey_my.torrent/bar (using torrent) ok
addurl _home_joey_my.torrent/baz (using torrent) ok
addurl _home_joey_my.torrent/foo (using torrent) ok
(Recording state in git...)
# ls _home_joey_my.torrent/
bar@  baz@  foo@
# git annex get _home_joey_my.torrent/baz
get _home_joey_my.torrent/baz (from torrent...) 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:-100   198  100   198    0     0  3580k      0 --:--:-- --:--:-- --:--:-- 3580k

12/11 18:14:56 [NOTICE] IPv4 DHT: listening on UDP port 6946

12/11 18:14:56 [NOTICE] IPv4 BitTorrent: listening on TCP port 6961

12/11 18:14:56 [NOTICE] IPv6 BitTorrent: listening on TCP port 6961

12/11 18:14:56 [NOTICE] Seeding is over.
12/11 18:14:57 [NOTICE] Download complete: /home/joey/tmp/tmp.Le89hJSXyh/tor

12/11 18:14:57 [NOTICE] Your share ratio was 0.0, uploaded/downloaded=0B/0B
Download Results:
gid   |stat|avg speed  |path/URI
71f6b6|OK  |       0B/s|/home/joey/tmp/tmp.Le89hJSXyh/tor/baz

Status Legend:
(OK):download completed.
(Recording state in git...)
# git annex find
# git annex whereis _home_joey_my.torrent/baz
whereis _home_joey_my.torrent/baz (2 copies) 
    1878241d-ee49-446d-8cce-041c46442d94 -- [torrent]
    52412020-2bb3-4aa4-ae16-0da22ba48875 -- joey@darkstar:~/tmp/repo [here]

  torrent: file:///home/joey/my.torrent#2
Posted Thu Dec 11 22:20:57 2014

Worked on ?extensible addurl today. When git annex addurl is run, remotes will be asked if they claim the url, and whichever remote does will be used to download it, and location tracking will indicate that remote contains the object. This is a masive 1000 line patch touching 30 files, including follow-on changes in rmurl and whereis and even rekey.

It should now be possible to build an external special remote that handles *.torrent and magnet: urls and passes them off to a bittorrent client for download, for example.

Another use for this would be to make an external special remote that uses youtube-dl or some other program than quvi for downloading web videos. The builtin quvi support could probably be moved out of the web special remote, to a separate remote. I haven't tried to do that yet.

Posted Mon Dec 8 23:17:35 2014

Today's release has a month's accumulated changes, including several nice new features: git annex undo, git annex proxy, git annex diffdriver, and I was able to land the s3-aws branch in this release too, so lots of improvements to the S3 support.

Spent several hours getting the autobuilders updated, with the haskell aws library installed. Android and armel builds are still out of date.

Also fixed two Windows bugs related to the location of the bundled ssh program.

Posted Wed Dec 3 23:02:56 2014

Back from the holiday, catching up on traffic. Backlog stands at 113 messages.

Here's a nice tip that Giovanni added: publishing your files to the public (using a public S3 bucket)

Just before going on break, I added a new feature that I didn't mention here. git annex diffdriver integrates git-annex with git's external diff driver support. So if you have a smart diff program that can diff, say, genome sequences, or cat videos, or something in some useful way, it can be hooked up to git diff and will be able to see the content of annexed files.

Also today, I spent a couple hours today updating the license file included in the standalone git-annex builds to include the licenses of all the haskell libraries git-annex depends on. Which I had for some reason not thought to include before, despite them getting built into the git-annex binary.

Posted Mon Dec 1 23:35:22 2014

Built the git annex undo command. This is intended to be a simple interface for users who have changed one file, and want to undo the change without the complexities of git revert or git annex proxy. It's simple enough that I added undo as an action in the file manager integration.

And yes, you can undo an undo. :)

Posted Fri Nov 14 22:19:20 2014

Ever since the direct mode guard was added a year ago, direct mode has been a lot safer to use, but very limited in the git commands that could be run in a direct mode repository.

The worst limitation was that there was no way to git revert unwanted changes. But also, there was no way to check out different branches, or run commands like git mv.

Today I made git annex proxy, which allows doing all of those things, and more. documentation here

It's so flexible that I'm not sure where the boundries lie yet, but it seems it will work for any git command that updates both the work tree and the index. Some git commands only update one or the other and not both and won't work with the proxy. As an advanced user tool, I think this is a great solution. I still want to make a simpler ?undo command that can nicely integrate into file managers.

The implementation of git annex proxy is quite simple, because it reuses all the complicated work tree update code that was already written for git annex merge.

And here's the lede I buried: I've gotten two years of funding to work on git-annex part-time! Details in my personal blog.

Posted Wed Nov 12 21:06:59 2014

The OSX autobuilder has been updated to OSX 10.10 Yosemite. The resulting build might also work on 10.9 Mavericks too, and I'd appreciate help testing that.

Went ahead and fixed the ?partial commit problem by making the pre-commit hook detect and block problematic partial commits.

Posted Tue Nov 11 21:02:53 2014

S3 multipart is finally completely working. I still don't understand the memory issue that stumped me yesterday, but rewrote the code to use a simpler approach, which avoids the problem. Various other issues, and testing it with large files, took all day.

This is now merged into the s3-aws branch, so when that branch lands, S3 support will massively improve, from the current situation of using a buggy library that buffers uploaded files in memory, and cannot support very large file uploads at all, to being able to support hopefully files of arbitrary hugeness (at least up to a few terabytes).

BTW, thanks to Aristid Breitkreuz and Junji Hashimoto for working on the multipart support in the aws library.

Posted Tue Nov 4 22:04:40 2014

More work on S3 multipart uploads, since the aws library got fixed today to return the ETAGs for the parts. I got multipart uploads fully working, including progress display.

The code takes care to stream each part in from the file and out the socket, so I'd hoped it would have good memory behavior. However, for reasons I have not tracked down, something in the aws library is causing each part to be buffered in memory. This is a problem, since I want to use 1 gb as the default part size.

Posted Tue Nov 4 02:11:56 2014

Some progress on the ?S3 upload not using multipart bug. The aws library now includes the multipart API. However, when I dug into it, it looks like the API needs some changes to get the ETAG of each uploaded part. Once that's fixed, git-annex should be able to support S3 multipart uploads, although I think that git-annex's own chunking is better in most situations -- it supports resuming uploads and downloads better. The main use case for S3 multipart seems to be using git-annex to publish large files.

Also, managed to get the backlog down from 100 to just 65 messages, including catching up on quite old parts of backlog.

Posted Tue Oct 28 20:41:42 2014

New AWS region in Germany announced today. git-annex doesn't support it yet, unless you're using the s3-aws branch.

I cleaned up that branch, got it building again, and re-tested it with testremote, and then fixed a problem the test suite found that was caused by some changes in the haskell aws library.

Unfortunately, s3-aws is not ready to be merged because of some cabal dependency problems involving dbus and random. I did go ahead and update Debian's haskell-aws package to cherry-pick from a newer version the change needed for Inernet Archive support, which allows building the s3-aws branch on Debian. Getting closer..

Posted Thu Oct 23 21:02:00 2014

Today, I've expanded git annex info to also be able to be used on annexed files and on remotes. Looking at the info for an individual remote is quite useful, especially for answering questions like: Does the remote have embedded creds? Are they encrypted? Does it use chunking? Is that old style chunking?

description: demo remote
uuid: 15b42f18-ebf2-11e1-bea1-f71f1515f9f1
cost: 250.0
type: rsync
encryption: encrypted (to gpg keys: 7321FC22AC211D23 C910D9222512E3C7)
chunking: 1 MB chunks
remote: ia3
description: test [ia3]
uuid: 12817311-a189-4de3-b806-5f339d304230
cost: 200.0
type: S3
creds: embedded in git repository (not encrypted)
bucket: joeyh-test-17oct-3
internet archive item:
encryption: not encrypted
chunking: none

Should be quite useful info for debugging too..

Yesterday, I fixed a bug that prevented retrieving files from Glacier.

Posted Tue Oct 21 19:51:54 2014

3 days spent redoing the Android autobuilder! The new version of yesod-routes generates TH splices that break the EvilSplicer. So after updating everything to new versions for the Nth time, I instead went back to older versions. The autobuilder now uses Debian jessie, instead of wheezy. And all haskell packages are pinned to use the same version as in jessie, rather than the newest versions. Since jessie is quite near to being frozen, this should make the autobuilder much less prone to getting broken by new versions of haskell packages that need patches for Android.

I happened to stumble over while doing that. This supports setting and unsetting environment variables on Windows, which I had not known a way to do from Haskell. Cleaned up several ugly corners of the Windows port using it.

Posted Thu Oct 16 19:37:28 2014

git commit $some_unlocked_file seems like a reasonably common thing for someone to do, so it's surprising to find that it's a ?little bit broken, leaving the file staged in the index after (correctly) committing the annexed symlink.

This is caused by either a bug in git and/or by git-annex abusing the git post-commit hook to do something it shouldn't do, although it's not unique in using the post-commit hook this way. I'm talking this over with Junio, and the fix will depend on the result of that conversation. It might involve git-annex detecting this case and canceling the commit, asking the user to git annex add the file first. Or it might involve a new git hook, although I have not had good luck getting hooks added to git before.

Meanwhile, today I did some other bug fixing. Fixed the Internet Archive support for embedcreds=yes. Made git annex map work for remote repos in a directory with an implicit ".git" prefix. And fixed a strange problem where the repository repair code caused a git gc to run and then tripped over its pid file.

I seem to have enough fixes to make another release pretty soon. Especially since the current release of git-annex doesn't build with yesod 1.4.

Backlog: 94 messages

Posted Sun Oct 12 20:13:27 2014

Made two releases of git-annex, yesterday and today, which turned out to contain only Debian changes. So no need for other users to upgrade.

This included fixing building on mips, and arm architectures. The mips build was running out of memory, and I was able to work around that. Then the arm builds broke today, because of a recent change to the version of llvm that has completely trashed ghc. Luckily, I was able to work around that too.

Hopefully that will get last week's security fix into Debian testing, and otherwise have git-annex in Debian in good shape for the upcoming freeze.

Posted Sat Sep 27 20:29:54 2014

Working through the forum posts and bugs. Backlog is down to 95.

Discovered the first known security hole in git-annex! Turns out that S3 and Glacier remotes that were configured with embedcreds=yes and encryption=pubkey or encryption=hybrid didn't actually encrypt the AWS credentials that get embedded into the git repo. This doesn't affect any repos set up by the assistant.

I've fixed the problem and am going to make a release soon. If your repo is affected, see insecure embedded creds for what to do about it.

Posted Thu Sep 18 22:24:43 2014

Made a release yesterday, which was all bugfixes.

Today, a few more bug fixes. Looked into making the webapp create non-bare repositories on removable drives, but before I got too far into the code, I noticed there's a big problem with that idea.

Rest of day was spent getting caught up on forum posts etc. I'm happy to read lots of good answers that have been posted while I've been away. Here's an excellent example:

That led to rewriting the docs for building git-annex from source. New page: fromsource.

Backlog is now down to 117.

Posted Tue Sep 16 20:18:33 2014

Yesterday and today were the first good solid days working on git-annex in a while. There's a big backlog, currently of 133 messages, so I have been concentrating on bug reports first. Happily, not many new bugs have been reported lately, and I've made good progress on them, fixing 5 bugs today, including a file descriptor leak.

catching up

In this end of summer rush, I've been too busy to blog for the past 20 days, but not entirely too busy to work on git-annex. Two releases have been made in that time, and a fair amount of improvements worked on.

Including a new feature: When a local git repository is cloned with git clone --shared, git-annex detects this and defaults to a special mode where file contents get hard linked into the clone. It also makes the cloned repository be untrusted, to avoid confusing numcopies counting with the hard links. This can be useful for temporary working repositories without the overhead of lots of copies of files.

looking back

I want to look back further, over the crowdfunded year of work covered by this devblog. There were a lot of things I wanted to accomplish this past year, and I managed to get to most of them. As well as a few surprises.

  • Windows support improved more than I guessed in my wildest dreams.
    git-annex went from working not too well on the command line to being pretty solid there, as well as having a working and almost polished webapp on Windows.
    There are still warts -- it's Windows after all!

  • Android didn't get many improvements. Most of the time I had budgeted to Android porting ended up being used on Windows porting instead. I did, however, get the Android build environment cleaned up a lot from the initial hacked together one, and generally kept it building and working on Android.

  • The direct mode guard was not planned, but the need for it became clear, and it's dramatically reduced the amount of command-line foot-shooting that goes on in direct mode.

  • Repository repair was planned, and I've very proud of git-repair. Also pleased with the webapp's UI for scheduling repository consistency checks.
    Always room for improvement in this kind of thing, but this brings a new capability to both git and git-annex.

  • The external special remote interface came together beautifully. External special remotes are now just as well supported as built-in ones, except the webapp cannot be used to configure them.

  • Using git-remote-gcrypt for fully encrypted git repositories, including support in the webapp for setting them (and gpg keys if necessary), happened. Still needs testing/more use/improvements. Avoided doing much in the area of gpg key management, which is probably good to avoid when possible, but is probably needed to make this a really suitable option for end users.

  • Telehash is still being built, and it's not clear if they've gotten it to work at all yet. The v2 telehash has recently been superseded by a a new v3. So I am not pleased that I didn't get git-annex working with telehash, but it was outside my control. This is a problem that needs to get solved outside git-annex first, either by telehash or something else. The plan is to keep an eye on everything in this space, including for example, Maidsafe.

  • In the meantime, the new notifychanges support in git-annex-shell makes XMPP/telehash/whatever unnecessary in a lot of configurations. git-annex's remotedaemon architecture supports that and is designed to support other notification methods later. And the webapp has a lot of improvements in the area of setting up ssh remotes, so fewer users will be stuck with XMPP.

  • I didn't quite get to deltas, but the final month of work on chunking provides a lot of new features and hopefully a foundation that will get to deltas eventually. There is a new haskell library that's being developed with the goal of being used for git-annex deltas.

  • I hadn't planned to make git-annex be able to upgrade itself, when installed from this website. But there was a need for that, and so it happened. Even got a gpg key trust path for the distribution of git-annex.

  • Metadata driven views was an entirely unplanned feature. The current prototype is very exciting, it opens up entire new use cases. I had to hold myself back to not work on it too much, especially as it shaded into adding a caching database to git-annex. Had too much other stuff planned to do all I wanted. Clearly this is an area I want to spend more time on!

Those are most of the big features and changes, but probably half of my work on git-annex this past year was in smaller things, and general maintenance. Lots of others have contributed, some with code (like the large effort to switch to bootstrap3), and others with documentation, bug reports, etc.

Perhaps it's best to turn to git diff --stat to sum up the activity and see just how much both the crowdfunding campaign and the previous kickstarter have pushed git-annex into high gear:

   campaign: 5410 files changed, 124159 insertions(+), 79395 deletions(-)
kickstarter: 4411 files changed, 123262 insertions(+), 13935 deletions(-)
year before: 1281 files changed,   7263 insertions(+), 55831 deletions(-)

What's next? The hope is, no more crowdfunded campaigns where I have to promise the moon anytime soon. Instead, the goal is to move to a more mature and sustainable funding model, and continue to grow the git-annex community, and the spaces where it's useful.

Posted Fri Sep 12 16:27:01 2014

Plan is to be on vacation and/or low activity this week before DebConf. However, today I got involved in fixing a bug that caused the assistant to keep files open after syncing with repositories on removable media.

Part of that bug involved lock files not being opend close-on-exec, and while fixing that I noticed again that the locking code was scattered all around and rather repetitive. That led to a lot of refactoring, which is always fun when it involves scary locking code. Thanks goodness for referential transparency.

Now there's a Utility.LockFile that works on both POSIX and Windows. Howver, that module actually exports very different functions for the two. While it might be tempting to try to do a portability layer, the two locking models are really very different, and there are lots of gotchas such a portability layer would face. The only API that's completely the same between the two is dropLock.

This refactoring process and the cleaner, more expressive code it led to helped me spot a couple of bugs involving locking. See e386e26ef207db742da6d406183ab851571047ff and 0a4d301051e4933661b7b0a0791afa95bfe9a1d3 Neither bug has ever seemed to cause a problem, but it's nice to be able to spot and fix such bugs before they do.

Posted Wed Aug 20 23:46:24 2014

Over the past couple days, got the arm autobuilder working again. It had been down since June with several problems. cabal install tended to crash; apparenty this has something to do with threading in user-mode qemu, because -j1 avoids that. And strange invalid character problems were fixed by downgrading file-embed. Also, with Yury's help I got the Windows autobuilder upgraded to the new Haskell Platform and working again.

Today a last few finishing touches, including getting rid of the last dependency on the old haskell HTTP library, since http-conduit is being used now. Ready for the release!

Posted Fri Aug 15 22:05:01 2014

Working on getting caught up with backlog. 73 messages remain.

Several minor bugs were fixed today. All edge cases. The most edge case one of all, I could not fix: git-annex cannot add a file that has a newline in its filename, because git cat-file --batch's interface does not support such filenames.

Added a page documenting how verify the signatures of git-annex releases.

Over the past couple days, all the autobuilders have been updated to new dependencies needed by the recent work. Except for Windows, which needs to be updated to the new Haskell Platform first, so hopefully soon.

Turns out that upgrading unix-compat means that inode(like) numbers are available even on Windows, which will make git-annex more robust there. Win win. ;)

Posted Tue Aug 12 20:54:33 2014

Yesterday, finished converting S3 to use the aws library. Very happy with the result (no memory leaks! connection caching!), but s3-aws is not merged into master yet. Waiting on a new release of the aws library so as to not break Internet Archive S3 support.

Today, spent a few hours adding more tests to testremote. The new tests take a remote, and construct a modified version that is intentionally unavailable. Then they make sure trying to use it fails in appropriate ways. This was a very good thing to test; two bugs were immediately found and fixed.

And that wraps up several weeks of hacking on the core of git-annex's remotes support, which started with reworking chunking and kind of took on a life of its own. I plan a release of this new stuff in a week. The next week will be spent catching up on 117 messages of backlog that accumulated while I was in deep coding mode.

Posted Sun Aug 10 19:21:59 2014

Finished up webdav, and after running testremote for a long time, I'm satisfied it's good. The newchunks branch has now been merged into master completely.

Spent the rest of the day beginning to rework the S3 special remote to use the aws library. This was pretty fiddly; I want to keep all the configuration exactly the same, so had to do a lot of mapping from hS3 configuration to aws configuration. Also there is some hairy stuff involving escaping from the ResourceT monad with responses and http connection managers intact.

Stopped once initremote worked. The rest should be pretty easy, although Internet Archive support is blocked by This is in the s3-aws branch until it gets usable.

Posted Sat Aug 9 03:28:00 2014

Today was spent reworking so much of the webdav special remote that it was essentially rewritten from scratch.

The main improvement is that it now keeps a http connection open and uses it to perform multiple actions. Before, one connection was made per action. This is even done for operations on chunks. So, now storing a chunked file in webdav makes only 2 http connections total. Before, it would take around 10 connections per chunk. So a big win for performance, although there is still room for improvement: It would be possible to reduce that down to just 1 connection, and indeed keep a persistent connection reused when acting on multiple files.

Finished up by making uploading a large (non-chunked) file to webdav not buffer the whole file in memory.

I still need to make downloading a file from webdav not buffer it, and test, and then I'll be done with webdav and can move on to making similar changes to S3.

Posted Fri Aug 8 00:00:31 2014

Converted the webdav special remote to the new API. All done with converting everything now!

I also updated the new API to support doing things like reusing the same http connection when removing and checking the presence of chunks.

I've been working on improving the haskell DAV library, in a number of ways that will let me improve the webdav special remote. Including making changes that will let me do connection caching, and improving its API to support streaming content without buffering a whole file in memory.

Posted Wed Aug 6 22:43:13 2014

Just finished converting both rsync and gcrypt to the new API, and testing them. Still need to fix 2 test suite failures for gcrypt. Otherwise, only WebDAV remains unconverted.

Earlier today, I investigated switching from hS3 to Learned its API, which seemed a lot easier to comprehend than the other two times I looked at it. Wrote some test programs, which are in the s3-aws branch. I was able to stream in large files to S3, without ever buffering them in memory (which hS3's API precludes). And for chunking, it can reuse an http connection. This seems very promising. (Also, it might eventually get Glacier support..)

I have uploaded haskell-aws to Debian, and once it gets into testing and backports, I plan to switch git-annex over to it.

Posted Mon Aug 4 00:38:10 2014

Have started converting lots of special remotes to the new API. Today, S3 and hook got chunking support. I also converted several remotes to the new API without supporting chunking: bup, ddar, and glacier (which should support chunking, but there were complications).

This removed 110 lines of code while adding features! And, I seem to be able to convert them faster than testremote can test them. :)

Now that S3 supports chunks, they can be used to work around several problems with S3 remotes, including file size limits, and a memory leak in the underlying S3 library.

The S3 conversion included caching of the S3 connection when storing/retrieving chunks. [Update: Actually, it turns out it didn't; the hS3 library doesn't support persistent connections. Another reason I need to switch to a better S3 library!]

But the API doesn't yet support caching when removing or checking if chunks are present. I should probably expand the API, but got into some type checker messes when using generic enough data types to support everything. Should probably switch to ResourceT.

Also, I tried, but failed to make testremote check that storing a key is done atomically. The best I could come up with was a test that stored a key and had another thread repeatedly check if the object was present on the remote, logging the results and timestamps. It then becomes a statistical problem -- somewhere toward the end of the log it's ok if the key has become present -- but too early might indicate that it wasn't stored atomically. Perhaps it's my poor knowledge of statistics, but I could not find a way to analize the log that reliably detected non-atomic storage. If someone would like to try to work on this, see the atomic-store-test branch.

Posted Sat Aug 2 23:13:16 2014

Built git annex testremote today.

That took a little bit longer than expected, because it actually found several fence post bugs in the chunking code.

It also found a bug in the sample external special remote script.

I am very pleased with this command. Being able to run 640 tests against any remote, without any possibility of damaging data already stored in the remote, is awesome. Should have written it a looong time ago!

Posted Fri Aug 1 21:59:23 2014

It took 9 hours, but I finally got to make c0dc134cded6078bb2e5fa2d4420b9cc09a292f7, which both removes 35 lines of code, and adds chunking support to all external special remotes!

The groundwork for that commit involved taking the type scheme I sketched out yesterday, completely failing to make it work with such high-ranked types, and falling back to a simpler set of types that both I and GHC seem better at getting our heads around.

Then I also had more fun with types, when it turned out I needed to run encryption in the Annex monad. So I had to go convert several parts of the utility libraries to use MonadIO and exception lifting. Yurk.

The final and most fun stumbling block caused git-annex to crash when retriving a file from an external special remote that had neither encryption not chunking. Amusingly it was because I had not put in an optimation (namely, just renaming the file that was retrieved in this case, rather than unnecessarily reading it in and writing it back out). It's not often that a lack of an optimisation causes code to crash!

So, fun day, great result, and it should now be very simple to convert the bup, ddar, gcrypt, glacier, rsync, S3, and WebDAV special remotes to the new system. Fingers crossed.

But first, I will probably take half a day or so and write a git annex testremote that can be run in a repository and does live testing of a special remote including uploading and downloading files. There are quite a lot of cases to test now, and it seems best to get that in place before I start changing a lot of remotes without a way to test everything.

Today's work was sponsored by Daniel Callahan.

Posted Wed Jul 30 00:46:42 2014

Zap! ... My internet gateway was destroyed by lightning. Limping along regardless, and replacement ordered.

Got resuming of uploads to chunked remotes working. Easy!

Next I want to convert the external special remotes to have these nice new features. But there is a wrinkle: The new chunking interface works entirely on ByteStrings containing the content, but the external special remote interface passes content around in files.

I could just make it write the ByteString to a temp file, and pass the temp file to the external special remote to store. But then, when chunking is not being used, it would pointlessly read a file's content, only to write it back out to a temp file.

Similarly, when retrieving a key, the external special remote saves it to a file. But we want a ByteString. Except, when not doing chunking or encryption, letting the external special remote save the content directly to a file is optimal.

One approach would be to change the protocol for external special remotes, so that the content is sent over the protocol rather than in temp files. But I think this would not be ideal for some kinds of external special remotes, and it would probably be quite a lot slower and more complicated.

Instead, I am playing around with some type class trickery:

{-# LANGUAGE Rank2Types TypeSynonymInstances FlexibleInstances MultiParamTypeClasses #-}

type Storer p = Key -> p -> MeterUpdate -> IO Bool

-- For Storers that want to be provided with a file to store.
type FileStorer a = Storer (ContentPipe a FilePath)

-- For Storers that want to be provided with a ByteString to store
type ByteStringStorer a = Storer (ContentPipe a L.ByteString)

class ContentPipe src dest where
        contentPipe :: src -> (dest -> IO a) -> IO a

instance ContentPipe L.ByteString L.ByteString where
        contentPipe b a = a b

-- This feels a lot like I could perhaps use pipes or conduit...
instance ContentPipe FilePath FilePath where
        contentPipe f a = a f

instance ContentPipe L.ByteString FilePath where
        contentPipe b a = withTmpFile "tmpXXXXXX" $ \f h -> do
                L.hPut h b
                hClose h
                a f

instance ContentPipe FilePath L.ByteString where
        contentPipe f a = a =<< L.readFile f

The external special remote would be a FileStorer, so when a non-chunked, non-encrypted file is provided, it just runs on the FilePath with no extra work. While when a ByteString is provided, it's swapped out to a temp file and the temp file provided. And many other special remotes are ByteStorers, so they will just pass the provided ByteStream through, or read in the content of a file.

I think that would work. Thoigh it is not optimal for external special remotes that are chunked but not encrypted. For that case, it might be worth extending the special remote protocol with a way to say "store a chunk of this file from byte N to byte M".

Also, talked with ion about what would be involved in using rolling checksum based chunks. That would allow for rsync or zsync like behavior, where when a file changed, git-annex uploads only the chunks that changed, and the unchanged chunks are reused.

I am not ready to work on that yet, but I made some changes to the parsing of the chunk log, so that additional chunking schemes like this can be added to git-annex later without breaking backwards compatability.

Posted Mon Jul 28 21:28:34 2014

Last night, went over the new chunking interface, tightened up exception handling, and improved the API so that things like WebDAV will be able to reuse a single connection while all of a key's chunks are being downloaded. I am pretty happy with the interface now, and except to convert more special remotes to use it soon.

Just finished adding a killer feature: Automatic resuming of interrupted downloads from chunked remotes. Sort of a poor man's rsync, that while less efficient and awesome, is going to work on every remote that gets the new chunking interface, from S3 to WebDAV, to all of Tobias's external special remotes! Even allows for things like starting a download from one remote, interrupting, and resuming from another one, and so on.

I had forgotten about resuming while designing the chunking API. Luckily, I got the design right anyway. Implementation was almost trivial, and only took about 2 hours! (See 9d4a766cd7b8e8b0fc7cd27b08249e4161b5380a)

I'll later add resuming of interrupted uploads. It's not hard to detect such uploads with only one extra query of the remote, but in principle, it should be possible to do it with no extra overhead, since git-annex already checks if all the chunks are there before starting an upload.

Posted Sun Jul 27 23:14:55 2014

Remained frustratingly stuck until 3 pm on the same stuff that puzzled me yesterday. However, 6 hours later, I have the directory special remote 100% working with both new chunk= and legacy chunksize= configuration, both with and without encryption.

So, the root of why this is was hard, since I thought about it a lot today in between beating my head into the wall: git-annex's internal API for remotes is really, really simple. It basically comes down to:

        { storeKey :: Key -> AssociatedFile -> MeterUpdate -> Annex Bool
        , retrieveKeyFile :: Key -> AssociatedFile -> FilePath -> MeterUpdate -> Annex Bool
        , removeKey :: Key -> Annex Bool
        , hasKey :: Key -> Annex (Either String Bool)

This simplicity is a Good Thing, because it maps very well to REST-type services. And it allows for quite a lot of variety in implementations of remotes. Ranging from reguar git remotes, that rsync files around without git-annex ever loading them itself, to remotes like webdav that load and store files themselves, to remotes like tahoe that intentionally do not support git-annex's built-in encryption methods.

However, the simplicity of that API means that lots of complicated stuff, like handling chunking, encryption, etc, has to be handled on a per-remote basis. Or, more generally, by Remote -> Remote transformers that take a remote and add some useful feature to it.

One problem is that the API is so simple that a remote transformer that adds encryption is not feasible. In fact, every encryptable remote has had its own code that loads a file from local disk, encrypts it, and sends it to the remote. Because there's no way to make a remote transformer that converts a storeKey into an encrypted storeKey. (Ditto for retrieving keys.)

I almost made the API more complicated today. Twice. But both times I ended up not, and I think that was the right choice, even though it meant I had to write some quite painful code.

In the end, I instead wrote a little module that pulls together supporting both encryption and chunking. I'm not completely happy because those two things should be independent, and so separate. But, 120 lines of code that don't keep them separate is not the end of the world.

That module also contains some more powerful, less general APIs, that will work well with the kinds of remotes that will use it.

The really nice result, is that the implementation of the directory special remote melts down from 267 lines of code to just 172! (Plus some legacy code for the old style chunking, refactored out into a file I can delete one day.) It's a lot cleaner too.

With all this done, I expect I can pretty easily add the new style chunking to most git-annex remotes, and remove code from them while doing it!

Today's work was sponsored by Mark Hepburn.

Posted Sun Jul 27 00:54:38 2014

A lil bit in the weeds on the chunking rewrite right now. I did succeed in writing the core chunk generation code, which can be used for every special remote. It was pretty hairy (needs to stream large files in constant memory, separating into chunks, and get the progress display right across operations on chunks, etc). That took most of the day.

Ended up getting stuck in integrating the encryptable remote code, and had to revert changes that could have led to rewriting (or perhaps eliminating?) most of the per-remote encryption specific code.

Up till now, this has supported both encrypted and non-encrypted remotes; it was simply passed encrypted keys for an encrypted remote:

remove :: Key -> Annex Bool

But with chunked encrypted keys, it seems it needs to be more complicated:

remove' :: Maybe (Key -> Key) -> ChunkConfig -> Key -> Annex Bool

So that when the remote is configured to use chunking, it can look up the chunk keys, and then encrypt them, in order to remove all the encrypted chunk keys.

I don't like that complication, so want to find a cleaner abstraction. Will sleep on it.

While I was looking at the encryptable remote generator, I realized the remote cost was being calculated wrongly for special remotes that are not encrypted. Fixed that bug.

Today's work was sponsored by bak.

Posted Sat Jul 26 01:00:04 2014

The design for new style chunks seems done, and I laid the groundwork for it today. Added chunk metadata to keys, reorganized the legacy chunking code for directory and webdav so it won't get (too badly) in the way, and implemented the chunk logs in the git-annex branch.

Today's work was sponsored by

Posted Thu Jul 24 20:51:58 2014

Working on designs for better chunking. Having a hard time finding a way to totally obscure file sizes, but otherwise a good design seems to be coming together. I particularly like that the new design puts the chunk count in the Key (which is then encrypted for special remotes, rather than having it be some special extension.

While thinking through chunking, I realized that the current chunking method can fail if two repositories have different chunksize settings for the same special remote and both upload the same key at the same time. Arn't races fun? The new design will eliminate this problem; in the meantime updated the docs to recommend never changing a remote's chunksize setting.

Posted Wed Jul 23 21:58:48 2014

Updated the Debian backport. (Also the git-remote-gcrypt backport.)

Made the assistant install a desktop file to integrate with Konqueror.

Improved git annex repair, fixing a bug that could cause it to leave broken branch refs and yet think that the repair was successful.

A bit surprised to see that now been a full year since I started doing development funded by my campaign. Not done yet!

Update on campaign rewards:

Today's work was sponsored by Douglas Butts.

Posted Mon Jul 21 23:22:44 2014

Spent hours today in a 10-minute build/test cycle, tracking down a bug that caused the assistant to crash on Windows after exactly 10 minutes uptime. Eventually found the cause; this is fallout from last month's work that got it logging to the debug.log on Windows.

There was more, but that was the interesting one..

Posted Wed Jul 16 22:28:00 2014

I have mostly been thinking about gcrypt today. This issue needs to be dealt with. The question is, does it really make sense to try to hide the people a git repository is encrypted for? I have posted some thoughts and am coming to the viewpoint that obscuring the identities of users of a repository is not a problem git-annex should try to solve itself, although it also shouldn't get in the way of someone who is able and wants to do that (by using tor, etc).

Finally, I decided to go ahead and add a gcrypt.publish-participants setting to git-remote-gcrypt, and make git-annex set that by default when setting up a gcrypt repository.

Some promising news from the ghc build on arm. I got a working ghc, and even ghci works. Which would make the template haskell in the webapp etc avaialble on arm without the current horrible hacks. Have not managed to build the debian ghc package successfully yet though.

Also, fixed a bug that made git annex sync not pull/push with a local repository that had not yet been initialized for use with git-annex.

Today's work was sponsored by Stanley Yamane.

Posted Tue Jul 15 21:47:37 2014

Yay, the Linux autobuilder is back! Also fixed the Windows build.

Fixed a reversion that prevented the webapp from starting properly on Windows, which was introduced by some bad locking when I put in the hack that makes it log to the log file on that platform.

Various other minor fixes here and there. There are almost enough to do a release again soon.

I've also been trying to bootstrap ghc 7.8 on arm, for Debian. There's a script that's supposed to allow building 7.8 using 7.6.3, dealing with a linker problem by using the gold linker. Hopefully that will work since otherwise Debian could remain stuck with an old ghc or worse lose the arm ports. Neither would be great for git-annex..

Posted Tue Jul 15 04:42:51 2014

Spent past 2 days catching up on backlog and doing bug triage and some minor bug fixes and features. Backlog is 27, lowest in quite a while so I feel well on top of things.

I was saddened to find this bug where I almost managed to analize the ugy bug's race condition, but not quite (and then went on vacation). BTW, I have not heard from anyone else who was hit by that bug so far.

The linux autobuilders are still down; their host server had a disk crash in an electrical outage. Might be down for a while. I would not mind setting up a redundant autobuilder if anyone else would like to donate a linux VM with 4+ gb of ram.

Posted Fri Jul 11 21:03:01 2014

Important A bug ?caused the assistant to sometimes remove all files from the git repository. You should check if your repository is ok. If the bug hit you, it should be possible to revert the bad commit and recover your files with no data loss. See the bug report for details.

This affected git-annex versions since 5.20140613, and only when using the assistant in direct mode. It should be fixed in today's release, 5.20140709.

I'm available to help anyone hit by this unfortunate bug.

This is another bug in the direct mode merge code. I'm not happy about it. It's particularly annoying that I can't fix up after it automatically (because there's no way to know if any given commit in the git history that deletes all the files is the result of this bug, or a legitimate deletion of all files).

The only good thing is that the design of git-annex is pretty robust, and in this case, despite stupidly committing the deletion of all the files in the repository, git-annex did take care to preserve all their contents and so the problem should be able to be resolved without data loss.

Unfortunately, the main autobuilder is down and I've had to spin up autobuilders on a different machine (thank goodness that's very automated now!), and so I have not been able to build the fixed git-annex for android yet. I hope to get that done later this evening.

Yesterday, I fixed a few (much less bad) bugs, and did some thinking about plans for this month. The roadmap suggests working on some of chunks, deltas or gpgkeys. I don't know how to do deltas yet really. Chunks is pretty easily done. The gpg keys stuff is pretty open ended and needs some more work to define some use cases. But, after today, I am more inclined to want to spend time on better testing and other means of avoiding this kind of situation.

Posted Wed Jul 9 20:29:59 2014

Got the release out. Had to fix various autobuilder issues. The arm autobuilder is unfortunatly not working currently.

Updated git-annex to build with a new version of the bloomfilter library.

Posted Mon Jul 7 20:13:28 2014

Got a bit distracted improving Haskell's directory listing code.

Only real git-annex work today was fixing ?Assistant merge loop, which was caused by changes in the last release (that made direct mode merging crash/interrupt-safe). This is a kind of ugly bug, that can result in the assistant making lots of empty commits in direct mode repositories. So, I plan to make a new release on Monday.

Posted Sat Jul 5 21:24:19 2014

Spent the morning improving behavior when commit.gpgsign is set. Now git-annex will let gpg sign commits that are made when eg, manually running git annex sync, but not commits implicitly made to the git-annex branch. And any commits made by the assistant are not gpg signed. This was slightly tricky, since lots of different places in git-annex ran git commit, git merge and similar.

Then got back to a test I left running over vacation, that added millions of files to a git annex repo. This was able to reproduce a problem where git annex add blew the stack and crashed at the end. There turned out to be two different memory issues, one was in git-annex and the other is in Haskell's core getDirectoryContents. Was able to entirely fix it, eventually.

Posted Fri Jul 4 22:14:33 2014

Finally back to work with a new laptop!

Did one fairly major feature today: When using git-annex to pull down podcasts, metadata from the feed is copied into git-annex's metadata store, if annex.genmetadata is set. Should be great for views etc!

Worked through a lot of the backlog, which is down to 47 messages now.

Only other bug fix of note is a fix on Android. A recent change to git made it try to chmod files, which tends to fail on the horrible /sdcard filesystem. Patched git to avoid that.

For some reason the autobuilder box rebooted while I was away, and somehow the docker containers didn't come back up -- so they got automatically rebuilt. But I have to manually finish up building the android and armel ones. Will be babysitting that build this evening.

Today's work was sponsored by Ævar Arnfjörð Bjarmason.

Posted Thu Jul 3 20:44:33 2014

I am back from the beach, but my dev laptop is dead. A replacement is being shipped, and I have spent today getting my old netbook into a usable state so I can perhaps do some work using it in the meantime.

(Backlog is 95 messages.)

Posted Mon Jun 30 22:36:21 2014

Last night, got logging to daemon.log working on Windows. Aside from XMPP not working (but it's near to being deprecated anyway), and some possible issues with unicode characters in filenames, the Windows port now seems in pretty good shape for a beta release.

Today, mostly worked on fixing the release process so the metadata accurarely reflects the version from the autobuilder that is included in the release. Turns out there was version skew in the last release (now manually corrected). This should avoid that happening again, and also automates more of my release process.

Posted Thu Jun 19 02:57:57 2014

After despairing of ever solving this yesterday (and for the past 6 months really), I've got the webapp running on Windows with no visible DOS box. Also have the assistant starting up in the background on login.

It turns out a service was not the way to do. There is a way to write a VB Script that runs a "DOS" command in a hidden window, and this is what I used. Amazing how hard it was to work this out, probably partly because I don't have the Windows vocabulary to know what to look for.

Posted Tue Jun 17 18:31:11 2014

More work on ?windows git-annex service, but am stuck with a permissions problem.

Fixed a bug that prevented two assistants from syncing when there was only a uni-directional link between them. Only affected direct mode, and was introduced back when I added the direct mode guard.

Posted Mon Jun 16 23:56:30 2014

It's officially a Windows porting month. Now that I'm half way through it and with the last week of the month going to be a vacation, this makes sense.

Today, finished up dealing with the timezone/timestamp issues on Windows. This got stranger and stranger the closer I looked at it. After a timestamp change, a program that was already running will see one timestamp, while a program that is started after the change will see another one! My approach works pretty much no matter how Windows goes insane though, and always recovers a true timestamp. Yay.

Also fixed a regression test failure on Windows, which turned out to be rooted in a bug in the command queue runner, which neglected to pass along environment overrides on Windows.

Then I spent 5 hours tracking down a tricky test suite failure on Windows, which turned out to also affect FAT and be a recent reversion that has as it's root cause a fun bug in git itself. Put in a not very good workaround. Thank goodness for test suites!

Also got the arm autobuilder unstuck. Release tomorrow.

Posted Fri Jun 13 02:09:02 2014

Spent all day on some horrible timestamp issues on legacy systems.

On FAT, timestamps have a 2s granularity, which is ok, but then Linux adds a temporary higher resolution cache, which is lost on unmount. This confused git-annex since the mtimes seemed to change and it had to re-checksum half the files to get unconfused, which was not good. I found a way to use the inode sentinal file to detect when on FAT and put in a workaround, without degrading git-annex everywhere else.

On Windows, time zones are a utter disaster; it changes the mtime it reports for files after the time zone has changed. Also there's a bug in the haskell time library which makes it return old time zone data after a time zone change. (I just finished developing a fix for that bug..)

Left with nothing but a few sticks, I rubbed them together, and actually found a way to deal with this problem too. Scary details in ?Windows file timestamp timezone madness. While I've implemented it, it's stuck on a branch until I find a way to make git-annex notice when the timezone changes while it's running.

Today's work was sponsored by Svenne Krap.

Posted Wed Jun 11 23:08:13 2014

Have for the first time gotten git-annex to run as a proper Windows service, using nssm. (details) Not quite ready yet though; doesn't run as the right user.

And a few other windows porting bits.

Posted Tue Jun 10 23:23:18 2014

Spent most of today improving behavior when a sync or merge is interrupted in direct mode. It was possible for an interrupt at the wrong time to leave the merge committed, but the work tree not yet updated. And then the next sync would make a commit that reverted the merged changes!

To fix this I had to avoid making any merge commit or indeed updating the index until after the work tree is updated. It looked intractable for a while; I'm still surprised I eventually succeeded.

Posted Tue Jun 10 00:16:27 2014

Did work on Windows porting today. First, fixed a reversion in the last release, that broke the git-annex branch pretty badly on Windows, causing \r to be written to files on that branch that should never have DOS line endings. Second, fixed a long-standing bug that prevented getting a file from a local bare repository on Windows.

Also refreshed all autobuilders to deal with the gnutls and openssl security holes-of-the-week. (git-annex uses gnutls only for XMPP, and does not use openssl itself, but a few programs bundled with it, like curl, do use openssl.)

A nice piece of news: OSX Homebrew now contains git-annex, so it can be easily installed with brew install git-annex

Posted Thu Jun 5 21:33:54 2014

Yesterday I recorded a new screencast, demoing using the assistant on a local network with a small server. git-annex assistant lan. That's the best screencast yet; having a real framing story was nice; recent improvements to git-annex are taken advantage of without being made a big deal; and audio and video are improved. (But there are some minor encoding glitches which I'd have to re-edit it to fix.)

The roadmap has this month dedicated to improving Android. But I think what I'd more like to do is whatever makes the assistant usable by the most people. This might mean doing more on Windows, since I hear from many who would benefit from that. Or maybe something not related to porting?

Posted Wed Jun 4 21:18:31 2014

After making a release yesterday, I've been fixing some bugs in the webapp, all to do with repository configuration stored on the git-annex branch. I was led into this by a strange little bug where the webapp stored configuration in the wrong repo in one situation. From there, I noticed that often when enabling an existing repository, the webapp would stomp on its group and preferred content and description, replacing them with defaults.

This was a systematic problem, it had to be fixed in several places. And some of the fixes were quite tricky. For example, when adding a ssh repository, and it turns out there's already a git-annex repository at the entered location, it needs to avoid changing its configuration. But also, the configuration of that repo won't be known until after the first git pull from it. So it doesn't make sense to show the repository edit form after enabling such a repository.

Also worked on a couple other bugs, and further cleaned up the bugs page. I think I am finally happy with how the bug list is displayed, with confirmed/moreinfo/etc tags.

Today's work was sponsored by François Deppierraz.

Posted Fri May 30 21:56:26 2014

Got a handle on the Android webapp static file problems (no, they were not really encoding problems!), and hopefully that's all fixed now. Also, only 3 modules use Char8 now. And updated the git-annex backport. That's all I did today.

Meanwhile, a complete ZSH completion has been contributed by Schnouki. And, Ben Gamari sent in a patch moving from the deprecated MonadCatchIO-transformers library to the exceptions library.

Posted Wed May 28 22:26:53 2014

These themed days are inaverdent, but it happened again: Nearly everything done today had to do with encoding issues.

The big news is that it turned out everything written to files in the git-annex branch had unicode characters truncated to 8 bits. Now fixed so you should always get out the same thing you put in, no matter what encoding you use (but please use utf8). This affected things like storing repository descriptions, but worse, it affected metadata. (Also preferred content expressions, I suppose.)

With that fixed, there are still 7 source files left that use Char8 libraries. There used to be more; nearly every use of those is a bug. I looked over the remaining uses of it, and there might be a problem with Creds using it. I should probably make a push to stamp out all remaining uses of Char8.

Other encoding bugs were less reproducible.

And just now, Sören made some progress on Bootstrap3 icons missing on Android ... and my current theory is this is actually caused by an encoding issue too.

Posted Tue May 27 20:37:23 2014

With some help from Sören, have been redoing the android build environment for git-annex. This included making propellor put it in a docker container, which was easy. But then much struggling with annoying stuff like getting the gnutls linking to work, and working around some dependency issues on hackage that make cabal's dependency resolver melt down. Finally succeeded after much more time than I had wanted to spend on this.

[[!meta Error: cannot parse date/time: Mon May 27 16:36:40 JEST 2014]]

Posted Tue May 27 17:23:39 2014

Working on moving the android autobuilder to Docker & Propellor, which will finish containerizing all the autobuilds that I run. Updated ghc-android to use the released ghc 7.8.2, which will make it build more reliably.

Also did bug triage. Bugs are now divided into confirmed and ?unconfirmed categories.

Posted Sun May 25 00:38:40 2014

Keeping lots of things going these past few days..

  • Rebootstrapping the armel autobuilder with propellor. Some qemu instability and the need to update haskell library patches meant this took a lot of hand-holding. Finally got a working setup today.
  • Designing and ordering new git-annex stickers on clear viynl backing; have put off sending those to campaign contributors for too long.
  • Added a new feature to the webapp: It now remembers the ssh remotes that it sets up, and makes it easy to enable them elsewhere, the same as other sorts of remotes. Had a very pleasant surprise building this, when I was able to reuse all the UI code for enabling rsync and gcrypt remotes. I think this will be a useful feature as we transition away from XMPP.
Posted Fri May 23 01:08:36 2014

Worked on triaging several bugs. Fixed an easy one, which involved the assistant choosing the wrong path to a repository that has multiple remotes. After today, backlog is down to 43, nearly pre-Brazil levels.

It seems that git-remote-gcrypt ?never quite worked on OSX. It looked like it did, but a bug prevented anything being pushed to the remote. Tracked down and fixed that bug.

This evening, getting back to working on the armel autobuilder setup using propellor. The autobuilder will use a pair of docker containers, one armel and a companion amd64, and their quite complex setup will be almost fully automated (except for the haskell library patching part).

Today's work was sponsored by Mica Semrick.

Posted Mon May 19 22:59:21 2014

Released git-annex 5.20140517 today. The changelog for this release is very unusual, because it's full of contributions from others! There are as many patches from others in this release as git-annex got in the first entire two years of its existence.

I'd like to keep that going. Also, I could really use help triaging bug reports right now. So I have updated the contribute page with more info about easy ways to contribute to git-annex. If you read this devblog, you're an ideal contributor, and you don't need to know how to write haskell either.. So take a look at the page and see if you can help out.

Posted Sun May 18 02:17:53 2014

Powered through the backlog today, and got it down to 67! Probably most of the rest is the hard ones though.

A theme today was: It's stupid hard to get git-annex-shell installed into PATH. While that should be the simplest thing in the world, I'm pinned between two problems:

  1. There's no single portable package format, so all the decades of development nice ways to get things into PATH don't work for everybody.
  2. bash provides not a single dotfile that will work in all circumstances to configure PATH. In particular, "ssh $host git-annex-shell" causes bash to helpfully avoid looking at any dotfiles at all.

Today's flailing to work around that inluded:

  • Merged a patch from Fraser Tweedale to allow git config remote.origin.annex-shell /not/in/path/git-annex-shell
  • Merged a patch from Justin Lebar to allow symlinking the git-annex-shell etc from the standalone tarball to a directory that is in PATH. (Only on Linux, not OSX yet.)
  • Improved the warning message git-annex prints when a remote server does not have git-annex-shell in PATH, suggesting some things the user could do to try to fix it.

I've found out why OSX machines were retrying upgrades repeatedly. The version in the .info file did not match the actual git-annex version for OSX. I've fixed the info file version, but will need to come up with a system to avoid such mismatches.

Made a few other fixes. A notable one is that dragging and dropping repositories in the webapp to reorder the list (and configure costs) had been broken since November.

git-annex 5.20140421 finally got into Debian testing today, so I updated the backport. I recommend upgrading, especially if you're using the assistant with a ssh remote, since you'll get all of last month's nice features that make XMPP unnecessary in that configuration.

Today's work was sponsored by Geoffrey Irving.

Posted Fri May 16 21:23:47 2014

Spent the day testing the sshpasswd branch. A few interesting things:

  • I was able to get rid of 10 lines of Windows specific code for, which had been necessary for console ssh password prompting to work. Yay!
  • git-remote-gcrypt turned out to be broken when there is no controlling tty. --no-tty has to be passed to gpg to avoid it falling over in this case, even when a gpg agent is available to be used. I fixed this with a new release of git-remote-gcrypt.

Mostly the new branch just worked! And is merged...

Merged a patch from Robie Basak that adds a new special remote that's sort of like bup but supports deletion: ddar

Backlog: 172

Today's work was sponsored by Andrew Cant.

Posted Thu May 15 20:39:48 2014

My backlog is massive -- 181 items to answer. Will probably take the rest of the month to get caught back up. Rather than digging into that yet, spent today working on the webapp's ssh password prompting.

I simplified it so the password is entered on the same form as the rest of the server's information. That made the UI easy to build, but means that when a user already has a ssh key they want to use, they need to select "existing ssh key"; the webapp no longer probes to automatically detect that case.

Got the ssh password prompting in the webapp basically working, and it's a really nice improvement! I even got it to work on Windows (eventually...). It's still only in the sshpassword branch, since I need to test it more and probably fix some bugs. In particular, when enabling a remote that already exists, I think it never prompts for the password yet.

Today's work was sponsored by Nicola Chiapolini.

Posted Wed May 14 22:22:20 2014

I have a preliminary design for requests routing. Won't be working on it immediately, but simulations show it can work well in a large ad-hoc network.

Posted Tue May 6 21:08:49 2014

Sören Brunk's massive bootstrap 3 patch has landed! This is a 43 thousand line diff, with 2 thousand lines after the javascript and CSS libraries are filtered out. Either way, the biggest patch contributed by anyone to git-annex so far, and excellent work.

Meanwhile, I built a ?haskell program to simulate a network of highly distributed git-annex nodes with ad-hoc connections and the selective file syncing algorythm now documented at the bottom of efficiency.

Currently around 33% of requested files never get to their destination in this simulation, but this is probably because its network is randomly generated, and so contains disconnected islands. So next, some data entry, from a map that involves an Amazon not in .com, dotted with names of people I have recently met... :)

Posted Fri May 2 18:57:16 2014

I've moved out of implementation mode (unable to concentrate enough), and into high-level design mode.

Syncing efficiency has been an open TODO for years, to find a way to avoid flood filling the network, and find more efficient ways to ensure data only gets to the nodes that want it. Relatedly, Android devices often need a way to mark individual files they want to have. Had a very productive discussion with Vince and Fernao and I think we're heading toward a design that will address both these needs, as well as some more Brazil-specific use cases, about which more later.

Today's work was sponsored by Casa do Boneco.

Posted Thu May 1 17:23:47 2014

Reviewed Sören's updated bootstrap3 patch, which appeared while I was traveling. Sören kindly fixed it to work with Debian stable's old version of Yesod, which was quite a lot of work. The new new bootstrap3 UI looks nice, found a few minor issues, but expect to be able to merge it soon.

Started on sshpassword groundwork. Added a simple password cache to the assistant, with automatic expiration, and made git-annex be able to be run by ssh as the SSH_ASKPASS program.

The main difficulty will be changing the webapp's UI to prompt for the ssh password when one is needed. There are several code paths in ssh remote setup where a password might be needed. Since the cached password expires, it may need to be prompted for at any of those points. Since a new page is loading, it can't pop up a prompt on the current page; it needs to redirect to a password prompt page and then redirect back to the action that needed the password. ...At least, that's one way to do it. I'm going to sleep on it and hope I dream up a better way.

Posted Tue Apr 29 22:33:53 2014

Today was mostly spent driving across Brazil, but I had energy this evening for a little work on git-annex.

Made the assistant delete old temporary files on startup. I've had scattered reports of a few users whose .git/annex/tmp contained many files, apparently put there by the assistant when it locks down a file prior to annexing it. That seems it could possibly be a bug -- or it could just be unclean shutdowns interrupting the assistant. Anyway, this will deal with any source of tmp cruft, and I made sure to preserve tmp files for partially downloaded content.

Posted Mon Apr 28 01:12:55 2014

Next month the roadmap has me working on sshpassword. That will be a nice UI improvement and I'd be very surprised if it takes more than a week, which is great.

Getting a jump on it today, investigating using SSH_ASKPASS. It seems this will even work on Windows! Preliminary design in sshpassword.

Time to get on a plane to a plane to a plane to Brasilia!

Posted Fri Apr 25 20:32:36 2014

Now git-annex's self-upgrade code will check the gpg signature of a new version before using it.

To do this I had to include the gpg public keys into the git-annex distribution, and that raised the question of which public keys to include. Currently I have both the dedicated git-annex distribution signing key, and my own gpg key as a backup in case I somehow misplace the former.

Also spent a while looking at the recent logs on the web server. There seem to be around 600 users of the assistant with upgrade checking enabled. That breaks down to 68% Linux amd64, 20% Linux i386, 11% OSX Mavericks, and 0.5% OSX Lion.

Most are upgrading successfully, but there are a few that seem to repeatedly fail for some reason. (Not counting the OSX Lion, which will probably never find an upgrade available.) I hope that someone who is experiencing an upgrade failure gets in touch with some debug logs.

In the same time period, around 450 unique hosts manually downloaded a git-anex distribution. Also compare with Debian popcon, which has 1200 reporting git-annex users.

Posted Wed Apr 23 21:10:28 2014

I hope this will be a really good release. Didn't get all the way to telehash this month, but the remotedaemon is pretty sweet. Updated roadmap pushes telehash back again.

The files in this release are now gpg signed, after recently moving the downloads site to a dedicated server, which has a dedicated gpg key. You can verify the detached signatures as an additional security check over trusting SSL. The automatic upgrade code doesn't check the gpg signatures yet.

Sören Brunk has ported the webapp to Bootstrap 3.
The branch is not ready for merging yet (it would break the Debian stable backports), but that was a nice surprise.

Posted Mon Apr 21 21:23:22 2014

Sometimes you don't notice something is missing for a long time until it suddenly demands attention. Like today.

Seems the webapp never had a way to stop using XMPP and delete the XMPP password. So I added one.

The new support for instantly noticing changes on a ssh remote forgot to start up a connection to a new remote after it was created. Fixed that.

(While doing some testing on Android for unrelated reasons, I noticed that my android tablet was pushing photos to a ssh server and my laptop immediately noticed and downloaded them from tere, which is an excellent demo. I will deploy this on my trip in Brazil next week. Yes, I'm spending 2 weeks in Brazil with git-annex users; more on this later.)

Finally, it turns out that "installing" git-annex from the standalone tarball, or DMG, on a server didn't make it usable by the webapp. Because git-annex shell is not in PATH on the server, and indeed git and rsync may not be in PATH either if they were installed with the git-annex bundle. Fixed this by making the bundle install a ~/.ssh/git-annex-wrapper, which the webapp will detect and use.

Also, quite a lot of other bug chasing activity.

Today's work was sponsored by Thomas Koch.

Posted Sun Apr 20 22:53:13 2014

Worked through message backlog today. Got it down from around 70 to just 37. Was able to fix some bugs, including making the webapp start up more robustly in some misconfigurations.

Added a new findref command which may be useful in a git update hook to deny pushes of refs if the annexed content has not been sent first.

BTW, I also added a new reinit command a few days ago, which can be useful if you're cloning back a deleted repository.

Also a few days ago, I made uninit a lot faster.

Posted Thu Apr 17 22:48:44 2014

After fixing a few bugs in the remotecontrol branch, It's landed in master. Try a daily build today, and see if the assistant can keep in sync using nothing more than a remote ssh repository!

So, now all the groundwork for telehash is laid too. I only need a telehash library to start developing on top of. Development on telehash-c is continuing, but I'm more excited that htelehash has been revived and is being updated to the v2 protocol, seemingly quite quickly.

Posted Tue Apr 15 01:26:57 2014

Made ssh connection caching be used in several more places. git annex sync will use it when pushing/pulling to a remote, as will the assistant. And git-annex remotedaemon also uses connection caching. So, when a push lands on a ssh remote, the assistant will immediately notice it, and pull down the change over the same TCP connection used for the notifications.

This was a bit of a pain to do. Had to set GIT_SSH=git-annex and then when git invokes git-annex as ssh, it runs ssh with the connection caching parameters.

Also, improved the network-manager and wicd code, so it detects when a connection has gone down. That propagates through to the remote-daemon, which closes all ssh connections. I need to also find out how to detect network connections/disconnections on OSX..

Otherwise, the remote-control branch seems ready to be merged. But I want to test it for a while first.

Followed up on yesterday's bug with writing some test cases for Utility.Scheduled, which led to some more bug fixes. Luckily nothing I need to rush out a release over. In the end, the code got a lot simpler and clearer.

-- Check if the new Day occurs one month or more past the old Day.
oneMonthPast :: Day -> Day -> Bool
new `oneMonthPast` old = fromGregorian y (m+1) d <= new
        (y,m,d) = toGregorian old

Today's work was sponsored by Asbjørn Sloth Tønnesen.

Posted Sat Apr 12 22:45:29 2014

Pushed out a new release today, fixing two important bugs, followed by a second release which fixed the bugs harder.

Automatic upgrading was broken on OSX. The webapp will tell you upgrading failed, and you'll need to manually download the .dmg and install it.

With help from Maximiliano Curia, finally tracked down a bug I have been chasing for a while where the assistant would start using a lot of CPU while not seeming to be busy doing anything. Turned out to be triggered by a scheduled fsck that was configured to run once a month with no particular day specified.

That bug turned out to affect users who first scheduled such a fsck job after the 11th day of the month. So I expedited putting a release out to avoid anyone else running into it starting tomorrow.

(Oddly, the 11th day of this month also happens to be my birthday. I did not expect to have to cut 2 releases today..)

Posted Fri Apr 11 23:02:46 2014

The git-remote-daemon now robustly handles loss of signal, with reconnection backoffs. And it detects if the remote ssh server has too old a version of git-annex-shell and the webapp will display a warning message.

Also, made the webapp show a network signal bars icon next to both ssh and xmpp remotes that it's currently connected with. And, updated the webapp's nudging to set up XMPP to now suggest either an XMPP or a ssh remote.

I think that the remotecontrol branch is nearly ready for merging!

Today's work was sponsored by Paul Tagliamonte.

Posted Wed Apr 9 20:34:46 2014

git-remote-daemon is tied into the assistant, and working! Since it's not really ready yet, this is in the remotecontrol branch.

My test case for this is two client repositories, both running the assistant. Both have a bare git repository, accessed over ssh, set up as their only remote, and no other way to keep in touch with one-another. When I change a file in one repository, the other one instantly notices the change and syncs.

This is gonna be awesome. Much less need for XMPP. Windows will be fully usable even without XMPP. Also, most of the work I did today will be fully reused when the telehash backend gets built. The telehash-c developer is making noises about it being almost ready for use, too!

Today's work was sponsored by Frédéric Schütz.

Posted Tue Apr 8 22:27:21 2014

Various bug triage today. Was not good for much after shuffling paper for the whole first part of the day, but did get a few little things done.

Re, git-annex does not use OpenSSL itself, but when using XMPP, the remote server's key could have been intercepted using this new technique. Also, the git-annex autobuilds and this website are served over https -- working on generating new https certificates now. Be safe out there..

Posted Mon Apr 7 23:16:10 2014

Built git-annex remotedaemon command today. It's buggy, but it already works! If you have a new enough git-annex-shell on a remote server, you can run "git annex remotedaemon" in a git-annex repository, and it will notice any pushes that get made to that remote from any other clone, and pull down the changes.

Posted Sun Apr 6 23:16:31 2014

Added git-annex-shell notifychanges command, which uses inotify (etc) to detect when git refs have changed, and informs the caller about the changes. This was relatively easy to write; I reused the existing inotify code, and factored out code for simple line-based protocols from the external special remote protocol. Also implemented the git-remote-daemon protocol. 200 lines of code total.

Meanwhile, Johan Kiviniemi improved the dbus notifications, making them work on Ubuntu and adding icons. Awesome!

There's going to be some fun to get git-annex-shell upgraded so that the assistant can use this new notify feaure. While I have not started working on the assistant side of this, you can get a jump by installing today's upcoming release of git-annex. I had to push this out early because there was a bug that prevented the webapp from running on non-gnome systems. Since all changes in this release only affected Linux, today's release will be a Linux-only release.

Posted Sat Apr 5 20:58:08 2014

I have a plan for this month. While waiting for telehash, I am going to build git-remote-daemon, which is the infrastructure git-annex will need, to use telehash. Since it's generalized to support other protocols, I'll be able to start using it before telehash is ready.

In fact, I plan to first make it work with ssh:// remotes, where it will talk with git-annex-shell on the remote server. This will let the assistant immediately know when the server has received a commit, and that will simplify using the assistant with a ssh server -- no more need for XMPP in this case! It should also work with git-remote-gcrypt encrypted repositories, so also covers the case of an untrusted ssh server where everything is end-to-end encrypted.

Building the git-annex-shell part of this should be pretty easy, and building enough of the git-remote-daemon design to support it also not hard.

Posted Thu Apr 3 23:04:11 2014

Got caught up on all recent bugs and questions, although I still have a backlog of 27 older things that I really should find time for.

Fixed a couple of bugs. One was that the assistant set up ssh authorized_keys that didn't work with the fish shell.

Also got caught up on the current state of telehash-c. Have not quite gotten it to work, but it seems pretty close to being able to see it do something useful for the first time.

Pushing out a release this evening with a good number of changes left over from March.

Posted Wed Apr 2 21:14:45 2014

Last week's trip was productive, but I came home more tired than I realized. Found myself being snappy & stressed, so I have been on break.

I did do a little git-annex dev in the past 5 days. On Saturday I implemented ?preferred content (although without the active checks I think it probably ought to have.) Yesterday I had a long conversation with the Tahoe developers about improving git-annex's tahoe integration.

Today, I have been wrapping up building propellor. To test its docker support, I used propellor to build and deploy a container that is a git-annex autobuilder. I'll be replacing the old autobuilder setup with this shortly, and expect to also publish docker images for git-annex autobuilders, so anyone who wants to can run their own autobuilder really easily.

I have April penciled in on the roadmap as the month to do telehash. I don't know if telehash-c is ready for me yet, but it has had a lot of activity lately, so this schedule may still work out!

Posted Wed Apr 2 01:17:39 2014

Catching up on conference backlog. 36 messages backlog remains.

Fixed git-annex-shell configlist to automatically initialize a git remote when a git-annex branch had been pushed to it. This is necessary for gitolite to be easy to use, and I'm sure it used to work.

Updated the Debian backport and made a Debian package of the fdo-notify haskell library used for notifications.

Applied a patch from Alberto Berti to fix support for tahoe-lafs 1.10.

And various other bug fixes and small improvements.

Posted Wed Mar 26 21:04:47 2014

Attended at the f-droid sprint at LibrePlanet, and have been getting a handle on how their build server works with an eye toward adding git-annex to it. Not entirely successful getting vagrant to build an image yet.

Posted Sun Mar 23 22:17:55 2014

Yesterday coded up one nice improvement on the plane -- git annex unannex (and uninit) is now tons faster. Before it did a git commit after every file processed, now there's just 1 commit at the end. This required using some locking to prevent the pre-commit hook from running in a confusing state.

Today. LibrePlanet and a surprising amount of development. I've added file manager integration, only for Nautilus so far. The main part of this was adding --notify-start and --notify-finish, which use dbus desktop notifications to provide feedback.

(Made possible thanks to Max Rabkin for updating fdo-notify to use the new dbus library, and ion for developing the initial Nautilus integration scripts.)

Today's work and LibrePlanet visit was sponsored by Jürgen Lüters.

Posted Sat Mar 22 20:21:46 2014

Yesterday, worked on cleaning up the todo list. Fixed Windows slash problem with rsync remotes. Today, more Windows work; it turns out to have been quite buggy in its handling of non-ASCII characters in filenames. Encoding stuff is never easy for me, but I eventually managed to find a way to fix that, although I think there are other filename encoding problems lurking in git-annex on Windows still to be dealt with.

Implemented an interesting metadata feature yesterday. It turns out that metadata can have metadata. Particularly, it can be useful to know when a field was last set. That was already beeing tracked, internally (to make union merging work), so I was able to quite cheaply expose it as "$field-lastchanged" metadata that can be used like any other metadata.

I've been thinking about how to implement required content expressions, and think I have a reasonably good handle on it.

Posted Wed Mar 19 20:56:12 2014

The website broke and I spent several hours fixing it, changing the configuration to not let it break like this again, cleaning up after it, etc.

Did manage to make a few minor bugfixes and improvements, but nothing stunning.

I'll be attending LibrePlanet at MIT this weekend.

Posted Mon Mar 17 23:25:10 2014

Added some power and convenience to preferred content expressions.

Before, "standard" was a special case. Now it's a first-class keyword, so you can do things like "standard or present" to use the standard preferred content expression, modified to also want any file that happens to be present.

Also added a way to write your own reusable preferred content expressions, tied to groups. To make a repository use them, set its preferred content to "groupwanted". Of course, "groupwanted" is also a first-class keyword, so "not groupwanted" or something can also be done.

While I was at it, I made vicfg show the built-in standard preferred content expressions, for reference. This little IDE should be pretty self-explanatory, I hope.

So, preferred content is almost its own little programming language now. Except I was careful to not allow recursion. ;)

Posted Sat Mar 15 21:46:46 2014

Did some more exploration and perf tuning and thinking on caching databases, and am pretty sure I know how I want to implement it. Will be several stages, starting with using it for generating views, and ending(?) with using it for direct mode file mappings.

Not sure I'm ready to dive into that yet, so instead spent the rest of the day working on small bugfixes and improvements. Only two significant ones..

Made the webapp use a constant time string comparison (from securemem) to check if its auth token is valid. This could help avoid a potential timing attack to guess the auth token, although that is theoretical. Just best practice to do this.

Seems that openssh 6.5p1 had another hidden surprise (in addition to its now-fixed bug in handing hostnames in .ssh/config) -- it broke the method git-annex was using for stopping a cached ssh connection, which led to some timeouts for failing DNS lookups. If git-annex seems to stall for a few seconds at startup/shutdown, that may be why (--debug will say for sure). I seem to have found a workaround that avoids this problem.

Posted Thu Mar 13 23:45:48 2014

Updated the Debian stable backport to the last release. Also it seems that the last release unexpectedly fixed XMPP SIGILL on some OSX machines. Apparently when I rebuilt all the libraries recently, it somehow fixed that ?old unsolved bug.

RichiH suggested "wrt ballooning memory on repair: can you read in broken stuff and simply stop reading once you reach a certain threshold, then start repairing, re-run fsck, etc?" .. I had considered that but was not sure it would work. I think I've gotten it to work.

Now working on a design for using a caching database for some parts of git-annex. My initial benchmarks using SQLite indicate it would slow down associated file lookups by nearly an order of magnitude compared with the current ".map files" implementation. (But would scale better in edge cases). OTOH, using a SQLite database to index metadata for use in views looks very promising.

Posted Wed Mar 12 22:20:32 2014

Squashed three or four more bugs today. Unanswered message backlog is down to 27.

The most interesting problem today is that the git-repair code was using too much memory when git-fsck output a lot of problems (300 thousand!). I managed to half the memory use in the worst case (and reduced it much more in more likely cases). But, don't really feel I can close that bug yet, since really big, really badly broken repositories can still run it out of memory. It would be good to find a way to reorganize the code so that the broken objects list streams through git-repair and never has to all be buffered in memory at once. But this is not easy.

Posted Mon Mar 10 21:41:18 2014

Release made yesterday, but only finished up the armel build today. And it turns out the OSX build was missing the webapp, so it's also been updated today.

Post release bug triage including:

Added a nice piece of UI to the webapp on user request: A "Sync now" menu item in the repository for each repo. (The one for the current repo syncs with all its remotes.)

Copying files to a git repository on the same computer turns out to have had a resource leak issue, that caused 1 zombie process per file. With some tricky monad state caching, fixed that, and also eliminated 8% of the work done by git-annex in this case.

Fixed git annex unused in direct mode to not think that files that were deleted out of the work tree by the user still existed and were unused.

Posted Fri Mar 7 20:27:58 2014

Preparing for a release (probably tomorrow or Friday).

Part of that was updating the autobuilders. Had to deal with the gnutls security hole fix, and upgrading that on the OSX autobuilder turned out to be quite complicated due to library version skew. Also, I switched the linux autobuilders over to building from Debian unstable, rather than stable. That should be ok to do now that the standalone build bundles all the libraries it needs... And the arm build has always used unstable, and has been reported working on a lot of systems. So I think this will be safe, but have backed up the old autobuilder chroots just in case.

Also been catching up on bug reports and traffic and and dealt with quite a lot of things today. Smarter log file rotation for the assistant, better webapp behavior when git is not installed, and a fix for the webdav 5 second timeout problem.

Perhaps the most interesting change is a new annex.startupscan setting, which can be disabled to prevent the assistant from doing the expensive startup scan. This means it misses noticing any files that changed since it last run, but this should be useful for those really big repositories.

(Last night, did more work on the test suite, including even more checking of merge conflict resolution.)

Today's work was sponsored by Michael Alan Dorman.

Posted Wed Mar 5 22:45:22 2014

Yesterday I learned of a nasty bug in handling of merges in direct mode. It turns out that if the remote repository has added a file, and there is a conflicting file in the local work tree, which has not been added to git, the local file was overwritten when git-annex did a merge. That's really bad, I'm very unhappy this bug lurked undetected for so long.

Understanding the bug was easy. Fixing it turned out to be hard, because the automatic merge conflict resolution code was quite a mess. In particular, it wrote files to the work tree, which made it difficult for a later stage to detect and handle the abovementioned case. Also, the automatic merge resolution code had weird asymmetric structure that I never fully understood, and generally needed to be stared at for an hour to begin to understand it.

In the process of cleaning that up, I wrote several more tests, to ensure that every case was handled correctly. Coverage was about 50% of the cases, and should now be 100%.

To add to the fun, a while ago I had dealt with a bug on FAT/Windows where it sometimes lost the symlink bit during automatic merge resolution. Except it turned out my test case for it had a heisenbug, and I had not actually fixed it (I think). In any case, my old fix for it was a large part of the ugliness I was cleaning up, and had to be rewritten. Fully tracking down and dealing with that took a large part of today.

Finally this evening, I added support for automatically handling merge conflicts where one side is an annexed file, and the other side has the same filename committed to git in the normal way. This is not an important case, but it's worth it for completeness. There was an unexpected benefit to doing it; it turned out that the weird asymmetric part of the code went away.

The final core of the automatic merge conflict resolver has morphed from a mess I'd not want to paste here to a quite consise and easy to follow bit of code.

        case (kus, kthem) of
                -- Both sides of conflict are annexed files
                (Just keyUs, Just keyThem) -> resolveby $
                        if keyUs == keyThem
                                then makelink keyUs
                                else do
                                        makelink keyUs
                                        makelink keyThem
                -- Our side is annexed file, other side is not.
                (Just keyUs, Nothing) -> resolveby $ do
                        graftin them file
                        makelink keyUs
                -- Our side is not annexed file, other side is.
                (Nothing, Just keyThem) -> resolveby $ do
                        graftin us file
                        makelink keyThem
                -- Neither side is annexed file; cannot resolve.
                (Nothing, Nothing) -> return Nothing

Since the bug that started all this is so bad, I want to make a release pretty soon.. But I will probably let it soak and whale on the test suite a bit more first. (This bug is also probably worth backporting to old versions of git-annex in eg Debian stable.)

Posted Wed Mar 5 00:06:55 2014

Worked on metadata and views. Besides bugfixes, two features of note:

Made git-annex run a hook script, pre-commit-annex. And I wrote a sample script that extracts metadata from lots of kinds of files, including photos and sound files, using extract(1) to do the heavy lifting. See automatically adding metadata.

Views can be filtered to not include a tag or a field. For example, git annex view tag=* !old year!=2013

Today's work was sponsored by Stephan Schulz

Posted Mon Mar 3 00:22:04 2014

Did not plan to work on git-annex today..

Unexpectedly ended up making the webapp support HTTPS. Not by default, but if a key and certificate are provided, it'll use them. Great for using the webapp remotely! See the new tip: remote webapp setup.

Also removed support for --listen with a port, which was buggy and not necessary with HTTPS.

Also fixed several webapp/assistant bugs, including one that let it be run in a bare git repository.

And, made the quvi version be probed at runtime, rather than compile time.

Posted Sat Mar 1 03:07:54 2014

Pushed a release today. Rest of day spent beating head against Windows XMPP brick wall.

Actually made a lot of progress -- Finally found the right approach, and got a clean build of the XMPP haskell libraries. But.. ghc fails to load the libraries when running Template Haskell. "Misaligned section: 18206e5b". Filed a bug report, and I'm sure this alignment problem can be fixed, but I'm not hopeful about fixing it myself.

One workaround would be to use the EvilSplicer, building once without the XMPP library linked in, to get the TH splices expanded, and then a second time with the XMPP library and no TH. Made a winsplicehack branch with tons of ifdefs that allows doing this. However, several dozen haskell libraries would need to be patched to get it to work. I have the patches from Android, but would rather avoid doing all that again on Windows.

Another workaround would be to move XMPP into a separate process from the webapp. This is not very appealing either, the IPC between them would be fairly complicated since the webapp does stuff like show lists of XMPP buddies, etc. But, one thing this idea has to recommend it is I am already considering using a separate helper daemon like this for Telehash.

So there could be synergies between XMPP and Telehash support, possibly leading to some kind of plugin interface in git-annex for this sort of thing. But then, once Telehash or something like it is available and working well, I plan to deprecate XMPP entirely. It's been a flakey pain from the start, so that can't come too soon.

Posted Thu Feb 27 21:43:59 2014

Not a lot accomplished today. Some release prep, followed up to a few bug reports.

Split git-annex's .git/annex/tmp into two directories. .git/annex/tmp will now be used only for partially transferred objects, while .git/annex/misctmp will be used for everything else. In particular this allows symlinking .git/annex/tmp to a ram disk, if you want to do that. (It's not possible for .git/annex/misctmp to be on a different filesystem from the rest of the repository for various reasons.)

Beat on Windows XMPP for several more painful hours. Got all the haskell bindings installed, except for gnuidn. And patched network-client-xmpp to build without gnuidn. Have not managed to get it to link.

Posted Thu Feb 27 01:11:08 2014

More Windows porting. Made the build completely -Wall safe on Windows. Fixed some DOS path separator bugs that were preventing WebDav from working. Have now tested both and Amazon S3 to be completely working in the webapp on Windows.

Posted Tue Feb 25 21:54:58 2014

Turns out that in the last release I broke making, Amazon S3 and Glacier remotes from the webapp. Fixed that.

Also, dealt with changes in the haskell DAV library that broke support for, and worked around an exception handling bug in the library.

I think I should try to enhance the test suite so it can run live tests on special remotes, which would at least have caught the some of these recent problems...

Since metadata is tied to a particular key, editing an annexed file, which causes the key to change, made the metadata seem to get lost.

I've now fixed this; it copies the metadata from the old version to the new one. (Taking care to copy the log file identically, so git can reuse its blob.)

That meant that git annex add has to check every file it adds to see if there's an old version. Happily, that check is fairly fast; I benchmarked my laptop running 2500 such checks a second. So it's not going to slow things down appreciably.

Posted Mon Feb 24 23:36:58 2014

When generating a view, there's now a way to reuse part of the directory hierarchy of the parent branch. For example, git annex view tag=* podcasts/=* makes a view where the first level is the tags, and the second level is whatever podcasts/* directories the files were in.

Also, year and month metadata can be automatically recorded when adding files to the annex. I made this only be done when annex.genmetadata is turned on, to avoid polluting repositories that don't want to use metadata.

It would be nice if there was a way to add a hook script that's run when files are added, to collect their metadata. I am not sure yet if I am going to add that to git-annex though. It's already possible to do via the regular git post-commit hook. Just make it look at the commit to see what files were added, and then run git annex metadata to set their metadata appropriately. It would be good to at least have an example of such a script to eg, extract EXIF or ID3 metadata. Perhaps someone can contribute one?

Posted Sun Feb 23 04:25:46 2014

Spent the day catching up on the last week or so's traffic. Ended up making numerous small big fixes and improvements. Message backlog stands at 44.

Here's the screencast demoing views!

Added to the design today the idea of automatically deriving metadata from the location of files in the master branch's directory tree. Eg, git annex view tag=* podcasts/=* in a repository that has a podcasts/ directory would make a tree like "$tag/$podcast". Seems promising.

So much still to do with views.. I have belatedly added them to the roadmap for this month; doing Windows and Android in the same month was too much to expect.

Posted Thu Feb 20 20:39:57 2014

Still working on views. The most important addition today is that git annex precommit notices when files have been moved/copied/deleted in a view, and updates the metadata to reflect the changes.

Also wrote some walkthrough documentation: metadata driven views.
And, recorded a screencast demoing views, which I will upload next time I have bandwidth.

Posted Thu Feb 20 01:56:42 2014

Today I built git annex view, and git annex vadd and a few related commands. A quick demo:

Chaos_Communication_Congress/  FOSDEM/       Linux_Conference_Australia/
Debian/                        LibrePlanet/
joey@darkstar:~/lib/talks>git annex view tag=*
view  (searching...)
Switched to branch 'views/_'
joey@darkstar:~/lib/talks#_>tree -d
|-- Debian
|-- android
|-- bigpicture
|-- debhelper
|-- git
|-- git-annex
`-- seen

7 directories
joey@darkstar:~/lib/talks#_>git annex vadd author=*
Switched to branch 'views/author=_;_'
joey@darkstar:~/lib/talks#author=_;_>tree -d
|-- Benjamin Mako Hill
|   `-- bigpicture
|-- Denis Carikli
|   `-- android
|-- Joey Hess
|   |-- Debian
|   |-- bigpicture
|   |-- debhelper
|   |-- git
|   `-- git-annex
|-- Richard Hartmann
|   |-- git
|   `-- git-annex
`-- Stefano Zacchiroli
    `-- Debian

15 directories
joey@darkstar:~/lib/talks#author=_;_>git annex vpop
vpop 1
Switched to branch 'views/_'
joey@darkstar:~/lib/talks#_>git annex vadd tag=git-annex
Switched to branch 'views/(git-annex)'
joey@darkstar:~/lib/talks#_>git annex vpop 2
vpop 2
Switched to branch 'master'

Not 100% happy with the speed -- the generation of the view branch is close to optimal, and fast enough (unless the branch has very many matching files). And vadd can be quite fast if the view has already limited the total number of files to a smallish amount. But view has to look at every file's metadata, and this can take a while in a large repository. Needs indexes.

It also needs integration with git annex sync, so the view branches update when files are added to the master branch, and moving files around inside a view and committing them does not yet update their metadata.

Today's work was sponsored by Daniel Atlas.

Posted Wed Feb 19 01:58:19 2014

Working on building metadata filtered branches.

Spent most of the day on types and pure code. Finally at the end I wrote down two actions that I still need to implement to make it all work:

applyView' :: MkFileView -> View -> Annex Git.Branch
updateView :: View -> Git.Ref -> Git.Ref -> Annex Git.Branch

I know how to implement these, more or less. And in most cases they will be pretty fast.

The more interesting part is already done. That was the issue of how to generate filenames in the filter branches. That depends on the View being used to filter and organize the branch, but also on the original filename used in the reference branch. Each filter branch has a reference branch (such as "master"), and displays a filtered and metadata-driven reorganized tree of files from its reference branch.

fileViews :: View -> (FilePath -> FileView) -> FilePath -> MetaData -> Maybe [FileView]

So, a view that matches files tagged "haskell" or "git-annex" and with an author of "J*" will generate filenames like "haskell/Joachim/interesting_theoretical_talk.ogg" and "git-annex/Joey/mytalk.ogg".

It can also work backwards from these filenames to derive the MetaData that is encoded in them.

fromView :: View -> FileView -> MetaData

So, copying a file to "haskell/Joey/mytalk.ogg" lets it know that it's gained a "haskell" tag. I knew I was on the right track when fromView turned out to be only 6 lines of code!

The trickiest part of all this, which I spent most of yesterday thinking about, is what to do if the master branch has files in subdirectories. It probably does not makes sense to retain that hierarchical directory structure in the filtered branch, because we instead have a non-hierarchical metadata structure to express. (And there would probably be a lot of deep directory structures containing only one file.) But throwing away the subdirectory information entirely means that two files with the same basename and same metadata would have colliding names.

I eventually decided to embed the subdirectory information into the filenames used on the filter branch. Currently that is done by converting dir/subdir/ to file(dir)(subdir).foo. We'll see how this works out in practice..

Posted Mon Feb 17 02:44:28 2014

More Windows porting.. Seem to be getting near an end of the easy stuff, and also the webapp is getting pretty usable on Windows now, the only really important thing lacking is XMPP support.

Made git-annex on Windows set HOME when it's not already set. Several of the bundled cygwin tools only look at HOME. This was made a lot harder and uglier due to there not being any way to modify the environment of the running process.. git-annex has to re-run itself with the fixed environment.

Got working in the webapp. Although with an extra password prompt on Windows, which I cannot find a way to avoid.

While testing that, I discovered that openssh 6.5p1 has broken support for ~/.ssh/config Host lines that contain upper case letters! I have filed a bug about this and put a quick fix in git-annex, which sometimes generated such lines.

Posted Fri Feb 14 21:02:52 2014

Windows porting all day. Fixed a lot of issues with the webapp, so quite productive. Except for the 2 hours wasted finding a way to kill a process by PID from Haskell on Windows.

Last night, made git annex metadata able to set metadata on a whole directory or list of files if desired. And added a --metadata field=value switch (and corresponding preferred content terminal) which limits git-annex to acting on files with the specified metadata.

Posted Thu Feb 13 21:37:49 2014

Built the core data types, and log for metadata storage. Making metadata union merge well is tricky, but I have a design I'm happy with, that will allow distributed changes to metadata.

Finished up the day with a git annex metadata command to get/set metadata for a file.

This is all the goundwork needed to begin experimenting with generating git branches that display different metadata-driven views of annexed files.

Posted Thu Feb 13 03:24:04 2014

There's a new design document for letting git-annex store arbitrary metadata. The really neat thing about this is the user can check out only files matching the tags or values they care about, and get an automatically structuted file tree layout that can be dynamically filtered. It's going to be awesome! metadata

In the meantime, spent most of today working on Windows. Very good progress, possibly motivated by wanting to get it over with so I can spend some time this month on the above. ;)

  • webapp can make and S3 remotes. This just involved fixing a hack where the webapp set environment variables to communicate creds to initremote. Can't change environment on Windows (or I don't know how to).
  • webapp can make repos on removable drives.
  • git annex assistant --stop works, although this is not likely to really be useful
  • The source tree now has 0 func = error "Windows TODO" type stubbed out functions to trip over.
Posted Tue Feb 11 20:21:42 2014

Pushed out the new release. This is the first one where I consider the git-annex command line beta quality on Windows.

Did some testing of the webapp on Windows, trying out every part of the UI. I now have eleven todo items involving the webapp listed in windows support. Most of them don't look too bad to fix.

Posted Mon Feb 10 22:48:12 2014

Last night I tracked down and fixed a bug in the DAV library that has been affecting WebDAV remotes. I've been deploying the fix for that today, including to the android and arm autobuilders. While I finished a clean reinstall of the android autobuilder, I ran into problems getting a clean reinstall of the arm autobuilder (some type mismatch error building yesod-core), so manually fixed its DAV for now.

The WebDAV fix and other recent fixes makes me want to make a release soon, probably Monday.

ObWindows: Fixed git-annex to not crash when run on Windows in a git repository that has a remote with a unix-style path like "/foo/bar". Seems that not everything aggrees on whether such a path is absolute; even sometimes different parts of the same library disagree!

import System.FilePath.Windows

prop_windows_is_sane :: Bool
prop_windows_is_sane = isAbsolute upath || ("C:\\STUFF" </> upath /= upath)
  where upath = "/foo/bar"

Perhaps more interestingly, I've been helping dxtrish port git-annex to OpenBSD and it seems most of the way there.

Posted Sat Feb 8 21:27:35 2014

git-annex has been using MissingH's absNormPath forever, but that's not very maintained and doesn't work on Windows. I've been wanting to get rid of it for some time, and finally did today, writing a simplifyPath that does the things git-annex needs and will work with all the Windows filename craziness, and takes advantage of the more modern System.FilePath to be quite a simple peice of code. A QuickCheck test found no important divergences from absNormPath. A good first step to making git-annex not depend on MissingH at all.

That fixed one last Windows bug that was disabled in the test suite: git annex add ..\subdir\file will now work.

I am re-installing the Android autobuilder for 2 reasons: I noticed I had accidentally lost a patch to make a library use the Android SSL cert directory, and also a new version of GHC is very near to release and so it makes sense to update.

Down to 38 messages in the backlog.

Posted Fri Feb 7 22:08:20 2014

Added a new feature that started out with me wanting a way to undo a git-annex drop, but turned into something rather more powerful. The --in option can now be told to match files that were in a repository at some point in the past. For example, git annex get --in=here@{yesterday} will get any files that have been dropped over the past day.

While git-annex's location tracking info is stored in git and thus versioned, very little of it makes use of past versions of the location tracking info (only git annex log). I'm happy to have finally found a use for it!

OB Windows porting: Fixed a bug in the symlink calculation code. Sounds simple; took 2 hours!

Also various bug triage; updated git version on OSX; forwarded bug about DAV-0.6 being broken upstream; fixed a bug with initremote in encryption=pubkey mode. Backlog is 65 messages.

Today's work was sponsored by Brock Spratlen.

Posted Fri Feb 7 01:08:56 2014

A more test driven day than usual. Yesterday I noticed a test case was failing on Windows in a way not related to what it was intended to test, and fixed the test case to not fail.. But knew I'd need to get to the bottom of what broke it eventually.

Digging into that today, I eventually (after rather a long time stuck) determined the bug involved automatic conflict resolution, but only happened on systems without symlink support. This let me reproduce it on FAT outside Windows and do some fast TDD iterations in a much less unwieldly environment and fix the bug.

Posted Tue Feb 4 21:34:34 2014

While I've not been blogging over what amounted to a long weekend, looking over the changelog, there were quite a few things done. Mostly various improvements and fixes to git annex sync --content.

Today, got the test suite to pass on Windows 100% again.

Posted Tue Feb 4 01:26:04 2014

With yesterday's release, I'm pretty much done with the month's work. Since there was no particular goal this month, it's been a grab bag of features and bugfixes. Quite a lot of them in this last release.

I'll be away the next couple of days.. But got a start today on the next part of the roadmap, which is planned to be all about Windows and Android porting. Today, it was all about lock files, mostly on Windows.

Lock files on Windows are horrific. I especially like that programs that want to open a file, for any reason, are encouraged in the official documentation to retry repeatedly if it fails, because some other random program, like a virus checker, might have opened the file first.

Turns out Windows does support a shared file read mode. This was just barely enough for me to implement both shared and exclusive file locking a-la-flock.

Couldn't avoid a busy wait in a few places that block on a lock. Luckily, these are few, and the chances the lock will be taken for a long time is small. (I did think about trying to watch the file for close events and detect when the lock was released that way, but it seemed much too complicated and hard to avoid races.)

Also, Windows only seems to support mandatory locks, while all locking in git-annex needs to be advisory locks. Ie, git-annex's locking shouldn't prevent a program from opening an annexed file! To work around that, I am using dedicated lock files on Windows.

Also switched direct mode's annexed object locking to use dedicated lock files. AFAICS, this was pretty well broken in direct mode before.

Posted Tue Jan 28 20:52:01 2014

Built the UI to manage unused files.

Testing yesterday's work, I found several problems that prevented the assistant from moving unused files around, and fixed them. It seems to be working pretty well now.

Posted Thu Jan 23 20:57:49 2014

A big missing peice of the assistant is doing something about the content of old versions of files, and deleted files. In direct mode, editing or deleting a file necessarily loses its content from the local repository, but the content can still hang around in other repositories. So, the assistant needs to do something about that to avoid eating up disk space unnecessarily.

I built on recent work, that lets preferred content expressions be matched against keys with no associated file. This means that I can run unused keys through all the machinery in the assistant that handles file transfers, and they'll end being moved to whatever repository wants them. To control which repositories do want to retain unused files, and which not, I added a unused keyword to preferred content expressions. Client repositories and transfer repositories do not want to retain unused files, but backup etc repos do.

One nice thing about this unused preferred content implementation is that it doesn't slow down normal matching of preferred content expressions at all. Can you guess why not? See 4b55afe9e92c045d72b78747021e15e8dfc16416

So, the assistant will run git annex unused on a daily basis, and cause unused files to flow to repositories that want them. But what if no repositories do? To guard against filling up the local disk, there's a annex.expireunused configuration setting, that can cause old unused files to be deleted by the assistant after a number of days.

I made the assistant check if there seem to be a lot of unused files piling up. (1000+, or 10% of disk used by them, or more space taken by unused files than is free.) If so, it'll pop up an alert to nudge the user to configure annex.expireunused.

Still need to build the UI to configure that, and test all of this.

Today's work was sponsored by Samuel Tardieu.

Posted Thu Jan 23 03:11:20 2014

Worked on cleaning up and reorganizing all the code that handles numcopies settings. Much nicer now. Fixed some bugs.

As expected, making the preferred content numcopies check look at .gitattributes slows it down significantly. So, exposed both the slow and accurate check and a faster version that ignores .gitattributes.

Also worked on the test suite, removing dependencies between tests. This will let tasty-rerun be used later to run only previously failing tests.

Posted Tue Jan 21 23:21:48 2014

In order to remove some hackishness in git annex sync --content, I finally fixed a bad design decision I made back at the very beginning (before I really knew haskell) when I built the command seek code, which had led to a kind of inversion of control. This took most of a night, but it made a lot of code in git-annex clearer, and it makes the command seeking code much more flexible in what it can do. Some of the oldest, and worst code in git-annex was removed in the process.

Also, I've been reworking the numcopies configuration, to allow for a ?preferred content numcopies check. That will let the assistant, as well as git annex sync --content proactively make copies when needed in order to satisfy numcopies.

As part of this, git config annex.numcopies is deprecated, and there's a new git annex numcopies N command that sets the numcopies value that will be used by any clone of a repository.

I got the preferred content checking of numcopies working too. However, I am unsure if checking for per-file .gitattributes annex.numcopies settings will make preferred content expressions be, so I have left that out for now.

Today's work was sponsored by Josh Taylor.

Posted Mon Jan 20 21:47:17 2014

Spent the day building this new feature, which makes git annex sync --content do the same synchronization of file contents (to satisfy preferred content settings) that the assistant does. The result has not been tested a lot yet, but seems to work well.

Posted Sun Jan 19 22:12:24 2014

Activity has been a bit low again this week. It seems to make sense to do weekly releases currently (rather than bi-monthly), and Thursday's release had only one new feature (Tahoe LAFS) and a bunch of bug fixes.

Looks like git-annex will get back into Debian testing soon, after various fixes to make it build on all architectures again, and then the backport can be updated again too.

I have been struggling with a problem with the OSX builds, which fail with a SIGKILL on some machines. It seems that homebrew likes to agressively optimise things it builds, and while I have had some success with its --build-bottle option, something in the gnutls stack used for XMPP is still over-optimised. Waiting to hear back from Kevin on cleaning up some optimised system libraries on the OSX host I use. (Is there some way to make a clean chrooot on OSX that can be accessed by a non-root user?)

Today I did some minor work involving the --json switch, and also a small change (well, under 300 line diff) allowing --all to be mixed with options like --copies and --in.

Posted Sat Jan 18 21:26:34 2014

Fixed a bug that one or two people had mentioned years ago, but I was never able to reproduce myself or get anyone to reproduce in a useful way. It caused log files that were supposed to be committed to the git-annex branch to end up in master. Turned out to involve weird stuff when the environment contains two different settings for a single variable. So was easily fixed at last. (I'm pretty sure the code would have never had this bug if Data.AssocList was not buried inside an xml library, which rather discourages using it when dealing with the environment.)

Also worked on, and hopefully fixed, another OSX cpu optimisations problem. This one involving shared libraries that git-annex uses for XMPP.

Also made the assistant detect corrupt .git/annex/index files on startup and remove them. It was already able to recover from corrupt .git/index files.

Today's work was sponsored by David Wagner.

Posted Tue Jan 14 21:23:27 2014

If you've been keeping an eye on the roadmap, you'll have seen that xmpp security keeps being pushed back. This was because it's a hard and annoying problem requiring custom crypto and with an ugly key validation problem built into it too. I've now removed it from the roadmap entirely, replacing it with a telehash design.

I'm excited by the possibilities of using telehash with git-annex. It seems it would be quite easy to make it significantly more peer-to-peer and flexible. The only issue is that telehash is still under heavy development and the C implementation is not even usable yet.. (I'll probably end up writing Haskell bindings to that.) So I've pushed it down the roadmap to at least March.

Spent the rest of the day making some minor improvements to external special remote protocol and doing some other minor bug fixes and backlog catch up. My backlog has exploded to nearly 50 messages remaining.

Today's work was sponsored by Chad Horohoe.

Posted Mon Jan 13 22:10:45 2014

Been on reduced activity the past several days. I did spend a full day somewhere in there building the Tahoe LAFS special remote. Also, Tobias has finished updating his full suite of external special remotes to use the new interface!

Worked on closing up the fundraising campaign today (long overdue). This included adding a new wall-o-names to thanks.

Posted Fri Jan 10 19:40:11 2014

Taught the assistant to stop reusing an existing git annex transferkeys process after it detects a network connection change. I don't think this is a complete solution to what to do about long-duration network connections in remotes. For one thing a remote could take a long time to time out when the network is disconnected, and block other transfers (eg to local drives) in the meantime. But at least if a remote loses its network connection and does not try to reconnect on its own, and so is continually failing, this will get it back into a working state eventually.

Also, fixed a problem with the OSX Mavericks build, it seems that the versions of wget and coreutils stuff that I was including in it were built by homebrew with full optimisations turned on, so didn't work on some CPUs. Replaced those with portable builds.

Posted Mon Jan 6 21:16:08 2014

Spent ages tracking down a memory leak in the assistant that showed up when a lot of files were added. Turned out to be a standard haskell laziness induced problem, fixed by adding strictness annotations. Actually there were several of them, that leaked at different rates. Eventually, I seem to have gotten them all fixed:

Before: ?leakbefore.png After: ?leakafter.png

Also fixed a bug in git annex add when the disk was completely full. In that situation, it could sometimes move the file from the work tree to .git/annex/objects and fail to put the symlink in place.

Posted Mon Jan 6 01:38:44 2014

Yesterday, added per-remote, per-key state storage. This is exported via the external special remote protocol, and I expect to use it at least for Tahoe Lafs.

Also, made the assistant write ssh config files with better permissions, so ssh won't refuse to use them. (The only case I know of where that happened was on Windows.)

Today, made addurl and importfeed honor annex.diskreserve. Found out about this the hard way, when an importfeed cron job filled up my server with youtube videos. I should probably also make import honor annex.diskreserve.

I've been working, so far inconclusively, on making the assistant deal with remotes that might open a long duration network connection. Problem being that if the connection is lost, and the remote is not smart enough to reconnect, all further use of it could fail.

In a restarttransferrer branch, I have made the assistant start separate transferkeys processes for each remote. So if a remote starts to fail, the assistant can stop its transferkeys process, and restart it, solving the problem.

But, if a resource needed for a remote is not available, this degrades to every transfer attempt to that remote restarting it. So I don't know if this is the right approach.

Other approaches being considered include asking that implementors of external special remotes deal with reconnection themselves (Tobias, do you deal with this in your remotes?), or making the assistant only restart failing remotes after it detects there's been a network connection change.

Posted Sat Jan 4 22:15:06 2014

Implemented read-only remotes. This may not cover every use case around wanting to clone a repository and use git-annex without leaking the existence of your clone back to it, but I think it hits most of them in a quite easy way, and allows for some potentially interesting stuff like partitioned networks of git-annex repositories.

Zooko and I have been talking things over (for rather too long), and I think have now agreed on a how a more advanced git-annex Tahoe-LAFS special remote should work. This includes storing the tahoe file-caps in the git-annex branch. So, I really need to add that per-special-remote data storage feature I've been thinking about.

Posted Fri Jan 3 00:29:54 2014

Various work on Debian, OSX, and Windows stuff. Mostly uninteresting, but took most of the day.

Made git annex mirror --all work. I can see why I left it out; when the mirroring wants to drop an object, in --all mode it doesn't have an associated file in the tree, so it cannot look at the annex.numcopies in gitattributes. Same reason why git annex drop --all is not implemented. But decided to go ahead and only use other numcopies configuration for mirroring.

Added GETWANTED and SETWANTED to the external special remote protocol, and that is as far as I want to go on adding git-annex plumbing stuff to the protocol. I expect Tobias will release a boatload of special remotes updated to the new protocol soon, which seems to prove it has everything that could reasonably be needed.

This is a nice public git-annex repository containing a growing collection of tech conference videos.

Did some design work on ?untracked remotes, which I think will turn out to be read-only remotes. Being able to clone a repository and use git-annex in the clone without anything leaking back upstream is often desirable when using public repository, or a repository with many users.

Posted Thu Jan 2 00:37:06 2014

Worked on bug report and forum backlog (24 messages left), and made a few bug fixes. The main one was a fix for a Windows-specific direct mode merge bug.

This month didn't go entirely to plan. I had not expected to work on the Windows assistant and webapp and get it so close to fully working. Nor had I expected to spend time and make significant progress on porting git-annex to Linux -- particularly to embedded NAS devices! I had hoped to encourage some others to develop git-annex, but only had one bite from a student and it didn't work out. Meanwhile, automatically rewarding committers with bitcoin is an interesting alternative approach to possibly motivating contributors, and I would like to set that up, but the software is new and I haven't had time yet. The only thing that went exactly as planned was the external special remote implementation.

A special surprise this month is that I have started hearing privately from several institutions that are starting using git-annex in interesting ways. Hope I can share details of some of that 2014!

Posted Tue Dec 31 21:45:36 2013

Fixed a bug that could leave a direct mode repository stuck at annex.version 3. As part of that, v3 indirect mode repositories will be automatically updated to v5. There's no actual change in that upgrade, it just simplifies things to have only one supported annex.version.

Added youtube playlist support to git-annex. Seems I had almost all the pieces needed, and didn't know it. Only about a dozen lines of code!

Added PREPARE-FAILURE support to the external special remote interface.

After I found the cable my kitten stole (her apport level is high), fixed file transfers to/from Android. This broke because git-annex assistant tries to use ionice, if it's in PATH, and Android's ionice is not suitable. It could probably include ionice in the busybox build and use that one, but I wanted a quick fix for this before the upcoming release.

Posted Sun Dec 29 21:33:20 2013

The external special remote interface is now done, and tested working great! Now we just need all the old hook special remotes to be converted to use it..

I punted on per-special-remote, per-key state storage in the git-annex branch for now. If I find an example of a remote that needs it (Tahoe-LAFS may, but still TBD), I'll add it. Added suppport for using the same credential storage that git-annex uses for S3 and WebDAV credentials.

The main improvement I'd like to make is to add an interface for transferring files where the file is streamed to/from the external special remote, rather than using temp files as it does now. This would be more efficient (sometimes) and make the progress bars better. But it needs to either use a named pipe, which is complicated and non-portable, or serialize the file's contents over a currently line-based protocol, which would be a pain. Anyway, this can be added later, the protocol is extensible.

Posted Fri Dec 27 20:37:12 2013

Built most of the external special remote today. While I've written 600 lines of code for this, and think it's probably working, and complete (except for a couple of features), all I know is that it compiles.

I've also written an example external special remote program in shell script, so the next step is to put the two together and see how it works. I also hope that some people who have built hook special remotes in the past will update them to the new external special remote interface, which is quite a lot better.

Today's work was sponsored by Justine Lam.

Posted Thu Dec 26 22:42:26 2013

Only did a few hours today, getting started on implementing the external special remote protocol.

Mostly this involved writing down types for the various messages, and code to parse them. I'm very happy with how the parsing turned out; nearly all the work is handled by the data types and type classes, and so only one line of very simple code is needed to parse each message:

instance Receivable Response where
       parseCommand "PREPARE-SUCCESS" = parse0 PREPARE_SUCCESS
       parseCommand "TRANSFER-SUCCESS" = parse2 TRANSFER_SUCCESS
       parseCommand "TRANSFER-FAILURE" = parse3 TRANSFER_FAILURE

An especially nice part of this implementation is that it knows exactly how many parameters each message should have (and their types of course), and so can both reject invalid messages, and avoid ambiguity in tokenizing the parameters. For example, the 3rd parameter of TRANSFER-FAILURE is an error message, and as it's the last parameter, it can contain multiple words.

*Remote.External> parseMessage "TRANSFER-FAILURE STORE SHA1--foo doesn't work on Christmas" :: Maybe Response
Just (TRANSFER_FAILURE Upload (Key {keyName = "foo", keyBackendName = "SHA1", keySize = Nothing, keyMtime = Nothing}) "doesn't work on Christmas")

That's the easy groundwork for external special remotes, done.

Posted Wed Dec 25 22:27:12 2013

Resurfaced today to fix some problems with the Linux standalone builds in the Solstice release. The worst of these prevented the amd64 build from running on some systems, and that build has been updated. The other problems all involved the binary shimming, and were less serious.

As part of that work, replaced the hacky shell script that handled the linux library copying and binary shimming with a haskell program.

Also worked on some Windows bugs, and fixed a typo in the test suite. Got my own little present: haskell-tasty finally got out of Incoming, so the next Debian package build will once again include the test suite.

Posted Tue Dec 24 21:48:20 2013

Got the arm webapp to build! (I have not tried to run it.) The build process for this is quite elaborate; 2 chroots, one amd64 and one armel, with the same versions of everything installed in each, and git-annex is built in the first to get the info the EvilSplicer needs to build it in the second.

Fixed a nasty bug in the assistant on OSX, where at startup it would follow symlinks in the repository that pointed to directories outside the repository, and add the files found there. Didn't cause data loss itself (in direct mode the assistant doesn't touch the files), but certainly confusingly breaks things and makes it easy to shoot your foot off. I will be moving up the next scheduled release because of this bug, probably to Saturday.

Looped the git developers in on a problem with git failing on some kernels due to RLIMIT_NOFILE not working. Looks like git will get more robust and this should make the armel build work on even more embedded devices.

Today's work was sponsored by Johan Herland.

Posted Wed Dec 18 21:53:43 2013

Fixed a few problems in the armel build, and it's been confirmed to work on Raspberry Pi and Synology NAS. Since none of the fixes were specific to those platforms, it will probably work anywhere the kernel is new enough. That covers 9+% of the missing ports in the user survey!

Thought through the possible issues with the assistant on Windows not being able to use lsof. I've convinced myself it's probably safe. (In fact, it might be safe to stop checking with lsof when using the assistant in direct mode entirely.) Also did some testing of some specific interesting circumstances (including 2 concurrent writers to a single file).

I've been working on adding the webapp to the armel build. This can mostly reuse the patches and EvilSplicer developed for Android, but it's taking some babysitting of the build to get yesod etc installer for various reasons. Will be surprised if I don't get there tomorrow.

One other thing.. I notice that is up and running. This was set up by Subito, who offered me the domain, but I suggested he keep it and set up a pretty start page that points new users at the relevant parts of the wiki. I think he's done a good job with that!

Posted Tue Dec 17 23:40:18 2013

Made the Linux standalone builds more self-contained, now they include their own linker and glibc, and ugly hacks to make them be used when running the included programs. This should make them more portable to older systems.

Set up an arm autobuilder. This autobuilder runs in an Debian armel chroot, using qemu-user-static (with a patch to make it support some syscalls ghc uses). No webapp yet; waiting on feedback of how well it works. I hope this build will be usable on eg, Synology NAS and Raspberry PI.

Also worked on improving the assistant's batching of commits during the startup scan. And some other followups and bug triage.

Today's work was sponsored by Hamish Coleman.

Posted Mon Dec 16 20:33:17 2013

Made some improvements to git-annex's plumbing level commands today. Added new lookupkey and examinekey commands. Also expanded the things that git annex find can report about files. Among other things, the elusive hash directory locations can now be looked up, which IIRC a few people have asked for a way to do.

Also did some work on the linux standalone tarball and OSX app. Both now include man pages, and it's also now possible to just unpack it and symlink git-annex into ~/bin or similar to add it to PATH.

Posted Sun Dec 15 23:16:44 2013

Spent most of today catching up with a weeks's worth of traffic.

Fixed 2 bugs. Message backlog is 23 messages.

Posted Thu Dec 12 20:40:28 2013

I've switched over to mostly working on Windows porting in the evenings when bored, with days spent on other git-annex stuff. So, getting back to the planned roadmap for this month..

Set up a tip4commit for git-annex. Anyone who gets a commit merged in will receive a currently small amount of bitcoin. This would almost be a good way to encourage more committers other than me, by putting say, half the money I have earmarked for that into the tip jar. The problem is, I make too many commits myself, so most of the money would be quickly tipped back out to me! I have gotten in touch with the tip4commit people, and hope they will give me a way to blacklist myself from being tipped.

Designed a external special remote protocol that seems pretty good for first-class special remotes implemented outside git-annex. It's moderately complicated on the git-annex side to make it simple and flexible on the special remote side, but I estimate only a few days to build it once I have the design finalized.


Tested the autobuilt windows webapp. It works! Sorted out some issues with the bundled libraries.

Reworked how git annex transferkeys communicates, to make it easier to port it to Windows. Narrowly managed to avoid needing to write Haskell bindings to Windows's equivilant of pipe(2). I think the Windows assistant can transfer keys now. and the webapp UI may even be able to be used to stop transfers. Needs testing.

Investigated what I'll need to get XMPP working on Windows. Most of the libs are available in cygwin, but gsasl would need to be built from source. Also some kind of space-in-path problem is preventing cabal installing some of the necessary dependencies.

Posted Wed Dec 11 22:04:40 2013

Got the Windows autobuilder building the webapp. Have not tried that build yet myself, but I have high hopes it will work.

Made other Windows improvements, including making the installer write a start menu entry file, and adding free disk space checking.

Spent rest of the day improving git repair code on a real-world corrupted repository.

Posted Tue Dec 10 22:30:18 2013

Fixed up a few problems with the Windows webapp, and it's now completely usable, using any browser other than MSIE. While there are missing features in the windows port, all the UI for the features it does have seems to just work in the webapp.

Fixed a ugly problem with Firefox, which turned out to have been introduced a while ago by a workaround for an ugly problem in Chrome. Web browsers are so wonderful, until they're crap.

Think I've fixed the bug in the EvilLinker that was causing it to hang on the autobuilder, but still don't have a Windows autobuild with the webapp just yet.

Also improved git annex import some more, and worked on a bug in git repository repair, which I will need to spend some more time on tomorrow.

Posted Mon Dec 9 22:08:08 2013

I have seen the glory of the webapp running on Windows.

One of the warp developers pointed me in the right direction and I developed a fix for the recv bug.

My Windows and MSIE are old and fall over on some of the javascript, so it's not glorious enough for a screenshot. But large chunks of it do seem to work.

Posted Sun Dec 8 20:27:06 2013

Windows webapp now starts, opens a web browser, and ... crashes.

This is a bug in warp or a deep level of the stack. I know that yesod apps have run on Windows before, so apparently something has changed and introduced this problem.

Also have a problem with the autobuilder; the EvilSplicer or something it runs is locking up on that system for reasons not yet determined.

Looks like I will need to wait a bit longer for the windows webapp, but I could keep working on porting the assistant in the meantime.

The most important thing that I need to port is how to check if a file is being written to at the same time the assistant adds it to the repository. No real lsof equivilant on Windows. I might be able to do something with exclusive locking to detect if there's a writer (but this would also block using the file while it was being added). Or I may be able to avoid the need for this check, at least in direct mode.

Posted Sat Dec 7 21:18:39 2013

Android has the EvilSplicer, now Windows gets the EvilLinker. Fully automated, and truly horrible solution to the too long command line problem.

Now when I run git annex webapp on windows, it almost manages to open the web browser.

At the same time, I worked with Yuri to upgrade the Windows autobuilder to a newer Haskell platform, which can install Yesod. I have not quite achieved a successful webapp build on the autobuilder, but it seems close.

Here's a nice Haskell exercise for someone. I wrote this quick and dirty function in the EvilSplicer, but it's crying out for a generalized solution.

{- Input contains something like 
 - c:/program files/haskell platform/foo -LC:/Program Files/Haskell Platform/ -L...
 - and the *right* spaces must be escaped with \
 - Argh.
escapeDosPaths :: String -> String
escapeDosPaths = replace "Program Files" "Program\\ Files"
        . replace "program files" "program\\ files"
        . replace "Haskell Platform" "Haskell\\ Platform"
        . replace "haskell platform" "haskell\\ platform"
Posted Sat Dec 7 01:12:47 2013

Got the entire webapp to build on Windows.

Compiling was easy. One line of code had to be #ifdefed out, and the whole rest of the webapp UI just built!

Linking was epic. It seems that I really am runninginto a 32kb command line length limit, which causes the link command to fail on Windows. git-annex with all its bells and whistles enabled is just too big. Filed a ghc bug report, and got back a helpful response about using to work around.

6 hours of slogging through compiling dependencies and fighting with toolchain later, I have managed to link git-annex with the webapp!

The process is not automated yet. While I was able to automate passing gcc a @file with its parameters, gcc then calls collect2, which calls ld, and both are passed too many parameters. I have not found a way to get gcc to generate a response file. So I did it manually. Urgh.

Also, it crashes on startup with getAddrInfo failure. But some more porting is to be expected, now that the windows webapp links.. ;)

Posted Fri Dec 6 03:50:03 2013

Had planned to spend all day not working on git-annex and instead getting caught up on conference videos. However, got a little bit multitasky while watching those, and started investigating why, last time I worked on Windows port, git-annex was failing to link.

A good thing to do while watching conference videos since it involved lots of test builds with different flags. Eventially solved it. Building w/o WebDAV avoids crashing the compiler anyhow.

Thought I'd try the resulting binary and see if perhaps I had forgotten to use the threaded RTS when I was running ghc by hand to link it last time, and perhaps that was why threads seemed to have hung back then.

It was. This became clear when I saw a "deadlocked indefinitely in MVar" error message, which tells me that it's at least using the threaded RTS. So, I fixed that, and a few other minor things, and ran this command in a DOS prompt box:

git annex watch --force --foreground --debug

And I've been making changes to files in that repository, and amazingly, the watcher is noticing them, and committing them!

So, I was almost entirely there to a windows port of the watcher a month ago, and didn't know. It has some rough edges, including not doing anything to check if a newly created file is open for write when adding it, and getting the full assistant ported will be more work, and the full webapp may be a whole other set of problems, but this is a quite nice milestone for the Windows port.

Posted Wed Dec 4 21:56:29 2013

The 2013 git-annex user survey has been running for several weeks and around 375 people have answered at least the first question. While I am going to leave it up through the end of the year, I went over the data today to see what interesting preliminary conclusions I can draw.

  • 11% build git-annex from source. More than I would have guessed.

  • 20% use the prebuilt versions from the git-annex website.

    This is a number to keep in mind later, when more people have upgraded to the last release, which checks for upgrades. I can run some stats on the number of upgrade checks I receive, and multiplying that by 5 would give a good approximation of the total number of computers running git-annex.

  • I'm surprised to see so many more Linux (79%) than OSX (15%) users. Also surprising is there are more Windows (2%) than Android (1%) users. (Android numbers may be artificially low since many users will use it in addition to one of the other OSes.)

  • Android and Windows unsurprisingly lead in ports requested, but the Synology NAS is a surprise runner up, with 5% (more than IOS).

    In theory it would not be too hard to make a standalone arm tarball, which could be used on such a device, although IIRC the Synology had problems with a too old linker and libc. It would help if I could make the standalone tarball not depend on the system linker at all.

    A susprising number (3%) want some kind of port the the Raspberry Pi, which is weird because I'd think they'd just be using Raspbian on it.. but a standalone arm tarball would also cover that use case.

  • A minimum of 1664 (probably closer to 2000) git annex repositories are being used by the 248 people who answered that question. Around 7 repositories per person average, which is either one repository checked out on 7 different machines or two repositories on 3 machines, etc.

  • At least 143 terabytes of data are being stored in git-annex. This does not count redundant data. (It also excludes several hundred terabytes from one instituion that I don't think are fully online yet.) Average user has more than half a terabyte of data.

  • 8% of users store scientific data in git-annex! :) A couple of users are using it for game development assets, and 5% of users are using it for some form of business data.

  • Only 10% of users are sharing a git-annex repository with at least one other person. 27% use it by themselves, but want to get others using their repositories. This probably points to it needing to be easier for nontechnical users.

  • 61% of git-annex users have good or very good knowledge of git. This question intentionally used the same wording as the general git user survey, so the results can be compared. The curves have somewhat different shapes, with git-annex's users being biased more toward the higher knowledge levels than git's users.

  • The question about how happy users are also used the same wording. While 74% of users are happy with git-annex, 94% are similarly happy with git, and a while the median git-annex user is happy, the median git user is very happy.

    The 10% who wrote in "very enthusiastic, but still often bitten by quirks (so not very happy yet, but with lots of confidence in the potential" might have thrown off this comparison some, but they certianly made their point!

  • 3% of respondants say that a bug is preventing them from using git-annex, but that they have not reported the bug yet. Frustrating! 1% say that a bug that's been reported already is blocking them.

  • 18% wrote in that they need the webapp to support using github (etc) as a central server. I've been moving in that direction with the encryption and some other changes, so it's probably time to make a UI for that.

  • 12% want more control over which files are stored locally when using the assistant.

  • A really surprising thing happened when someone wrote in that I should work on "not needing twice disk space of repo in direct mode", and 5% of people then picked this choice. This is some kind of documentation problem, because of course git-annex never needs 2x disk space, whether using direct mode or not. That's one of its advantages over git!

  • Somewhere between 59 and 161 of the survey respondants use Debian. I can compare this with Debian popularity contest data which has 400 active installations and 1000 total installations, and make guesses about what fraction of all git-annex users have answered the survey. By making different assumptions I got guesses that varied by 2 orders of magnitude, so not worth bothering with. Explicitly asking how many people use each Linux distribution would be a good idea in next year's survey.

Main work today was fixing Android DNS lookups, which was trying to use /etc/resolv.conf to look up SRV records for XMPP, and had to be changed to use a getprop command instead. Since I can't remember dealing with this before (not impossible I made some quick fix to the dns library before and lost it though), I'm wondering if XMPP was ever usable on Android before. Cannot remember. May work now, anyway...

Posted Tue Dec 3 22:04:12 2013

Still working through thanksgiving backlog. Around 55 messages to go.

Wrote hairy code to automatically fix up bad bare repositories created by recent versions of git-annex. Managed to do it with only 1 stat call overhead (per local repository). Will probably keep that code in git-annex for a year or so, despite the bug only being present for a few weeks, because the repositories that need to be fixed might be on removable drives that are rarely used.

Various other small bug fixes, including dealing with having changed their WebDAV endpoint url.

Spent a while evaluating various key/value storage possibilities. ?incremental fsck should not use sticky bit has the details.

Posted Tue Dec 3 00:33:30 2013

Made a release yesterday to fix a bug that made git-annex init in a bare repository set core.bare=false. This bug only affected git-annex 5, it was introduced when building the direct mode guard. Currently recovering from it is a manual (pretty easy) process. Perhas I should automate that, but I mostly wanted to get a fix out before too many people encountered the bug.

Today, I made the assistant run batch jobs with ionice and nocache, when those commands are available. Also, when the assistant transfers files, that also runs as a batch job.

Changed how git-annex does commits, avoiding using git commit in direct mode, since in some situations git commit (not with -a!) wants to read the contents of files in the work tree, which can be very slow.

Posted Sun Dec 1 20:18:32 2013

My last day before thanksgiving, getting caught up with some recent bug reports and, quite a rush to get a lot of fixes in. Adding to the fun, wintery weather means very limited power today.

It was a very productive day, especially for Android, which hopefully has XMPP working again (at least it builds..), halved the size of the package, etc.

Fixed a stupid bug in the automatic v5 upgrade code; annex.version was not being set to 5, and so every git annex command was actually re-running the upgrade.

Fixed another bug I introduced last Friday, which the test suite luckily caught, that broke using some local remotes in direct mode.

Tracked down a behavior that makes git annex sync quite slow on filesystems that don't support symlinks. I need to switch direct mode to not using git commit at all, and use plumbing to make commits there. Will probably work on this over the holiday.

Posted Wed Nov 27 00:08:03 2013

Upgrades should be working on OSX Mavericks, Linux, and sort of on Android. This needs more testing, so I have temporarily made the daily builds think they are an older version than the last git-annex release. So when you install a daily build, and start the webapp, it should try to upgrade (really downgrade) to the last release. Tests appreciated.

Looking over the whole upgrade code base, it took 700 lines of code to build the whole thing, of which 75 are platform specific (and mostly come down to just 3 or 4 shell commands). Not bad..

Last night, added support for quvi 0.9, which has a completely changed command line interface from the 0.4 version.

Plan to spend tomorrow catching up on bug reports etc and then low activity for rest of the week.

Posted Mon Nov 25 19:42:52 2013

Upgrades are fully working on Linux. OSX code is written but intested and I thought of one bug it certainly has on my evening walk. Probably another hour's work left later this evening to finish it off.

Posted Sun Nov 24 22:34:28 2013

Completely finished up with making the assistant detect when git-annex's binary has changed and handling the restart.

It's a bit tricky because during an upgrade there can be two assistant daemons running at the same time, in the same repository. Although I disable the watcher of the old one first. Luckily, git-annex has long supported running multiple concurrent git-annex processes in the same repository.

The surprisingly annoying part turned out to be how to make the webapp redirect the browser to the new url when it's upgraded. Particularly needed when automatic upgrades are enabled, since the user will not then be taking any action in the webapp that could result in a redirect. My solution to this feels like overkill; the webapp does ajax long polling until it gets an url, and then redirects to it. Had to write javascript code and ugh.

But, that turned out to also be useful when manually restarting the webapp (removed some horrible old code that ran a shell script to do it before), and also when shutting the webapp down.

assistant downloading an upgrade to itself

Getting back to upgrades, I have the assistant downloading the upgrade, and running a hook action once the key is transferred. Now all I need is some platform-specific code to install it. Will probably be hairy, especially on OSX where I need to somehow unmount the old git-annex dmg and mount the new one, from within a program running on the old dmg.

Today's work was sponsored by Evan Deaubl.

Posted Sat Nov 23 21:28:42 2013

The difference picking the right type can make! Last night, I realized that the where I had a distributionSha256sum :: String, I should instead use distributionKey :: Key. This means that when git-annex is eventually downloading an upgrade, it can treat it as just another Key being downloaded from the web. So the webapp will show that transfer along with all the rest, and I can leverage tons of code for a new purpose. For example, it can simply fsck the key once it's downloaded to verify its checksum.

Also, built a DistriutionUpdate program, which I'll run to generate the info files for a new version. And since I keep git-annex releases in a git-annex repo, this too leverages a lot of git-annex modules, and ended up being just 60 easy lines of code. The upgrade notification code is tested and working now.

And, I made the assistant detect when the git-annex program binary is replaced or modified. Used my existing DirWatcher code for that. The plan is to restart the assistant on upgrade, although I need to add some sanity checks (eg, reuse the lsof code) first. And yes, this will work even for apt-get upgrade!

Today's work was sponsored by Paul Tötterman

Posted Fri Nov 22 23:03:22 2013

Still working on the git repair code. Improved the test suite, which found some more bugs, and so I've been running tests all day and occasionally going and fixing a bug in the repair code. The hardest part of repairing a git repo has turned out to be reliably determining which objects in it are broken. Bugs in git don't help (but the git devs are going to fix the one I reported).

But the interesting new thing today is that I added some upgrade alert code to the webapp. Ideally everyone would get git-annex and other software as part of an OS distribution, which would include its own upgrade system -- But the survey tells me that a quarter of installs are from the prebuilt binaries I distribute.

So, those builds are going to be built with knowledge of an upgrade url, and will periodically download a small info file (over https) to see if a newer version is available, and show an alert.

I think all that's working, though I have not yet put the info files in place and tested it. The actual upgrade process will be a manual download and reinstall, to start with, and then perhaps I'll automate it further, depending on how hard that is on the different platforms.

Posted Fri Nov 22 04:26:24 2013

Pushed out a minor release of git-annex today, mostly to fix build problems on Debian. No strong reason to upgrade to it otherwise.

Continued where I left off with the Git.Destroyer. Fixed quite a lot of edge cases where git repair failed due to things like a corrupted .git/HEAD file (this makes git think it's not in a git repository), corrupt git objects that have an unknown object type and so crash git hard, and an interesting failure mode where git fsck wants to allocate 116 GB of memory due to a corrupted object size header. Reported that last to the git list, as well as working around it.

At the end of the day, I ran a test creating 10000 corrupt git repositories, and all of them were recovered! Any improvements will probably involve finding new ways to corrupt git repositories that my code can't think of. ;)

Posted Wed Nov 20 23:34:30 2013

Wrote some evil code you don't want to run today. Git.Destroyer randomly generates Damage, and applies it to a git repository, in a way that is reproducible -- applying the same Damage to clones of the same git repo will always yeild the same result.

This let me build a test harness for git-repair, which repeatedly clones, damages, and repairs a repository. And when it fails, I can just ask it to retry after fixing the bug and it'll re-run every attempt it's logged.

This is already yeilding improvements to the git-repair code. The first randomly constructed Damage that it failed to recover turned out to be a truncated index file that hid some other corrupted object files from being repaired.

[Damage Empty (FileSelector 1),
 Damage Empty (FileSelector 2),
 Damage Empty (FileSelector 3),
 Damage Reverse (FileSelector 3),
 Damage (ScrambleFileMode 3) (FileSelector 5),
 Damage Delete (FileSelector 9),
 Damage (PrependGarbage "¥SOH¥STX¥ENQ¥f¥a¥ACK¥b¥DLE¥n") (FileSelector 9),
 Damage Empty (FileSelector 12),
 Damage (CorruptByte 11 25) (FileSelector 6),
 Damage Empty (FileSelector 5),
 Damage (ScrambleFileMode 4294967281) (FileSelector 14)

I need to improve the ranges of files that it damages -- currently QuickCheck seems to only be selecting one of the first 20 or so files. Also, it's quite common that it will damage .git/config so badly that git thinks it's not a git repository anymore. I am not sure if that is something git-repair should try to deal with.

Today's work was sponsored by the WikiMedia Foundation.

Posted Tue Nov 19 21:35:16 2013

Release today, right on bi-weekly schedule. Rather startled at the size of the changelog for this one; along with the direct mode guard, it adds support for OS X Mavericks, Android 4.3/4.4, and fixes numerous bugs.

Posted another question in the survey,

Spun off git-repair as an independant package from git-annex. Of course, most of the source code is shared with git-annex. I need to do something with libraries eventually..

Posted Mon Nov 18 22:27:47 2013

Fixed two difficult bugs with direct mode. One happened (sometimes) when a file was deleted and replaced with a directory by the same name and then those changes were merged into a direct mode repository.

The other problem was that direct mode did not prevent writes to .git/annex/objects the way that indirect mode does, so when a file in the repository was not currently present, writing to the dangling symlink would follow it and write into the object directory.

Hmm, I was going to say that it's a pity that direct mode still has so many bugs being found and fixed, but the last real bug fix to direct mode was made last May! Instead, I probably have to thank Tim for being a very thorough tester.

Finished switching the test suite to use the tasty framework, and prepared tasty packages for Debian.

Posted Fri Nov 15 20:31:36 2013

The user survey is producing some interesting and useful results!
Added two more polls: using with and blocking problems
(There were some load issues so if you were unable to vote yesterday, try again..)

Worked on getting the autobuilder for OS X Mavericks set up. Eventually succeeded, after patching a few packages to work around a cpp that thinks it should parse haskell files as if they're C code. Also, Jimmy has resuscitated the OS X Lion autobuilder.

A not too bad bug in automatic merge conflict resolution has been reported, so I will need to dig into that tomorrow. Didn't feel up to it today, so instead have been spending the remaining time finishing up a branch that switches the test suite to use the tasty test framework.

Posted Fri Nov 15 00:09:47 2013

One of my goals for this month is to get a better sense of how git-annex is being used, how it's working out for people, and what areas need to be concentrated on. To start on that, I am doing the 2013 git-annex user survey, similar to the git user surveys. I will be adding some less general polls later (suggestions for topics appreciated!), but you can go vote in any or all of 10 polls now.

Found a workaround for yesterday's Windows build problem. Seems that only cabal runs gcc in a way that fails, so ghc --make builds is successfully. However, the watcher doesn't quite work on Windows. It does get events when files are created, but it seems to then hang before it can add the file to git, or indeed finish printing out a debug log message about the event. This looks like it could be a problem with the threaded ghc runtime on Windows, or something like that.

Main work today was improving the git repository repair to handle corrupt index files. The assistant can now start up, detect that the index file is corrupt, and regenerate it all automatically.

Posted Wed Nov 13 20:56:41 2013

Annoyingly, the Android 4.3 fix breaks git-annex on Android 4.0 (probably through 4.2), so I now have two separate builds of the Android app.

Worked on Windows porting today. I've managed to get the assistant and watcher (but not yet webapp) to build on Windows. The git annex transferrer interface needs POSIX stuff, and seems to be the main thing that will need porting for Windows for the assistant to work, besides of course file change detection. For that, I've hooked up Win32-notify.

So the watcher might work on Windows. At least in theory. Problem is, while all the code builds ok, it fails to link:

ghc.exe: could not execute: C:\Program Files (x86)\Haskell Platform\2012.4.0.0\lib/../mingw/bin/gcc.exe

I wonder if this is case of too many parameters being passed?

This happens both on the autobuilder and on my laptop, so I'm stuck here. Oh well, I was not planning to work on this anyway until February...

Posted Wed Nov 13 01:05:04 2013

Finally found the root cause of the Android 4.3/4.4 trouble, and a fix is now in place!

As a bonus, it looks like I've fixed a problem accessing the environment on Android that had been worked around in an ugly way before.

Big thanks to my remote hands Michael Alan, Sören, and subito. All told they ran 19 separate tests to help me narrow down this tricky problem, often repeating long command lines on software keyboards.

Posted Tue Nov 12 06:54:19 2013

Been chipping away at my backlog of messages, and it's down to 23 items.

Finally managed to get ghc to build with a newer version of the NDK. This might mean a solution to git-annex on Android 4.2. I need help with testing.

Posted Sun Nov 10 20:14:20 2013

Finished the direct mode guard, including the new git annex status command.

Spent the rest of the day working on various bug fixes. One of them turned into rather a lot of work to make the webapp's UI better for git remotes that do not have an annex.uuid.

Posted Thu Nov 7 22:03:32 2013

Started by tracking down a strange bug that was apparently ubuntu-specific and caused git-annex branch changes to get committed to master. Root cause turned out to failing to recover from an exception. I'm kicking myself about that, because I remember looking at the code where the bug was at least twice before and thinking "hmm, should add exception handling here? nah..". Exceptions are horrible.

Made a release with a fix for that and a few minor other accumulated changes since last Friday's release. The pain point of this release is to fix building without the webapp (so it will propigate to Debian testing, etc). This release does not include the direct mode guard, so I'll have a few weeks until the next release to get that tested.

Fixed the test suite in directguard. This branch is now nearly ready to merge to master, but one command that is badly needed in guarded direct mode is "git status". So I am planning to rename "git annex status" to "git annex info", and make "git annex status" display something similar to "git status".

Also took half an hour and added optional ?EKG support to git-annex. This is a Haskell library that can add a terrific monitoring console web UI to any program in 2 lines of code. Here we can see the git-annex webapp using resources at startup, followed in a few seconds by the assistant's startup scan of the repository.


BTW, Kevin tells me that the machine used to build git-annex for OSX is going to be upgraded to 10.9 soon. So, hopefully I'll be making autobuilds of that. I may have to stop the 10.8.2 autobuilds though.

Today's work was sponsored by Protonet.

Posted Wed Nov 6 20:39:24 2013

Long, long day coding up the direct mode guard today. About 90% of the fun is dealing with receive.denyCurrentBranch not preventing pushes that change the current branch, now that core.bare is set in direct mode. My current solution to this involves using a special branch when using direct mode, which nothing will ever push to (hopefully). A much nicer solution would be to use a update hook to deny pushes of the current branch -- but there are filesystems where repos cannot have git hooks.

The test suite is falling over, but the directguard branch otherwise seems usable.

Today's work was sponsored by Carlo Matteo Capocasa.

Posted Wed Nov 6 01:26:06 2013

I've been investigating ways to implement a ?direct mode guard. Preventing a stray git commit -a or git add doing bad things in a direct mode repository seems increasingly important.

First, considered moving .git, so git won't know it's a git repository. This doesn't seem too hard to do, but there will certainly be unexpected places that assume .git is the directory name.

I dislike it more and more as I think about it though, because it moves direct mode git-annex toward being entirely separate from git, and I don't want to write my own version control system. Nor do I want to complicate the git ecosystem with tools needing to know about git-annex to work in such a repository.

So, I'm happy that one of the other ideas I tried today seems quite promising. Just set core.bare=true in a direct mode repository. This nicely blocks all git commands that operate on the working tree from doing anything, which is just what's needed in direct mode, since they don't know how to handle the direct mode files. But it lets all git commands and other tools that don't touch the working tree continue to be used. You can even run git log file in such a repository (surprisingly!)

It also gives an easy out for anyone who really wants to use git commands that operate on the work tree of their direct mode repository, by just passing -c core.bare=false. And it's really easy to implement in git-annex too -- it can just notice if a repo has core.bare and both set, and pass that parameter to every git command it runs. I should be able to get by with only modifying 2 functions to implement this.

Posted Mon Nov 4 21:32:03 2013

Low activity the past couple of days. Released a new version of git-annex yesterday. Today fixed three bugs (including a local pairing one that was pretty compicated) and worked on getting caught up with traffic.

Posted Sat Nov 2 21:07:19 2013

Spent today reviewing my ?plans for the month and filling in a couple of missing peices.

Noticed that I had forgotten to make repository repair clean up any stale git locks, despite writing that code at the beginning of the month, and added that in.

Made the webapp notice when a repository that is being used does not have any consistency checks configured, and encourage the user to set up checks. This happens when the assistant is started (for the local repository), and when removable drives containing repositories are plugged in. If the reminders are annoying, they can be disabled with a couple clicks.

And I think that just about wraps up the month. (If I get a chance, I would still like to add recovery of git-remote-gcrypt encrypted git repositories.)

My roadmap has next month dedicated to user-driven features and polishing and bugfixing.

Posted Tue Oct 29 20:59:51 2013

All command line stuff today..

Added --want-get and --want-drop, which can be used to test preferred content settings of a repository. For example git annex find --in . --want-drop will list the same files that git annex drop --auto would try to drop. (Also renamed git annex content to git annex wanted.)

Finally laid to rest problems with git annex unannex when multiple files point to the same key. It's a lot slower, but I'll stop getting bug reports about that.

Posted Mon Oct 28 22:19:19 2013

Finally got the assistant to repair git repositories on removable drives, or other local repos. Mostly this happens entirely automatically, whatever data in the git repo on the drive has been corrupted can just be copied to it from ~/annex/.git.

And, the assistant will launch a git fsck of such a repo whenever it fails to sync with it, so the user does not even need to schedule periodic fscks. Although it's still a good idea, since some git repository problems don't prevent syncing from happening.

Watching git annex heal problems like this is quite cool!

One thing I had to defer till later is repairing corrupted gcrypt repositories. I don't see a way to do it without deleting all the objects in the gcrypt repository, and re-pushing everything. And even doing that is tricky, since the gcrypt-id needs to stay the same.

Posted Sun Oct 27 20:58:10 2013

Got well caught up on bug fixes and traffic. Backlog is down to 40.

Made the assistant wait for a few seconds before doing the startup scan when it's autostarted, since the desktop is often busy starting up at that same time.

Fixed an ugly bug with chunked webdav and directory special remotes that caused it to not write a "chunkcount" file when storing data, so it didn't think the data was present later. I was able to make it recover nicely from that mistake, by probing for what chunks are actually present.

Several people turn out to have had problems with git annex sync not working because receive.denyNonFastForwards is enabled. I made the webapp not enable it when setting up a ssh repository, and I made git annex sync print out a hint about this when it's failed to push. (I don't think this problem affects the assistant's own syncing.)

Made the assistant try to repair a damaged git repository without prompting. It will only prompt when it fails to fetch all the lost objects from remotes.

Glad to see that others have managed to get git-annex to build on Max OS X 10.9. Now I just need someone to offer up a ssh account on that OS, and I could set up an autobuilder for it.

Posted Sat Oct 26 21:17:47 2013

The webapp now fully handles repairing damage to the repository.

Along with all the git repository repair stuff already built, I added additional repairs of the git-annex branch and git-annex's index file. That was pretty easy actually, since git-annex already handles merging git-annex branches that can sometimes be quite out of date. So when git repo repair has to throw away recent changes to the git-annex branch, it just effectively becomes out of date. Added a git annex fsck --fast run to ensure that the git-annex branch reflects the current state of the repository.

When the webapp runs a repair, it first stops the assistant from committing new files. Once the repair is done, that's started back up, and it runs a startup scan, which is just what is needed in this sitation; it will add any new files, as well as any old files that the git repository damange caused to be removed from the index.

Also made git annex repair run the git repository repair code, for those with a more command-line bent. It can be used in non-git-annex repos too!

So, I'm nearly ready to wrap up working on disaster recovery. Lots has been accomplished this month. And I have put off making a release for entirely too long!

The big missing piece is repair of git remotes located on removable drive. I may make a release before adding that, but removable drives are probably where git repository corruption is most likely to occur, so I certainly need to add that.

Today's work was sponsored by Scott Robinson.

Posted Wed Oct 23 19:16:15 2013

I think that git-recover-repository is ready now. Made it deal with the index file referencing corrupt objects. The best approach I could think of for that is to just remove those objects from the index, so the user can re-add files from their work tree after recovery.

Now to integrate this git repository repair capability into the git-annex assistant. I decided to run git fsck as part of a scheduled repository consistency check. It may also make sense for the assistant to notice when things are going wrong, and suggest an immediate check. I've started on the webapp UI to run a repository repair when fsck detects problems.

Posted Tue Oct 22 20:30:39 2013

Solid day of working on repository recovery. Got git recover-repository --force working, which involves fixing up branches that refer to missing objects. Mostly straightforward traversal of git commits, trees, blobs, to find when a branch has a problem, and identify an old version of it that predates the missing object. (Can also find them in the reflog.)

The main complication turned out to be that git branch -D and git show-ref don't behave very well when the commit objects pointed to by refs are themselves missing. And git has no low-level plumbing that avoids falling over these problems, so I had to write it myself.

Testing has turned up one unexpected problem: Git's index can itself refer to missing objects, and that will break future commits, etc. So I need to find a way to validate the index, and when it's got problems, either throw it out, or possibly recover some of the staged data from it.

Posted Mon Oct 21 21:44:05 2013

Built a git-recover-repository command today. So far it only does the detection and deletion of corrupt objects, and retrieves them from remotes when possible. No handling yet of missing objects that cannot be recovered from remotes.

Here's a couple of sample runs where I do bad things to the git repository and it fixes them:

joey@darkstar:~/tmp/git-annex>chmod 644 .git/objects/pack/*
joey@darkstar:~/tmp/git-annex>echo > .git/objects/pack/pack-a1a770c1569ac6e2746f85573adc59477b96ebc5.pack 
Running git fsck ...
git fsck found a problem but no specific broken objects. Perhaps a corrupt pack file? Unpacking all pack files.
fatal: early EOF
Unpacking objects: 100% (148/148), done.
Unpacking objects: 100% (354/354), done.
Re-running git fsck to see if it finds more problems.
Re-running git fsck to see if it finds more problems.
Initialized empty Git repository in /home/joey/tmp/tmprepo.0/.git/
Trying to recover missing objects from remote origin
Successfully recovered repository!
You should run "git fsck" to make sure, but it looks like
everything was recovered ok.

joey@darkstar:~/tmp/git-annex>chmod 644 .git/objects/00/0800742987b9f9c34caea512b413e627dd718e
joey@darkstar:~/tmp/git-annex>echo > .git/objects/00/0800742987b9f9c34caea512b413e627dd718e
Running git fsck ...
error: unable to unpack 000800742987b9f9c34caea512b413e627dd718e header
error: inflateEnd: stream consistency error (no message)
error: unable to unpack 000800742987b9f9c34caea512b413e627dd718e header
error: inflateEnd: stream consistency error (no message)
git fsck found 1 broken objects. Unpacking all pack files.
removing 1 corrupt loose objects
Re-running git fsck to see if it finds more problems.
Re-running git fsck to see if it finds more problems.
Initialized empty Git repository in /home/joey/tmp/tmprepo.0/.git/
Trying to recover missing objects from remote origin
Successfully recovered repository!
You should run "git fsck" to make sure, but it looks like
everything was recovered ok.

Works great! I need to move this and git-union-merge out of the git-annex source tree sometime.

Today's work was sponsored by Francois Marier.

Posted Sun Oct 20 21:56:25 2013

Goal for the rest of the month is to build automatic recovery git repository corruption. Spent today investigating how to do it and came up with a fairly detailed design. It will have two parts, first to handle repository problems that can be fixed by fetching objects from remotes, and secondly to recover from problems where data never got sent to a remote, and has been lost.

In either case, the assistant should be able to detect the problem and automatically recover well enough to keep running. Since this also affects non-git-annex repositories, it will also be available in a standalone git-recover-repository command.

Posted Fri Oct 18 20:19:22 2013

A long day of bugfixing. Split into two major parts. First I got back to a bug I filed in August to do with the assistant misbehaving when run in a subdirectory of a git repository, and did a nice type-driven fix of the underlying problem (that also found and fixed some other related bugs that would not normally occur). Then, spent 4 hours in Windows purgatory working around crazy path separator issues.

Posted Fri Oct 18 02:39:56 2013

Productive day, but I'm wiped out. Backlog down to 51.

Posted Wed Oct 16 20:58:53 2013

While I said I was done with fsck scheduling yesterday, I ended up adding one more feature to it today: Full anacron style scheduling. So a fsck can be scheduled to run once per week, or month, or year, and it'll run the fsck the next time it's available after that much time has passed. The nice thing about this is I didn't have to change Cronner at all to add this, just improved the Recurrence data type and the code that calculates when to run events.

Rest of the day I've been catching up on some bug reports. The main bug I fixed caused git-annex on Android to hang when adding files. This turns out to be because it's using a new (unreleased) version of git, and git check-attr -z output format has changed in an incompatible way.

I am currently 70 messages behind, which includes some ugly looking bug reports, so I will probably continue with this over the next couple days.

Posted Tue Oct 15 20:16:31 2013

Fixed a lot of bugs in the assistant's fsck handling today, and merged it into master. There are some enhancments that could be added to it, including fscking ssh remotes via git-annex-shell and adding the ability to schedule events to run every 30 days instead of on a specific day of the month. But enough on this feature for now.

Today's work was sponsored by Daniel Brockman.

Posted Mon Oct 14 20:33:23 2013

Built everything needed to run a fsck when a remote gets connected. Have not tested it; only testing is blocking merging the incrementalfsck branch now.

Also updated the OSX and Android builds to use a new gpg release (denial of service security fix), and updated the Debian backport, and did a small amount of bug fixing. I need to do several more days of bug fixing once I get this incremental fsck feature wrapped up before moving on to recovery of corrupt git repositories.

Posted Sun Oct 13 21:22:31 2013

Last night, built this nice user interface for configuring periodic fscks:

Rather happy that that whole UI needed only 140 lines of code to build. Though rather more work behind it, as seen in this blog..

Today I added some support to git-annex for smart fscking of remotes. So far only git repos on local drives, but this should get extended to git-annex-shell for ssh remotes. The assistant can also run periodic fscks of these.

Still need to test that, and find a way to make a removable drive's fsck job run when the drive gets plugged in. That's where picking "any time" will be useful; it'll let you configure fscking of removable drives when they're available, as long as they have not been fscked too recently.

Today's work was sponsored by Georg Bauer.

Posted Fri Oct 11 21:35:54 2013

Some neat stuff is coming up, but today was a pretty blah day for me. I did get the Cronner tested and working (only had a few little bugs). But I got stuck for quite a while making the Cronner stop git-annex fsck processes it was running when their jobs get removed. I had some code to do this that worked when run standalone, but not when run from git-annex.

After considerable head-scratching, I found out this was due to forkProcess masking aync exceptions, which seems to be probably a bug. Luckily was able to work around it. Async exceptions continue to strike me as the worst part of the worst part of Haskell (the worst part being exceptions in general).

Was more productive after that.. Got the assistant to automatically queue re-downloads of any files that fsck throws out due to having bad contents, and made the webapp display an alert while fscking is running, which will go to the page to configure fsck schedules. Now all I need to do is build the UI of that page.

Posted Fri Oct 11 04:45:46 2013

Lots of progress from yesterday's modest start of building data types for scheduling. Last night I wrote the hairy calendar code to calculate when next to run a scheduled event. (This is actually quite superior to cron, which checks every second to see if it should run each event!) Today I built a "Cronner" thread that handles spawning threads to handle each scheduled event. It even notices when changes have been made to the its schedule and stops/starts event threads appropriately.

Everything is hooked up, building, and there's a good chance it works without too many bugs, but while I've tested all the pure code (mostly automatically with quickcheck properties), I have not run the Cronner thread at all. And there is some tricky stuff in there, like noticing that the machine was asleep past when it expected to wake up, and deciding if it should still run a scheduled event, or should wait until next time. So tomorrow we'll see..

Today's work was sponsored by Ethan Aubin.

Posted Tue Oct 8 22:15:36 2013

Spent most of the day building some generic types for scheduling recurring events. Not sure if rolling my own was a good idea, but that's what I did.

In the incrementalfsck branch, I have hooked this up in git-annex vicfg, which now accepts and parses scheduled events like "fsck self every day at any time for 60 minutes" and "fsck self on day 1 of weeks divisible by 2 at 3:45 for 120 minutes", and stores them in the git-annex branch. The exact syntax is of course subject to change, but also doesn't matter a whole lot since the webapp will have a better interface.

Posted Tue Oct 8 03:58:26 2013

Finished up the automatic recovery from stale lock files. Turns out git has quite a few lock files; the assistant handles them all.

Improved URL and WORM keys so the filenames used for them will always work on FAT (which has a crazy assortmeny of illegal characters). This is a tricky thing to deal with without breaking backwards compatability, so it's only dealt with when creating new URL or WORM keys.

I think my next step in this disaster recovery themed month will be adding periodic incremental fsck to the assistant. git annex fsck can already do an incremental fsck, so this should mostly involve adding a user interface to the webapp to configure when it should fsck. For example, you might choose to run it for up 1 hour every night, with a goal of checking all your files once per month. Also will need to make the assistant do something useful when fsck finds a bad file (ie, queue a re-download).

Posted Sat Oct 5 21:26:17 2013

Started the day by getting the builds updated for yesterday's release. This included making it possible to build git-annex with Debian stable's version of cryptohash. Also updated the Debian stable backport to the previous release.

The roadmap has this month devoted to improving git-annex's support for recovering from disasters, broken repos, and so on. Today I've been working on the first thing on the list, stale git index lock files.

It's unfortunate that git uses simple files for locking, and does not use fcntl or flock to prevent the stale lock file problem. Perhaps they want it to work on broken NFS systems? The problem with that line of thinking is is means all non-broken systems end up broken by stale lock files. Not a good tradeoff IMHO.

There are actually two lock files that can end up stale when using git-annex; both .git/index.lock and .git/annex/index.lock. Today I concentrated on the latter, because I saw a way to prevent it from ever being a problem. All updates to that index file are done by git-annex when committing to the git-annex branch. git-annex already uses fcntl locking when manipulating its journal. So, that can be extended to also cover committing to the git-annex branch, and then the git index.lock file is irrelevant, and can just be removed if it exists when a commit is started.

To ensure this makes sense, I used the type system to prove that the journal locking was in effect everywhere I need it to be. Very happy I was able to do that, although I am very much a novice at using the type system for interesting proofs. But doing so made it very easily to build up to a point where I could unlink the .git/annex/index.lock and be sure it was safe to do that.

What about stale .git/index.lock files? I don't think it's appropriate for git-annex to generally recover from those, because it would change regular git command line behavior, and risks breaking something. However, I do want the assistant to be able to recover if such a file exists when it is starting up, since that would prevent it from running. Implemented that also today, although I am less happy with the way the assistant detects when this lock file is stale, which is somewhat heuristic (but should work even on networked filesystems with multiple writing machines).

Today's work was sponsored by Torbjørn Thorsen.

Posted Thu Oct 3 20:58:40 2013

Did I say it would be easy to make the webapp detect when a gcrypt repository already existed and enable it? Well, it wasn't exactly hard, but it took over 300 lines of code and 3 hours..

So, gcrypt support is done for now. The glaring omission is gpg key management for sharing gcrypt repositories between machines and/or people. But despite that, I think it's solid, and easy to use, and covers some great use cases.

Pushed out a release.

Now I really need to start thinking about disaster recovery.

Today's work was sponsored by Dominik Wagenknecht.

Posted Wed Oct 2 20:13:45 2013

Long day, but I did finally finish up with gcrypt support. More or less.

Got both creating and enabling existing gcrypt repositories on ssh servers working in the webapp. (But I ran out of time to make it detect when the user is manually entering a gcrypt repo that already exists. Should be easy so maybe tomorrow.)

Fixed several bugs in git-annex's gcrypt support that turned up in testing. Made git-annex ensure that a gcrypt repository does not have receive.denyNonFastForwards set, because gcrypt relies on always forcing the push of the branch it stores its manifest on. Fixed a bug in git-annex-shell recvkey when it was receiving a file from an annex in direct mode.

Also had to add a new git annex shell gcryptsetup command, which is needed to make setting up a gcrypt repository work when the assistant has set up a locked-down ssh key that can only run git-annex-shell. Painted myself into a bit of a corner there.

And tested, tested, tested. So many possibilities and edge cases in this part of the code..

Today's work was sponsored by Hendrik Müller Hofstede.

Posted Tue Oct 1 23:21:47 2013

So close to being done with gcrypt support.. But still not quite there.

Today I made the UI changes to support gcrypt when setting up a repository on a ssh server, and improved the probing and data types so it can tell which options the server supports. Fairly happy with how that is turning out.

Have not yet hooked up the new buttons to make gcrypt repos. While I was testing that my changes didn't break other stuff, I found a bug in the webapp that caused it to sometimes fail to transfer one file to/from a remote that was just added, because the transferrer process didn't know about the new remote yet, and crashed (and was restarted knowing about it, so successfully sent any other files). So got sidetracked on fixing that.

Also did some work to make the gpg bundled with git-annex on OSX be compatable with the config files written by MacGPG. At first I was going to hack it to not crash on the options it didn't support, but it turned out that upgrading to version 1.4.14 actually fixed the problem that was making it build without support for DNS.

Today's work was sponsored by Thomas Hochstein.

Posted Sun Sep 29 20:35:34 2013

Worked on making the assistant able to merge in existing encrypted git repositories from

This had two parts. First, making the webapp UI where you click to enable a known special remote work with these encrypted repos. Secondly, handling the case where a user knows they have an encrypted repository on, so enters in its hostname and path, but git-annex doesn't know about that special remote. The second case is important, for example, when the encrypted repository is a backup and you're restoring from it. It wouldn't do for the assistant, in that case, to make a new encrypted repo and push it over top of your backup!

Handling that was a neat trick. It has to do quite a lot of probing, including downloading the whole encrypted git repo so it can decrypt it and merge it, to find out about the special remote configuration used for it. This all works with just 2 ssh connections, and only 1 ssh password prompt max.

Next, on to generalizing this specific code to work with arbitrary ssh servers!

Today's work was made possible by RMS's vision 30 years ago.

Posted Fri Sep 27 20:36:58 2013

Being still a little unsure of the UI and complexity for configuring gcrypt on ssh servers, I thought I'd start today with the special case of gcrypt on Since allows running some git commands, gcrypt can be used to make encrypted git repositories on it.

Here's the UI I came up with. It's complicated a bit by needing to explain the tradeoffs between the rsync and gcrypt special remotes.

This works fine, but I did not get a chance to add support for enabling existing gcrypt repos on Anyway, most of the changes to make this work will also make it easier to add general support for gcrypt on ssh servers.

Also spent a while fixing a bug in git-remote-gcrypt. Oddly gpg --list-keys --fast-list --fingerprint does not show the fingerprints of some keys.

Today's work was sponsored by Cloudier - Thomas Djärv.

Posted Fri Sep 27 05:03:50 2013

Did various bug fixes and followup today. Amazing how a day can vanish that way. Made 4 actual improvements.

I still have 46 messages in unanswered backlog. Although only 8 of the are from this month.

Posted Wed Sep 25 20:13:08 2013

Added support for gcrypt remotes to git-annex-shell. Now gcrypt special remotes probe when they are set up to see if the remote system has a suitable git-annex-shell, and if so all commands are sent to it. Kept the direct rsync mode working as a fallback.

It turns out I made a bad decision when first adding gcrypt support to git-annex. To make implementation marginally easier, I decided to not put objects inside the usual annex/objects directory in a gcrypt remote. But that lack of consistency would have made adding support to git-annex-shell a lot harder. So, I decided to change this. Which means that anyone already using gcrypt with git-annex will need to manually move files around.

Today's work was sponsored by Tobias Nix.

Posted Tue Sep 24 21:51:12 2013

Finished moving the Android autobuilder over to the new clean build environment. Tested the Android app, and it still works. Whew!

There's a small chance that the issue with the Android app not working on Android 4.3 has been fixed by this rebuild. I doubt it, but perhaps someone can download the daily build and give it another try..

I have 7 days left in which I'd like to get remote gcrypt repositories working in the assistant. I think that should be fairly easy, but a prerequisite for it is making git-annex-shell support being run on a gcrypt repository. That's needed so that the assistant's normal locked down ssh key setup can also be used for gcrypt repositories.

At the same time, not all gcrypt endpoints will have git-annex-shell installed, and it seems to make sense to leave in the existing support for running raw rsync and git push commands against such a repository. So that's going to add some complication.

It will also complicate git-annex-shell to support gcrypt repos. Basically, everything it does in git-annex repos will need to be reimplemented in gcrypt repositories. Generally in a more simple form; for example it doesn't need to (and can't) update location logs in a gcrypt repo.

I also need to find a good UI to present the three available choices (unencrypted git, encrypted git, encrypted rsync) when setting up a repo on a ssh server. I don't want to just remove the encrypted rsync option, because it's useful when using xmpp to sync the git repo, and is simpler to set up since it uses shared encryption rather than gpg public keys.

My current thought is to offer just 2 choices, encrypted and non-encrypted. If they choose encrypted, offer a choice of shared encryption or encrypting to a specific key. I think I can word this so it's pretty clear what the tradeoffs are.

Posted Mon Sep 23 20:32:11 2013

Made a release on Friday. But I had to rebuild the OSX and Linux standalone builds today to fix a bug in them.

Spent the past three days redoing the whole Android build environment. I've been progressively moving from my first hacked up Android build env to something more reproducible and sane. Finally I am at the point where I can run a shell script (well, actually, 3 shell scripts) and get an Android build chroot. It's still not immune to breaking when new versions of haskell libs are uploaded, but this is much better, and should be maintainable going forward.

This is a good starting point for getting git-annex into the F-Droid app store, or for trying to build with a newer version of the Android SDK and NDK, to perhaps get it working on Android 4.3. (Eventually. I am so sick of building Android stuff right now..)

Friday was all spent struggling to get ghc-android to build. I had not built it successfully since February. I finally did, on Saturday, and I have made my own fork of it which builds using a known-good snapshot of the current development version of ghc. Building this in a Debian stable chroot means that there should be no possibility that upstream changes will break the build again.

With ghc built, I moved on to building all the haskell libs git-annex needs. Unfortunately my build script for these also has stopped working since I made it in April. I failed to pin every package at a defined version, and things broke.

So, I redid the build script, and updated all the haskell libs to the newest versions while I was at it. I have decided not to pin the library versions (at least until I find a foolproof way to do it), so this new script will break in the future, but it should break in a way I can fix up easily by just refreshing a patch.

The new ghc-android build has a nice feature of at least being able to compile Template Haskell code (though still not run it at compile time. This made the patching needed in the Haskell libs quite a lot less. Offset somewhat by me needing to make general fixes to lots of libs to build with ghc head. Including some fun with ==# changing its type from Bool to Int#. In all, I think I removed around 2.5 thousand lines of patches! (Only 6 thousand lines to go...)

Today I improved ghc-android some more so it cross builds several C libraries that are needed to build several haskell libraries needed for XMPP. I had only ever built those once, and done it by hand, and very hackishly. Now they all build automatically too.

And, I put together a script that builds the debian stable chroot and installs ghc-android.

And, I hacked on the EvilSplicer (which is sadly still needed) to work with the new ghc/yesod/etc.

At this point, I have git-annex successfully building, including the APK!

In a bored hour waiting for a compile, I also sped up git annex add on OSX by I think a factor of 10. Using cryptohash for hash calculation now, when external hash programs are not available. It's still a few percentage points slower than external hash programs, or I'd use it by default.

This period of important drudgery was sponsored by an unknown bitcoin user, and by Bradley Unterrheiner and Andreas Olsson.

Posted Mon Sep 23 02:55:40 2013

Spent a few hours improving gcrypt in some minor ways, including adding a --check option that the assistant can use to find out if a given repo is encrypted with dgit, and also tell if the necessary gpg key is available to decrypt it. Also merged in a fix to support subkeys, developed by a git-annex user who is the first person I've heard from who is using gcrypt. I don't want to maintain gcrypt, so I am glad its author has shown up again today.

Got mostly caught up on backlog. The main bug I was able to track down today is git-annex using a lot of memory in certian repositories. This turns out to have happened when a really large file was committed right intoo to the git repository (by mistake or on purpose). Some parts of git-annex buffer file contents in memory while trying to work out if they're git-annex keys. Fixed by making it first check if a file in git is marked as a symlink. Which was really hard to do!

At least 4 people ran into this bug, which makes me suspect that lots of people are messing up when using direct mode (probably due to not reading the documentation, or having git commit -a hardwired into their fingers, and forcing git to commit large files into their repos, rather than having git-annex manage them. Implementing ?direct mode guard seems more urgent now.

Today's work was sponsored by Amitai Schlair.

Posted Thu Sep 19 21:10:49 2013

Spent basically all of today getting the assistant to be able to handle gcrypt special remotes that already exist when it's told to add a USB drive. This was quite tricky! And I did have to skip handling gcrypt repos that are not git-annex special remotes.

Anyway, it's now almost easy to set up an encrypted sneakernet using a USB drive and some computers running the webapp. The only part that the assistant doesn't help with is gpg key management.

Plan is to make a release on Friday, and then try to also add support for encrypted git repositories on remote servers. Tomorrow I will try to get through some of the communications backlog that has been piling up while I was head down working on gcrypt.

Posted Wed Sep 18 20:11:59 2013

I decided to keep gpg key generation very simple for now. So it generates a special-purpose key that is only intended to be used by git-annex. It hardcodes some key parameters, like RSA and 4096 bits (maximum recommended by gpg at this time). And there is no password on the key, although you can of course edit it and set one. This is because anyone who can access the computer to get the key can also look at the files in your git-annex repository. Also because I can't rely on gpg-agent being installed everywhere. All these simplifying assumptions may be revisited later, but are enough for now for someone who doesn't know about gpg (so doesn't have a key already) and just wants an encrypted repo on a removable drive.

Put together a simple UI to deal with gpg taking quite a while to generate a key ...



Then I had to patch git-remote-gcrypt again, to have a per-remote signingkey setting, so that these special-purpose keys get used for signing their repo.

Next, need to add support for adding an existing gcrypt repo as a remote (assuming it's encrypted to an available key). Then, gcrypt repos on ssh servers..

Also dealt with build breakage caused by a new version of the Haskell DNS library.

Today's work was sponsored by Joseph Liu.

Posted Wed Sep 18 00:08:57 2013

Now the webapp can set up encrypted repositories on removable drives.


This UI needs some work, and the button to create a new key is not wired up. Also if you have no gpg agent installed, there will be lots of password prompts at the console.

Forked git-remote-gcrypt to fix a bug. Hopefully my patch will be merged; for now I recommend installing my worked version.

Today's work was sponsored by Romain Lenglet.

Posted Mon Sep 16 21:00:18 2013

Fixed a typo that broke automatic youtube video support in addurl.

Now there's an easy way to get an overview of how close your repository is to meeting the configured numcopies settings (or when it exceeds them).

# time git annex info . 
numcopies stats: 
    numcopies +0: 6686
    numcopies +1: 3793
    numcopies +3: 3156
    numcopies +2: 2743
    numcopies -1: 1242
    numcopies -4: 1098
    numcopies -3: 1009
    numcopies +4: 372

This does make git annex info slow when run on a large directory tree, so --fast disables that.

Posted Sun Sep 15 23:36:52 2013

Worked to get git-remote-gcrypt included in every git-annex autobuild bundle. (Except Windows; running a shell script there may need some work later..)

Next I want to work on making the assistant easily able to create encrypted git repositories on removable drives. Which will involve a UI to select which gpg key to use, or creating (and backing up!) a gpg key.

But, I got distracted chasing down some bugs on Windows. These were quite ugly; more direct mode mapping breakage which resulted in files not being accessible. Also fsck on Windows failed to detect and fix the problem. All fixed now. (If you use git-annex on Windows, you should certainly upgrade and run git annex fsck.)

As with most bugs in the Windows port, the underlying cause turned out to be stupid: isSymlink always returned False on Windows. Which makes sense from the perspective of Windows not quite having anything entirely like symlinks. But failed when that was being used to detect when files in the git tree being merged into the repository had the symlink bit set..

Did bug triage. Backlog down to 32 (mostly messages from August).

Posted Fri Sep 13 20:09:20 2013

I've been out sick. However, some things kept happening. Mesar contributed a build host, and the linux and android builds are now happening, hourly, there. (Thanks as well to the two other people who also offered hostng.) And I made a minor release to fix a bug in the test suite that I was pleased three different people reported.

Today, my main work was getting git-annex to notice when a gcrypt remote located on some removable drive mount point is not the same gcrypt remote that was mounted there before. I was able to finesse this so it re-configures things to use the new gcrypt remote, as long as it's a special remote it knows about. (Otherwise it has to ignore the remote.) So, encrypted repos on removable drives will work just as well as non-encrypted repos!

Also spent a while with tech support trying to work out why someone's git-annex apparently opened a lot of concurrent ssh connections to Have not been able to reproduce the problem though.

Also, a lot of catch-up to traffic. Still 63 messages backlogged however, and still not entirely well..

Posted Thu Sep 12 21:58:43 2013

Got git annex sync working with gcrypt. So went ahead and made a release today. Lots of nice new features!

Unfortunately the linux 64 bit daily build is failing, because my build host only has 2 gb of memory and it is no longer enough. I am looking for a new build host, ideally one that doesn't cost me $40/month for 3 gb of ram and 15 gb of disk. (Extra special ideally one that I can run multiple builds per day on, rather than the current situation of only building overnight to avoid loading the machine during the day.) Until this is sorted out, no new 64 bit linux builds..

Posted Mon Sep 9 19:37:10 2013

gcrpyt is fully working now. Most of the examples in fully encrypted git repositories with gcrypt should work.

A few known problems:

  • git annex sync refuses to sync with gcrypt remotes. some url parsing issue.
  • Swapping two drives with gcrypt repositories on the same mount point doesn't work yet.
  • http urls are not supported
Posted Sun Sep 8 19:57:57 2013

About half way done with a gcrypt special remote. I can initremote it (the hard part to get working), and can send files to it. Can't yet get files back, or remove files, and only local repositories work so far, but this is enough to know it's going to be pretty nice!

Did find one issue in gcrypt that I may need to develop a patch for:

Posted Sat Sep 7 23:10:26 2013

Woke up with a pretty solid plan for gcrypt. It will be structured as a separate special remote, so initremote will be needed, with a gitrepo= parameter (unless the remote already exists). git-annex will then set up the git remote, including pushing to it (needed to get a gcrypt-id).

Didn't feel up to implementing that today. Instead I expectedly spent the day doing mostly Windows work, including setting up a VM on my new laptop for development. Including a ssh server in Windows, so I can script local builds and tests on Windows without ever having to touch the desktop. Much better!

Posted Fri Sep 6 22:54:24 2013

Started work on gcrypt support.

The first question is, should git-annex leave it up to gcrypt to transport the data to the encrypted repository on a push/pull? gcrypt hooks into git nicely to make that just work. However, if I go this route, it limits the places the encrypted git repositores can be stored to regular git remotes (and rsync). The alternative is to somehow use gcrypt to generate/consume the data, but use the git-annex special remotes to store individual files. Which would allow for a git repo stored on S3, etc. For now, I am going with the simple option, but I have not ruled out trying to make the latter work. It seems it would need changes to gcrypt though.

Next question: Given a remote that uses gcrypt, how do I determine the annex.uuid of that repository. I found a nice solutuon to this. gcrypt has its own gcrypt-id, and I convert it to a UUID in a reproducible, and even standards-compliant way. So the same encrypted remote will automatically get the same annex.uuid wherever it's used. Nice. Does mean that git-annex cannot find a uuid until git pull or git push has been used, to let gcrypt get the gcrypt-id. Implemented that.

The next step is actually making git-annex store data on gcrypt remotes. And it needs to store it encrypted of course. It seems best to avoid needing a git annex initremote for these gcrypt remotes, and just have git-annex automatically encrypt data stored on them. But I don't know. Without initializing them like a special remote is, I'm limited to using the gpg keys that gcrypt is configured to encrypt to, and cannot use the regular git-annex hybrid encryption scheme. Also, I need to generate and store a nonce anyway to HMAC ecrypt keys. (Or modify gcrypt to put enough entropy in gcrypt-id that I can use it?)

Another concern I have is that gcrypt's own encryption scheme is simply to use a list of public keys to encrypt to. It would be nicer if the full set of git-annex encryption schemes could be used. Then the webapp could use shared encryption to avoid needing to make the user set up a gpg key, or hybrid encryption could be used to add keys later, etc.

But I see why gcrypt works the way it does. Otherwise, you can't make an encrypted repo with a friend set as one of the particpants and have them be able to git clone it. Both hybrid and shared encryption store a secret inside the repo, which is not accessible if it's encrypted using that secret. There are use cases where not being able to blindly clone a gcrypt repo would be ok. For example, you use the assistant to pair with a friend and then set up an encrypted repo in the cloud for both of you to use.

Anyway, for now, I will need to deal with setting up gpg keys etc in the assistant. I don't want to tackle full gpgkeys yet. Instead, I think I will start by adding some simple stuff to the assistant:

  • When adding a USB drive, offer to encrypt the repository on the drive so that only you can see it.
  • When adding a ssh remote make a similar offer.
  • Add a UI to add an arbitrary git remote with encryption. Let the user paste in the url to an empty remote they have, which could be to eg github. (In most cases this won't be used for annexed content..)
  • When the user has no gpg key, prompt to set one up. (Securely!)
  • Maybe have an interface to add another gpg key that can access the gcrypt repo. Note that this will need to re-encrypt and re-push the whole git history.
Posted Thu Sep 5 21:19:04 2013

Now I can build git-annex twice as fast! And a typical incremental build is down to 10 seconds, from 51 seconds.

Spent a productive evening working with Guilhem to get his encryption patches reviewed and merged. Now there is a way to remove revoked gpg keys, and there is a new encryption scheme available that uses public key encryption by default rather than git-annex's usual approach. That's not for everyone, but it is a good option to have available.

Posted Thu Sep 5 04:10:43 2013

I try hard to keep this devblog about git-annex development and not me. However, it is a shame that what I wanted to be the beginning of my first real month of work funded by the new campaign has been marred by my home's internet connection being taken out by a lightning strike, and by illness. Nearly back on my feet after that, and waiting for my new laptop to finally get here.

Today's work: Finished up the git annex forget feature and merged it in. Fixed the bug that was causing the commit race detection code to incorrectly fire on the commit made by the transition code. Few other bits and pieces.

Posted Tue Sep 3 20:59:33 2013

Implemented git annex forget --drop-dead, which is finally a way to remove all references to old repositories that you've marked as dead.

I've still not merged in the forget branch, because I developed this while slightly ill, and have not tested it very well yet.

Posted Sat Aug 31 22:25:23 2013

John Millikin came through and fixed that haskell-gnutls segfault on OSX that I developed a reproducible test case for the other day. It's a bit hard to test, since the bug doesn't always happen, but the fix is already deployed for Mountain Lion autobuilder.

However, I then found another way to make haskell-gnutls segfault, more reliably on OSX, and even sometimes on Linux. Just entering the wrong XMPP password in the assistant can trigger this crash. Hopefully John will work his magic again.

Meanwhile, I fixed the sync-after-forget problem. Now sync always forces its push of the git-annex branch (as does the assistant). I considered but rejected having sync do the kind of uuid-tagged branch push that the assistant sometimes falls back to if it's failing to do a normal sync. It's ugly, but worse, it wouldn't work in the workflow where multiple clients are syncing to a central bare repository, because they'd not pull down the hidden uuid-tagged branches, and without the assistant running on the repository, nothing would ever merge their data into the git-annex branch. Forcing the push of synced/git-annex was easy, once I satisfied myself that it was always ok to do so.

Also factored out a module that knows about all the different log files stored on the git-annex branch, which is all the support infrastructure that will be needed to make git annex forget --drop-dead work. Since this is basically a routing module, perhaps I'll get around to making it use a nice bidirectional routing library like Zwaluw one day.

Posted Fri Aug 30 00:28:45 2013

Yesterday I spent making a release, and shopping for a new laptop, since this one is dying. (Soon I'll be able to compile git-annex fast-ish! Yay!) And thinking about ?wishlist: dropping git-annex history.

Today, I added the git annex forget command. It's currently been lightly tested, seems to work, and is living in the forget branch until I gain confidence with it. It should be perfectly safe to use, even if it's buggy, because you can use git reflog git-annex to pull out and revert to an old version of your git-annex branch. So if you're been wanting this feature, please beta test!

I actually implemented something more generic than just forgetting git history. There's now a whole mechanism for git-annex doing distributed transitions of whatever sort is needed.

There were several subtleties involved in distributed transitions:

First is how to tell when a given transition has already been done on a branch. At first I was thinking that the transition log should include the sha of the first commit on the old branch that got rewritten. However, that would mean that after a single transition had been done, every git-annex branch merge would need to look up the first commit of the current branch, to see if it's done the transition yet. That's slow! Instead, transitions are logged with a timestamp, and as long as a branch contains a transition with the same timestamp, it's been done.

A really tricky problem is what to do if the local repository has transitioned, but a remote has not, and changes keep being made to the remote. What it does so far is incorporate the changes from the remote into the index, and re-run the transition code over the whole thing to yeild a single new commit. This might not be very efficient (once I write the more full-featured transition code), but it lets the local repo keep up with what's going on in the remote, without directly merging with it (which would revert the transition). And once the remote repository has its git-annex upgraded to one that knows about transitions, it will finish up the transition on its side automatically, and the two branches will once again merge.

Related to the previous problem, we don't want to keep trying to merge from a remote branch when it's not yet transitioned. So a blacklist is used, of untransitioned commits that have already been integrated.

One really subtle thing is that when the user does a transition more complicated than git annex forget, like the git annex forget --dead that I need to implement to forget dead remotes, they're not just telling git-annex to forget whatever dead remotes it knows right now. They're actually telling git-annex to perform the transition one time on every existing clone of the repository, at some point in the future. Repositories with unfinished transitions could hang around for years, and at some future point when git-annex runs in the repository again, it would merge in the current state of the world, and re-do the transition. So you might tell it to forget dead remotes today, and then the very repository you ran that in later becomes dead, and a long-slumbering repo wakes up and forgets about the repo that started the whole process! I hope users don't find this massively confusing, but that's how the implementation works right now.

I think I have at least two more days of work to do to finish up this feature.

  • I still need to add some extra features like forgetting about dead remotes, and forgetting about keys that are no longer present on any remote.

  • After git annex forget, git annex sync will fail to push the synced/annex branch to remotes, since the branch is no longer a fast-forward of the old one. I will probably fix this by making git annex sync do a fallback push of a unique branch in this case, like the assistant already does. Although I may need to adjust that code to handle this case, too..

  • For some reason the automatic transitioning code triggers a "(recovery from race)" commit. This is certainly a bug somewhere, because you can't have a race with only 1 participant.

Today's work was sponsored by Richard Hartmann.

Posted Wed Aug 28 21:41:55 2013

I've started a new page for my devblog, since I'm not focusing extensively on the assistant and so keeping the blog here increasingly felt wrong. Also, my new year of crowdfunded development formally starts in September, so a new blog seemed good.

Posted Wed Aug 28 21:40:09 2013