git-annex uses FilePath (String) extensively. That's a slow data type. Converting to ByteString, and RawFilePath, should speed it up significantly, according to profiling.

I've made a test branch, bs, to see what kind of performance improvement to expect.

Benchmarking git-annex find, speedups range from 28-66%. The files fly by much more snappily. Other commands likely also speed up, but do more work than find so the improvement is not as large.

The bs branch is in a mergeable state now, but still needs work:

  • Eliminate all the fromRawFilePath, toRawFilePath, encodeBS, decodeBS conversions. Or at least most of them. There are likely quite a few places where a value is converted back and forth several times.

    As a first step, profile and look for the hot spots. Known hot spots:

    • keyFile uses fromRawFilePath and that adds around 3% overhead in git-annex find. Converting it to a RawFilePath needs a version of </> for RawFilePaths.
    • getJournalFileStale uses fromRawFilePath, and adds 3-5% overhead in git-annex whereis. Converting it to RawFilePath needs a version of </> for RawFilePaths. It also needs a ByteString.readFile for RawFilePath.
  • System.FilePath is not available for RawFilePath, and many of the conversions are to get a FilePath in order to use that library.

    It should be entirely straightforward to make a version of System.FilePath that can operate on RawFilePath, except possibly there could be some complications due to Windows.

  • Use versions of IO actions like getFileStatus that take a RawFilePath, avoiding a conversion. Note that these are only available on unix, not windows, so a compatability shim will be needed. (I can't seem to find any library that provides one.)