Hi
Is there any (built-in or otherwise) way to search git-annex metadata and file content? Ideally I think something that knows about git-annex would be helpful because of files moving around / going away due to metadata driven views, dangling symlinks etc.
I'm imagining:
- Something based on Lucene (solr/elasticsearch) or Xapian for fast searches
- Probably ideally using git-annex metadata to track which files move where on disk
- Maybe use of inotify to let it know when git annex has moved file content around or added/removed it from a working tree
Thanks everybody, and Joey for making git annex and being an inspiration
I personally use recoll to do that. It is not perfect but works well.
To extract the metadata, I use a script called git-annex_dump_tags.sh whose content is: set -eu FILE="${1}" DIR="$(dirname "${FILE}")" FILE_NO_DIR="$(basename "${FILE}")" cd "${DIR}" git annex metadata "${FILE_NO_DIR}"|\ tail -n+2|\ head -n-1|\ sed -r 's/^ +//'|\ sed -r 's/^([^=]+)=(.+)$/\1 = \2/'
Then in the recoll configuration, I added [~/perso] metadatacmds = ; rclmulti_gitannex = git-annex_dump_tags.sh %f
More information can be found in link the recoll manual.
Then you'll have to indicate what key to use in the indexing by updating the "fields" file. For instance, you could add: [prefixes] ack = XYACK year = XYMONTH month = XYMONTH day = XYDAY
I generally use the ack metadata in my bibliography to indicate whether I read the paper or not (ack=no or ack=yes). I can get access to all the paper I did not read yet, that were added in August 2015, and that deal with gaussian processes with the query recoll -t -q ack:no year:2015 month:8 gaussian processes
It was not easy to setup, but it does the job.
Hope that helps.
The previous comment did not display the files correctly.
link the script to dump the tags
link the fields file
Nice script to use recoil, and I imagine similar approaches could be used to integrate with other search engines.
Note that
git annex metadata --json
yields a format better suited to machine parsing. Especially because metadata values can contain arbitrary content, even potentially newlines.There are plans to eventually make git-annex use a caching database for things, including metadata. This would automatically get updated whenever there are changes to the metadata, and SQL queries could then be run against it. Or,
git annex metadata
queries could boil down to SQL queries and so run a lot faster than they do now.I see there is an Ubuntu Unity search lens for recoll: http://www.webupd8.org/2012/03/recoll-lens-full-text-search-unity-lens.html It should be possible to integrate git-annex metadata with that ..