metadata
Attach an arbitrary set of metadata to a key. This consists of any number of fields. Each field has an unordered set of values. The special field "tag" has as its values any tags that are set for the key.
Store in git-annex branch, next to location log files.
Storage needs to support union merging, including removing an old value of a field, and adding a new value of a field.
filtered branches
The reason to use specially named filtered branches is because it makes self-documenting how the repository is currently filtered.
unmatched files in filtered branches
TODO Files not matching the view should be able to be included in the filtered branch, in a special location, an "other" directory.
For example, it could make a "other" directory containing files without a tag when viewing by tag.
It might be nice, if in a two level view, for the other directories
to nest. For example, other/2014/file
. However, that leads to a
performance problem: When adding a level to a view, it has to look at each
file in the "other" directory and generate a view for it too. With a lot
of files, that'd be slow.
Instead, why not replicate the parent branch's directory structure inside the "other" directory? Then the directory tree only has to be constructed once, and can be left alone when refining a view.
operations while on filtered branch
- If files are removed and git commit called, git-annex should remove the
relevant metadata from the files. done
(Currently, only metadata used for visible subdirs is added and removed this way.) (Also, this is not usable in direct mode because deleting the file.. actually deletes it...) - If a file is moved into a new subdirectory while in a view branch, a tag is added with the subdir name. This allows on the fly tagging. done
git annex sync
should avoid pushing out the view branch, but it should check if there are changes to the metadata pulled in, and update the branch to reflect them.
automatically added metadata
When annex.genmetadata
is set, git annex add automatically attaches
some metadata to a file. Currently year and month fields, from its mtime.
There's also a post-commit-annex hook script.
directory hierarchy metadata
From the original filename used in the master branch, when constructing a view, generate fields. For example foo/bar/baz.mp3 would get /=foo, foo/=bar, foo/bar/=baz, and .=mp3.
Note that dir/=subdir allows a view to use dir/=*
and only
match one level of subdirs with the glob. So is better than dir=foo/bar
as the metadata. (Alternatively, could do special glob matching.)
This allows using whatever directory hierarchy exists to inform the view, without locking the view into using it.
Complication: When refining a view, it only looks at the filenames in the view, so it has to map from those filenames to derive the same metadata, unless there is persistent storage. Luckily, the filenames used in the views currently include the subdirs.
other uses for metadata
Uses are not limited to view branches.
git annex checkoutmeta year=2014 talk
in a subdir of master could create the
same tree of files filter would. The user can then commit that if desired.
Or, they could run additional commands like git annex fadd
to refine the
tree of files in the subdir.
Metadata can be used for configuring numcopies. One way would be a numcopies=n value attached to a file. But perhaps better would be to make the numcopies.log allow configuring numcopies based on which files have other metadata.
Other programs could query git-annex for the metadata of files in the work tree, and do whatever it wants with it.
filenames
The hard part of this is actually getting a useful filename to put in the view branch, since git-annex only has a key which the user will not want to see.
- Could use filename metadata for the key, recorded by git-annex add (which may not correspond to filenames being used in regular git branches like master for the key).
- Could use the Keys database's associated files. Currently only works for v6 unlocked files, and not for locked files.
- Current approach: Have a reference branch (eg master) and walk it to find filenames and keys. Fine as long as it can be done efficiently. Also allows including the subdirectory a file is in, potentially. cwebber points out that this is essentially a form of tracking branch. Which implies it will need to be updatable when the reference branch changes. Should be doable via diff-tree.
Note that we have to take care to avoid generating conflicting filenames. The current approach is to embed the full directory structure inside the filename in the view branch.
union merge properties
While the storage could just list all the current values of a field on a line with a timestamp, that's not good enough. Two disconnected repositories can make changes to the values of a field (setting and unsetting tags for example) and when this is union merged back together, the changes need to be able to be replayed in order to determine which values we end up with.
To make that work, we log not only when a field is set to a value, but when a value is unset as well.
For example, here two different remotes added tags, and then later a tag was removed:
1287290776.765152s tag +foo +bar
1287290991.152124s tag +baz
1291237510.141453s tag -bar
efficient metadata lookup
Looking up metadata for view generation so far requires traversing all keys in the git-annex branch. This is slow. A fast cache is needed.
TODO
unlocked file issues
View branches can't be used in direct mode repositories.
But, view branches do work with unlocked files in v6 repositories. The resulting view branch has all its files locked, although you can unlock them again after entering the branch.
gotchas
Checking out a view branch can remove the current subdir. May be worth detecting when this happens and help the user. done
Git has a complex set of rules for what is legal in a ref name. View branch names will need to filter out any illegal stuff. done
Metadata should be copied to the new key when adding a modified version of a file. done
Filesystems that are not case sensitive (including case preserving OSX) will cause problems if view branches try to use different cases for 2 directories representing a metadata field.
Solution might be to compare fields names case-insensitively, and pick one representation consistently. done
Assistant needs to know about views, so it can update metadata when files are moved around inside them. TODO
What happens if git annex add or the assistant add a new file while on a view? If the file is not also added to the master branch, it will be lost when exiting the view. TODO
The filename mangling can result in a filename in a view that is too long for its containing filesystem. Should detect and do something reasonable to avoid. TODO
Hi,
I love the idea behing storing metadata.
I suggest to exchange ideas (and maybe code) with projects already implementing metadata systems.
I have tried several implementations and particularly noticed tmsu (http://tmsu.org/). This tool stores tags into a sqlite database and uses also a SHA-256 fingerprint of the file to be aware of file moves. It provides a fuse view of the tags with the ability to change tags by moving files (like in the git annex metadata view).
Paul Ruane is particularly responsive on the mailing list and he already supports git annexed files (with SHAE-256 fingerprint) (see the end of the thread https://groups.google.com/forum/#!topic/tmsu/A5EGpnCcJ2w).
Even if you cannot reuse the project, they are interresting ideas that might be worth looking at like the implications of tags: a file tagged "film" being automatically tagged "video".
Tagsistant (http://www.tagsistant.net/) may also be a good source of inspirations. I just don't like the fact that it uses a backstore of tagged files.
Thanks for reading.
Some additional ideas for metadata...
Instead of having a simplistic scheme like 'field=value' it might be advantageous to consider a scheme like 'attribute=XXX, value=YYY, unit=ZZZ' that way you could do intesting things with the metadata like adding counters to things, and allow for doing interesting queries like give me all 'things' tagged with a unit of "audio_file", this assumes one had trawled through an entire annex and then tagged all files based on type with the unix file tool or something like that.
The above idea is already in use in irods and its a really nice and powerful way to let users add meta-data and to build up more interesting use cases and tools.
btw, I plan on taking a look at seeing if I can map some of the meta that we have in work into this new git-annex feature to see how well/bad it works. Either way this feature looks cool! +1!!!
actually in your mp3 example you could have ....
ATTRIBUTE=sample_rate, VALUE=22100, UNIT=Hertz
another example use case is to always be consistent with the AVU order then you could stick in ntriples from RDF to do other cool things by looking up various linked data sources -- see http://www.w3.org/2001/sw/RDFCore/ntriples/ and http://www.freebase.com/, actually this would be quite cool if git-annex examined the mp3's id3 tag, the created an ntriple styled entry can be automatically parsed with the web-based annex gui and automatically pull in additional meta-data from the likes of freebase. I guess the list of ideas can just only get bigger with this potential metadata capability.
Hi,
apologies if I am missing something, but from what I understand, git-annex will automatically add the year and the month from a file's mtime to its metadata if instructed to do so.
So... What about the day (or the time, for that matter?)? What is the reasoning behind the decision not to add those bits automatically? And, is there a way to get git-annex to add those bits of information automatically as well (besides the obvious way of creating a pre-commit-hook script to that effect)?
THX & Cheers, Toby.
Sorry for the noise, I see that tags can be used for preferred content, excellent!
But it seems metadata is tied to a key, not to a specific file/path. If I have 10 different files all with the same content (for some reason, say a simple txt file, Gemspec, or something), and I want to tag one of them as important, it doesn't mean they all are
Hi everbody, is it possible to use a metadata field for the filename in a metadata driven view?
I am thinking of the following use case:
git annex metadata --set artist=Led\ Zeppelin --set album=Led\ Zeppelin\ IV --set title=04\ Stairway\ to\ heaven some/weird/filename.mp3 git annex view --filename-from title artist= album=
result: Led Zeppelin/Led Zeppelin IV/04 Stairway to heaven.mp3
@Sunke, the reason that views make up their own filenames is to avoid the problem of having 2 files in a view that have the same name.
In your example, that could happen if you used --set title with the same title for 2 separate files.
So, I don't think this can be supported reasonably.