Dr. MonkeyIQ: metadata

Friday, July 15, 2011

Calligra & RDF Smaak

As some readers will already know, the ODF document standard has support for including one or more RDF/XML files. I have made a few such files, and will grow that collection over time. The weekend hike document cites a few people, links some to a location and also cites a time and place. While one could just say "Dom Plein" or "11 am Wednesday" these purely text references are subjective and require a human to read them.

On the other hand, the location could have an exact bounding polygon or point with digital longitude and latitude information. A computer is all to happy to use that precise description to offer maps and "how to get there from here" type information. And the 11am is of course dubious because it doesn't link to a timezone, a human will know we are not talking UK time or the ACDT timezone. But that requires inference from the cited location and knowledge of which Plein that is, or rather, which timezone that Plein is in.

I did a little hacking to freshen up some of the RDF code in Calligra and bring back optional support for using Marble to show and edit location information. The below is the weekend hike example from the github above. James, Joyce, and Mark all have contact RDF associated, with Mark also giving his location. The "next weekend" at the end of the first paragraph has both a time and place associated. In the screenshot I've opened up the place to edit it. Rather simple to drag a map around and click OK than to know the digital coordinates right of the top of your head ;)

I was hacking on fixing the same bug as boemann on #calligra, which is strange for me as I normally don't overlap on things. It seems along the way setCanvas() was called again so I removed my explicit view passing stuff in the update I just git pushed. It looks like the docker was fixed correctly by somebody else in the end. My SPARQL updates should still help. It has been a while since I hacked on this codebase, and I have to thank the Calligra guys for being so welcoming and having such a fun contribution environment! I'm fortunate to be able to hack on two projects, Abiword and Calligra, which are both so welcoming :)

Wednesday, September 8, 2010

Metadata extraction and segv

This is a post about metadata extraction from files for desktop search. While it applies specifically to libferris, the same considerations apply to anything which trawls an entire filesystem looking to create a rich index of metadata to allow desktop search to run queries in a reasonable timeframe. Ferris runs fine on KDE and the n810 (soon n900?) so its applicable to both syndications in a way.

Back in January 2008 I released a version of libferris which allowed metadata extraction to be performed out of process. While this obviously means some context switching is added, it is not necessarily slower for it. It will likely be slower on a single core ARM chip, but on a four or more core desktop the processes can run in parallel and so might actually be faster for it.

I recently tinkered with this, moving to using QT for DBus inter process communications and activation. The design uses a worker interface which has simple get/set methods which take the URL and name of the metadata to fetch and return the data. The set also takes the new value to set.

There is a broker which sits between the client and the worker and provides an async interface and also allows more than one worker to be spawned and controlled without the client needing to worry or care about this. The broker has asyncGet which takes the same URL and metadata name, and returns a numeric ID for this request. A signal is then fired either for success (with value) or failure (without). A failure might occur when the file at the URL is corrupted and the library used to inspect it for the metadata crashed. This is of course recorded so it doesn't constantly happen for that file.

The upshot is that I can use libferris to view a directory and if the xine metadata plugin crashes, the file manager stays around and only the background dbus worker process crashes. This is not to pick on Xine, which is a great project, but if hacking on libferris has told me anything it is that there is always a file on the drive which has some bytes in a strange way that is unexpected.

Perhaps more interesting than the file manager, the file indexer can trawl through everything and will not itself crash when a strange invalid file causes a segv in metadata extraction. So partially downloaded (and failed) content in my ~/FromInternet folder will not halt the indexing process with a segv.

Using DBus with a simple interface like that also allows new metadata extractors to be shared more easily. It doesn't matter if Strigi, Tracker, Beagle, or Indexer Foo use C++, Qt, mono, perl or whatever, as long as they can offer two methods on DBus. Though some factory/activation stuff is also needed to let the indexers and projects know that metadata these plugins can handle.

For those who haven't hit the pagedown by now, the interfaces are in
DBusGlue/ferris_internal_metadata_broker_introspect.xml
and worker_introspect.xml in the upcoming release of libferris (ie, the one that will happen in the next week or so when I get a moment). The DBus processes live in the apps/metadataserver of the libferris tarball.

The broker interface looks like this:


    <method name="asyncGet">
      <arg type="s" name="earl" direction="in"/>
      <arg type="s" name="name" direction="in"/>
      <arg type="i" name="reqid" direction="out"/>
    </method>

    <signal name="asyncGetResult">
      <arg type="i" name="reqid" />
      <arg type="s" name="earl"  />
      <arg type="s" name="name"  />
      <arg type="ay" name="value" />
    </signal>

    <signal name="asyncGetFailed">
      <arg type="i" name="reqid" />
      <arg type="s" name="earl"  />
      <arg type="i" name="eno" />
      <arg type="s" name="ename"  />
      <arg type="s" name="edesc"  />
    </signal>