Wednesday, September 8, 2010

Metadata extraction and segv

This is a post about metadata extraction from files for desktop search. While it applies specifically to libferris, the same considerations apply to anything which trawls an entire filesystem looking to create a rich index of metadata to allow desktop search to run queries in a reasonable timeframe. Ferris runs fine on KDE and the n810 (soon n900?) so its applicable to both syndications in a way.

Back in January 2008 I released a version of libferris which allowed metadata extraction to be performed out of process. While this obviously means some context switching is added, it is not necessarily slower for it. It will likely be slower on a single core ARM chip, but on a four or more core desktop the processes can run in parallel and so might actually be faster for it.

I recently tinkered with this, moving to using QT for DBus inter process communications and activation. The design uses a worker interface which has simple get/set methods which take the URL and name of the metadata to fetch and return the data. The set also takes the new value to set.

There is a broker which sits between the client and the worker and provides an async interface and also allows more than one worker to be spawned and controlled without the client needing to worry or care about this. The broker has asyncGet which takes the same URL and metadata name, and returns a numeric ID for this request. A signal is then fired either for success (with value) or failure (without). A failure might occur when the file at the URL is corrupted and the library used to inspect it for the metadata crashed. This is of course recorded so it doesn't constantly happen for that file.

The upshot is that I can use libferris to view a directory and if the xine metadata plugin crashes, the file manager stays around and only the background dbus worker process crashes. This is not to pick on Xine, which is a great project, but if hacking on libferris has told me anything it is that there is always a file on the drive which has some bytes in a strange way that is unexpected.

Perhaps more interesting than the file manager, the file indexer can trawl through everything and will not itself crash when a strange invalid file causes a segv in metadata extraction. So partially downloaded (and failed) content in my ~/FromInternet folder will not halt the indexing process with a segv.

Using DBus with a simple interface like that also allows new metadata extractors to be shared more easily. It doesn't matter if Strigi, Tracker, Beagle, or Indexer Foo use C++, Qt, mono, perl or whatever, as long as they can offer two methods on DBus. Though some factory/activation stuff is also needed to let the indexers and projects know that metadata these plugins can handle.

For those who haven't hit the pagedown by now, the interfaces are in
DBusGlue/ferris_internal_metadata_broker_introspect.xml
and worker_introspect.xml in the upcoming release of libferris (ie, the one that will happen in the next week or so when I get a moment). The DBus processes live in the apps/metadataserver of the libferris tarball.

The broker interface looks like this:

<method name="asyncGet">
<arg type="s" name="earl" direction="in"/>
<arg type="s" name="name" direction="in"/>
<arg type="i" name="reqid" direction="out"/>
</method>

<signal name="asyncGetResult">
<arg type="i" name="reqid" />
<arg type="s" name="earl" />
<arg type="s" name="name" />
<arg type="ay" name="value" />
</signal>

<signal name="asyncGetFailed">
<arg type="i" name="reqid" />
<arg type="s" name="earl" />
<arg type="i" name="eno" />
<arg type="s" name="ename" />
<arg type="s" name="edesc" />
</signal>

1 comment:

pvanhoof said...

For large amounts of metadata, it still matters that you're using D-Bus: you'll be hammering the bus while indexing, and the UTF8 validation of strings and marshalling will kill your performance. You should look into D-Bus 1.3.x's FD passing. The tracker-extract to tracker-miner-fs and tracker-miner-fs to tracker-store communication uses FD passing for this reason. For integration of this stuff ideally you either make a miner or you integrate it as plugins for tracker-extract, and then we can on Harmattan look into using it.