The problem statement: you want to search for a file using any arbitrary substring contained in the file url. This is deceptively hard, if you think in a relational db way, you want something "like %foo%" where the leading percent will force indexes to be considered redundant and a sequential scan to ensue. If you try instead to break up the file's URL into words and use a full text indexing solution you will find your old friend the leading percent again, like here for Lucene.
One easy way to use full text indexing style and allow substrings is to cheat. First disable word stemming (ie, dont turn diving and dive into a stemmed "dive" word). Then for each word to be added to the full text index, shift it left from 1..10 times. So wonderful, onderful, nderful, derful, erful etc are all added as words for the same URL. This is most simply accomplished using a postgresql function to do the shifting.
Then when you search for a substring foo you want to find to_tsquery('simple,'foo:*') as a fulltext query on your url column. As prefix searches are allowed in fulltext index engines like Lucene and postgresql's TSearch2 engine, this evaluates quite quickly.
For 200,000 URLs when using the raw regex match "~" in postgresql you might see 400ms evaluation times, the same data using a gin index on to_tsvector('simple',fnshiftstring(url)) might return in 2-3ms. Of course the index can't always handle your regular expression, but if you can tease out substrings which must be present from your regular expression, a shifted gin index could drop evaluation times for you. eg, .*[Ff]oob([0-9]... could use a shifted search for "oob" as a prefilter to full regular expression evaluation.
Getting down to a few ms evaluation allows GUIs to return some sample results as you type in your regular expression. This speed up will be available in 1.5.6+ of libferris.
A while ago I released an engine for maemo with statistical spatial indexing for regular expression evaluation. It's an interesting problem IMHO, and its also a very common one.
C++, Linux, libferris and embedded development. Yet another blog from yet another NARG.
Showing posts with label desktop search. Show all posts
Showing posts with label desktop search. Show all posts
Monday, May 2, 2011
Wednesday, September 8, 2010
Metadata extraction and segv
This is a post about metadata extraction from files for desktop search. While it applies specifically to libferris, the same considerations apply to anything which trawls an entire filesystem looking to create a rich index of metadata to allow desktop search to run queries in a reasonable timeframe. Ferris runs fine on KDE and the n810 (soon n900?) so its applicable to both syndications in a way.
Back in January 2008 I released a version of libferris which allowed metadata extraction to be performed out of process. While this obviously means some context switching is added, it is not necessarily slower for it. It will likely be slower on a single core ARM chip, but on a four or more core desktop the processes can run in parallel and so might actually be faster for it.
I recently tinkered with this, moving to using QT for DBus inter process communications and activation. The design uses a worker interface which has simple get/set methods which take the URL and name of the metadata to fetch and return the data. The set also takes the new value to set.
There is a broker which sits between the client and the worker and provides an async interface and also allows more than one worker to be spawned and controlled without the client needing to worry or care about this. The broker has asyncGet which takes the same URL and metadata name, and returns a numeric ID for this request. A signal is then fired either for success (with value) or failure (without). A failure might occur when the file at the URL is corrupted and the library used to inspect it for the metadata crashed. This is of course recorded so it doesn't constantly happen for that file.
The upshot is that I can use libferris to view a directory and if the xine metadata plugin crashes, the file manager stays around and only the background dbus worker process crashes. This is not to pick on Xine, which is a great project, but if hacking on libferris has told me anything it is that there is always a file on the drive which has some bytes in a strange way that is unexpected.
Perhaps more interesting than the file manager, the file indexer can trawl through everything and will not itself crash when a strange invalid file causes a segv in metadata extraction. So partially downloaded (and failed) content in my ~/FromInternet folder will not halt the indexing process with a segv.
Using DBus with a simple interface like that also allows new metadata extractors to be shared more easily. It doesn't matter if Strigi, Tracker, Beagle, or Indexer Foo use C++, Qt, mono, perl or whatever, as long as they can offer two methods on DBus. Though some factory/activation stuff is also needed to let the indexers and projects know that metadata these plugins can handle.
For those who haven't hit the pagedown by now, the interfaces are in
DBusGlue/ferris_internal_metadata_broker_introspect.xml
and worker_introspect.xml in the upcoming release of libferris (ie, the one that will happen in the next week or so when I get a moment). The DBus processes live in the apps/metadataserver of the libferris tarball.
The broker interface looks like this:
Back in January 2008 I released a version of libferris which allowed metadata extraction to be performed out of process. While this obviously means some context switching is added, it is not necessarily slower for it. It will likely be slower on a single core ARM chip, but on a four or more core desktop the processes can run in parallel and so might actually be faster for it.
I recently tinkered with this, moving to using QT for DBus inter process communications and activation. The design uses a worker interface which has simple get/set methods which take the URL and name of the metadata to fetch and return the data. The set also takes the new value to set.
There is a broker which sits between the client and the worker and provides an async interface and also allows more than one worker to be spawned and controlled without the client needing to worry or care about this. The broker has asyncGet which takes the same URL and metadata name, and returns a numeric ID for this request. A signal is then fired either for success (with value) or failure (without). A failure might occur when the file at the URL is corrupted and the library used to inspect it for the metadata crashed. This is of course recorded so it doesn't constantly happen for that file.
The upshot is that I can use libferris to view a directory and if the xine metadata plugin crashes, the file manager stays around and only the background dbus worker process crashes. This is not to pick on Xine, which is a great project, but if hacking on libferris has told me anything it is that there is always a file on the drive which has some bytes in a strange way that is unexpected.
Perhaps more interesting than the file manager, the file indexer can trawl through everything and will not itself crash when a strange invalid file causes a segv in metadata extraction. So partially downloaded (and failed) content in my ~/FromInternet folder will not halt the indexing process with a segv.
Using DBus with a simple interface like that also allows new metadata extractors to be shared more easily. It doesn't matter if Strigi, Tracker, Beagle, or Indexer Foo use C++, Qt, mono, perl or whatever, as long as they can offer two methods on DBus. Though some factory/activation stuff is also needed to let the indexers and projects know that metadata these plugins can handle.
For those who haven't hit the pagedown by now, the interfaces are in
DBusGlue/ferris_internal_metadata_broker_introspect.xml
and worker_introspect.xml in the upcoming release of libferris (ie, the one that will happen in the next week or so when I get a moment). The DBus processes live in the apps/metadataserver of the libferris tarball.
The broker interface looks like this:
<method name="asyncGet">
<arg type="s" name="earl" direction="in"/>
<arg type="s" name="name" direction="in"/>
<arg type="i" name="reqid" direction="out"/>
</method>
<signal name="asyncGetResult">
<arg type="i" name="reqid" />
<arg type="s" name="earl" />
<arg type="s" name="name" />
<arg type="ay" name="value" />
</signal>
<signal name="asyncGetFailed">
<arg type="i" name="reqid" />
<arg type="s" name="earl" />
<arg type="i" name="eno" />
<arg type="s" name="ename" />
<arg type="s" name="edesc" />
</signal>
Friday, April 23, 2010
Looking for some coding work...
If you are on the lookout for a C++, GTK+2, Qt, Linux developer with a PhD and experience in persistent storage and indexing then I'm looking for you!
For those who don't know me, I am "the libferris guy". Which means I have played with virtual filesystems for many many years and implemented a good handful of ways to index and search diverse filesystems. Some of the articles I did for Linux Journal and linux.com will give an impression of the sorts of things libferris can do, and in the next months the Linux Format includes information on Web services as a filesystem, eg, mounting flickr, youtube, and facebook.
After hacking RDF support into KOffice earlier this year, I would very much like to work more with Soprano and RDF technologies. But anything involving indexing and storage is also of huge interest. In fact one of the things I like about RDF at the low level is that it overlaps these desires too, for example, my custom model partial implementation for Soprano which is targeted at great performance on embedded devices.
I have implemented some GIST trees for PostgreSQL and worked with C++ and development on a Linux platform for over 10 years. If you are interested in these things or just want a Qt hacker for a while then please drop me a line using the nick for this blog at gmail or sourceforge as email address.
For those who don't know me, I am "the libferris guy". Which means I have played with virtual filesystems for many many years and implemented a good handful of ways to index and search diverse filesystems. Some of the articles I did for Linux Journal and linux.com will give an impression of the sorts of things libferris can do, and in the next months the Linux Format includes information on Web services as a filesystem, eg, mounting flickr, youtube, and facebook.
After hacking RDF support into KOffice earlier this year, I would very much like to work more with Soprano and RDF technologies. But anything involving indexing and storage is also of huge interest. In fact one of the things I like about RDF at the low level is that it overlaps these desires too, for example, my custom model partial implementation for Soprano which is targeted at great performance on embedded devices.
I have implemented some GIST trees for PostgreSQL and worked with C++ and development on a Linux platform for over 10 years. If you are interested in these things or just want a Qt hacker for a while then please drop me a line using the nick for this blog at gmail or sourceforge as email address.
Subscribe to:
Comments (Atom)