I've started a little light hacking allowing metadata extraction to work out-of-process in libferris. I really want metadata extraction to be able to scale to 4-8 threads of concurrent extraction and caching for use on the 4-8 core machines of today and tomorrow. Also doing metadata extraction out-of-process like this means that apps wanting metadata will not segv because a strange file is given that causes the metadata extraction path to segv. So folks will not blame libferris when libY.so that libferris uses to handle extracting data from Y.foo files has a segv causing bug in it.
I'm planning on using dbus for this at the moment. This should finally make code sharing for metadata extraction on the desktop somewhat more sharable. Hopefully I can just drop in libferris, strigi and other metadata extraction services and have dbus automatically pick them up and use them. This part isn't much of a gain to me personally because I tend to just add native support in libferris for metadata that I am interested in (or using other libs like strigi from ferris ;)
The below is my current design thoughts;
The plan is to have a object broker and many worker objects. The API should be async by default and clients can fairly easily block for a metadata reply if they are designed to work that way (like console apps). Other apps can just issue a bunch of metadata requests and update the GUI as the results come in.
I plan to have two APIs, one for quick get me the value of X from url Y and a bulk API for get it all.
The broker API might be something like this;
void registerClient( string callbackname )
long request( string url, string attribute )
long put( string url, string attribute, string value )
With registerClient() telling the broker what object to call back with metadata to, so the request() call will find the metadata and call the dbus object registered to tell it the value. The other option is to just use signals to reply to request() and put() from the broker.
The abstract api for workers is synchronous in nature and the broker handles managing many workers and remembering if any segv and under what condidtions so it doesn't invoke those cases again. Of course the broker will have to use threading or async dispatch of dbus calls to be able to remain responsive while the workers are doing their, um, "work".
Single attribute API;
string get( string url, string attribute )
void put( string url, string attribute, string value )
Bulk API;
map
void setbulk( string url, map
The values might well become byte arrays or something else more streamy. But overall the use of streams here doesn't gain much unless you add the complexity of a streambuf API over dbus to be able to do partial reads and seeks on metadata. Overall using strings should be dandy fine for value.length() <>
1 comment:
Great post on extracting data, with some well thought out points, For simple stuff i use python to get or simplify data, data extraction can be a time consuming process but for other projects that include documents, files, or the web i tried http://www.extractingdata.com which worked great, they build quick custom screen scrapers, extracting data, and data parsing programs
Post a Comment