Dr. MonkeyIQ

Friday, July 15, 2011

Calligra & RDF Smaak

As some readers will already know, the ODF document standard has support for including one or more RDF/XML files. I have made a few such files, and will grow that collection over time. The weekend hike document cites a few people, links some to a location and also cites a time and place. While one could just say "Dom Plein" or "11 am Wednesday" these purely text references are subjective and require a human to read them.

On the other hand, the location could have an exact bounding polygon or point with digital longitude and latitude information. A computer is all to happy to use that precise description to offer maps and "how to get there from here" type information. And the 11am is of course dubious because it doesn't link to a timezone, a human will know we are not talking UK time or the ACDT timezone. But that requires inference from the cited location and knowledge of which Plein that is, or rather, which timezone that Plein is in.

I did a little hacking to freshen up some of the RDF code in Calligra and bring back optional support for using Marble to show and edit location information. The below is the weekend hike example from the github above. James, Joyce, and Mark all have contact RDF associated, with Mark also giving his location. The "next weekend" at the end of the first paragraph has both a time and place associated. In the screenshot I've opened up the place to edit it. Rather simple to drag a map around and click OK than to know the digital coordinates right of the top of your head ;)

I was hacking on fixing the same bug as boemann on #calligra, which is strange for me as I normally don't overlap on things. It seems along the way setCanvas() was called again so I removed my explicit view passing stuff in the update I just git pushed. It looks like the docker was fixed correctly by somebody else in the end. My SPARQL updates should still help. It has been a while since I hacked on this codebase, and I have to thank the Calligra guys for being so welcoming and having such a fun contribution environment! I'm fortunate to be able to hack on two projects, Abiword and Calligra, which are both so welcoming :)

Monday, July 11, 2011

RDF in ODF: Abiword & Calligra

RDF has been slowly making it's way into Office applications. The ODF standard includes support for shipping RDF/XML file(s) inside the zip file that is an odt file. This RDF can also be linked to particular part(s) of the document text so that you and your computer both know where the RDF is most relevant. For example, if "Fred" in the document has his phone number, location, and cake preference in RDF, that can all be linked just to the four characters "Fred" so that it all makes sense. Strange as it might be, not everybody likes Baumkuchen, and it is fairly likely not to be relevant to a stock quote in another part of the document.

RDF has spread to OpenOffice, abiword, KOffice, and Calligra. All of these applications can read and write RDF in text documents. The later two also include a GUI to allow you to query, inspect, and update the RDF. Since I'm hacking on abiword, I've been throwing around how to best expose RDF to the person using abiword for document editing...

First, this is what Calligra does. The main document window includes an RDF docker which shows you the high level "Semantic objects". These are things which make use of many RDF triples to present a single object type such as a contact, calendar event, location, or explicit train trip. Note that the RDF docker only shows you the semantic objects for the RDF which is relevant to the current document cursor position.

The Document Information window also lets you get at all of the RDF which ships with an ODF file. The Semantic tab is very similar to the RDF docker but shows all the Semantic Objects regardless of where they are relevant in the document (if at all). As you can see below, editing the "Dan" person semantic object you can set their name, nickname, phone number, and homepage. Of course, more information is relevant to people and this whole section should be expanded to cater for that. And yes, for Calligra having a good hookup to Akonadi would be of great use for all.

Contacts use the FOAF RDF schema in Calligra. This allows not only contact information but also the relations between contacts to be expressed. FOAF is about Friends of a Friend after all. Looking at the above you might think name, phone etc are each going to be a triple in the RDF from the document. The triples tab lets you get at that lower level RDF goodness as shown below. A few things to note; while RDF is triples, each object has a type (is it a chunk of text or a link to another subject), and since there are many possible "files" the RDF/XML came from that is tracked for each triple so they can go back there too on save.

Notice in the above that prefixes are included in the subject, predicate, and object columns. This is an attempt to make the raw RDF less verbose and somewhat simpler to handle. The namespaces tab lets you set these up. Any namespaces that are used in the RDF/XML from the ODF file are automatically added and used for you.

The stylesheets tab I'll cover at another time. The SPARQL tab lets you run a query against all the RDF for the document. The one I've run here is the default one that Calligra shows you, which will select all the triples from the document without restriction. The subject, predicate, and object resulting are shown in the bottom half of the window.

I was thinking about all of this recently because I'm now looking to add GUI stuff to abiword to allow RDF interaction. The first idea was to simply add a "edit RDF" context menu item to allow you to associate one or more triples with the cursor position or current selection. The ability to define and reuse namespaces would also help to make such a dialog less painful to use. This brings the design close to the combined "Triples" and "Namespaces" tabs of the Calligra Document Information window. This might be OK for determined users who already really, really know they want to do these things. But I tend to think there are more folks who could take advantage of using RDF but not necessarily care about it.

Simplicity for users was the driving force behind the design of Semantic Objects and the use of Drag and Drop to and from other applications to create and harvest RDF data. I think it is much simpler to grab the "Fred" contact from Evolution and drop it into the document than to work out that you want to use FOAF and the exact predicates to create a well defined RDF graph for the Fred contact and then copy and paste each of those pieces of data individually.

One might like to consider the Triples+Namespaces as a special type of Semantic Object, a "raw" object if you will. This brings together the design of the advanced and user friendly interaction into a single dialog. As the namespaces are likely to have whole document scope they can be setup and edited elsewhere. Unfortunately I had a bit of trouble working out how to populate a tree or list in glade-2 or glade-3 for mock ups, so these are gimped a bit too.

The dialog below is a semantic object editor with the advanced tab allowing raw interaction. As there can be zero or more semantic objects of a given type in scope at any point there is a list on the left side allowing you to choose which object of a type to view. Perhaps that should be a drop down list at the top of the tab to save screen space.

The email and VoIP links should start a new message or request a phone call with the person respectively. Such actions should also be available without getting to the editor itself. My current plan is to have the advanced tab allow interaction with the raw triples. Remember though that triples carry type information, possibly extra context, and/or perhaps a range of the revisions in a change tracked document that the triple is valid for. So its by no means just a list with three columns as the name triples might at first imply.

A somewhat problematic first blush at this gives the below. I'm thinking that the subj, pred, and object strings can be namespace:foo strings, possibly with some completion for known namespaces like foaf, et al. The type is fairly OK as these are fixed and mainly URI or Object.

The revision range selection is a real challenge. This might become some sort of date range bar line the timeline or timeplot from the simile widgets. The trick as is usual is extrapolating the extra dimension from what is in it's vanilla sense a linear one dimensional data set ( time, revision ). Though having the revisions and their descriptions in the top half of the timeline and the ability to pan and zoom seeing a density plot in the lower half would work for starters.

I'm thinking that as well as showing you all the triples that maybe allowing simple one or two line SPARQL to be run to find the triples to edit would be preferable. Perhaps it doesn't add much for a small document range with only 20 triples associated, but to use the dialog on the whole document too, you might want to limit triples to "current revision" and foaf related only for example. Using a triple list allows you to sort by column and search, but such a search could also be performed with relatively simple SPARQL. And normally, and extremely unfortunately, one normally doesn't get to stable sort lists by 2+ columns. A limitation I try to avoid inflicting.

So in summary, raw triple editing can be just an advanced semantic object. The list of semantic objects should be able to be found from a document position (cursor) or arbitrary begin-end range. The later catering for whole document RDF editing as a special case. For contacts there might be one or more semantic objects for any doc position or range, but there will only be one raw-triple semantic object for any range.

Though I'm still chucking around how to make the query/edit part most convenient for users for the raw triples semantic object.

Saturday, June 18, 2011

Libferris for Fedora 15

My openSUSE Build Service project now has rpm files for Fedora 15 for the libferris chain including gevas and the ego file manager. This includes the recently mentioned updates to postgresql indexing allowing for indexed resolution of some regex queries (2ms query time on 200+k files).

During the updates, I found the normal little things such as gcc being more selective about what it likes to compile and treated these cases. The main glitch I found was in the @F15/../gtk/gtkmain.h header there is a prototype which takes a "GModule *module" argument. This means I had to include gmodule.h before gtk.h to get that type in scope otherwise things failed to compile.

The Fedora 15 packages have support for printing with Qt, so one can echo foo > printer://default/ or "cp" an image to the printer. Scanning support will pop up in a future build once the few tiny changes to ksane I sent in trickle through to the shipped KDE release in Fedora.

Perhaps soon I'll make Phonon available as a filesystem too so one can cp an ogg file to sound://default to play it. Seeking will be interesting, as it's not normally a common case to start copying a file at a given offset. But that smells like a seek during playback to me. I'm looking forward to making a QML audio player, with index-search and playback already filesystems the heavy lifting will have already been done :)

Monday, May 2, 2011

Searching again (400ms -> 2ms)

The problem statement: you want to search for a file using any arbitrary substring contained in the file url. This is deceptively hard, if you think in a relational db way, you want something "like %foo%" where the leading percent will force indexes to be considered redundant and a sequential scan to ensue. If you try instead to break up the file's URL into words and use a full text indexing solution you will find your old friend the leading percent again, like here for Lucene.

One easy way to use full text indexing style and allow substrings is to cheat. First disable word stemming (ie, dont turn diving and dive into a stemmed "dive" word). Then for each word to be added to the full text index, shift it left from 1..10 times. So wonderful, onderful, nderful, derful, erful etc are all added as words for the same URL. This is most simply accomplished using a postgresql function to do the shifting.

Then when you search for a substring foo you want to find to_tsquery('simple,'foo:*') as a fulltext query on your url column. As prefix searches are allowed in fulltext index engines like Lucene and postgresql's TSearch2 engine, this evaluates quite quickly.

For 200,000 URLs when using the raw regex match "~" in postgresql you might see 400ms evaluation times, the same data using a gin index on to_tsvector('simple',fnshiftstring(url)) might return in 2-3ms. Of course the index can't always handle your regular expression, but if you can tease out substrings which must be present from your regular expression, a shifted gin index could drop evaluation times for you. eg, .*[Ff]oob([0-9]... could use a shifted search for "oob" as a prefilter to full regular expression evaluation.

Getting down to a few ms evaluation allows GUIs to return some sample results as you type in your regular expression. This speed up will be available in 1.5.6+ of libferris.

A while ago I released an engine for maemo with statistical spatial indexing for regular expression evaluation. It's an interesting problem IMHO, and its also a very common one.

Friday, April 29, 2011

Everything is a file? Spitting it out again...

After recently adding support for scanning through a virtual filesystem interface in libferris I thought I'd complete the cycle and add printer-as-filesystem too. This of course allows the neat trick to "copy" a document from your scanner and dumping it to a printer:

$ ferriscp scanner:///my-scanner/color/scan.png printer:///my-friends-printer/

Naturally the printer and scanner don't have to be in the same room etc. As you can see from the above, writing a picture to a file inside the printer's virtual directory prints that image to paper. By default images are smooth stretched to fit the entire page, not a huge issue if you scan and print at the same DPI.

Images can come from anywhere, to hard copy a webcam
$ ferriscp gstreamer://capture/webcam.jpg printer:///Ich-bin-ein-drucker/

If on the other hand you write a plain text file to the printer's directory it will be printed in an acceptable manner too. So something like this works

$ date | ferris-redirect -T printer://Cups-PDF/foo

In this case, you can expect a ~/Desktop/foo.pdf file which is a PDF document with the current date in it. One can think if ferris-redirect as the shell "|" but which natively knows about all the libferris filesystems. On the other hand, you could use FUSE to mount printer:// at some normal kernel location and just do
$ date >| ~/printer/Cups-PDF/foo

Of course, none of this is meant to replace programs which let you tweak how images are printed, format how text will appear, or select from the myriad of options your printer offers you. But if you want a piece of paper from a scanner to a printer, such a "copy" might be just the ticket.

Tuesday, April 26, 2011

Blocking a nostril

I have been meaning to add sane support to libferris since I upgraded my scanner to something modern. I now also have an Automatic Document Feeder (ADF) so naturally that needs to be a filesystem too. A quick example of how this might be useful, to turn a document into something on screen quickly:

$ fcat sane:///my-scanner/color/scan.jpg | okular -

The trees under sane:// allow preconfigured scanning units and discovery. If you have setup my-scanner then no device listing is performed with sane, a good thing because discovery can be slow. When you configure a scanner you create one or more directories with specific configurations of how you want the scan to proceed. Incidentally, configuration is of course done with a filesystem. Inside ~/.ferris/sane you create a directory for your scanner, plomp the device name into scanner/device and create directories with scanning options for that scanner.

As well as scan.jpg there is a directory "adf" which chomps new things from the ADF for each file you read. Having this directory allows you to grab 20 documents by a normal looking posix like command such as:

$ ferriscp -av sane:///adf/gray-100/adf /tmp/

in this case, I made two trees, color-N and gray-N which scan in color or grayscale at a given DPI (100,300,600 etc). So I can get most of what I want by picking the right directory. "adf" is a softlink to my scanner (a local link in ~/.ferris/sane from adf to "my-scanner"). Using links like this means I can replace the hardware and scripts can still just ask for "the adf" scanner and get something.

One might wonder what happens for a scanner which is not known to the system. The ".discovery" configuration inside ~/.ferris/sane provides defaults for such devices, so you can have your color/gray DPI directories magically appear for discovered scanners too. Of course, most of the gray directories are similar, so softlinks are your friend except for the "resolution" file which you want to change per directory.

Discovery isn't performed for the above commands, because libferris uses short cut directory loading as it knows explicitly how to make the virtual filesystems for preconfigured scanners.

Some hints on configuration will be in dot-ferris/sane/dot-discovered which will be in libferris 1.5.6+ along with the code to actually make it happen.

Sunday, April 24, 2011

Change Tracking: Why?

As some of my past posts have mentioned, the OASIS group is currently working out how it wishes to extend support for change tracking in ODF. The change tracking feature allows you to have an office application remember what changes you have made and associate them with one or more revisions. There are many examples where governments might want such traceability, but small businesses are likely to find this functionality valuable too.

Late last year there was an ODF plugfest held in Brussels where it was decided that an Advanced Document Collaboration subcommittee should be formed to work out how to serialize tracked changes into ODF.

There are currently two proposals. One which is generic and tracks how the XML of the ODF is modified over time, and an extension to the limited change tracking that already exists in ODF 1.2.

As my previous posts have mentioned, I'm involved in making abiword able to produce and consume ODF files which contain tracked changes in the format of the generic proposal.
See my github for abiword and its change tracking test suite. Ganesh Paramasivam has been working to make Calligra and KOffice support the generic proposal too, and Jos van den Oever has done some hacking to enrich WebODF for change tracking.

As I have been going along implementing things in abiword, I've been clarifying some points on the oasis office-collab mailing list, and participating in some of the general conversation there too.

Some of the differences between the two proposals can have a large impact to the complexity of implementing change tracking in applications. Folks who are involved in office applications which use ODF might like to read up on the proposals and have their say on the future of this feature before a decision is made for you.

I think it is important for ODT to include a good, complete change tracking specification because it would offer all people the ability to collaborate on documents and businesses the change to know the exact extent of modifications made to documents by each party.