Saturday, October 31, 2009

White lightning in triplicate

Recently I started hacking on a memory mapped, multi_index soprano backend. While adding triples, and using listStatements() should work fine, implementing SPARQL is making for interesting times.

I started out allowing a single triple match with a filter(regex()) to restrict results. And this worked rather well, making the first one free as they say. So, noticing the little white rabbit that seemed to disappear into the SPARQL bushes, I decided to join in the high tea and mercury sniffing that so induces sanity. Over the course of version 0.0.1 to 0.0.5 the SPARQL code is becoming better, little by little. The code is up at my sf.net page. But don't blame me if the your SPARQL is not implemented yet or your triples somehow disappear.

Anyway, here is a little benchmark session. I'm using the data set generator and queries found here. To make the data I use


$ cd /usr/local/java/bsbmtools
$ cat run.sh
#!/bin/bash
java -cp bin:lib/ssj.jar benchmark.generator.Generator "$@"
$ ./run.sh -fc -pc 1000 -s nt
$ mv dataset.nt thousand-prods.nt
$ mkdir -p /tmp/RDFBENCH
$ cd /tmp/RDFBENCH
$ mkdir mmap redland


Queries are run multiple times to ensure a hot disk cache. This is on a 3 disk RAID-5 and an Intel Q6600 with 8gb RAM.
The last query is not optimized properly in boostmmap yet, so its far slower than it rightly should be. For benchmarking the boostmmap backend...


$ cd /tmp/RDFBENCH/mmap
$ time sopranocmd --backend boostmmap \
--serialization ntriples \
import /usr/local/java/bsbmtools/thousand-prods.nt >|out 2>&1

real 1m49.642s
210M triples.mmap*

$ time sopranocmd \
--backend boostmmap \
list "" '<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>' \
'<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/product>' \
>| /tmp/out 2>&1

real 0m0.103s
grep Product /tmp/out | wc -l
1001

## based on Query 6
$ time sopranocmd \
--backend boostmmap query \
"
select ?what ?lab
where
{
?what http://www.w3.org/2000/01/rdf-schema#label ?lab .
filter( regex( str( ?lab ), 'excites' ))
}"
?lab -> <yawned%20excites%20deflower>;
?what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature295>
?lab -> <goofs%20excites%20enigmata>;
?what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature3276>

real 0m0.091s


$ time sopranocmd --backend boostmmap query \
"
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix xsd: <http://www.w3.org/2001/xmlschema#>
prefix dc: <http://purl.org/dc/elements/1.1/>
select ?offer ?price
where {
?offer bsbm:product http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/dataFromProducer1/Product5 .
?offer bsbm:vendor ?vendor .
?vendor bsbm:country http://downlode.org/rdf/iso-3166/countries#ES .
?offer dc:publisher ?vendor .
?offer bsbm:price ?price .
}"
0.93sec


Note that this 0.9seconds is shameful and needs to be optimized back to <0.1sec.

For redland,

$ cd /tmp/RDFBENCH/redland
$ time sopranocmd --backend redland \
--serialization ntriples \
import /usr/local/java/bsbmtools/thousand-prods.nt \
>|/tmp/out 2>&1

real 38m34.735s
480mb

$ time sopranocmd --backend redland \
list "" \
'<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>' \
'<http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/product>' \
>| /tmp/out 2>&1

real 0m0.096s
grep Product /tmp/out | wc -l
1000


So for just listStatements() redland and mmap are fairly equal in performance. Which, for a single indexed lookup, you might expect. In libferris I had restricted RDF usage to raw triple probes like this because I used redland directly prior to version 1.4.x of libferrris.

So for SPARQL,

## based on Query 6
$ time sopranocmd --backend redland query \
"
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select ?what ?lab
where
{
?what rdfs:label ?lab .
filter( regex( str( ?lab ), 'excites' ))
}"
what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature295>;
lab -> "yawned excites deflower"
what -> <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/productfeature3276>;
lab -> "goofs excites enigmata"
real 0m3.855s


Gah, and I didn't slip up and put the 3 on the left side of the dot there. We are talking about 0.1 seconds for boostmmap against 3.86 seconds for redland.


$ time sopranocmd --backend redland query \
"
prefix bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
prefix xsd: <http://www.w3.org/2001/xmlschema#>
prefix dc: <http://purl.org/dc/elements/1.1/>
select ?offer ?price
where {
?offer bsbm:product <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/instances/datafromproducer1/product5>
?offer bsbm:price ?price .
?offer bsbm:vendor ?vendor .
?offer dc:publisher ?vendor .
?vendor bsbm:country <http://downlode.org/rdf/iso-3166/countries#es> .
}"
real 0m7.134s


Since this query doesn't work well on boostmmap it only goes from 1 to 7 seconds. But I think I can resolve it in much much less time than 1 second. This is not meant to make redland look bad, it's SPARQL implementation is much more complete than boostmmap will likely be any time soon. Creating an optimal query plan for the full SPARQL language will be an interesting challenge.

Development might be bursty as I don't know what time I can spare for improving the SPARQL completeness in the short term.

4 comments:

monkeyiq said...

Version 0.0.6 will move the 1 second boostmmap query to about 0.1 seconds. It was in fact not hitting an index when it should have...

HPC said...

Interesting, although I understand only half of it. :)
How do you make it persistent?

monkeyiq said...

see mmap() and msync(). I would only recommend using it for:
data set size < RAM size
and preferably where you are performing many queries in succession. For embedded devices or "the maemo" it matters a bit less because disk seek time is low.

Marek said...

Hi I have something absolutely offtopic regarding http://www.linux.com/news/hardware/servers/8222-benchmarking-hardware-raid-vs-linux-kernel-software-raid

I'm building 4x1TB RAID 5 now and want to format it as XFS. I ran into your benchmark, which btw is very nice, and saw that xfs with aligned strides and chunks provides better performance. Could you please tell me how to do that? Is it just setup raid with eg 4KB and then xfs with same size? Thanks