Thursday, March 8, 2012

Libferris on both arms

I now have libferris on a 512mb ARM5t embedded device and a 1gb arm7 one (the Nokia n9). As part of the fun and games I updated the clucene and boostmmap index modules for libferris and created a QtSQL module which uses SQLite by default. Hey, when you can't choose which cake to have why not have a slice of them all... or a slice to fit the moment or device at hand. eg, desktop or server works nicely with the soprano or postgresql indexing modules, embedded is nicer on mmap or clucene.

I find it ironic that I never really thought much of "embedded" devices when hacking libferris. But devices with 512/1024mb of RAM are really not so much embedded I guess.

A few things I've tried to do in the design of libferris index+search is to design for many machines and also federations of them. It is possible to search a router, phone, and desktop's individual indexes as a federation from the laptop. Another thing that helps is that indexing is routed through the "findexadd" commands. So you can use find and split to break up indexing activity, and have it done from cron when you want.
The new --total-files-to-index-per-run option works in combination, causing findexadd to exit when it has indexed a given number of files. Note that if a file has not changed since it was last indexed it is not indexed again (no need), so that file does not count toward the
total-files-to-index-per-run tally.

The below is a little script to incrementally index just selected metadata from /usr and your home directory using clucene. The WHITELIST environment variable stops libferris from trying to sniff up metadata for files and has it only look for and add what metadata you want. If you have md5 in there then libferris will store the checksum for each file, at a commensurate cost in IO. Splitting into batches of 5000 prevents the process running too long and wanting too much RAM.

$ mkdir -p ~/.ferris/ea-index
$ cd ~/.ferris/ea-index
$ fcreate --create-type eaindexclucene .
$ vi update.sh
#!/bin/bash

TMPDIR=~/tmp
EAIDXPATH=~/.ferris/ea-index

cd $TMPDIR
find /usr | split -l 5000 - usr.split.
find ~ | split -l 5000 - home.split.

export LIBFERRIS_EAINDEX_EXPLICIT_WHITELIST=name,size,mtime,mtime-display,atime,ctime,user-owner-name,group-owner-name,user-owner-number,group-owner-number,inode
echo "whitelist: $LIBFERRIS_EAINDEX_EXPLICIT_WHITELIST"
cd $EAIDXPATH
rm -f write.lock
for if in $TMPDIR/*split.*
do
echo "Processing $if"
cat $if | feaindexadd -P `pwd` -1
done



Ironically the arm5 has given me much less trouble overall. One issue seems to be with gcc-4.4x on the n9. Charming little errors like my old friend the undefined __sync_val_compare_and_swap_4 which stops memory mapped boost data structures from working properly and also leaves the clucene-core-2.3.3.4 build laying on the side of the road bleeding. I've hacked the clucene code to get around the atomic errors, but then seem to have found that search results are not accurate. I guess my quick hack there was just bad^TM. Especially since the arm5 produces the right results using the virgin clucene codebase.

I've been trying to convince gcc 4.5 and 4.6 to build for me so I can use the updated compiler to generate a proper and working clucene for the device. I seem to run into little build issues after time consuming rebuilds. (uses VFP register arguments, yay). Once I stop compiling compilers then maybe I can get my favourite indexing code on the n9.

No comments: