Thursday, May 2, 2013

Filesystem Indexing: Taking the reins

To index data using a small ARM CPU without much RAM you might like to break the indexing run down into many parts, and get more explicit control over what is happening. The below will index all files on /DATA-PATH in batches of 5000 files at a time with libferris. This will use whatever index plugin you have setup for ~/.ferris/ea-index (the default metadata index). Be that PostgreSQL, SQLite, boost memory mapped files, clucene or whatever.

I'm currently racing the boost memory mapped index with the SQLite backed index on simple URL queries against the filesystem. This is being done on about 2ghz ARM machines with either 512 or 2048mb of RAM. The boostmmap plugin is of my own design and contains some smarts while executing regular expression matching against unanchored strings (.*foo.*). Unfortunately the boostmmap plugin is not as smart as it could be regarding scattered updates, transactions, and journaling, which slows it down a bit in the index creation phase relative to the SQLite plugin.

The below is a skeleton bash script to get started adding files. Another option is to ssh into the remote host and run find(1) there which can be much faster over network filesystems. The whitelist environment variable is to override which metadata libferris will index. If your index indicates it wants sha and md5 digests, the act of calculating those can dominate indexing time. An explicit whitelist keeps index adding times down with the obvious side effect of limiting what you can use in your queries. Such a limited list of metadata as in the below brings the index closer to what locate provides.


rm -rf /tmp/fidxtmp
mkdir -p /tmp/fidxtmp
cd /tmp/fidxtmp

find /DATA-PATH | split -l 5000


for if in x*
   echo "adding $if..."
   cat "$if" | feaindexadd -v -1 >>/tmp/ferris-index-progress.txt

Then you can find all your PDF files for example using the following:

feaindexquery -Z '(url=~pdf)'

The -Z tells libferris not to try to lstat() or resolve URLs to see if they exist currently. Much faster results but at the cost of not weeding out things which might have moved since they were last indexed.

And all the files which have been written this year

feaindexquery -Z '(mtime>=begin this year)'

Unfortunate about needing the quotes as bash wants to do things with naked parenthesis.

Save Ferris! Or just donate to an open source project or organization of your choice if you like the ferris posts.

No comments: