Dr. MonkeyIQ: Keeping an index up to date... quickly?

Wednesday, February 11, 2009

Keeping an index up to date... quickly?

Most of the index engine implementations in libferris will detect when you try to add the same file again, and when it hasn't changed in a meaningful way will just skip reindexing it. This makes it really easy to just use the below command to update the index for a specific filesystem.

$ find /Data | feaindexadd --filelist-stdin

The trick comes in when /Data is an NFS share with 400,000 files on it that you are accessing from a Nokia n810. Or when /Data is a file server that you are indexing from your laptop over wifi or another sluggish, higher latency network.

feaindexadd can be told to directly traverse one or more directories and so you don't have to use find in the above command. But separating out the find from the indexing has a really big advantage: you can update indexes of extremely large, but infrequently changing NFS shares very quickly -- Even over slow networks.

The trick is to do the filesystem traversal on the server side, and just pump the URLs that are interesting to the client machine:

ssh lowaccess@server 'find /Data -mtime -10' \
| feaindexadd --filelist-stdin

Of course, this relies on /Data being the same filesystem on both the server and the client. Otherwise you're in for some fun with sed or awk to mangle the paths to be what the client expects.
And the 10 in the above means that you'll have to run the command from cron within 10 days to maintain a complete index. You might find that doing a "time find /Data" on the client and server has significant performance differences, particularly if the filesystem has many files.

You can always store and search the index from the file server, but for disconnected searching, you really need to have the index itself stored on the maemo device.

Dr. MonkeyIQ

Wednesday, February 11, 2009

Keeping an index up to date... quickly?

No comments:

Blog Archive