Saturday, February 28, 2009

Tracker: Reading it both ways :(

Normally I refrain from criticising other projects that are "competing" with my little hobby -- libferris. I mention competing in quotes because hobbies don't really compete in a commercial sense. But having read this post recently on planet maemo about Tracker progress I was a little overwhelmed, wondering, was the Tracker code really so bad a year ago?

"In this last year, we refactor (well, almost rewrote) the daemon"

and talks about replacing the crawler code wholesale. Sure progress sometimes includes churn of functionality: reimplementing stuff with the benefit of 20/20 hindsight. But this seemed a little dramatic...

Slow as a tree

Hmm, nepomuk integration as the core of amarok's collection manager... now if only flacs without seek tables were not so inconvenient in amarok it would be damn cool.

http://lists.kde.org/?l=amarok&m=120671838014374&q=p3

Finally it seems RDF is making its merry way into the core position of (meta) data sharing on the desktop. A good time to be a semantic hacker.

Wednesday, February 11, 2009

Keeping an index up to date... quickly?

Most of the index engine implementations in libferris will detect when you try to add the same file again, and when it hasn't changed in a meaningful way will just skip reindexing it. This makes it really easy to just use the below command to update the index for a specific filesystem.

$ find /Data | feaindexadd --filelist-stdin

The trick comes in when /Data is an NFS share with 400,000 files on it that you are accessing from a Nokia n810. Or when /Data is a file server that you are indexing from your laptop over wifi or another sluggish, higher latency network.

feaindexadd can be told to directly traverse one or more directories and so you don't have to use find in the above command. But separating out the find from the indexing has a really big advantage: you can update indexes of extremely large, but infrequently changing NFS shares very quickly -- Even over slow networks.

The trick is to do the filesystem traversal on the server side, and just pump the URLs that are interesting to the client machine:

ssh lowaccess@server 'find /Data -mtime -10' \
| feaindexadd --filelist-stdin

Of course, this relies on /Data being the same filesystem on both the server and the client. Otherwise you're in for some fun with sed or awk to mangle the paths to be what the client expects.
And the 10 in the above means that you'll have to run the command from cron within 10 days to maintain a complete index. You might find that doing a "time find /Data" on the client and server has significant performance differences, particularly if the filesystem has many files.

You can always store and search the index from the file server, but for disconnected searching, you really need to have the index itself stored on the maemo device.