C++, Linux, libferris and embedded development. Yet another blog from yet another NARG.
Friday, December 14, 2012
Cross platform package building..?
I've had great success with OBS, but that was mainly for Linux packages. It seems OBS can do mingw too, but I've not walked the valley of darkness into building for the more closed platforms on OBS before.
The saucelabs looks pretty cool, but it seems targeted to web code if I am reading it correctly.
The initial plan is to get 24hr rolling packages for all platforms and have feedback as to which day a github commit has broken the package build. It might be nice to have it for each github commit, but I think it would be easy enough to bisect a break given a 24 hour window unless an armada of contributors rushes at the ship.
A separate build issue I've been tinkering with in my mind for a while is grabbing from a github repo and creating android packages. Different code base for this though, mainly some of my n9 apps, as such, preferably for a mixed C++/QML app. But I think for that project I'll wind up taking my chisel and hammer and coming back with a cron job.
Sunday, November 25, 2012
KDE, Qt, and the segv
The crashes offered my g_iterator type things in the backtrace. Nothing juicy that I could try to work backward from. Digging around seemed to show theme stuff, pointing at the nvidia drivers (which always comes up in searches), and there was a bug I found along the way mentioning qt 4.8.4, so I reverted to 4.8.3 from a mirror that had the old packages still around.
I decided to update the laptop before a trip on thursday, and this was of course when I discovered this little bundle of joy. I'm not sure which part of my config was causing this, but at least I can now use the laptop with my desktop of choice while away!
Wednesday, November 21, 2012
Libferris as a KIO Slave
The following will create a Berkeley db4 file from the command line and show it to you. To dig into such files with libferris you can just read the file directly. So in this case I just grab the "base" directory in this db4 file with konq. You should see the sample and file2 files in that directory view and be able to load and save those "files" into kwrite. Sorry about the video being a tad jumpy, I have to work out what part of my desktop is causing that :/
Using libferris through the KIO interface from Ben Martin on Vimeo.
The commands as plain text:
$ cd /tmp
$ db45_load -T -t btree foo.db
base/sample
value here
base/file2
contents
$ db_dump -p foo.db
$ konqueror ferris:/tmp/foo.db/base
little tricks I found during the hacking, your listDir() method might call listEntry( e, false ) with a final call using ( e, true ). The last call is impotent with regard to "e" and just a finalizer call. The get() method uses data() to deliver bytes to the KIO user (application). The put() method uses dataReq() in a loop to grab chunks frrom the KIO user. Currently I have lazy methods, for example the put() just grabs everything from the KIO user and then operates on the data once it has everything. Really bad for 4gb files, but for smallish files to get a feel for things it works quite well.
Also, if you are using library init, in my case using gmodule to dynamically load some plugins from libferris itself, you might be in for a world of fun and games. Currently I spawn processes and interact with them from the KIO slave to get around this issue. I imagine for a more efficient implementation being able to tell KDE to load my KIO Slave as an application with a normal full init leading to a main() but haven't looked at that little technicality yet.
I've mostly been tinkering with kioclient, konq, and kwrite on libferris files at the moment. Things are turning out well, though there are still many glitches to this early days integration. This will be released in the next libferris version once I clean it up a bit more.
Monday, November 19, 2012
Mounting KIO with libferris
The groundwork is very similar to what I'm thinking of using to allow kio to mount libferris. Top level URL schemes will appear first, allowing you to dig into each URL scheme. For example, in libferris kio appears under the kio: filesystem. The first directory in that filesystem is the KIO URL scheme to use. So something like the following will work:
$ date >| /tmp/df.txt
$ fcat kio:file:/tmp/df.txt
Some of the more fun KIO slaves to access through this are the man and fonts URLs. The following two commands produce the same man page:
$ fcat kio:man:/man
$ kioclient cat man:/man
Support is preliminary and only allows reading the files but not writing to them through the kio: filesystem as yet. Already though the kio: can be exposed to XQuery, SQLite, and through the libferris REST interface. Yay for cooperating!
I notice some really juicy digikam kio slaves but haven't dug into them enough to use them as yet. Although you can already upload to various web services from digikam, once I get access to digikamalbums through ferris kio mounting I can then 'cp' the images directly from the command line to other things that libferris can already mount such as flickr API sites, printers etc.
Sunday, October 14, 2012
Mini Xplus Hard Float
For an initial test I ran the following
time cjpeg -quality 90 img_01_l.pnm >| out.jpg
on a sample jpeg image from Nikon at
http://imaging.nikon.com/lineup/dslr/d3s/sample.htm
Three machines for cross comparison, an Intel 2600k, an kirkwood 2ghz running debian armel, and the Linaro hard float on the minix running an A10 at 1ghz.
0m0.097s 2600K
0m1.236s 1ghz A10 MiniX HF
0m3.250s 2ghz kirkwood armel
So you can see a clear advantage for the hardfloat; a chip at twice the clock rate needs 2.6x the time to perform the same process.
Friday, October 12, 2012
Mini Xplus
Under android this thing ranges down to 2 watts with wifi up, around 3 under moderate use and 4 watts with a reasonable browser load put on it (mjpeg streaming to firefox). All with wifi up.
My growing little "openssl speed" comparison now has another value point. For ciphers this A10 does about as well as an n9. For md5 shown below, it is the best cat in the camp by a good way:
The next tests will be the floating point and simd stuff, which is why I'm interested in the chip. One little discovery, the wifi seems to be very short range, the nexus 7 is happy to connect to one of my APs but I get the minix dropping the connection all the time to that AP from the same room. Don't let the cute aerial fool you it seems.
Monday, July 30, 2012
Off the Record Messaging: A Tutorial
A commonly used method for sending messages to others when you need authentication and privacy is to use an OpenPGP tool such as GNU Privacy Guard (GnuPG). For real time communications such as instant messaging, IRC, and socket IO, using Off The Record (OTR) messaging provides Perfect Forward Secrecy and secure identification of the remote party without the need for a web of trust.
In order to operate without a web of trust, libotr implements the Socialist Millionaires' Protocol (SMP). The SMP allows two parties to verify that they both know the same secret. The secret might be a passphrase or answer to a private joke that two people will easily know. The SMP operates fine in the presence of eaves droppers (who don't get to learn the secret). Active communications tampering is not a problem, though of course it might cause the protocol not to complete successfully.
Because the SMP doesn't rely on the fingerprint of the user's private key for authentication, the private key becomes almost an implementation detail. Once generated, the user generally doesn't need to know about the key or it's fingerprint. The only time a user really cares to know is when a key is created because a bit of entropy has to go into that process. Of course, an application should avoid regenerating keys for no reason because each time the key is replaced the user has to use the SMP again to allow remote parties to authenticate them.
In this article I'll show you how to use the current release, libotr 3.2.0+, to provide OTR messaging. I'll present two examples which are both in C++ and use the boost library for socket IO. I have gone this was so we can focus on the OTR action and not the details of sockets.
The first example does not use the Socialist Millionaires' Protocol (SMP). So the new_fingerprint() callback is essential to establishing a secure session. When not using the SMP, authentication is performed by comparing the sent fingerprints of those you are wishing to communicate with against known good values. These known values must be sent beforehand through a secure secondary channel, such as a face to face meeting. Once fingerprints have been accepted, subsequent OTR communications with the same party can be performed without explicit fingerprint verification.
The second example makes things simpler for the user by using the SMP for authentication of the remote party. This way, the information exchanged beforehand becomes shared experiences you and the other party have had such that a question can be raised that only you and they can easily answer.
A central abstraction in using the libotr library is the struct s_OtrlMessageAppOps vtable. This is used by libotr to callback into your code when something happens such as a cryptographic fingerprint being received, or libotr wanting to send a message to the other end. The later happens frequently during OTR session establishment.
If a program monitors it's socket IO using select() or some other mainloop abstraction, then having these internal protocol messages being sent is not so much of an issue. Alas, for the simple echo server I present one must remember that there might be one or more internal OTR protocol messages sent from what seems like outside of the normal program flow. I'll get back to this point while describing the relevant section of the first example.
Many of the callback functions in s_OtrlMessageAppOps might be simple stubs, but you should be aware of inject_message() which will be called when libotr itself wants to send something, notify and display_otr_message can both provide feedback to the user, the new_fingerprint() method is called when a remote key is discovered in order to allow you to inform the user and possibly abort the session. The gone_secure() method is called to allow you to inform the user that they are off the record. When you call libotr functions you supply both a pointer to a s_OtrlMessageAppOps structure uiops and a void* opdata. When libotr calls a method in uiops it will pass opdata back to you.
Another common three parameters you will pass to libotr functions are the accountname, protocol and sender or receiver name. The protocol string can be anything as long as both ends of the system use the same protocol string. The state data that libotr uses is stored in an OtrlUserState object which is created with otrl_userstate_create() and passed to many of the libotr functions along the way.
The code below loads a private key or creates a new one if none already exists. Because creating a new key is an entropy heavy operation, the setupKey() function warns the user that if they are erratic it the process might move along a bit quicker. Note that the uiops has a callback create_privkey to generate a key if needed. I just prefer to make this codepath explicit and out of the main callback logic.
bool ok( gcry_error_t et ) { return gcry_err_code(et) == GPG_ERR_NO_ERROR; } void setupKey( const std::string& filename ) { gcry_error_t et; et = otrl_privkey_read( userstate, filename.c_str() ); if( !ok(et) ) { cerr << "can't find existing key, generating a new one!" << endl; cerr << "this needs a bunch of entropy from your machine... so please" << endl; cerr << "move the mouse around and slap some keys mindlessly for a while" << endl; cerr << "a message will be printed when keys have been made..." << endl; et = otrl_privkey_generate( userstate, filename.c_str(), accountname, protocol ); if( !ok(et) ) { cerr << "failed to write new key file at:" << filename << endl; } cerr << "Have keys!" << endl; } }
The main.cpp program implements both the client and server. The server mode is selected by passing -s at startup. Firstly, a userstate is created, some variables set depending on if we are a client or server, and the correct private key is loaded or created.
OTRL_INIT; userstate = otrl_userstate_create(); keyfile = "client.key"; accountname = "client"; recipientname = "server"; if( ServerMode ) { keyfile = "server.key"; accountname = "server"; recipientname = "client"; } setupKey( keyfile );
The core logic for the echo client is to read a string from the user, send it to the server, grab a reply from the server and show it to the user. The start of the client code connects to a given port on localhost and reads a string from the user.
VMSG << "client mode..." << endl; stringstream portss; portss << Port; iosockstream stream( "127.0.0.1", portss.str() ); if (!stream) { cerr << "can't connect to server!" << endl; exit(1); } string s; while( true ) { getline(cin,s); cerr << "your raw message:" << s << endl; cerr << "send plaintext:" << colorsend(s) << endl;
We certainly do not want to send the raw string s over the wire to the server though. That would very much be "on the record". So the next fragment of the client gets libotr to encrypt the string s so we can send it off the record to the server. The userstate is the value created during program initialization using otrl_userstate_create(). The ui_ops is the vtable s_OtrlMessageAppOps structure described above, and opdata is the value we want libotr to pass back to our methods in ui_ops when it uses them. In this case, we use the address of the iostream for the socket as the opdata so callbacks can send and receive data on the socket if they so desire. The newmessage will point to an off-the-record message that the server can decrypt to read the string s. The tests on the return value for message_sending() ensure that we have a new, encrypted off the record message to send instead of the plaintext s.
void* opdata = &stream; OtrlTLV* tlvs = 0; gcry_error_t et; char* newmessage; void* opdata = &stream; OtrlTLV* tlvs = 0; gcry_error_t et; char* newmessage; et = otrl_message_sending( userstate, &ui_ops, opdata, accountname, protocol, recipientname, s.c_str(), tlvs, &newmessage, myotr_add_appdata, &ui_ops ); cerr << "encoded... ok:" << ok(et) << endl; if( !ok(et) ) { cerr << "OTR message_sending() failed!" << endl; } if( ok(et) && !newmessage ) { cerr << "There was no error, but an OTR message could not be made." << endl; cerr << "perhaps you need to run some key authentication first..." << endl; } if( newmessage ) { VMSG << "have new OTR message:" << newmessage << endl; s = newmessage; }
Since we have replaced the plaintext s with the off the record version, we send that to the server using the socket iostream and then wait a moment before reading a response. The while loop is slightly hairy in that it will block for new messages if we are not secure. As I mentioned above, libotr can call the inject_message() callback to write a new off the record message to the socket. Outgoing messages will be generated and injected during session establishment. There is no incoming version of inject_message() so the client needs to keep reading these injected messages before it tries to send another off the record message. One will find that there are many messages exchanged between libotr at each end when the string s is written to the socket. This only happens the first time through to setup the OTR protocol.
When reading messages from the server, the encrypted string is read and passed to otrl_message_receiving(). If the recevied message was an OTR message that was sent from the other end by libotr using inject_message() then otrl_message_receiving() will indicate to the client that it should simply ignore this message. Otherwise a real message was encrypted and sent by the server and so the client will show the user the decrypted newmessage.
cerr << "WRITE:" << s << endl; stream << s << endl; usleep( 200 * 1000 ); while( !secure && stream.peek() != std::iostream::traits_type::eof() || secure && stream.rdbuf()->available() ) { s = "junk"; VMSG << "reading data from server" << endl; getline(stream,s); VMSG << "READ:" << s << endl; int ignore_message = otrl_message_receiving( userstate, &ui_ops, opdata, accountname, protocol, recipientname, s.c_str(), &newmessage, &tlvs, myotr_add_appdata, &ui_ops ); VMSG << "ignore:" << ignore_message << " newmsg:" << maybenull(newmessage) << endl; if( ignore_message ) { VMSG << "libotr told us to ignore this message..." << endl; VMSG << "available:" << stream.rdbuf()->available() << endl; VMSG << " in_avail:" << stream.rdbuf()->in_avail() << endl; continue; } if( newmessage ) s = newmessage; otrl_message_free( newmessage ); cout << color( s ) << endl; }
Server mode is handled by a thread which executes server_session() using the std::iostream for the new socket.
if( ServerMode ) { VMSG << "server mode..." << endl; boost::asio::io_service io_service; tcp::acceptor a( io_service, tcp::endpoint( tcp::v4(), Port )); for (;;) { h_iosockstream stream(new iosockstream()); a.accept( *(stream->rdbuf()) ); boost::thread t(boost::bind(server_session, stream)); } }
The server implementation would look like the below if OTR messaging was not being used.
void server_session( h_iosockstream streamptr ) { iosockstream& stream = *(streamptr.get()); while( stream ) { std::string s; getline( stream,s ); cout << "server got:" << s << endl; stream << s << endl; } }
The OTR server implementation starts out the same way, reading a string from the socket. Then our old friend otrl_message_receiving() is called to decrypt that message. If ignore_message is set then there is nothing to be done and we simply continue to the top of the loop to read another string from the client. Also, if we are not yet secure, there is no point in trying to send a new OTR message back to the client, so we simply continue at the top of the while loop again. This way we avoid writing replies to the client when session establishment messages are sent by libotr on the client side.
This might seem a little strange at first, how will we ever become secure and start replying to the client if all we do is read from them and throw away the messages. The thing to keep in mind is that messages sent with inject_message() on the client will be seen by libotr when we call otrl_message_receiving() which in turn might cause libotr on the server to inject_message() with a reply to this session establishment message. Eventually libotr will call the gone_secure() OtrlMessageAppOps callback in which we set the global variable secure to true, this allowing the server to start replying to the client as it normally would.
void server_session( h_iosockstream streamptr ) { iosockstream& stream = *(streamptr.get()); while( stream ) { gcry_error_t et; std::string s; VMSG << "getting more data from the client..." << endl; getline( stream,s ); VMSG << "READ:" << s << endl; void* opdata = &stream; OtrlTLV* tlvs = 0; char *newmessage = NULL; int ignore_message = otrl_message_receiving( userstate, &ui_ops, opdata, accountname, protocol, recipientname, s.c_str(), &newmessage, &tlvs, myotr_add_appdata, &ui_ops ); VMSG << "ignore:" << ignore_message << " newmsg:" << maybenull(newmessage) << endl; if( newmessage ) s = newmessage; otrl_message_free( newmessage ); if( ignore_message ) { VMSG << "libotr told us to ignore this message..." << endl; continue; } cout << "ignore:" << ignore_message << " server got:" << s << endl; cout << "message from client:" << color(s) << endl; // do not echo back messages when we are establishing the session if( !secure ) continue;
The remainder of server_session() creates the echo reply message, encrypts it with otrl_message_sending() and sends the OTR message over the socket.
static int count = 0; stringstream zz; zz << "back to you s:" << s << " count:" << count++; s = zz.str(); cout << "writing...s:" << s << endl; cerr << "send plaintext:" << colorsend(s) << endl; et = otrl_message_sending( userstate, &ui_ops, opdata, accountname, protocol, recipientname, s.c_str(), tlvs, &newmessage, myotr_add_appdata, &ui_ops ); if( !ok(et) ) { cerr << "OTR message_sending() failed!" << endl; } if( ok(et) && !newmessage ) { cerr << "There was no error, but an OTR message could not be made." << endl; cerr << "perhaps you need to run some key authentication first..." << endl; } if( newmessage ) { VMSG << "have new OTR message:" << newmessage << endl; s = newmessage; } VMSG << "writing otr...s:" << s << endl; stream << s << endl;
As the security of the OTR messaging relies on fingerprints in the first example, the new_fingerprint callback presents our fingerprint and the remote fingerprint and asks the user if they want to continue to establish the session or not. Unforuntately this means the user has to eyeball scan the remote fingerprint against an expected value they have obtained from the remote party at some other time in a secure channel.
static void myotr_new_fingerprint( void *opdata, OtrlUserState us, const char *accountname, const char *protocol, const char *username, unsigned char fingerprint[20]) { cerr << "myotr_new_fingerprint(top)" << endl; char our_fingerprint[45]; if( otrl_privkey_fingerprint( us, our_fingerprint, accountname, protocol) ) { cerr << "myotr_new_fingerprint() our human fingerprint:" << embold( our_fingerprint ) << endl; } cerr << "myotr_new_fingerprint() their human fingerprint:" << embold( fingerprint_hash_to_human( fingerprint )) << endl; cerr << "do the fingerprints match at the remote end (enter YES to proceed)" << endl; std::string reply; getline( cin, reply ); if( reply != "YES" ) { cerr << "You have chosen not to continue to talk to these people... good bye." << endl; exit(0); } }
Simpler authentication with SMP
The second example uses the SMP to avoid having to verify fingerprints. For good measure, the fingerprints established are saved and loaded to/from disk so that subsequent conversations do not need any SMP or user fingerprint verification.During process startup, fingerprints are read from file if they exist;
std::stringstream fn; fn << "fingerprints-" << accountname; gcry_error_t e = otrl_privkey_read_fingerprints( userstate, fn.str().c_str(), 0, 0 );
The otrl_message_sending() and otrl_message_receiving() functions both have a parameter OtrlTLV *tlvs. The tlvs allow data to be sent and received as sideband information that does not effect what you send with libotr. The SMP uses the tlvs to communicate the information that it needs in order to authenticate.
In server_session() the main change is a check on the tlvs variable after calling otrl_message_receiving().
if( tlvs ) { handle_smp( stream, tlvs, userstate, &extended_ui_ops, opdata ); }
The client initiates the SMP and has heavier changes to it's code. After creating a iosockstream to localhost, the client calls run_smp_client() to setup the OTR session and run the SMP to authenticate. Apart from the call to run_smp_client() the client mainloop while(true) doesn't need to change. This makes sense because the SMP is normally only used at session establishment when we do not know about the remote key (fingerprint) already.
In the run_smp_client function, the first while( !secure... loop will establish an OTR session using fingerprints just like the first example. This time we do not stop to ask the user to verify the fingerprints, we simply record that a new fingerprint was seen. This is done by setting runSMP=true to force the SMP if we are using a fingerprint that we didn't already have on disk.
If runSMP is set then we read a secret from the user and call otrl_message_initiate_smp() to get the SMP ball rolling with libotr. This leads to the second while( !secure loop which will stop when we are secure again.
void run_smp_client( iosockstream& stream ) { void* opdata = &stream; OtrlTLV* tlvs = 0; // establish session using fingerprints stream << "?OTR?v2?" << endl; usleep( 200 * 1000 ); while( !secure && stream.peek() != std::iostream::traits_type::eof() ) client_read_msg_from_server( stream ); if( !runSMP ) { return; } VMSG << "Starting the Socialist Millionaires' Protocol " << endl << " to work out who the other guy is..." << endl << endl; VMSG << "please give me a secret that only you and the other guy know..." << endl; std::string s; getline( cin, s ); int add_if_missing = true; int addedp = 0; ConnContext* smpcontext = otrl_context_find( userstate, recipientname, accountname, protocol, add_if_missing, &addedp, myotr_add_appdata, &ui_ops ); cerr << "addedp:" << addedp << " smpcontext:" << smpcontext << endl; if( !smpcontext ) return; otrl_message_initiate_smp( userstate, &ui_ops, opdata, smpcontext, (const unsigned char*)s.c_str(), s.length() ); // we are only secure if the SMP succeeds secure = 0; while( !secure && stream.peek() != std::iostream::traits_type::eof() ) client_read_msg_from_server( stream ); cerr << "secure:" << secure << endl; if( secure == SMP_BAD ) { cerr << "couldn't authenticate server, exiting..." << endl; exit(1); } }
The client_read_msg_from_server() function calls otrl_message_receiving() and checks if tlvs is set and if so calls handle_smp() with that tlvs value.
As you see from the above, whenever a tlvs is set in the client or server then handle_smp() is called. If you look at the UPGRADING file in libotr 3.2.0+ you will see a skeleton code in "3.3.4. Control Flow and Errors" which the handle_smp() is based on. The handle_smp() function uses otrl_tlv_find() on tlvs to check for internal OTR messages sent from libotr itself which describe a stage in the SMP. handle_smp() is like a primitive state machine working through from SMP1 (the server asking for the secret to respond to the client's initial request), through to SMP3 and SMP4 which are called when the protocol completes with either success or failure (same or different secrets).
if( tlv = otrl_tlv_find(tlvs, OTRL_TLV_SMP2)) { if (nextMsg != OTRL_SMP_EXPECT2) { cerr << "smp: spurious SMP2 received, aborting" << endl; otrl_message_abort_smp( userstate, ui_ops, opdata, smpcontext); otrl_sm_state_free(smpcontext->smstate); } else { cerr << embold("SMP2 received, otrl_message_receiving will have sent SMP3") << endl; smpcontext->smstate->nextExpected = OTRL_SMP_EXPECT4; } }
If the secrets are proven to be the same when the SMP is used it is adventagious to save the fingerprints to disk so that future communications do not require user fingerprint verificaiton or the SMP.
if( tlv = otrl_tlv_find(tlvs, OTRL_TLV_SMP4) || tlv = otrl_tlv_find(tlvs, OTRL_TLV_SMP3)) { if( smpcontext->smstate->sm_prog_state == OTRL_SMP_PROG_SUCCEEDED ) { std::stringstream fn; fn << "fingerprints-" << accountname; gcry_error_t e = otrl_privkey_write_fingerprints( userstate, fn.str().c_str() ); } }
Hopefully you are now in a better position to add libotr support to your real time network programs. The full source code to these programs as well as the HTML for this post itself is up on my github page. Remeber, using off the record messaging doesn't nessesarily mean you have anything to hide, just that you have nothing to show.
Monday, July 23, 2012
RDF: Abiword and Calligra
There are some things abiword can do with RDF that Calligra currently can't. These include highlighting the parts of the document that have RDF associated and initial support for capturing relations between things in RDF. Such relationships from RDF can give back to the user right off the bat. For example they can be useful for navigating the document by semantics, moving from a child to their parent, or the cat to it's owner. RDF also opens up document computing to other possibilities like navigating by time rather than by the linear sequence of pages.
I am currently looking to do some more development around ODF. It would be wonderful to do some more indepth hacking on either project as they both have a very nice, welcoming, open source team behind them. If anybody knows of some sponsored hacking floating around for these tools, please get in touch :)
I recently committed a small change to Calligra to move back to using intrusive reference counting for some non graphical RDF objects. While Calligra can use marble to allow editing locations from RDF, I used marble at a level in the calligra stack that isn't optimal, so have to shift that to a plugin to allow it to be easily accessible to distribution builds.
Monday, June 18, 2012
More ARMses my precious!
Moving along to the Digest speeds below, it is interesting how well the sha1 speeds hold up for the tp-link device. And conversely how strange the numbers are for the Xoom running Tegra2/ICS. The suspicion is that the compiler plays a major role where too...
Hmm, Beagle Bone, Marvell 2Ghz ARM from a device like the QNAP TS-419PII would also be nice on the charts. hardware also accepted ;-p~~
Thursday, June 14, 2012
Libferris and SQLite -- a powerful combination
The libferris virtual filesystem allows you to mount things that are not normally thought of as filesystems. Examples include XML files, Berkeley db4 ISAM files, relational databases, system log files, as well as applications like Firefox, Evolution, XWindow, emacs and Amarok.
Version 1.3.0 of libferris has enhanced support for SQLite databases. This includes both mounting an SQLite database as a filesystem and exposing any libferris virtual filesystem to SQLite as a virtual table. You can have as many libferris backed virtual tables as you like in SQLite, so you can join information from multiple sources (XML, syslog, db4 files) using SQL queries.
In this article series I will explain how to mount SQLite as a filesystem so you can manipulate it with the familiar command line tools (ls, cat, redirection), how to go the other way and expose libferris filesystems like mounted XML to SQLite for manipulation. I'll finish up with a discussion of how libferris lets you mount semi structured plain text files like log files as a filesystem and how you can then expose those to SQLite for analysis.
libferris tries to perform what are traditionally seen as filesystem mounting operations for you automatically. For example, if you list an XML file using ferrisls without using the
-d
option to list the entry itself rather than
its contents, then libferris will try to mount that XML file and
show it to you as a filesystem. This allows clients to access
and store information into XML and db4 files without needing to
know any of the details. A URL could be for a normal directory
on an ext3 filesystem or for part of an XML file, it doesn't
matter.
Mounting SQLite databases is now done automatically too. Consider the simple SQLite database shown below:
$ touch /tmp/test.sqlite $ sqlite3 /tmp/test.sqlite SQLite version 3.5.9 Enter ".help" for instructions sqlite> create table foo ( id int primary key, msg varchar ); sqlite> insert into foo values (4,'hello there'); sqlite> insert into foo values (5,'foo bar');
Without doing anything special, you can get at the data in this SQLite database through any libferris client. If you prefer you can also use FerrisFUSE to expose it through the Linux kernel to any program. For those unfamiliar, FUSE allows a filesystem that appears through the Linux kernel to be served by an application running as a normal application process. This means FerrisFUSE runs like a daemon and can use libferris to store information while making it available to the Linux kernel like any other traditional (ext3) filesystem. All of a sudden you can run vi or emacs to edit a tuple in an SQLite database. The libferris distribution contains clones of the traditional ls, cat, cp, etc. tools which use libferris to perform their work. This way you don't have to setup any FUSE mounts or the like if you just want to ls, cat or redirect information into a libferris filesystem.
When you mount a table or SQL query result from a relational database with libferris, each column from a tuple is exposed using the standard filesystem Extended Attribute (EA) interface. The EA interface allows key=value pairs to be associated with a file. You can think of EA simply as metadata for files and directories, the familiar lstat(2) information of size, mtime, owner and group of a file are exposed as EA by libferris. In fact, you can get and set the byte contents of a file through the
content
EA of a file. Making
everything available through the EA interface allows you to
filter and sort a directory based on lstat metadata as well as
user supplied metadata -- it's all just EA.
Shown below, I first list the table from the SQLite database and then view a tuple with the fcat command. When you read a tuple as a file, libferris exposes it as an XML document. In places that do not have a direct mapping, libferris normally tries to do the most sane thing it can. For a tuple, you have to expose all the fields in some delimited format because libferris can't know what fields are the most important. So using XML makes the most sense because it offers explicit key=value serialization and is easily digestible by other applications. You can also use the
-a
option to fcat to select a single single
column (EA) which avoids the XML.
$ ferrisls -lh /tmp/test.sqlite/foo 40 4 36 5 $ fcat /tmp/test.sqlite/foo/5 <context id="5" msg="foo bar" /> $ fcat -a msg /tmp/test.sqlite/foo/4 hello there
As you can see above, the long listing format from ferrisls only shows the size and name of each file. All of the information shown by ls is exposed as EA by libferris. But there is no modification time, or file ownership privileges stored for a tuple in the SQLite database, so no value is available. This makes the ferrisls long listing seem fairly spartan.
You can also ask a libferris directory what it thinks is the most interesting EA (the recommended-ea) for the user. The ferrisls -0 option uses this feature to let you dig into strange filesystems. Shown below, you can see the
msg
and
id
column from the database are shown when using -0
to see the recommended-ea. Because the columns that are shown
by recommended-ea can be arbitrary EA, you might like to use the
--show-headings option to have ferrisls tell you the name of
each EA.
$ ferrisls -0 /tmp/test.sqlite/foo 4 hello there 4 id 5 foo bar 5 id $ ferrisls -0 --show-headings /tmp/test.sqlite/foo id msg name primary-key 4 hello there 4 id 5 foo bar 5 id
ferrisls also lets you see its output in XML or RDF/XML using the --xml or --rdf options respectively. Apart from being able to easily parse the output, the XML output lets you see the names of the EA for each value.
$ ferrisls --xml /tmp/test.sqlite/foo <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> <ferrisls id="" msg="" name="foo" primary-key="id" url="file:///tmp/test.sqlite/foo"> <context id="4" msg="hello there" name="4" primary-key="id"/> <context id="5" msg="foo bar" name="5" primary-key="id"/> </ferrisls> </ferrisls>
While such read access is a huge step forward for unlocking the data in an SQLite file, you might like to update data as well. Because the redirection commands in bash only know about kernel filesystems, there is the ferris-redirect command which performs a similar role to the shell
>
,
>>
and >|
family of
primitives. The ferris-redirect command reads from standard
input and writes that data to a nominated URL. The
-T
option to ferris-redirect truncates the file
before writing to it, and the --ea
command is used
to select an EA to write to instead of writing to the contents
of the file.
$ echo -n "welcome to libferris" \ | ferris-redirect -T --ea msg /tmp/test.sqlite/foo/5 $ ferrisls --xml /tmp/test.sqlite/foo <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> <ferrisls id="" msg="" name="foo" primary-key="id" url="file:///tmp/test.sqlite/foo"> <context id="4" msg="hello there" name="4" primary-key="id"/> <context id="5" msg="welcome to libferris" name="5" primary-key="id"/> </ferrisls> </ferrisls>
I added support for mounting PostgreSQL database to libferris years ago. There are still some features unique to the PostgreSQL support in libferris such as calling database functions through a virtual filesystem. You can also mount other databases that QtSQL can access using libferris.
Going the other way
Lets now turn our attention to doing the opposite: exposing libferris as a relation to SQLite. You might have some data in an XML file that you need to join with data stored in Berkeley db4 or an existing SQLite database. Your first thought might be to whip up a Perl script to bring the data together. With libferris you don't have to write any of this glue code.As SQLite includes support for virtual tables, you can expose any libferris virtual filesystem as a table to SQLite. Because XML is just another virtual filesystem to libferris you can easily expose it to SQLite and join the data with a normal SQL query.
To use libferris from SQLite, you'll need an sqlite client binary that can load shared libraries. Allowing binary libraries to be loaded into your SQLite session is a bit of a security issue, the code could potentially do anything. As such, the packaged client for Fedora does not allow loading shared libraries. It is very easy to download the sqlite source and recompile it with shared library support though, and you only have to install the single
sqlite3
binary somewhere
in order to get at libferris virtual tables.
Firstly you'll need to load the libferris support using the
.load
directive at an sqlite prompt. Data is
exposed into SQLite as a virtual table using the create
virtual table
statement. The format for this statement
varies depending on the code implementing the virtual table. For
libferris, the parameters are: The URL to expose, a string
containing key=value pairs to effect this exposure, and the EA
you want exposed, expressed in the normal create
table
format. Note that you can only specify the name and
datatype for each EA (column) you want, you can't put
constraints on them like foreign or primary keys. To avoid
confusion, I'll call the URL supplied as the first parameter to
create virtual table the "base URL".
Issuing the below two statements at an sqlite prompt will bring libferris support into the current session and create a virtual table which shows the /tmp/test directory with various lstat(2) information as well as the MD5 checksum and file contents as columns in the new SQLite table. Note that there is an implicit first column in the created virtual table which is always the file URL (primary key) so you don't have to specify that manually.
$ sqlite3 .load libferrissqlitevtable.so create virtual table fs using libferris( 'file:///tmp/test', '', name text, size int, md5 text, path text, mtime int, atime int, ctime int, is-dir int, mtime-display text, content text );
There is a short list of common EA descriptions that are available in libferris. If you are unsure what an EA looks like, use the
--show-ea
option to ferrisls to select which EA you want to see for a
URL. --show-ea
takes a comma separated list of
which EA you want to see. The special ea-names
EA
will list the names of all EA that are available for a URL as
a comma separated string. The second command below generates a
newline separated version in a temporary file.
$ ferrisls --show-ea=name,foo URL $ fls --show-ea=ea-names URL | tr ',' '\n' >|/tmp/o $ vi /tmp/o
Turning our attention back to the
create virtual table
SQLite function again.
There are currently two parameters which can be specified in the second parameter as comma
separated key=value pairs. If
you specify recursive=true
then the table exposes
the entire filesystem recursively from the base URL, otherwise
only the directory of the base URL is exposed. This lets you
choose if you want the directory or entire filesystem tree from
the base URL to be exposed. If you specify
add-rowid-to-base-filesystem=true
then stable 64bit
row-ids are added as needed to files in the exposed table. You
need stable row-ids to allow the SQL update
.. where
and insert into
statements to work
as expected.
There are a few things to keep in mind when exposing a filesystem as a relational table. For example, EA that are shown in ferrisls and native libferris clients with a '-' in them are translated to use '_' instead because SQLite doesn't like the dash. Likewise, ':' is translated into '_colon_'. So in the above create virtual table, the
is-dir
and
mtime-display
EA will be available in the SQL
columns is_dir
and mtime_display
respectively.
To get at the above virtual fs table, you can use the normal SQL select command:
select is_dir,name,url from fs order by url desc; ... 0|df2|file:///tmp/test/df2 0|df1|file:///tmp/test/df1 ...Because some calculated EA like the MD5 checksum are exposed as columns you can grab those from a query too. The libferris virtual table implementation includes code to allow you to pick files in a where clause using their path or URL and libferris will optimize the lookup in a manner similar to (but not quite as efficiently) as if you had created a relational database index on those columns. Note that if you use the URL in the where clause, you need to specify the
file://
part
too, and the URL must exactly match what you are looking for. If
you are not getting the result you expect when searching for
yourURL, use ferrisls --show-ea=url yourURL
and
make sure the result exactly equals yourURL.
select md5,size,mtime,mtime_display,name,url from fs where path = '/tmp/test/df1';
Exposing a normal filesystem directory to SQLite is not very exciting, but keep in mind that anything you can see through libferris you can see through an SQLite virtual table backed by libferris. So if you have an XML file:
$ cat customers.xml <customers> <customer id="3" givenname="Foo" familyname="Bar"/> <customer id="15" givenname="Bobby" familyname="McGee"/> <customer id="131" givenname="Ziggy" familyname="Stardust" /> </customers>
You can expose it to SQLite and query based on any attributes in the XML file as shown below.
$ cat test-xml.sql sqlite> .load libferrissqlitevtable.so sqlite> create virtual table fs using libferris( 'customers.xml/customers', '', name text, size int, id int, givenname text, familyname text content text ); sqlite> select * from fs order by givenname desc; file:///ferris/exposures/sqlitevtab/customers.xml/customers/131|131|0|131|Ziggy|Stardust file:///ferris/exposures/sqlitevtab/customers.xml/customers/3|3|0|3|Foo|Bar file:///ferris/exposures/sqlitevtab/customers.xml/customers/15|15|0|15|Bobby|McGee
You can also create a filesystem index with libferris and expose desktop search results as a virtual sqlite table. For example, the following searches an image collection for a particular bridge using the EA Query (eaq://) libferris virtual filesystem. For more information on creating and using the desktop search functionality in libferris see the Linux Journal article.
$ cat query.sql .load libferrissqlitevtable.so create virtual table fs using libferris( 'eaq://(url=~Akashi-Kaikyo-Bridge)', '', name text, size int, md5 text, path text, mtime int, atime int, ctime int, is-dir int, mtime-display text, width int, height int, content text ); select is_dir,width,height,url from fs order by url desc; ... 0|1840|1232|file:///digital_pictures/2004-japan/Akashi-Kaikyo-Bridge-main-section2.png 0|1840|1232|file:///digital_pictures/2004-japan/Akashi-Kaikyo-Bridge-main-section.png ...
Updating XML through SQL commands
Imagine that you have the customers.xml file shown below. You have exposed it to SQLite and used it to join with information from tables in the database to resolve queries. All is well, but you would like to add a new customer to the XML file. While you might consider opening up a text or XML editor on the XML file itself, you would really like to just keep using SQL.$ cat customers.xml <customers> <customer id="3" givenname="Foo" familyname="Bar"/> <customer id="15" givenname="Bobby" familyname="McGee"/> <customer id="131" givenname="Ziggy" familyname="Stardust" /> </customers>
Currently you have to specify everything including the URL of the new file you wish to create in the SQL INSERT statement as shown below. Note that the ID does not need to be part of the URL, you only need to tell libferris you want to create a new
customer
element. A planned
addition here is to allow libferris to understand the
autoincrement keyword during the creation of the virtual SQLite
table. This would let you be able to supply a NULL for the ID
column and have one automatically generated for you, even though
this is backed by an XML file.
.load libferrissqlitevtable.so create virtual table fs using libferris( 'file:///tmp/customers.xml/customers', 'add-rowid-to-base-filesystem=true', id int, givenname text, familyname text ); insert into fs values ( 'file:///tmp/customers.xml/customers/customer', 300, 'this is the new given name', 'new sirname here' );
At this stage your customers.xml file will include a new element for the entry you INSERTed above:
$ cat customers.xml <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <customers ferris-sqlite-next-rowid="1"> <customer familyname="Bar" givenname="Foo" id="3"/> ... <customer familyname="new sirname here" ferris-sqlite-rowid="1" givenname="this is the new given name" id="300"/> </customers>
The little differences that surfaced for the SQL INSERT statement completely disappear when using SQL UPDATE as shown below:
update fs set familyname = 'Anderson' where id = 300;As you can see above, setting add-rowid-to-base-filesystem=true in the create virtual table statement causes libferris to save ferris-sqlite-rowid data for files in the virtual filesystem. If these attributes can be saved into the file itself, either directly as in XML, or to filesystem Extended Attributes then they are stored there. If neither of these are possible then libferris stores the ferris-sqlite-rowid data in a personal RDF store.
Plain text log files and SQLite
Those familiar with awk and Perl don't really think too much about splitting a log file into its various columns. But a log file stored as a plain text file on a filesystem is hard to directly query from a relational database. One might wish to know the number of connections from each IP address and the total number of bytes send to each of those IP addresses. For such a query the SQL is fairly quick to conjure up but the data is not in an agreeable format for SQL.Version 1.3.0 of libferris also includes support for mounting text files with a regular structure as a virtual filesystem. In its most basic form this extends to splitting up each line of a file using a nominated separator character and making a virtual file for each line with the individual fields exposed as EA. For example, using the comma as the delimiter you could mount a comma separated file as a libferris filesystem. Once its a libferris filesystem you can then turn around and use SQLite to query that filesystem.
In an awk script, the fields that are generated by splitting up a line (record) of input are available through the $1, $2, etc. variables. By default the fields are exposed using similar names in libferris but you can also specify a comma separated list of identifiers for the fields. These identifiers give the name of an extended attribute (EA) which will have the field value. Perhaps an example would best clarify things.
The test2.csv file shown below contains three fields separated by commas. The first ferris-redirect command sets the ferris-recordfile-field-names EA for the file which instructs libferris the names you would like to give to the columns. The second ferris-redirect tells libferris what separates the fields in the file. This is like the FS variable from awk and can be a regular expression.
Finally the ferris-type is set to recordfile to tell libferris that it should try to mount the file by parsing it if the user tries to list it like a directory. Because recordfiles are just plain text files, libferris has no quick way of knowing if it can or should try to parse the file without you explicitly setting the ferris-type to recordfile.
Notice in the ferrisls command that the fields have the names which you specified using the ferris-recordfile-field-names attribute.
$ cat test2.csv cola,colb,colc seca,secb,secc third1,third2,third3,third4 $ echo -n 'colA,secondColumn,lastColumn' \ | ferris-redirect -T --ea=ferris-recordfile-field-names test2.csv $ echo -n ',' \ | ferris-redirect -T --ea=ferris-recordfile-fs test2.csv $ echo -n recordfile \ | ferris-redirect -T --ea=ferris-type test2.csv $ fls --xml test2.csv <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> <ferrisls name="/tmp/testrecordfile/test2.csv" url="file:///tmp/testrecordfile/test2.csv"> <context colA="cola" lastColumn="colc" name="0" secondColumn="colb"/> <context colA="seca" lastColumn="secc" name="1" secondColumn="secb"/> <context colA="third1" lastColumn="third3" name="2" secondColumn="third2"/> </ferrisls> </ferrisls>
Since its annoying and tedious to set these field separators for each file, you can set the ferris-recordfile-metadata-url EA to an absolute or relative name where the metadata about field names and separators is kept. In the case shown below, ferris will look in ~/.ferris/recordfile-types.db/apache_access_log for the info about how to split up the file.
$ echo -n apache_access_log \ | ferris-redirect -T -x --ea=ferris-recordfile-metadata-url /var/log/httpd/access_log $ echo -n recordfile \ | ferris-redirect -T --ea=ferris-type /var/log/httpd/access_log
The ferris-recordfile-metadata-url is simply a level of indirection. Instead of libferris looking at the ferris-recordfile-fs EA on the /var/log/httpd/access_log file it will instead look for the ferris-recordfile-fs EA on ~/.ferris/recordfile-types.db/apache_access_log. This way you can setup the information for field names and the regular expressions used to split up various file types in one place and simply refer files of that format to the correct recordfile-types.db URL.
Trying to parse apache access logs using a fixed regular expression for the field separator is not as simple as with comma separated files. This is because the separator changes as you move across a line in the access log. To enable parsing of less regular files libferris supports the ferris-recordfile-fsplit EA which defines a single regular expression which includes regular expression (captures) to explicitly pick out the fields from the input. The fsplit regular expression (regex) is normally much longer than a field separator (FS) regex but you can easily change what the regex searches for after each matched field.
Shown below is a partial implementation of a fsplit capturing regex for parsing apache access log files. The recordfile-types.db comes with libferris as part of the /etc/skel files for
~/.ferris
. Note that you can easily
explore and update the data stored in a Berkeley db file because
libferris can mount those too.
$ fls --xml ~/.ferris/recordfile-types.db <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> <ferrisls name="/ferris/dot-ferris/recordfile-types.db" url="file:///ferris/dot-ferris/recordfile-types.db"> <context ferris-recordfile-field-names="ip,f2,f3,date,req,response,sz,f8" ferris-recordfile-fsplit="^([0-9]+\.[0-9]+\.[0-9]+\.[0-9]+) ([^ ]+) ([^ ]+) \[(.+)\] "(.*)" ([0-9]+) ([0-9]+|\-).*" name="apache_access_log"/> ... </ferrisls> </ferrisls>
Once you have set the ferris-type to recordfile and set the metadata URL as shown above, then you can dig into the log file as a filesystem as shown below. Note that if you wanted to import the log somewhere else you can use the --xml or --rdf options to ferrisls to expose this filesystem as XML or RDF/XML respectively.
$ fls -0 access_log 0 - - 14/Mar/2009:14:37:06 +1000 585 192.168.100.200 GET / HTTP/1.1 200 1 - - 14/Mar/2009:14:37:06 +1000 287 192.168.100.200 GET /favicon.ico HTTP/1.1 404
But rather than exporting the data as XML, why not just use the SQLite virtual table interface to query it directly as shown below. First I make a new virtual table from the log file and then perform the most basic query on it. The second query shows how many requests each IP address made and how many bytes were sent to each IP address.
sqlite> .load libferrissqlitevtable.so sqlite> create virtual table logfile using libferris( 'file:///tmp/access_log', '', ip text, req test, response text, date text, sz int ); sqlite> select * from logfile order by ip desc; file:///tmp/access_log/0|192.168.100.200|GET / HTTP/1.1|200|14/Mar/2009:14:37:06 +1000|585 file:///tmp/access_log/1|192.168.100.200|GET /favicon.ico HTTP/1.1|404|14/Mar/2009:14:37:06 +1000|287 ... sqlite> select count(*),max(ip),sum(sz) from logfile group by ip; 15|192.168.100.200|1798021 3|192.168.110.200|42722 1|192.168.130.200|16927 2|192.168.133.200|271977
Most log files contain a timestamp containing the date and time that a request or event occurred. In the above query, we declare the
date
column as part of the virtual
table but don't use it in the query. There is one major glitch
-- having the date field as plain text makes performing date
comparisons or grouping by date almost impossible. You are at
the mercy of the string to date functions of your SQL
environment.
To get around this, support was added to libferris to have it parse arbitrary date and time EA and generate a new EA-epoch attribute exposing the UNIX epoch value (seconds since 1970). The epoch time is the lowest common denominator in that it requires no parsing for sqlite to interpret it properly and you can perform date and interval comparisons directly on the epoch.
The below command tells libferris to try to parse the
date
EA as a time value using the given strftime(3)
time format string. The format fully matches the apache
access_log time format and so libferris will generate a new
date-epoch EA containing the seconds since 1970 for the date EA.
Note that recordfile-types.db that ships with libferris has all
of these modifications already in place for you but knowing how
things are setup lets you parse files in other formats too.
As you can see, I now use the date-epoch column when creating the virtual table because that value is much more useful to SQLite. Selecting just the date_epoch column shows you the number of seconds as an integer. The date_epoch can be made more digestible with the datetime() SQLite function. If you have more than one EA that you want to parse into an epoch time, separate then with the pipe character, for example,
a=%b/%Y|b=%H:%M:%S
echo -n 'date=%d/%b/%Y:%H:%M:%S %z' \ | ferris-redirect --ea ferris-recordfile-field-timeparsers \ recordfile-types.db/apache_access_log
sqlite> .load libferrissqlitevtable.so sqlite> create virtual table logfile using libferris( 'file:///tmp/access_log', '', ip text, req test, response text, date-epoch int, sz int ); sqlite> select url,ip,date_epoch,datetime(date_epoch,'unixepoch') from logfile order by ip desc; file:///tmp/access_log/19|192.168.133.200|1237005517|2009-03-14 04:38:37 file:///tmp/access_log/20|192.168.133.200|1237005517|2009-03-14 04:38:37 ...This leads you to move interesting queries such as the one shown below which gives a day by day breakdown of bandwidth usage for the last 60 days. The first part of the where clause converts the current time into a UNIX epoch value so that the time from the log file can be subtracted and we can exclude events that are older than 60 days.
select strftime('%Y-%m-%d',datetime(date_epoch,'unixepoch')) as Date, sum(sz) as Bytes from logfile where strftime('%s',datetime('now')) - date_epoch <= 60*60*24*60 group by Date;
Wrap up
If you are more familiar with XML technologies, libferris can also expose itself as a virtual document to XQuery. Currently libferris can expose itself to FUSE, SQLite, XQuery and as a virtual Xerces-C DOM. A core ideal of libferris is liberation, you should be able to see your data as a virtual filesystem, be it from an XML or db4 file, from a relational database, or from an application like Firefox or emacs. Once information is available to libferris you should be able to get at it from whatever tool best suits your current task, a filesystem, a table in a relational database, or a document in an XQuery.Tuesday, June 12, 2012
Putting your Web hits on the map with libferris and XQuery
Learn how to take the IP addresses and other information from an apache access.log file and see those IP addresses as place marks with Google Earth and Google Maps. We'll use XQuery to do this instead of Perl or your favourite scripting language. We won't have to worry about breaking up the lines in access.log or opening connections to relational databases to lookup the location of IP addresses in a table -- with the recent advancements in libferris both access.log and relational databases are both already at your fingertips from an XQuery.
Because this article touches on many technologies, I'll omit an introduction to XQuery. You should be able to get the gist of the XQuery code shown at the end of the article, which I'll also describe. But the details of XQuery are left for you to explore in other tutorials.
Because we want to deal with geographic data, specifically IP addresses placed at locations and cities, using KML format XML files makes sense. KML files can be loaded and saved by Google Earth and Google maps, making them a good format for showing graphically on maps where IP addresses are. This is another reason that XQuery works well, you can easily generate an XML document directly from XQuery.
The libferris project started about 10 years ago with the goal of creating a powerful virtual filesystem. Along the way it has gained the ability to mount XML, relational databases, Berkeley db and other ISAM files, and anything else that makes sense. While having everything available as a virtual filesystem is nice, at times you might prefer to use a different interface than a filesystem. So you can now access any libferris filesystem as a virtual XML Document Object Model (DOM), an XQilla document, an SQLite virtual table, or through XSLT.
The traditional "mounting" filesystem operation is mostly handled for you by libferris. If you access a directory and libferris knows what to do to mount it in a meaningful way then it does so. For example, listing
postgresql://localhost/mydb
will list all of the
tables in the mydb database on localhost. No mounting is needed,
just grab the data you want. This works well when using
libferris with XQuery because you can just peek into any
libferris filesystem directly.
I'll talk about how to map an IP address to city and a latitude, longitude pair first, then turn attention to getting at the data contained in apache's access.log files from libferris, and finally turn to XQuery to bring it all together.
IP to Location
To resolve an IP address to the city, country and digital longitude and latitude values I'll use the free IP address geolocation SQL database. The free IP address database download is targeted to a MySQL database. I used the mysql2pgsql tool to convert the SQL file and import into a PostgreSQL database. The main reasons for this are that I prefer PostgreSQL and libferris has better support for mounting PostgreSQL as a filesystem than other relational database servers. Since I am the author of libferris, these two reasons are not unrelated. Of particular interest for this article, you can call a PostgreSQL database function through the libferris filesystem interface which is not currently implemented for other databases.
The IP address database uses the MySQL
inet_aton()
function. A replacement for PostgreSQL can be found here.
The final touch is to wrap up the IP to geographic information
into a custom PostgreSQL function as shown below. First the
return type of the function is defined, cc is the country code,
reg is the region, ipn is the numerical version of the dotted IP
address and ip is the dotted IP address. The first select
statement illustrates this, 69.90.118.0 is the dotted IP address
and 1163556352 would be the numerical version (ipn).
The
ipgeo
function takes a dotted IP address and
returns the location information if available as shown in the
final example query.
select inet_aton('69.90.118.0') as ipn ------------ 1163556352 create type ipgeoresult as ( cc varchar, reg varchar, city varchar, lat double precision, long double precision, ipn bigint, ip text ); CREATE OR REPLACE FUNCTION ipgeo( ip varchar ) RETURNS ipgeoresult AS ' select country_code as cc, region_code as reg, city,latitude as lat,longitude as long, ip_start as ipn,inet_ntoa(ip_start) as ip from ip_group_city where ip_start <= inet_aton($1) order by ip_start desc limit 1; ' LANGUAGE 'SQL'; select city,lat,long from ipgeo('69.90.118.0'); city | lat | long ----------+-------+-------- Edmonton | 53.55 | -113.5
The ferrisls command should now be able to find geoinfo from an IP address by calling the
ipgeo()
PostgreSQL
function through the filesystem interface as shown below. The
ferrisls command works like the normal ls(1) command with some
extensions. As you can see, using --xml tells ferrisls to print
its output as an XML document.
$ ferrisls --xml "pg://localhost/ipinfodb/ipgeo('69.90.111.0')" <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> <ferrisls cc="" city="" ip="" ipn="" lat="" long="" name="ipgeo('69.90.111.0')" primary-key="cc-reg-city-lat-long-ipn-ip" reg="" url="pg:///localhost/ipinfodb/ipgeo('69.90.111.0')"> <context cc="US" city="New York" ip="69.90.111.0" ipn="1163554560" lat="40.6888" long="-74.0203" name="US-36-New York-40.6888--74.0203-1163554560-69.90.111.0" primary-key="cc-reg-city-lat-long-ipn-ip" reg="36"/> </ferrisls> </ferrisls>
Breaking access.log
Now that we have IP address to geoinfo stuff sorted out, we need to be able to get at the information in an access.log file. XQuery is not well suited to breaking up plain text files though. Luckily we can shift that work onto libferris, getting a much more XML agreeable version of the information in access.log for free.
As detailed in a previous article on libferris and SQLite, libferris can mount log files as filesystems. And as mentioned above, once you have a libferris filesystem, you also have a virtual document that can accessed from an XQuery. The below commands tell libferris that the access_log file in the current directory has the structure of an apache access.log file and that it should try to mount it that way if you treat the file like it was a directory.
$ echo -n apache_access_log \ | ferris-redirect -T -x --ea=ferris-recordfile-metadata-url access_log $ echo -n recordfile \ | ferris-redirect -T --ea=ferris-type access_logThe ferrisls command can be used to peek into the log file to make sure your libferris is setup correctly and mounting the log file as expected. This is shown in the command below. Notice that the pieces of information from a line in access.log are represented as XML attributes and form a single XML element. The XML attributes all have useful names, complements of libferris using the metadata associated with the apache_access_log file type.
Also notice that there is a date-epoch field in the output which is not explicitly stored in the access.log. As part of the apache_access_log settings, libferris recognizes the date field as a timestamp and creates date-epoch which contains the value of date as the number of seconds since 1970. Having the epoch as an integer can be convenient when you are importing the data into another tool.
$ ferrisls --xml access_log <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <ferrisls> ... <context date="17/Mar/2009:12:24:15 +0100" date-epoch="1237256655" ip="78.111.111.111" logname="-" name="4" referer="-" req="GET /.../" response="400" sz="385" user="-" user-agent="-"/> ...For more details on mounting log files, see the previous article on libferris and SQLite on linux.com.
XQuery -- Bringing it all Together
To access a document from an XQuery, you use thedoc()
function. If you are resolving your XQuery
with libferris, there is the ferris-doc()
function
which makes the named libferris filesystem available as though
it was XML.
The entire XQuery to generate a KML file from an access_log is shown below. The core of the query is the first 4 lines, the rest just picks out data from the results that were found in the first lines and creates the correct KML file structure.
The first for loop iterates over each line in the log file, for each of these lines the
$ip
is set to just the
value of the dotted IP address from the log file. Using the
data()
function casts away the fact that
$res/@ip
is an XML attribute in the source. After
the data() function is applied, $ip is just a string containing
the IP address. The next line uses the PostgreSQL database to
lookup the geoinfo for the IP address. Notice that we are
actually grabbing an entire tuple into $ipinfo
from
the relational database here, including the city, country,
longitude and latitude. The compare line silently ignores IP
addresses where we do not have the valid geoinfo. You might like
to flag these IP addresses with another XQuery because if they
are unknown they might be worth investigating.
<kml xmlns="http://www.opengis.net/kml/2.2"> <Document> { for $res in ferris-doc( "/.../access_log" )/* let $ip := data($res/@ip) let $ipinfo := ferris-doc( concat( "pg://localhost/ipinfodb/ipgeo('", $ip, "')" ) )/* where compare($ipinfo/@city,"") != 0 return <Placemark> <name>{ $ip }</name> <description> A hit from {data($ipinfo/@city)} <br/> { $ip } <br/> At: { data($res/@date) } <br/> user-agent: { data($res/@user-agent) } <br/> { let $r := data($res/@referer) where $r != "-" return <a href="{ $r }">{ $r }</a> } </description> <Point> <coordinates> {concat(data($ipinfo/@long),",",data($ipinfo/@lat))} </coordinates> </Point> </Placemark> } </Document> </kml>
You might like to improve these examples to take the path of the access_log file as a parameter instead of hard coding it. To transform the access_log file referenced in the above XQuery into a KML file simply issue the below command.
ferris-xqilla apache-access-log-to-kml.xq >| access_log.kmlThe above XQuery will generate a valid KML file when used against an apache access.log file. When you load the KML file into Google Earth, after exploring around and clicking on place markers a few times you might notice the flaw to the above query. Each individual access by an IP address causes a new place marker in the KML file. Thinking about this with our SQL hat on, what we would like to do is "group by" IP address first.
This SQL "group by" is expressed in the XQuery below. Firstly we pick out all the unique IP addresses and perform a single geoinfo lookup for each of those IP addresses. As you can see in the creation of the description element, the query searches over the log again to make the click bubble for a place marker show information about how often and when each IP address has accessed our site.
<kml xmlns="http://www.opengis.net/kml/2.2"> <Document> { for $ip in distinct-values( ferris-doc( "/.../access_log" )/*/@ip ) let $ipinfo := ferris-doc( concat( "pg://localhost/ipinfodb/ipgeo('", $ip, "')" ) )/* where compare($ipinfo/@city,"") != 0 return <Placemark> <name>{ $ip }</name> <description> A hit from {data($ipinfo/@city)} <br/> { $ip } <br/> { for $res in ferris-doc( "/.../access_log" )/*[@ip = $ip] return <p>At: { data($res/@date) } { let $r := data($res/@referer) where $r != "-" return <a href="{ $r }">{ $r }</a> } </p> } </description> <Point> <coordinates> {concat(data($ipinfo/@long),",",data($ipinfo/@lat))} </coordinates> </Point> </Placemark> } </Document> </kml>
Wrap up
In the past, the focus has been on giving libferris the ability to see and interact with more data sources as filesystems. Recently that focus has also extended to making any libferris filesystem available to other systems, be it SQLite, XQuery, FUSE, or as a massive, virtual, Xerces-C document. The ability of libferris to get its hands on data from a wide array of places frees you from having to pick things apart in a script, perform temporary storage management for the data and bring the data together yourself in a that script. Just use SQL or XQuery and let libferris do the heavy lifting of data access for you.Other possibilities include extending the scripts to include details of user logins and session times and possibly combine the Timeline widget with Google maps to investigate changes over time. For example, using timemap. Work on the XQilla integration for resolving XQueries in libferris is still under development. In particular, query times are likely to improve as bottlenecks are found and fixed.
Wednesday, June 6, 2012
An ARMs race, with a core i7 in there too for relativity
The above image shows the cipher performance of "openssl speed" across many machines. The 2600k is only single threaded, so could be many times faster in real world use by taking advantage of the cores. One interesting point right off the bat is that the 1.2Ghz kirkwood in the synology NAS is bested by the 1.0ghz CPU of the Nokia n9. Arms is not arms.
Removing the overload from the Intel i7 2600K from the graph we see that the Dreamplug is very close to the ds212j in terms of performance.
On the other hand, the digests show a distinct advantage to the Dreamplug setup. Again the n9 has a nice little lead on the others. Since a mobile phone can perform some useful work, one should perhaps also be demanding the NAS also offer handy features as well as just serving data blocks.
The RSA sign and verify graphs both show the same large slump for the ds212j unit. So for connection heavy workloads there would seem to be a large difference to the throughput you might get for the ARM you choose. On the other hand, the Dreamplug and ds212j both have similar performance on steam ciphers. So if connections are longer term then the difference will be less.
I would love to add benchmarks for the CuBox and QNAP 2ghz (TS-219PII) NAS units. It would also be interesting to use after market software on the ds212j and see the difference.
Friday, May 18, 2012
Diving into the Socialist Millionaire Protocol
- Picks random exponents a, b, and s
- Sends Bob ga and gb
Bob:
- Picks random exponents c, d, and r
- Computes j = gca and k = gbd
- Computes M = kr and N = gr jy
- Sends Alice gc, gd , M and N
Alice:
- Computes j = gac and k = gbd
- Computes P = ks and Q = gs jx
- Computes R = (Q / N) b
- Sends Bob P, Q and R
Bob:
- Computes T = (Q / N) d
- Computes Z = Rd
- Checks whether Z == (P / M)
- Sends Alice T
Alice:
- Computes Z = Tb
- Checks whether Z == (P / M)
Thursday, April 26, 2012
Editing XML and PostgreSQL with ferris REST.
Notice that the read link is only offered on the XML files. This is because libferris tells the client that those are not natively "directories" but can be seen that way if you like.
I then simply click down to the "barry" XML element in the mounted XML file. Editing the barry entry will write the data back to the server asynchronously. The terminal is used to verify that things went on back.
The second browser tab shows a mounted PostgreSQL table which has ID and Message columns. If I select edit on the message column I get to update the tuple as desired. Notice that the data grid in the browser is updated when the data is saved on back by watching the "message" column in the browser. I then use psql to verify the update from the command line.
Other interesting possibilities include mounted log files being split into columns in the web view, or grabbing some data from plasma on the server, streaming data from gstreamer or zoneminder, or anything else that ferris can do as a VFS. And as I tend to want ferris to mount it all^TM, the sky should be the limit! Heh, so I got the catchphrase in there emes.
Monday, April 23, 2012
VFS in the Could? Libferris Web Interface...
The initial web interface is still fairly basic, the back and forward buttons are handled by the browser leaving only the parent button in the apps toolbar. Home and Heart are your home dir and bookmarks respectively.
Clicking on a row allows arbitrary annotation of that file. The annotations are stored in either native kernel level Extended Attributes or RDF. A feature I find very useful is that all metadata is presented via the same interface. As you can see the "Annotation" column in the listview is showing your own description of each file. You can filter or sort on annotation just as you can on the file name.
The search page allows you to find files by their text content (full text search) and/or their metadata. As I've mentioned in the past, the metadata indexing modules include many optimizations above using the native APIs. This includes indexed lookup on certain classes of regular expressions. Many naive query evaluations using regular expressions will result in a linear time complexity. And unless you have used explicit code to handle it, you are likely to enjoy this bad performance even with very advanced indexing libraries and databases.
Tags or "emblems" etc are also handled through the same metadata interface. The tagging sidepanel offers suggestions for existing tags as you type and allows you to create new tags as you attach them to files. Removal of a tag is just a click away.
Of course, clicking a filename shows you the file itself over REST. This allows you to stream video files over REST to the Nokia n9 for example. There is partial IO support and write support so I could include a fancy text editor or image editor component in there... Unfortunately YUI 3.5.0 doesn't seem to support "selections" in the datagrid, so that nice visual feedback will have to wait for yui 3.6.0.