I found myself analyzing some apache log files the other day, and found myself wanting a SQL interface to the logs.
Now, it's possible to send the logs directly into a database, but that wasn't how the machine was configured.
This got me thinking. We have many different log formats, and many different sources of data. All of our tools seem to have different interfaces.
For example, why should I write regular expressions and shell pipelines to get at my logs?
Why do I have to learn a custom syntax ("rpm -qa --queryformat='various things'") for looking at the software I have installed? Why does e.g. the audit subsystem have its own query format?
Why can't I just use SQL, and write SELECT statements to drill down into all of this data?
So I started writing a SELECT statement for the command line.
I didn't want to use SELECT as caps are a recipe for RSI, and annoyingly, "select" is a bash builtin.
So I've picked "show" as the command; it doesn't seem to be taken by anything on my system. (it's a SQL command, but hopefully that's not going to be too confusing)
The idea is that it looks at the FROM part of the query, and looks up a data source based on this. For example, if you write
it figures out that you're looking at apache logs, looks up an appropriate backend, and makes a table "on the fly", so that it can run the query and give you the results:
(You have to either use quotes or escaping to deal with parentheses and * characters from the shell)
You can use filters using "WHERE":
If you specify more than one filename it adds a "filename" column:
Naturally, this isn't just for apache logs.
Here I'm querying the yum logs. The backend code deals with the changes that happened in the logging format, without me having to worry about this in my query:
I also wrote "rpm" and "proc" backends:
Looking in RPM database by vendor:
It got tiresome typing "*", and "from" all the time, so you can omit these:
There's a --format option, which can currently be used to emit tables in HTML format. Other formats could be written e.g. xml, json, yaml, odf spreadsheets, etc.
For bonus points, I wrote an ncurses UI for browsing tabular results. The command detects if stdout is connected to a tty, and if so, goes into the UI, otherwise it sends text, so you can use it for shell pipelines etc.
Here's a screenshot from

I'm already finding this useful.
I'm hoping to host this as a Fedora project. For now, you can grab the code from my Red Hat page here:
http://people.redhat.com/dmalcolm/s how-0.2-1.fc10.src.rpm
There's plenty of scope for writing new table backends for other data sources/file formats, improving the UI, writing new output formats etc Any ideas?
Now, it's possible to send the logs directly into a database, but that wasn't how the machine was configured.
This got me thinking. We have many different log formats, and many different sources of data. All of our tools seem to have different interfaces.
For example, why should I write regular expressions and shell pipelines to get at my logs?
Why do I have to learn a custom syntax ("rpm -qa --queryformat='various things'") for looking at the software I have installed? Why does e.g. the audit subsystem have its own query format?
Why can't I just use SQL, and write SELECT statements to drill down into all of this data?
So I started writing a SELECT statement for the command line.
I didn't want to use SELECT as caps are a recipe for RSI, and annoyingly, "select" is a bash builtin.
So I've picked "show" as the command; it doesn't seem to be taken by anything on my system. (it's a SQL command, but hopefully that's not going to be too confusing)
The idea is that it looks at the FROM part of the query, and looks up a data source based on this. For example, if you write
$ show host, "count(*)", "total(size)" from /var/log/httpd/*access_log* group by host;
it figures out that you're looking at apache logs, looks up an appropriate backend, and makes a table "on the fly", so that it can run the query and give you the results:
host|count(*)|total(size)|
---------+--------+-----------+
127.0.0.1| 10| 27679.0|
(You have to either use quotes or escaping to deal with parentheses and * characters from the shell)
You can use filters using "WHERE":
$ show distinct request from /var/log/httpd/*access_log* where status = 404 request | -------------------------+ GET /favicon.ico HTTP/1.1|
If you specify more than one filename it adds a "filename" column:
# show filename, "count(*)", "total(size)" from /var/log/httpd/*access_log* group by filename order by "total(size)" desc
filename|count(*)|total(size)|
-------------------------------+--------+-----------+
/var/log/httpd/ssl_access_log.4| 1921| 12824849.0|
/var/log/httpd/access_log.3| 222| 6207367.0|
/var/log/httpd/ssl_access_log.3| 741| 2210799.0|
/var/log/httpd/access_log.4| 268| 626711.0|
/var/log/httpd/ssl_access_log.1| 8| 13351.0|
/var/log/httpd/ssl_access_log.2| 5| 7305.0|
/var/log/httpd/access_log.2| 4| 6995.0|
/var/log/httpd/access_log.1| 2| 288.0|
Naturally, this isn't just for apache logs.
Here I'm querying the yum logs. The backend code deals with the changes that happened in the logging format, without me having to worry about this in my query:
[root@brick select]# show from /var/log/yum.log* where 'name like "kernel%"' limit 5
time| event| name| arch|epoch| version| release| filename|
---------------+---------+---------------+------+-----+---------+------------+----------------+
Feb 14 20:00:03|Installed|kernel-firmware|noarch| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 20:00:28| Updated| kernel-headers| i386| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 20:15:11|Installed| kernel-devel| i686| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 21:05:53|Installed| kernel| i686| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 21:12:41|Installed| kernel-PAE| i686| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
I also wrote "rpm" and "proc" backends:
[david@brick ~]$ show name, "count(*)" from rpm group by name having "count(*)>1"
name|count(*)|
-------------------------+--------+
gpg-pubkey| 4|
jakarta-commons-validator| 2|
java-1.6.0-openjdk-devel| 2|
kernel| 3|
kernel-devel| 2|
kernel-xen| 3|
libgnat| 2|
openssl| 2|
Looking in RPM database by vendor:
[david@brick select]$ show vendor, "count(*)" from rpm group by vendor
vendor|count(*)|
------------------+--------+
None| 12|
Fedora Project| 2042|
Koji| 2|
It got tiresome typing "*", and "from" all the time, so you can omit these:
# show rpm where release not like \'%fc%\' order by name limit 4
name|epoch|version|release| arch| vendor|
--------+-----+-------+-------+------+--------------+
MAKEDEV| None| 3.24| 1| i386|Fedora Project|
PyXML| None| 0.8.4| 10| i386|Fedora Project|
autofs| 1| 5.0.3| 36| i386|Fedora Project|
automake| None| 1.10.1| 2|noarch|Fedora Project|
There's a --format option, which can currently be used to emit tables in HTML format. Other formats could be written e.g. xml, json, yaml, odf spreadsheets, etc.
For bonus points, I wrote an ncurses UI for browsing tabular results. The command detects if stdout is connected to a tty, and if so, goes into the UI, otherwise it sends text, so you can use it for shell pipelines etc.
Here's a screenshot from
show rpm order by name
I'm already finding this useful.
I'm hoping to host this as a Fedora project. For now, you can grab the code from my Red Hat page here:
http://people.redhat.com/dmalcolm/s
There's plenty of scope for writing new table backends for other data sources/file formats, improving the UI, writing new output formats etc Any ideas?
23 comments | Leave a comment