You are viewing dmalcolm

 
 
22 March 2009 @ 07:31 pm
SQL for the command line: "show"  
I found myself analyzing some apache log files the other day, and found myself wanting a SQL interface to the logs.

Now, it's possible to send the logs directly into a database, but that wasn't how the machine was configured.

This got me thinking. We have many different log formats, and many different sources of data. All of our tools seem to have different interfaces.

For example, why should I write regular expressions and shell pipelines to get at my logs?
Why do I have to learn a custom syntax ("rpm -qa --queryformat='various things'") for looking at the software I have installed? Why does e.g. the audit subsystem have its own query format?

Why can't I just use SQL, and write SELECT statements to drill down into all of this data?

So I started writing a SELECT statement for the command line.

I didn't want to use SELECT as caps are a recipe for RSI, and annoyingly, "select" is a bash builtin.

So I've picked "show" as the command; it doesn't seem to be taken by anything on my system. (it's a SQL command, but hopefully that's not going to be too confusing)

The idea is that it looks at the FROM part of the query, and looks up a data source based on this. For example, if you write

$ show host, "count(*)", "total(size)" from /var/log/httpd/*access_log* group by host;

it figures out that you're looking at apache logs, looks up an appropriate backend, and makes a table "on the fly", so that it can run the query and give you the results:

     host|count(*)|total(size)|
---------+--------+-----------+
127.0.0.1|      10|    27679.0|


(You have to either use quotes or escaping to deal with parentheses and * characters from the shell)

You can use filters using "WHERE":

$ show distinct request from /var/log/httpd/*access_log* where status = 404
request                  |
-------------------------+
GET /favicon.ico HTTP/1.1|


If you specify more than one filename it adds a "filename" column:

# show filename, "count(*)", "total(size)" from /var/log/httpd/*access_log* group by filename order by "total(size)" desc
                       filename|count(*)|total(size)|
-------------------------------+--------+-----------+
/var/log/httpd/ssl_access_log.4|    1921| 12824849.0|
    /var/log/httpd/access_log.3|     222|  6207367.0|
/var/log/httpd/ssl_access_log.3|     741|  2210799.0|
    /var/log/httpd/access_log.4|     268|   626711.0|
/var/log/httpd/ssl_access_log.1|       8|    13351.0|
/var/log/httpd/ssl_access_log.2|       5|     7305.0|
    /var/log/httpd/access_log.2|       4|     6995.0|
    /var/log/httpd/access_log.1|       2|      288.0|

Naturally, this isn't just for apache logs.

Here I'm querying the yum logs. The backend code deals with the changes that happened in the logging format, without me having to worry about this in my query:

[root@brick select]# show from /var/log/yum.log* where 'name like "kernel%"' limit 5
           time|    event|           name|  arch|epoch|  version|     release|        filename|
---------------+---------+---------------+------+-----+---------+------------+----------------+
Feb 14 20:00:03|Installed|kernel-firmware|noarch| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 20:00:28|  Updated| kernel-headers|  i386| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 20:15:11|Installed|   kernel-devel|  i686| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 21:05:53|Installed|         kernel|  i686| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|
Feb 14 21:12:41|Installed|     kernel-PAE|  i686| None|2.6.27.12|170.2.5.fc10|/var/log/yum.log|

I also wrote "rpm" and "proc" backends:

[david@brick ~]$ show name, "count(*)" from rpm group by name having "count(*)>1"
                     name|count(*)|
-------------------------+--------+
               gpg-pubkey|       4|
jakarta-commons-validator|       2|
 java-1.6.0-openjdk-devel|       2|
                   kernel|       3|
             kernel-devel|       2|
               kernel-xen|       3|
                  libgnat|       2|
                  openssl|       2|


Looking in RPM database by vendor:

[david@brick select]$ show vendor, "count(*)" from rpm group by vendor
            vendor|count(*)|
------------------+--------+
              None|      12|
    Fedora Project|    2042|
              Koji|       2|


It got tiresome typing "*", and "from" all the time, so you can omit these:

# show rpm where release not like \'%fc%\' order by name limit 4
    name|epoch|version|release|  arch|        vendor|
--------+-----+-------+-------+------+--------------+
 MAKEDEV| None|   3.24|      1|  i386|Fedora Project|
   PyXML| None|  0.8.4|     10|  i386|Fedora Project|
  autofs|    1|  5.0.3|     36|  i386|Fedora Project|
automake| None| 1.10.1|      2|noarch|Fedora Project|


There's a --format option, which can currently be used to emit tables in HTML format. Other formats could be written e.g. xml, json, yaml, odf spreadsheets, etc.

For bonus points, I wrote an ncurses UI for browsing tabular results. The command detects if stdout is connected to a tty, and if so, goes into the UI, otherwise it sends text, so you can use it for shell pipelines etc.
Here's a screenshot from
show rpm order by name



I'm already finding this useful.

I'm hoping to host this as a Fedora project. For now, you can grab the code from my Red Hat page here:
http://people.redhat.com/dmalcolm/show-0.2-1.fc10.src.rpm

There's plenty of scope for writing new table backends for other data sources/file formats, improving the UI, writing new output formats etc Any ideas?
 
 
( 23 comments — Leave a comment )
He Who Burblesnecaris on March 23rd, 2009 10:09 am (UTC)
That is one of the most amazingly cool things I have ever seen!

Would there be much of a point in having a kernel or X11 log backend? So that one could, for instance, look at the boot logs for all the machines on the network with a view to finding failing components / optimizing?
dmalcolm on March 23rd, 2009 03:02 pm (UTC)
Thanks!

Generalizing it to multiple machines would be interesting. "func" may be the best way to do this, as another commenter suggested, with a "host" column.

Any ideas on how you'd want the generated tables to look?
What might one of your queries look like?

I'm thinking that most logs might be simply be two columns "timestamp", "message", which would let you merge different logs on one machine into one timeline view quite easily.
ctyler on March 23rd, 2009 11:38 am (UTC)
Very cool -- would be nice to use with func
This is a great idea!

It would be cool to run this on top of func, so you could query across a number of machines (i.e., have a 'host' column automatically appended).
dmalcolm on March 23rd, 2009 03:11 pm (UTC)
Re: Very cool -- would be nice to use with func
Thanks; good idea.

I've only briefly played with Func, but it sounds promising.

What should it look like?

This could be done in several different ways:
(a) "show" could call func to get its data
(b) func could call "show"
(c) a bit of both

(a) might look like:
$ show hardware, host from func \*.example.org hardware.info order by hardware

(b) use "show" to implement a func module:
$ func "*.example.org" call show.logs

(c) $ show time, message from "func(*.example.org, show.logs)" order by time
to show a timeline of all logs on all machines

Obviously I'm just making up syntax at this point. Ideas?
ctyler on March 25th, 2009 01:35 pm (UTC)
Re: Very cool -- would be nice to use with func
What if func was invoked behind-the-scenes automatically if a host (or perhaps "funchost", to avoid collisions) column was accessed, with the default func host selection being "*"?

show vendor, funchost, "count(*)" from rpm group by vendor, funchost

show distinct request from /var/log/httpd/*access_log* where status = 404 and funchost = "*.webfarm.example.com"

(Yes, I realize that func host globs and sql wildcards don't really mix, but this is probably a practical middle-ground).
Flewellynflewellyn on March 23rd, 2009 04:31 pm (UTC)
Interesting idea!

Have you considered how you might handle complex queries, subqueries, joins, and the like?
dmalcolm on March 23rd, 2009 04:51 pm (UTC)
Not yet... I wanted to get something useful up and running before tackling more advanced SQL.

Right now, both the parser and the internal query generators are messy proof-of-concept hacks, and would need cleaning up a lot before the tool could handle the stuff you mention.

What kind of ideas do you have? Joining the result from a func query against a text file perhaps?
Flewellynflewellyn on March 23rd, 2009 05:36 pm (UTC)
Or one text file against another, given certain predicates. Perhaps then taking that "table" and joining it against a third. Or, subqueries to build up result tables.

That sort of thing.
bazzargh123 on April 8th, 2009 10:14 am (UTC)
Years ago, I built a command line tool to query/join across multiple files, but it was a trivial wrapper for perl's DBD::CSV, I was just throwing files into csv if I needed to do more than a one-liner on them. Not as neat as your tool by a long shot!

However, DBD::AnyData is much closer to what you're doing (ie it supports sql against any kind of data file), and supports joins etc. Maybe worth taking a look at it?

http://search.cpan.org/~jzucker/DBD-AnyData-0.09/AnyData.pm
rathannrathann on March 23rd, 2009 05:37 pm (UTC)
small spec file comment
You don't have to manually install the README

install -m 644 README $RPM_BUILD_ROOT/%{_docdir}/%{name}-%{version}


It's enough to list it in %doc:

%files
%defattr(-,root,root,-)
%doc README

and rpmbuild will take care of it automatically.
dmalcolm on March 25th, 2009 01:50 am (UTC)
Re: small spec file comment
Thanks; I've followed your suggestion
(see
https://fedorahosted.org/show/changeset/fba0a93ba40975d8bd072f04b7b6afb41a4621e3#file1 for the gory details)
Erling Jacobsenlinuxcub on March 24th, 2009 09:05 am (UTC)
It would perhaps be even more powerful to have ODBC-drivers (or whatever they're called) for the different backends (logfiles, rpm etc) ...
dmalcolm on March 25th, 2009 01:55 am (UTC)
ODBC drivers suggest some kind of daemon to me; all this is at this stage is simply a command-line tool. I guess if you're expressing data sources as URLs you could have a "show" method.

One approach is to dump the resulting data as a sqlite db or as SQL statements; you could then run further queries on it; throw it into a real database server etc.
Steve Kempskx on March 24th, 2009 09:39 pm (UTC)
I like the way that you manage different log formats.

My own tool only copes with Apache's common logfile format, and it hadn't occurred to me to be more flexible than that.

In terms of other targets I think two useful formats to query would be syslog and dmesg. Syslog would give you:

* Priority (int/char)
* Deamon (char)
* PID (int)
* Message (char)
dmalcolm on March 25th, 2009 02:04 am (UTC)
Thanks for the link. I hadn't seen your tool. Some similar ideas, done in Perl. Looks nice, though I merely read the sources/docs.

The two approaches are similar. There are several phases for both tools:
- the user specifies some data sources
- the program grabs the data and stuffs it into an in-RAM sqlite db
- a query is executed on the db
- the results are rendered

I try to have all data specified at the command-line, to enable the program to be used for one-liners a the terminal. Your approach keeps a SQL-like UI around for repeated queries. I've had vague ideas of doing that in the ncurses UI. The other thing to do is to have an option that dumps the sqlite file, and then have the tool load sqlite files, so that you've got roundtripping.

I took the syslog idea and implemented it:
https://fedorahosted.org/show/changeset/eab0782b57a637e440e2f8746afe9e5e51454b58

BTW, I hope that my code is broadly usable on Debian systems. Have I introduced any Fedora-isms?
Steve Kempskx on March 25th, 2009 09:38 am (UTC)
Like you I read the code more than used the tool - but nothing jumped out at me about being non-portable.

My tool does allow a limited amount of scripting via command line queries, e.g:

asql -e "SELECT ...."

But yes it is much more of an interactive one than yours.

Nice to see the syslog support!
piavlo on March 25th, 2009 10:27 am (UTC)
Is there public svn for the utility?
dmalcolm on March 25th, 2009 11:50 pm (UTC)
Public git repo
mab99 on January 4th, 2010 02:13 pm (UTC)
Re: Public git repo
Dear Malcom,
Does this http://git.fedorahosted.org/git/show.git repository still alive? In fact I tried to clone it, without success. The show.git repo does not exist.

Thanks a lot for your help, indeed.

KR,

MAB
dmalcolm on February 22nd, 2010 04:45 pm (UTC)
Re: Public git repo
Sorry for the belated response.

"show" was regarded by some as too generic, so I renamed the project to "squeal". The new homepage is here:
https://fedorahosted.org/squeal/

Info on the git repository here:
https://fedorahosted.org/squeal/#Gettingthecode
piavlo on March 25th, 2009 11:29 am (UTC)
I've install show as usul python module but I get:

Traceback (most recent call last):
File "/usr/bin/show.py", line 98, in
import show.table
File "/usr/bin/show.py", line 98, in
import show.table
ImportError: No module named table
corydodt on March 26th, 2009 06:50 pm (UTC)
"describe"
How about something like "describe" i.e. introspection/inspection of logs. Just tell me what columns are available for selection.
rhillegas on April 2nd, 2009 06:41 pm (UTC)
sql queries against xml log files
This is pretty cool. Another approach for running SQL queries against log files is to use Derby table functions. Table functions give you the full expressive power of SQL, including aggregates and subqueries. An example of how to run SQL against Apache log files can be found in the Saucer Separation talk given at ApacheCon in 2007: http://people.apache.org/~rhillegas/vtiDemo/doc/saucerSeparation.html To get to the relevant example, click on the third icon in the bottom left panel (Log Files), then click on the feather icon (XML Log Files). For further information about table functions, see the white paper posted here: http://developers.sun.com/javadb/reference/whitepapers/index.jsp

Cheers!
( 23 comments — Leave a comment )