You are viewing dmalcolm

 
 
23 June 2011 @ 05:53 pm
Static analysis of CPython extensions, using a new GCC plugin  
I've been looking at ways to improve the quality of Python extensions written in C.

CPython provides a great C API that makes it easy to relatively easy to integrate C and C++ libraries with Python code. We use it extensively within Fedora - for example, Fedora's installation program is written in Python.

But you do have to be write such code carefully:

  • you have to correctly keep track of reference counts in your objects. If you get this wrong, you can segfault the interpreter, or introduce a memory leak.

  • some APIs use a format string, with C variable-length arguments (see e.g PyArg_ParseTuple and its variants). If the C compiler doesn't know the rules, it can't enforce type-safety. This can lead to people accidentally writing architecture-specific code (more on this below)

  • like any API, function calls can fail. This seems to be a universal rule of computer programming: it's tricky to correctly handle all the errors that can occur - bugs tend to lurk in the error-handling cases



I want to make it easier for people to write correct Python extension code, so I've been looking at static analysis.

None of the existing tools seemed to do exactly what I wanted, and given that all of my work is done with GCC, I wanted a solution that was well integrated with GCC. I also wanted to be able to use Python itself to work on the tool. (I attempted some of this a while back with Coccinelle, but I use GCC, so I wanted to embed the checking directly into GCC).

So I've written a GCC plugin that embeds Python within that compiler. This means that it's now possible to write new C and C++ compilation passes in Python, and use Python packages for things like syntax-highlighting, visualization, and so on.

That's the theory, anyway. The code is still fairly new, so I've only wrapped a small subset of GCC's types and APIs.

I've started using this to write a static analyser for CPython extension code.

Here's an example of what it can do so far...

Given this fragment of C code:
    24	
    25	PyObject *
    26	socket_htons(PyObject *self, PyObject *args)
    27	{
    28	    unsigned long x1, x2;
    29	
    30	    if (!PyArg_ParseTuple(args, "i:htons", &x1)) {
    31	        return NULL;
    32	    }
    33	    x2 = (int)htons((short)x1);
    34	    return PyInt_FromLong(x2);
    35	}
    36	


there's a bug: at line 30, the "i" code to PyArg_ParseTuple signifies an "int", but it's being passed an "unsigned long" from line 28 (via a pointer) to write back its result to. This will break badly on a big-endian 64-bit CPU.

First of all, we can use the Python support in the compiler to visualize the code:
[david@fedora-15 gcc-python]$ ./gcc-with-python show-ssa.py -I/usr/include/python2.7 demo.c


Here's the output. This visualization shows the basic blocks of code, with source code on the left, interleaved with GCC's internal representation on the right:
SVG rendering of the control-flow graph of the given function

(If you're wondering what the "PHI<>" functions mean in the above, this is actually showing the SSA representation after some of GCC's analysis and optimizations passes have already happened).

Given that this is Python, it's really easy to write new visualizations.

I've also written the first new compiler warnings using the Python plugin.

Here's the output from compiling that C code using my "cpychecker.py" script to add new warnings:
[david@fedora-15 gcc-python]$ ./gcc-with-python cpychecker.py $(python-config --cflags) demo.c 
demo.c: In function ‘socket_htons’:
demo.c:30:26: error: Mismatching type in call to PyArg_ParseTuple with format code "i:htons" [-fpermissive]
  argument 3 ("&x1") had type "long unsigned int *" (pointing to 64 bits)
  but was expecting "int *" (pointing to 32 bits) for format code "i"


I've tried to make the new error message readable, containing as much information as possible.

Any ideas on how to improve this?

I'm now working frantically on implementing reference-count checking :)

I hope that I'll be able to get this into a working state in time for Fedora 16: I'd like to run all of the C Python extension code in the Fedora distribution through a checker, but I need to do a lot of polishing before it's ready!

The code is free software (GPLv3 or later), and you can grab it from this git repository:

http://git.fedorahosted.org/git/?p=gcc-python-plugin.git;a=summary

I'm using this Trac instance for bug tracking:

https://fedorahosted.org/gcc-python-plugin/

Anyone got ideas for other uses for this? Visualizations of code? New compiler warnings? Remember, this thing's built on top of GCC, so (in theory) it can handle anything that GCC can handle e.g. C++ templates, Java, Fortran, and so on.

If you want to get involved, or want more information, there's a mailing list here:

https://fedorahosted.org/mailman/listinfo/gcc-python-plugin

Thanks to Red Hat for supporting the development of this software! (and for general awesomeness); thanks also to Read the Docs for providing a nifty hosting service for free software API documentation.
Tags: , ,
 
 
 
( 5 comments — Leave a comment )
ncoghlan on June 24th, 2011 01:15 pm (UTC)
How does it fare when compiling CPython?
Sounds quite nifty - my first thought was to wonder whether or not it could pick up any lingering argument processing bugs in CPython itself.
dmalcolm on June 24th, 2011 03:39 pm (UTC)
Re: How does it fare when compiling CPython?
Good idea. I'm having a look at that now.
Jeff DarcyObdurodon on June 24th, 2011 01:34 pm (UTC)
That is just so awesome
A long time ago, I wrote a tool to do C code transformations using the output from gcc -fdump-translation-unit. It was very tedious to parse that output (even though I was doing it in Python) which also turned out to be slightly ambiguous and incomplete. I was never very happy with the result. This looks like a much more robust and powerful way of doing similar things.
dmalcolm on June 24th, 2011 02:16 pm (UTC)
Re: That is just so awesome
Thanks! Yeah, it's much easier to go in and use GCC's own data structures, rather than have to try to parse things and reconstruct them.
crazyhackerdudecrazyhackerdude on June 30th, 2011 08:13 pm (UTC)
Hey,
if you are ever in Portland(perhaps for OSCon?), ping me. I did something similar with Treehydra for Mozilla, would love to share experiences.

Taras Glek
( 5 comments — Leave a comment )