Autodetecting reference-counting bugs: the video!

I gave a talk at PyCon 2012 about my Python plugin for GCC - how this lowers the barrier for entry to potential GCC hackers, and how I've been using this to find reference-counting errors in Python C extension modules.

A video of the talk can be seen here:
http://pyvideo.org/video/648/static-analysis-of-python-extension-modules-using
or on YouTube here.

The slides are here.

We had a mini-sprint after the talk, and another later on in the main sprints, covering these topics:
  • getting the plugin to build on OS X (using MacPorts' build of gcc-4.6.1): this works, but needed some compat patching around some of the differences between glibc and OS X's libc (also case-sensitivity of filenames).
  • improving the static analysis engine: currently it takes the simplistic approach of trying to generate all traces of execution through the function in a big tree, which suffers from exponentially explosions.  I'm hoping the insides can be reworked to implement an iterative solver that can handle loops more robustly (data flow equations)
  • improving the HTML error reports that the analyzer generates.  More on this to follow!
I've also finally succumbed to the inevitable and joined github: you can fork the plugin here (this is just a clone I keep synchronized with the Fedora Hosted repository)

Adding a spellchecking pass to GCC, using Python

LWN recently posted an interesting article on writing GCC plugins, showing how to add a spellchecking pass to the compiler.

I thought this was a neat example, so I had a go at porting it from C to use my Python plugin for GCC

The result is about 35 lines of Python code (including comments), which you can see here

I hope the code is easy to read.  One other thing this demonstrates is how by extending the compiler in Python you have access to the whole of the Python packaging ecosystem - I was able to use the "enchant" spellchecking library (originally from the AbiWord project) with about 4 lines of code.

Automatically detecting reference-count bugs in Python extension modules

[ For the tl;dr version, scroll down to see the pretty screenshots :) ]

I've been working on a static analysis tool to automatically detect reference-count bugs in C Python extension code.

(see my earlier posts on verifying calls to the PyArg_ParseTuple API here and here)

Mismanaging reference counts can lead to the python process leaking memory (and other resources), for when an object becomes immortal, or segfaulting, when an object is cleaned up when things still refer to it.

My "cpychecker" code is still an early prototype (don't expect to use it on arbitrary C code yet), but here's an example of some of the things it's already capable of:

Can you see the reference-counting error in this (contrived) code fragment?
    22	PyObject *
    23	refcount_demo(PyObject *self, PyObject *args)
    24	{
    25	    PyObject *list;
    26	    PyObject *item;
    27	    list = PyList_New(1);
    28	    if (!list)
    29	        return NULL;
    30	
    31	    item = PyLong_FromLong(42);
    32	    if (!item)
    33	        return NULL;
    34	
    35	    PyList_SetItem(list, 0, item);
    36	    return list;
    37	}
    38	
    39	static PyMethodDef test_methods[] = {
    40	    {"refcount_demo",  refcount_demo, METH_VARARGS, NULL},
    41	    {NULL, NULL, 0, NULL} /* Sentinel */
    42	};


Compiling like this:
  [david@fedora-15 gcc-plugin]$ ./gcc-with-python cpychecker.py -I/usr/include/python2.7 refcount-demo.c


the checker adds this output to gcc's:
refcount-demo.c: In function ‘refcount_demo’:
refcount-demo.c:37:1: error: ob_refcnt of PyListObject is 1 too high
refcount-demo.c:27:10: note: PyListObject allocated at:     list = PyList_New(1);
refcount-demo.c:27:10: note: when PyList_New() succeeds at:     list = PyList_New(1);
refcount-demo.c:28:8: note: when taking False path at:     if (!list)
refcount-demo.c:31:10: note: reaching:     item = PyLong_FromLong(42);
refcount-demo.c:31:10: note: when PyLong_FromLong() fails at:     item = PyLong_FromLong(42);
refcount-demo.c:32:8: note: when taking True path at:     if (!item)
refcount-demo.c:33:9: note: reaching:         return NULL;
refcount-demo.c:37:1: note: when returning


which can be navigated in any IDE that can parse GCC's output messages (works for me in emacs).

This demonstrates a particular path of execution that has a bug.

I found the textual output a bit heavy on the eye, so I've hacked up the plugin script so it can render graphical HTML visualizations of the errors that it finds.

Here's that same report, in HTML form:



The report shows the control flow through the function: lines that get executed are written in bold and outlined in blue, with arrows connecting them, and additional annotations in italics. (I'm not so good at HTML/CSS, so help here would be most welcome!).

I used the jsplumb JavaScript library to add lines to the HTML to link together elements. This uses the newish <canvas> element, so the control-flow lines may only appear in recent browsers. It works for me in Chromium 12 and Firefox 4. You can see the HTML report itself here:
http://fedorapeople.org/~dmalcolm/blog/2011-07-15/refcount_demo-refcount-errors.html

(Currently it's hardcoded to generate the reports, but I'll probably add something like a -fdump-html command-line option to the gcc-with-cpychecker harness).

Here are some more examples:

Detecting the all-too-common: "return Py_None;" bug:

As HTML: http://fedorapeople.org/~dmalcolm/blog/2011-07-15/losing_refcnt_of_none-refcount-errors.html

Another (very contrived) reference leak:

As HTML: http://fedorapeople.org/~dmalcolm/blog/2011-07-15/object_leak-refcount-errors.html

Detecting a stray Py_INCREF that makes the reference count too high, or segfaults python, depending on what happened earlier:

As HTML: http://fedorapeople.org/~dmalcolm/blog/2011-07-15/too_many_increfs-refcount-errors.html

This is still an experimental prototype, so it's not yet ready for general purpose use, but I'm frantically working on it, and I hope it will be ready in time for inclusion in Fedora 16.

The checker is Free Software (licensed under GPLv3 or later), and if you want to get involved, go to https://fedorahosted.org/pipermail/gcc-python-plugin/ (as I said above, I could really use some help with HTML and CSS! The checker is written in Python itself, if you're interested in hacking on the code).

(Thanks to Red Hat for allowing me to spend a substantial proportion of my $DAYJOB on this)

Verifying the more awkward parts of the CPython API

I've been running my Python extension module static analyser over CPython itself (the latest in the 2.7 hg branch, specifically).

I'm pleased to say that the project's mailing list received the first patch to the checker from someone other than me (Thanks Tom!) - He's been running it over the sources of gdb (which embeds python).

These make for good torture tests for the analyser, and I'm pleased with how far it survived. Some bugs do remain - in the checker, that is.

There are quite a few places where CPython calls PyArg_ with a code expecting a "const char*", but receives a "char*'. I think the ones in CPython are all false positives, so I think we're going to need to make that configurable.

Perhaps the most fiddly part of the checking this API is the "O&" conversion code - I wasn't able to handle this in my previous Coccinelle-based approach to this problem.

Here's an example:
    68	
    69	extern int convert_to_ssize(PyObject *, Py_ssize_t *);
    70	
    71	PyObject *
    72	buggy_converter(PyObject *self, PyObject *args)
    73	{
    74	    int i;
    75	
    76	    if (!PyArg_ParseTuple(args, "O&", convert_to_ssize, &i)) {
    77	        return NULL;
    78	    }
    79	
    80	    Py_RETURN_NONE;
    81	}
    82	


The idea is that you're meant to supply a conversion callback, which can extract a value back to the next argument.

The above example has a bug (can you see it?)

After a fair amount of coding today, the checker is now able to detect it:

[david@fedora-15 gcc-python]$ ./gcc-with-python cpychecker.py $(python-config --cflags) demo.c
demo.c: In function ‘buggy_converter’:
demo.c:76:26: error: Mismatching type in call to PyArg_ParseTuple with format code "O&" [-fpermissive]
  argument 4 ("&i") had type
    "int *" (pointing to 32 bits)
  but was expecting
    "Py_ssize_t *" (pointing to 64 bits) (from second argument of "int (*fn) (struct PyObject *, Py_ssize_t *)")
  for format code "O&"


Notice how it used the type of the callback to figure out what the type of the next argument must be.

I've also reformatted the error messages slightly, adding newlines and indentation to try to make them easier to grok.

Hopefully we'll shake out the rest of the bugs soon, and then on to reference-count checking...

If's still rather rough around the edges, but if you want to try running it on your extension module, then come and join us.

Static analysis of CPython extensions, using a new GCC plugin

I've been looking at ways to improve the quality of Python extensions written in C.

CPython provides a great C API that makes it easy to relatively easy to integrate C and C++ libraries with Python code. We use it extensively within Fedora - for example, Fedora's installation program is written in Python.

But you do have to be write such code carefully:

  • you have to correctly keep track of reference counts in your objects. If you get this wrong, you can segfault the interpreter, or introduce a memory leak.

  • some APIs use a format string, with C variable-length arguments (see e.g PyArg_ParseTuple and its variants). If the C compiler doesn't know the rules, it can't enforce type-safety. This can lead to people accidentally writing architecture-specific code (more on this below)

  • like any API, function calls can fail. This seems to be a universal rule of computer programming: it's tricky to correctly handle all the errors that can occur - bugs tend to lurk in the error-handling cases



I want to make it easier for people to write correct Python extension code, so I've been looking at static analysis.

None of the existing tools seemed to do exactly what I wanted, and given that all of my work is done with GCC, I wanted a solution that was well integrated with GCC. I also wanted to be able to use Python itself to work on the tool. (I attempted some of this a while back with Coccinelle, but I use GCC, so I wanted to embed the checking directly into GCC).

So I've written a GCC plugin that embeds Python within that compiler. This means that it's now possible to write new C and C++ compilation passes in Python, and use Python packages for things like syntax-highlighting, visualization, and so on.

That's the theory, anyway. The code is still fairly new, so I've only wrapped a small subset of GCC's types and APIs.

I've started using this to write a static analyser for CPython extension code.

Here's an example of what it can do so far...

Given this fragment of C code:
    24	
    25	PyObject *
    26	socket_htons(PyObject *self, PyObject *args)
    27	{
    28	    unsigned long x1, x2;
    29	
    30	    if (!PyArg_ParseTuple(args, "i:htons", &x1)) {
    31	        return NULL;
    32	    }
    33	    x2 = (int)htons((short)x1);
    34	    return PyInt_FromLong(x2);
    35	}
    36	


there's a bug: at line 30, the "i" code to PyArg_ParseTuple signifies an "int", but it's being passed an "unsigned long" from line 28 (via a pointer) to write back its result to. This will break badly on a big-endian 64-bit CPU.

First of all, we can use the Python support in the compiler to visualize the code:
[david@fedora-15 gcc-python]$ ./gcc-with-python show-ssa.py -I/usr/include/python2.7 demo.c


Here's the output. This visualization shows the basic blocks of code, with source code on the left, interleaved with GCC's internal representation on the right:
SVG rendering of the control-flow graph of the given function

(If you're wondering what the "PHI<>" functions mean in the above, this is actually showing the SSA representation after some of GCC's analysis and optimizations passes have already happened).

Given that this is Python, it's really easy to write new visualizations.

I've also written the first new compiler warnings using the Python plugin.

Here's the output from compiling that C code using my "cpychecker.py" script to add new warnings:
[david@fedora-15 gcc-python]$ ./gcc-with-python cpychecker.py $(python-config --cflags) demo.c 
demo.c: In function ‘socket_htons’:
demo.c:30:26: error: Mismatching type in call to PyArg_ParseTuple with format code "i:htons" [-fpermissive]
  argument 3 ("&x1") had type "long unsigned int *" (pointing to 64 bits)
  but was expecting "int *" (pointing to 32 bits) for format code "i"


I've tried to make the new error message readable, containing as much information as possible.

Any ideas on how to improve this?

I'm now working frantically on implementing reference-count checking :)

I hope that I'll be able to get this into a working state in time for Fedora 16: I'd like to run all of the C Python extension code in the Fedora distribution through a checker, but I need to do a lot of polishing before it's ready!

The code is free software (GPLv3 or later), and you can grab it from this git repository:

http://git.fedorahosted.org/git/?p=gcc-python-plugin.git;a=summary

I'm using this Trac instance for bug tracking:

https://fedorahosted.org/gcc-python-plugin/

Anyone got ideas for other uses for this? Visualizations of code? New compiler warnings? Remember, this thing's built on top of GCC, so (in theory) it can handle anything that GCC can handle e.g. C++ templates, Java, Fortran, and so on.

If you want to get involved, or want more information, there's a mailing list here:

https://fedorahosted.org/mailman/listinfo/gcc-python-plugin

Thanks to Red Hat for supporting the development of this software! (and for general awesomeness); thanks also to Read the Docs for providing a nifty hosting service for free software API documentation.

PyCon US talks on memory usage and gdb

Various people have been asking for the slides to my PyCon US talks, so here goes.

"Dude, Where's My RAM?" - A deep dive into how Python uses memory

Slides can be found here in ODP format (2.7M) and here as a PDF (2.5M)

To answer some of the questions asked about the gdb-heap tool:

  • I've used it with both Python 2.6 and Python 2.7 as the Python running inside gdb
  • I've used to analyze the memory usage of Python 2.4, 2.6, 2.7, 3.1 and 3.2
  • If anyone wants to work on extending this (e.g. adding PyPy support), I'm attending the sprints until Thursday

Using Python to debug C and C++ code (using gdb)
The slides can be seen here

(BTW, these were built using slippy, which I like a lot.  I first tried using rst and rst2s5 but ran into lots of aspect ratio issues when trying it out on a projector, so I fixed up the s5 html into slippy's html.  Thanks to Ned Batchelder for showing me slippy)


FUDcon talk: Different Species of Python

I'm at FUDCon in Tempe, Arizona.  Today I gave a talk entitled "Different Species of Python", comparing the various implementations of the Python language, and how we might better support them within Fedora.

I hope this will be of interest to both the folks on the Planet Fedora and the Planet Python aggregators.

I ran overtime a bit: lots of interesting questions and discussion on the internals of the two CPython implementations, so I didn't get to cover Jython, IronPython and PyPy as much as I might have done, or the packaging issues, but hopefully a useful time was had by all.

I've uploaded a copy of my slides to my Fedora People page, as PDF here and as ODF here - enjoy!

UPDATE:
Tim Flink heroically transcribed the talk and questions for those on IRC.  A transcript can be found here

Python 3 support for GTK via PyGI: proof-of-concept

Screenshot showing GTK window created by Python 3 via PyGI


I spent much of last week taking part in a GTK/Python hackfest.

My particular interest is in Python 3 support, so I spent some time helping John Ehresman clean up pygobject, and some time with John Palmieri on PyGI

I'm somewhat new to gobject-introspection, so I created an ASCII-art diagram to try to help figure out how it works:
http://live.gnome.org/GObjectIntrospection/Architecture

In the old GTK approach to binding native libraries for use by other language runtimes (such as Python's), a .defs file provided metadata on the API, which had to be kept in-sync with the code. An example can be seen here. For Python, this was used to generate .c code which could then be compiled into an extension module. A problem with this approach is that you typically need to write numerous .overrides files to handle the awkward cases, and these are specific to Python.

This means that for every N libraries (gtk, atk, gstreamer, dbus, telepathy, etc) and M runtimes (python 2, python 3, javascript, java, etc) you potentially need to handle N*M bindings, and each one could involve specialcasing.

In the new approach, "gobject-introspection" defines a simple textual format for source-code comments, containing similar information to a .defs file, but (I hope) rich enough to handle more of the special cases. This is scraped from the source into an XML file (e.g. Foo.gir), then compiled into an efficient binary format (e.g. Foo.typelib) which can be mapped into memory at runtime using a library (libgirepository.so).

The idea is that each language runtime can define a bridge, which calls into libgirepositrory, mapping between the language and the various libraries, using libffi, dynamically lazy-creating the glue code.

My hope is that we now need to handle only N + M cases, and so hopefully we have less of a combinatorial explosion, and potentially, faster start-up times and less RAM usage.

I've now got a py3k branch of PyGI in a Fedora git repository here.

(the tracking bug is https://bugzilla.gnome.org/show_bug.cgi?id=615872)

This code is able to be compiled against both Python 2.6 and Python 3.1 (in each case using the py3k branch of pygobject), and all tests pass with both versions of Python.

I can now write something like this:
from gi.repository import Gtk
w = Gtk.Window()
w.set_title('\u6587\u5b57\u5316\u3051') # 3 Kanji and a hiragana: "Mojibake"
b = Gtk.Button()
b.set_label('I am a GtkButton created by Python 3 via PyGI')
w.add(b)
w.show_all()
Gtk.main()


and have Python 3 dynamically load the Gtk typelib, dynamically load the underlying C libraries, and generate the machine code glue to call into the GTK library and create a Gtk Window. In this case a unicode string is correctly marshalled from Python 3, converted into UTF 8 internally within PyGI, and used to invoke gtk_window_set_title dynamically.

Early days yet (still seeing far too many errors when spelunking around inside the API), but at least it works well enough for a screenshot!

Thanks to the other hackfest attendees for answering my many dumb questions about PyGI, to Red Hat for paying me to work on Python 3 porting, and for sponsoring our food at the event, to One Laptop per Child for letting us use a room in their office as hacking space, and to Canonical for buying us coffee.

What variability exists within proposed updates to the Fedora package collection?

There's been a lot of discussion, and alas, some bad feeling, I think, about trying to balance updates versus testing in Fedora.

I believe there are many areas where we can mitigate risk for the users of Fedora without imposing extra work on package maintainers.

I don't think "one size fits all" - I believe that one of the problems we face is that no package or update is alike, and that discussion tends to lump things together without recognizing those differences.

In the hope that it's helpful, I've tried to gather some of the variables that I think are meaningful in the context of "how likely is it that a proposed update might break something?" (and there's some of my opinion in here too)

Built-in test suite
There's great variability here between different src rpms:
  • Does the upstream code have a test suite?
  • Do we run it during %check ?
  • If something started failing, do we actually know about it? (e.g. does an error here kill the build? is there a list of known good/known bad tests?)

I think that a package that's passed hundreds of selftests during the build and not failed any is doing better than one that has no built-in test suite, and should be in some way privileged in our update system. (It's possible to auto-detect the presence of a %check section as well)


External test suite
How much testing does the package get via autoqa? How much testing has this proposed update had via an automated system?


Manual testing
Yes, having a human actually try the software is often good, and finds different types of problem to those that can be found via automated testing. Having said that, I feel we have far too little automated testing in Fedora today, and that the current way we do manual testing has flaws: people test for the bugs they wanted to see fixed, and report whether they are fixed or not. From a formal coverage perspective, we've no idea if they're hitting the important use-cases for other users of the package. But presumably we do hit a lot of coverage with the current approach.


Can multiple versions be installed?
The kernel gets to install multiple copies of itself, and this is the ultimate escape hatch for when a kernel update hoses your system: you still have a known-good version installed (I hope) and get to reboot with that.

To what extent are other packages allowed to do that? (or would the maintainers want to?) Would extending it be a way to mitigate risk on those packages that want to rebase more often?


Meaningful test coverage
With my former professional QA hat on I think the ideal for software testing is:
  • The "functional" side: to have a set of personas describing who will be using the software, what they will be using it to do, and to use that to come up with a set of test cases that cover that functionality
  • The "non-functional" side: to know about the types of flaw expected based on the technology (e.g. I expect buffer overruns in C), and to use this to come up with appropriate test cases to prevent these
This should give an idea of what test coverage is appropriate, and you can _then_ think about automating them.

So I think that a package that has some test cases on the wiki is "in better shape" for doing updates than one that doesn't, and I hope that's a lightweight way of guiding testing. I hope there's a way of streamlining this within our processes so that we do smarter testing without needing extra work for package maintainers. (I don't expect anyone wants to adopt IEEE 829 in Fedora QA; see p133-136 of "Lessons Learned in Software Testing", Kaner et al (2002) for excellent arguments for not using it; a great book, BTW).


Lines of code overall
Some packages are small, some are huge. I did some stats on this for RHEL when I worked on RHEL QA, using "sloccount". I believe the largest by SLOC was openoffice, closely followed by the kernel (in the millions of SLOC), then a big dropoff to the 100k SLOC packages, then a long tail.


Amount of code touched
What is the build-time difference between old and new versions of the src.rpm? This isn't the whole story (a one-line bug can still kill you), but it's part of the story. A rebase might contain a fix for bugs you care about, but might also touch 50 other subsystems.


Amount of testing elsewhere
One advantage of a rebase is that you are sharing source code changes with other people, and so if there is a problem, someone else might have already run into it. This isn't a panacea: yes, there are plenty of ways in which we can have Fedora-specific bugs, but it is one difference between a tarball rebase versus cherry-picking patches.

(random thought: could Bodhi have integration with other distributions update systems and warn us about analogous updates that are breaking other people? or is Fedora always the first into the minefield, finding the bugs for other distributions?)


Noarch vs architecture independent
The former are typically much simpler than the latter. The latter has specific risks (e.g. word-size assumptions). To what extent can we mitigate these risks with automated testing?


Programming Language
Each programming language exhibits its own sets of emergent behavior. For example (and this is grossly oversimplifying):
  • C code tends to exhibit buffer-overflow bugs, poor unit testing, poor error-handling
  • C++ code can be more prone to compiler/static linker/dynamic linkage bugs than C code
etc. I don't want to populate this list too much as this kind of thing is prone to unhelpful programming language flamewars.

Problems inherent to packaging
Each software delivery system exhibits its own set of flaws, and our RPM/yum is no exception. To what extent does, say, rpmlint cover the types of thing that go wrong, and to what extent can we extend it to help us?


Build system
Although the Fedora packaging guidelines manage to impose some sanity on this, there are many ways in which packages get configured and built.

Some examples:
  • the GNU autotools: configure.in, Makefile.am, leading to a "configure" used during the build to generate a Makefile. This can be prone to "silently" dropping functionality when the buildroot changes. It's sometimes possible to detect such breakage by looking at the "Requires" metadata of the built packages (can we automate this in Bodhi?)
  • hand-written one-of-a-kind Makefile written by upstream. Given that each is unique, each will have unique problems
  • python setup.py, using distutils/setuptools.
  • cmake
etc

Security fixes
Security fixes probably should be treated differently from non-security fixes: many people expect that the former should happen as a matter or course, that if someone has distributed software, they should also promptly distribute security fixes. This seems to be regarded as some kind of natural entitlement within software in a way that other kinds of update aren't, and so our update process probably should reflect this special quality ascribed to security flaws (I suspect I'm getting grumpy and middle-aged in my attitudes here, sorry)


Critical versus Speciality packages
Is this a package that needs to work for an essential Fedora functionality to work, or is it more of a "leaf" within the dependency graph. For example, if this package breaks, could it prevent a user from running yum, or running a graphical browser to search for information on the bug?

I like our "critical path" approach: some packages are definitely more critical than others. The exact details might need tuning, of course.


Paid versus Volunteer
I'm in the very fortunate position that I'm being paid to work on Fedora, and thus I'm professionally responsible for doing some tasks that aren't fun. Others volunteer their time and effort on Fedora, and I think it's important that their time should be fun, or, at least satisfying for some of the higher levels of Maslow's hierarchy of needs. (I happen to enjoy most of what I do in Fedora, and I do spend evenings and weekends on it too).


I hope this is a constructive addition to the debate.  What other variability is meaningful in the context of "candidate updates"? I probably missed some.

Update on 2to3c

My 2to3c tool now has a website: https://fedorahosted.org/2to3c/

I haven't yet sanely packaged it (I'm working on that) but I have done some work since my last blog post:
- It's somewhat more robust at handling errors from spatch
- Beginnings of a selftest suite (ultimately I want to be able to take C code, run the tool, compile it against multiple runtimes, then execute it in each one, verifying that the module still works)
- Some (possibly crazy) hacks to try to better handling preprocessor macros when reworking module initialization.

John Palmieri has used the tool to help with porting the DBus python bindings to Python 3. He's had to do a lot of manual work, but apparently 2to3c did save him some time

This is more of a shovel than a silver bullet, perhaps.

Help with this would be most welcome. I'm at PyCon, BTW, and will be around for a few of the sprint days, if anyone's interested in working on this.