Debugging the innards of Python... with Python

I've been trying a new approach for debugging the internals of Python.

One of the strengths of the C implementation of Python ("CPython") is how easy it is to wrap libraries written in C/C++ so that they're callable from Python code.

The drawback of this is that you're running "native" machine code, and that code can crash the python process, or worse, corrupt the internals of the python process so that the crash seems to be coming from somewhere else.

Time to break out the debugger...

Unfortunately, practically everything inside CPython is a pointer to a PyObject structure, with some additional type-specific data lurking after it in memory. A typical debugger by default just shows you the addresses of these structures, or shows you the two values they contain ("ob_refcnt" and "ob_type", which aren't necessarily enlightening).

Collapse )

Python memory usage: is it worth sharing constant data?

I had some interesting conversations at FUDcon Toronto [1] about python's memory footprint, and about ways of reducing the whole-system memory pressure of a collection of Python processes, by sharing more pages of memory between them.

We discussed sharing memory between python processes.

One approach to doing this would be to write some kind of support for starting python as a "zygote process", where we keep a started-up python process around, forking from it, so that the freshly forked process (in theory) uses only memory for the pages that become different (copy-on-write). As I understand it, this is how Chromium implements its tabs.

Bernie Innocenti apparently tried this for OLPC's activities (each a PyGTK process) but only got 10-20% sharing. The problem is that every python object embeds a reference count, and constantly increments/decrements those ob_refcnt fields, leading to unsharing of the pages.

One way of improving this might be to invent a magic refcount value (e.g. 0) that means "don't refcount this", so that pages can be shared (stuff that's been created at vm startup will probably be long-lived). (do we need the ability to "seal" an object, to optimize for the common case where nothing's been monkeypatched?)

(Or we could completely scrap refcounting and go to a full gc model, but I'm looking for easy wins here)

A similar approach that doesn't require us to have a zygote process is to use KSM (Kernel SamePage Merging). This is a feature added in recent versions of the Linux kernel, where a program can mark regions of its address space. The kernel can try to hash these pages, and pages that are bit-for-bit identical will be shared in RAM across the various processes.

KSM was developed by Qumranet hackers (now at Red Hat) for use in KVM. Perhaps Python could use it?

Unfortunately I don't think this approach will work either: all of the various PyObject* pointers cross-referencing the data in memory will be non-equal, and that will kill the sharing; the pages are unlikely to be bit-for-bit identical.

One idea for achieving equality of pages was to mimic how we build emacs: as I understand it, we build emacs as a process, then take a core-dump of it. On starting up emacs we bring the coredump back to life. (I may be horribly mischaracterizing this - it seems very fragile to me).

An approach that sounded promising was to try to consolidate the representation of the immutable blobs of data in the loaded python modules: the docstrings, the constant strings and unicode representations; the bytecode blobs (actually these are PyStringObjects).

The idea was a new variant of .pyc/.pyo. A .pyc file is a hierarchical representation of a parsed python module, containing everything needed at runtime (e.g. optimized bytecodes blobs rather than source strings, with links to the constants needed in the code) serialized to disk using the "marshal" module. It's analogous to pickle, except that the marshal format only caters to strict hierarchies of objects, whereas pickle supports cross-references, and this the marshal code can be simpler (and hopefully more efficient).

So in our proposed variant of .pyc, we would split the data into two streams:
- control data for building the hierarchy of objects, to be thrown away after the module is loaded
- "large" data to be mmap-ed, to persist in the process' address space after the module is loaded, with the kernel sharing all instances of this data in RAM between all python processes.

This would require hooks in PyStringObject (need PyBytesObject to do it for py3k) e.g. a new ob_sstate: SSTATE_INTERNED_MMAPPED, which places the bytes in a pointer elsewhere in the address space.

Some approaches to doing this:
- use the .pyc format as is. Unfortunately I don't think this works: currently they're written to disk as (size, bytes) without a nul terminator, whereas PyStringObject assumes that ob_sval is nul-terminated.
- come up with a simple variant of .pyc that splits the marshalled data into two streams (as above), storing offsets into the second stream within the first whenever writing out e.g. a PyStringObject
- use the ELF format directly: ELF is a binary format supporting multiple chunks/streams of data, with room for expansion, and a library and command-line tools for manipulating them. We could invent some new types of section. However I suspect that tools for dealing with ELF files are only readily-available on Linux (it happens to be the native format for executables and shared libraries) (we came up with the name ".pye" to refer to these ELF-based bytecode files)

Another idea was linkage: to try to link together all of the .pyc files per RPM into one big file, linking together the large sections as far as possible, or perhaps a big sqlite db mapping dotted path to .pyc files for standard modules. The idea here was to try to reduce the number of syscalls needed to locate the .py files.

As it turned out, this seems to have been a classic case of optimizing without measuring.

Looking at the "stat" case, starting up the python interpreter 100 times:
[david@brick ~]$ time (for i in $(seq 1 100) ; do python -c "pass" ; done)

real	0m3.129s
user	0m2.328s
sys	0m0.652s the bulk of the time taken is in user-space, rather than waiting on the kernel. (We tried this on the OLPC "boot animation" workload, and I believe the real slowdown is an accidental syncronous call that should have been asyncronous, that's stalling on waiting for a socket to close).

On my return from the conference I spent some time trying to capture real measurements to justify a possible pyc rewrite.

To look at the memory usage of all of those shared docstrings, I wrote some systemtap scripts.

You can see the latest version here:

I trieds various approaches to instrumentation.

The one I've settled on is to instrument returns from the r_object() call in Python/marshal.c, and to record PyObject* instances returned from that function that are of ob_type "str" (i.e. are PyStringObject instances) and have ob_refcnt == 1 (i.e. they are shared with anything, and haven't been interned).

Assuming I've instrumented things correctly, a simple startup of the interpreter under systemtap:
$ stap -v marshal.stp -c"python -c'pass'"
has this usage of unique strings (the "value" is the length of the string); note that this includes docstrings, constant strings, and bytecode blobs:
(snipped, full output here: )
Total cumulative size of r_object() calls returning strings with refcnt==1:  192K
value |-------------------------------------------------- count
    0 |                                                     0
    1 |@@                                                  54
    2 |@@@@@@@@@@@@@@                                     291
    4 |@@@@@@@@@@@                                        238
    8 |@@@@@@@@@@@@@@                                     281
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    959
   32 |@@@@@@@@@@@@@@@@@@@@@                              432
   64 |@@@@@@@@@                                          196
  128 |@@@@@@                                             138
  256 |@@@@                                                97
  512 |@@                                                  45
 1024 |                                                    11
 2048 |                                                     5
 4096 |                                                     1
 8192 |                                                     1
16384 |                                                     0
32768 |                                                     0

so (assuming my method is correct) we'd save 192K of mmap-ed data per python process.

For the OLPC case, each "activity" on the laptop is a python process that typically imports the GTK and DBus modules.

This shows a larger saving: 431K per python process:
$ stap -v marshal.stp -c"python -c'import gtk; import dbus'"

(output snipped; full output here: )
Total cumulative size of r_object() calls returning strings with refcnt==1:  431K
value |-------------------------------------------------- count
    0 |                                                      0
    1 |@                                                    65
    2 |@@@@@@@@@@@@@                                       534
    4 |@@@@@@@@@@@@@@                                      565
    8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   1302
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1958
   32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      1173
   64 |@@@@@@@@@@@                                         478
  128 |@@@@@@@@                                            336
  256 |@@@@@                                               216
  512 |@@                                                   87
 1024 |                                                     22
 2048 |                                                     10
 4096 |                                                      2
 8192 |                                                      1
16384 |                                                      0
32768 |                                                      0

Similarly, I suspect that there may be savings if you have numerous python web apps on one box (mod_wsgi daemon mode?), or via KSM savings as above if dealing with multiple guest VMs running python on one host.

Worth pursuing?

I also started looking at "yum"'s memory usage; see . I wrote a systemtap script to try to instrument the various levels of memory allocation inside the python runtime; see ; unfortunately this script doesn't work yet, owing to a systemtap bug. Hopefully when that's fixed we can get some real insight into this.

[1] with Bernie Innocenti, Colin Walters, Luke Macken, Adam Jackson and others; please forgive me if I've forgotten you.

2to3c: an implementation of Python's 2to3 for C code

I'm hoping that we'll package python 3 versions of as many modules as possible in Fedora 13, so the easier it is to port them, the better.

To that end, I've written a tool to help people port their C python extensions from Python 2 to Python 3.

It uses the Coccinelle tool to apply a series of "semantic patches" to .c files. I also had to code one of the refactorings in python with regular expressions (due to the need to manipulate preprocessor macros containing commas).

Sample session, running on a tarball of dbus-python:
[david@brick 2to3]$ ./2to3c --help
Usage: 2to3c [options] filenames...

  -h, --help   show this help message and exit
  -w, --write  Write back modified files
[david@brick 2to3]$ ./2to3c ../../python3/packaging/modules/by-hand/dbus-python/devel/dbus-python-0.83.0/_dbus_bindings/*.c > dbus-python.patch 

[david@brick 2to3]$ diffstat dbus-python.patch
 abstract.c       |   28 ++----
 bus.c            |    4 
 bytes.c          |   16 +--
 conn.c           |    7 -
 containers.c     |   21 ++--
 float.c          |    6 -
 generic.c        |    4 
 int.c            |   31 ++-----
 libdbusconn.c    |    5 -
 mainloop.c       |    3 
 message-append.c |    4 
 message.c        |   17 +--
 module.c         |  243 ++++++++++++++++++++++++++++++++++++++++++++-----------
 pending-call.c   |    3 
 server.c         |    7 -
 signature.c      |    6 -
 string.c         |    9 --
 17 files changed, 267 insertions(+), 147 deletions(-)

[david@brick 2to3]$ head -n 30 dbus-python.patch
--- ../../python3/packaging/modules/by-hand/dbus-python/devel/dbus-python-0.83.0/_dbus_bindings/abstract.c.orig 
+++ ../../python3/packaging/modules/by-hand/dbus-python/devel/dbus-python-0.83.0/_dbus_bindings/abstract.c 
@@ -54,7 +54,7 @@
     if (!vl_obj)
         return 0;
-    return PyInt_AsLong(vl_obj);
+    return PyLong_AsLong(vl_obj);
@@ -76,7 +76,7 @@
     else {
-        PyObject *vl_obj = PyInt_FromLong(variant_level);
+        PyObject *vl_obj = PyLong_FromLong(variant_level);
         if (!vl_obj) {
             return FALSE;
@@ -127,7 +127,7 @@
     if (!value)
-        return PyInt_FromLong(0);
+        return PyLong_FromLong(0);
     return value;

You can see the full patch it generated here:

It hasn't done all of the work, there are some places involving the preprocessor where it didn't quite generate correct code, and there are some remaining issues - for example, a human is going to have to decide whether the strings are bytes or unicode.

However, I think this ought to save a lot of time: it takes care of a lot of the tedious parts of such patches.

The public git repo can be seen here:;a=tree

You should be able to download it by cloning it thus:
git clone git://

Patches most welcome! (send them to I intend to license this under LGPLv2.1, but am happy to relicense as the upstream Python community see fit.

Static analysis of CPython's .c code

I've been hearing good things about Coccinelle for a while now, a tool for working with C code.

For example, it's been used for automating tedious (and error-prone) work on the Linux kernel.

I decided it was time to take it for a test-drive on CPython code.

I occasionally run into problems with the PyArg_ParseTuple API. It's a convenient API - it makes it very easy to marshal the objects passed as parameters to Python function calls into their C equivalents using a mini-language. The downside of this approach is that the compiler can't check such code for type safety, and so it's an area where bugs can lurk.

So I've written a tool which can detect such problems. You can download it from my Fedora people page using this command:
git clone git://

or simply read the code online here:;a=tree

To run it you'll need Coccinelle (which includes the "spatch" tool). On Fedora you can install it using this command:
yum install coccinelle

You should then be able to run it thus:
spatch -sp_file pyarg-parsetuple.cocci buggy.c

init_defs_builtins: /usr/share/coccinelle/standard.h
HANDLING: buggy.c
buggy.c:13:socket_htons:Mismatching type of argument 1 in ""i:htons"": expected "int *" but got "unsigned long *"

thus correctly finding the bug (an old one, fixed in )

Early days yet, but this seems promising. Does anyone know of any other non-proprietary tools that can do this kind of thing?

(I've posted more info to python-dev list here)

librpm and python 3(000)

I've had a go at hacking up librpm's python bindings so that they work with both python 2 and python 3.

Here's my progress so far, trying out some sample rpm python code with py3k. Some things appear to be working, but I really need a unit test suite to do this properly, I think.

[david@brick rpm]$ PYTHONPATH=/home/david/coding/python3/rpm-python-bindings/install-prefix/lib/python3.1/site-packages python3
Python 3.1.1 (r311:74480, Oct 1 2009, 12:20:21)
[GCC 4.4.1 20090725 (Red Hat 4.4.1-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import rpm
/home/david/coding/python3/rpm-python-bindings/install-prefix/lib/python3.1/site-packages/rpm/ DeprecationWarning: Type rpm.hdr defines tp_reserved (formerly tp_compare) but not tp_richcompare. Comparisons may not behave as intended.
from rpm._rpm import *
# (clearly I need to fix this)
>>> rpm.addMacro("_dbpath", "/var/lib/rpm")
>>> ts = rpm.TransactionSet()
>>> mi = ts.dbMatch()
>>> for h in mi:
... print("%s-%s-%s" % (h['name'], h['version'], h['release']))

etc. (note how print has become a function, and how the data is coming back as "bytes", not as strings)

For the gory details, I sent my patches here:

Does anyone have a good test suite for the python rpm API?

Whence came "codebase"?

The word "codebase" has been bugging me lately.  I've noticed myself using it, and it's starting to grate on me.

It appears to mean a collection of computer code, from an implementation perspective.  Users of the word might refer to the "foo-1.0 codebase" as opposed to the "foo-2.0 codebase, where we totally rewrote everything".

Assuming that this is what the word means, it's a useful one.  However, where did this word come from, and when?   Does it have a more precise meaning, or has it spread across the internet with the rough (but useful) meaning above?

The only definitions I can find are within Wikipedia or Wictionary (neither of which I'm a fan of).  I don't see it in the Jargon File either (but ditto, frankly).  It also seems to have a (different) specific meaning for Java applets, and is the name of a proprietary database product.

Does anyone have an etymology for this word?

squeal 0.4

I've put together a tarball release of my SQL/command-line collision, formerly "show", now "squeal"

Tarball here:
SRPM for Fedora/RHEL 5 here:

Here's what happened other than the name change:

New Data Sources

Text files and streams

Squeal can now deal with arbitrary text files, and on stdin (using "-").

It uses the first line of the input to determine the number of columns, giving you an input source of string columns named "col0", "col1", ..., up to N-1.

By default it splits columns on whitespace, but you can use -F "awk" to specify a field separator, e.g.:

   cat some-file.txt | squeal -F: --format=html -

to generate an HTML table from a colon-separated file.

You can also specify a regular expression containing groups, which will become the columns, e.g.:

   squeal col0, count from -r "(\<[^\>]+\>) (.*)" irc-log.txt group by col0 order by count desc

to figure out who's the chattiest in an IRC log.

Of course, if you need more complicated parsing, it's probably worth writing a dedicated data-source backend.


You can now issue SQL-like queries upon the contents of various kinds of archive files: .zip, .tar, .tar.bz2, .tar.gz2, and .rpm. For example, here's a query on the payload of an RPM file:

    $ squeal "total(size)" from ~/rpmbuild/RPMS/noarch/squeal-0.4-1.fc10.noarch.rpm

tcpdump/Wireshark files

I wrote a proof-of-concept backend for querying tcpdump files, e.g.:

$ squeal "count(*)", "total(length)", src_mac, dst_mac from test.pcap group by src_mac, dst_mac

to analyse the quantity of network traffic between pairs of machines.

Internally it's merely invoking tcpdump turn the file back into textual form, then carve up with regexps, so it's not at all robust yet.

/var/log/maillog* (sendmail and spamd)

I wrote an experimental parser for /var/log/maillog

Query Syntax Improvements

Squeal can split its own arguments, to minimize the amount of escaping that you have to do within the shell. You can now pass in mixed queries like:

      $ squeal "count(*), total(size) host from" /var/log/httpd/*error_log* \
      "order by total(size) desc"

where some of the arguments are split by the shell, and some by squeal's parser.

You can now type "count" rather than "count(*)", provided "count" isn't a column name (it's a pain to type, and to have to escape this from the shell). So the above query can be simplified further to:

      $ squeal "count, total(size) host from" /var/log/httpd/*error_log* \
      "order by total(size) desc"


I added two new formatting options:

  • "text" : outputs as lines of space-separated fields
  • "table" : outputs an ascii-art table


  • The httpd log backend now supports parsing logs containing usernames
  • The syslog backend can now deal with single-digit dates within a month(!)
  • Deal with absolute and relative paths when path-matching input filenames; only check for Augeas below /etc
  • Various internal cleanups
  • Started a unit test suite.
  • Support Python < 2.5 by using earlier versions of sqlite; runs on RHEL 5
  • Work around an issue seen sometimes with the RPM backend
  • Detect exceptions in execution of the internal sqlite queries, and log them

How not to render 3D graphics: 40 ways to get a blank black screen

I've learned most of these the hard way:
  • The monitor is in power-saving mode
  • Your video card has two outputs, and the monitor is plugged into the wrong one.
  • The screensaver kicked in while you were talking to your boss, but your program has grabbed input focus.
  • You've forgotten to flip the display buffers. You're drawing everything OK, but not displaying it.
  • There's a bug in your memory management. You're rendering to a different area of video RAM than the one the monitor is reading from.
  • Your scene is too complex; either the CPU or CPU crashed before the first frame finished rendering.
  • The "near" clip plane is further away than the "far" clip plane
  • You're rejecting the wrong side of each line as you walk your BSP tree; your universe is hiding just beyond the corner of your eye.
  • You forgot to set up the bounding-boxes for your objects, and all of them are being rejected as too small
  • You have a bug in your matrix library. The entire universe has collapsed to (0,0,0).
  • You have a bug in your quaternion library. All of your rotated objects are collapsing to points.
  • You have a bug in your back-face culling. Every face is being culled.
  • Your camera is too narrow angle; the universe has collapsed to a single point directly ahead of you
  • Your camera's so wide-angle... (insert mother joke)
  • You've confused the origins of screen space and of the frame buffer, and all of the scene is being rendered off the side of the framebuffer (and culled per-fragment)
  • You're using index-colour, and you forgot to set up a palette: all colours are black.
  • Your scale is wildly wrong: the scene is vastly larger than you expect. You are viewing a single texel on a single triangle in the scene.
  • Your scale is wildly wrong: the scene is vastly smaller than you expect. The entire world has collapsed to a tiny point in the centre of the screen.
  • The camera is outside the scene and pointing the wrong way. Did you try looking behind you? Above you? etc
  • You forgot to add any lights to the scene. Darkness remains over the surface of the deep.
  • You forgot to set the textures. All of the scene is being rendered with a blank texture.
  • You forgot to set any UV coordinates. The entire scene is being rendered with the colour at (0,0) in their textures.
  • You're doing integer multiplication of fractions and everything is coming out as zero.
  • You've forgotten to offset the objects in the scene in world space and relative to the camera. You are viewing all of the visible objects in the scene from inside, and back-face culling ensures that nothing is visible. (Insert bad pun about "Lost In Translation")
  • You've forgotten to pop matrices from the transformation stack, and you overflowed the stack after a few frames. (You should check error codes at least once per frame)
  • Your collision algorithms aren't good enough. The object representing the camera has fallen through the floor and is hurtling at great speed downwards into nothingness whilst the world disappears high above.
  • Your geometry shader has a bug, and all geometry in the scene is appearing behind the camera
  • Your fragment shader failed to compile, and everything is coming out as black.
  • You forgot to clear the z-buffer, and all of your fragments are being rejected.
  • Your scene is foggier than a lazy TV remake of Sherlock Holmes.
  • You forgot to set up alpha values, and the entire world is fully-transparent.
  • You're standing in front of a black wall.
  • The game logic has decided that we're fading out the screen, in preparation for the next level.
  • You set up all of your data correctly, but a stray pointer is trashing one/all of the above.
  • The in-game menus have appeared, overlaid in front of your scene. Unfortunately, due to poor state management, they're far too big, and a single corner of one letter "A" is obscuring the entire screen with black.
  • Your radiosity renderer doesn't have enough photons.
  • Your volume shader is too opaque.
  • Your bidirectional reflectance distribution functions aren't reflective enough
  • Time is running at a different rate than it should be, and your simulation code has either destroyed all of the objects in the scene, or none have been created yet.
  • There's a FIXME in your code that really needs fixing...