You are viewing dmalcolm

 
 
18 December 2009 @ 08:02 pm
Python memory usage: is it worth sharing constant data?  
I had some interesting conversations at FUDcon Toronto [1] about python's memory footprint, and about ways of reducing the whole-system memory pressure of a collection of Python processes, by sharing more pages of memory between them.

We discussed sharing memory between python processes.

One approach to doing this would be to write some kind of support for starting python as a "zygote process", where we keep a started-up python process around, forking from it, so that the freshly forked process (in theory) uses only memory for the pages that become different (copy-on-write). As I understand it, this is how Chromium implements its tabs.

Bernie Innocenti apparently tried this for OLPC's activities (each a PyGTK process) but only got 10-20% sharing. The problem is that every python object embeds a reference count, and constantly increments/decrements those ob_refcnt fields, leading to unsharing of the pages.

One way of improving this might be to invent a magic refcount value (e.g. 0) that means "don't refcount this", so that pages can be shared (stuff that's been created at vm startup will probably be long-lived). (do we need the ability to "seal" an object, to optimize for the common case where nothing's been monkeypatched?)

(Or we could completely scrap refcounting and go to a full gc model, but I'm looking for easy wins here)

A similar approach that doesn't require us to have a zygote process is to use KSM (Kernel SamePage Merging). This is a feature added in recent versions of the Linux kernel, where a program can mark regions of its address space. The kernel can try to hash these pages, and pages that are bit-for-bit identical will be shared in RAM across the various processes.

KSM was developed by Qumranet hackers (now at Red Hat) for use in KVM. Perhaps Python could use it?

Unfortunately I don't think this approach will work either: all of the various PyObject* pointers cross-referencing the data in memory will be non-equal, and that will kill the sharing; the pages are unlikely to be bit-for-bit identical.

One idea for achieving equality of pages was to mimic how we build emacs: as I understand it, we build emacs as a process, then take a core-dump of it. On starting up emacs we bring the coredump back to life. (I may be horribly mischaracterizing this - it seems very fragile to me).

An approach that sounded promising was to try to consolidate the representation of the immutable blobs of data in the loaded python modules: the docstrings, the constant strings and unicode representations; the bytecode blobs (actually these are PyStringObjects).

The idea was a new variant of .pyc/.pyo. A .pyc file is a hierarchical representation of a parsed python module, containing everything needed at runtime (e.g. optimized bytecodes blobs rather than source strings, with links to the constants needed in the code) serialized to disk using the "marshal" module. It's analogous to pickle, except that the marshal format only caters to strict hierarchies of objects, whereas pickle supports cross-references, and this the marshal code can be simpler (and hopefully more efficient).

So in our proposed variant of .pyc, we would split the data into two streams:
- control data for building the hierarchy of objects, to be thrown away after the module is loaded
- "large" data to be mmap-ed, to persist in the process' address space after the module is loaded, with the kernel sharing all instances of this data in RAM between all python processes.

This would require hooks in PyStringObject (need PyBytesObject to do it for py3k) e.g. a new ob_sstate: SSTATE_INTERNED_MMAPPED, which places the bytes in a pointer elsewhere in the address space.

Some approaches to doing this:
- use the .pyc format as is. Unfortunately I don't think this works: currently they're written to disk as (size, bytes) without a nul terminator, whereas PyStringObject assumes that ob_sval is nul-terminated.
- come up with a simple variant of .pyc that splits the marshalled data into two streams (as above), storing offsets into the second stream within the first whenever writing out e.g. a PyStringObject
- use the ELF format directly: ELF is a binary format supporting multiple chunks/streams of data, with room for expansion, and a library and command-line tools for manipulating them. We could invent some new types of section. However I suspect that tools for dealing with ELF files are only readily-available on Linux (it happens to be the native format for executables and shared libraries) (we came up with the name ".pye" to refer to these ELF-based bytecode files)

Another idea was linkage: to try to link together all of the .pyc files per RPM into one big file, linking together the large sections as far as possible, or perhaps a big sqlite db mapping dotted path to .pyc files for standard modules. The idea here was to try to reduce the number of syscalls needed to locate the .py files.

As it turned out, this seems to have been a classic case of optimizing without measuring.

Looking at the "stat" case, starting up the python interpreter 100 times:
[david@brick ~]$ time (for i in $(seq 1 100) ; do python -c "pass" ; done)

real	0m3.129s
user	0m2.328s
sys	0m0.652s


...so the bulk of the time taken is in user-space, rather than waiting on the kernel. (We tried this on the OLPC "boot animation" workload, and I believe the real slowdown is an accidental syncronous call that should have been asyncronous, that's stalling on waiting for a socket to close).

On my return from the conference I spent some time trying to capture real measurements to justify a possible pyc rewrite.

To look at the memory usage of all of those shared docstrings, I wrote some systemtap scripts.

You can see the latest version here: http://dmalcolm.fedorapeople.org/systemtap/marshal.stp

I trieds various approaches to instrumentation.

The one I've settled on is to instrument returns from the r_object() call in Python/marshal.c, and to record PyObject* instances returned from that function that are of ob_type "str" (i.e. are PyStringObject instances) and have ob_refcnt == 1 (i.e. they are shared with anything, and haven't been interned).

Assuming I've instrumented things correctly, a simple startup of the interpreter under systemtap:
$ stap -v marshal.stp -c"python -c'pass'"
has this usage of unique strings (the "value" is the length of the string); note that this includes docstrings, constant strings, and bytecode blobs:
(snipped, full output here: http://dmalcolm.fedorapeople.org/systemtap/output-of-marshal.stp-on-python-pass.txt )
Total cumulative size of r_object() calls returning strings with refcnt==1:  192K
value |-------------------------------------------------- count
    0 |                                                     0
    1 |@@                                                  54
    2 |@@@@@@@@@@@@@@                                     291
    4 |@@@@@@@@@@@                                        238
    8 |@@@@@@@@@@@@@@                                     281
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    959
   32 |@@@@@@@@@@@@@@@@@@@@@                              432
   64 |@@@@@@@@@                                          196
  128 |@@@@@@                                             138
  256 |@@@@                                                97
  512 |@@                                                  45
 1024 |                                                    11
 2048 |                                                     5
 4096 |                                                     1
 8192 |                                                     1
16384 |                                                     0
32768 |                                                     0


so (assuming my method is correct) we'd save 192K of mmap-ed data per python process.

For the OLPC case, each "activity" on the laptop is a python process that typically imports the GTK and DBus modules.

This shows a larger saving: 431K per python process:
$ stap -v marshal.stp -c"python -c'import gtk; import dbus'"

(output snipped; full output here: http://dmalcolm.fedorapeople.org/systemtap/output-of-marshal.stp-on-python-import-gtk-and-dbus.txt )
Total cumulative size of r_object() calls returning strings with refcnt==1:  431K
value |-------------------------------------------------- count
    0 |                                                      0
    1 |@                                                    65
    2 |@@@@@@@@@@@@@                                       534
    4 |@@@@@@@@@@@@@@                                      565
    8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   1302
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1958
   32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      1173
   64 |@@@@@@@@@@@                                         478
  128 |@@@@@@@@                                            336
  256 |@@@@@                                               216
  512 |@@                                                   87
 1024 |                                                     22
 2048 |                                                     10
 4096 |                                                      2
 8192 |                                                      1
16384 |                                                      0
32768 |                                                      0


Similarly, I suspect that there may be savings if you have numerous python web apps on one box (mod_wsgi daemon mode?), or via KSM savings as above if dealing with multiple guest VMs running python on one host.

Worth pursuing?

I also started looking at "yum"'s memory usage; see http://dmalcolm.fedorapeople.org/systemtap/output-of-marshal.stp-on-yum-whatprovides-python.txt . I wrote a systemtap script to try to instrument the various levels of memory allocation inside the python runtime; see http://dmalcolm.fedorapeople.org/systemtap/python-mem.stp ; unfortunately this script doesn't work yet, owing to a systemtap bug. Hopefully when that's fixed we can get some real insight into this.


[1] with Bernie Innocenti, Colin Walters, Luke Macken, Adam Jackson and others; please forgive me if I've forgotten you.
Tags: ,
 
 
 
( Read 14 commentsLeave a comment )
dmalcolm on December 21st, 2009 02:29 am (UTC)
Re: split object storage
(replying to self, sorry)
Brainstorming, perhaps something like this:
typedef struct PyObjectRef PyObjectRef;

typedef int PyObjectId; /* index into Py_LiveObjects array: */
PyObjectRef *Py_LiveObjects; /* needs to grow when new objects are created */
int Py_NumIds;

struct PyObjectRef {
  PyObject *or_objptr;
  int or_refcnt;
  /* we could have a debug mode where we track all of the references to each referent */
};

PyObjectId **Py_FreeIds;
int Py_NumFreeIds;
int Py_AllocedFreeIds;

/* Unchecked macro: */
#define Py_OBJREF_FROM_ID(id) (Py_LiveObjects[id])

/* Function */
PyObjectRef* Py_ObjRef_From_Id(PyObjectId id)
{
  assert(id >= 0);
  assert(id < Py_NumIds);

  return Py_LiveObjects[id];
}

/* Unchecked macro: */
#define Py_OBJPTR_FROM_ID(id) (Py_OBJREF_FROM_ID(id)->obj_ptr)

/* Function */
PyObject* Py_ObjPtr_From_Id(PyObjectId id, int incref)
{
  PyObjectRef* or = Py_ObjRef_From_Id(id);
  if (incref) {
    or->or_refcnt++;
  }
  return or->or_objptr;
}

#define Py_INCREF_ID(id) \
  (Py_OBJREF_FROM_ID(id)->or_refcnt++)

#define Py_DECREF_ID(id) \
   do {						\
       PyObjectRef* or = Py_OBJREF_FROM_ID(id); \
       if (0==--or->or_refcnt) {                \
           _Py_Dealloc(or->obj_ptr);            \
       }                                        \
   } while (0)

/*
  ...etc; need to manage all of these IDs, free IDs, etc
*/

One idea is to register the various typeobjects with standardized IDs and use for demarshalling; this would allow the ob_type field to become a PyObjectId, and thus be sharable in mmapped data; one sideffect is that marshalled objects all have the same widths on 32 vs 64 bit architectures, and thus .pyc can contain mmappable PyObject structs; to do this requires the cross-references to be the same (maybe we need offsets instead of indices?)