?
 
 
18 December 2009 @ 08:02 pm
Python memory usage: is it worth sharing constant data?  
I had some interesting conversations at FUDcon Toronto [1] about python's memory footprint, and about ways of reducing the whole-system memory pressure of a collection of Python processes, by sharing more pages of memory between them.

We discussed sharing memory between python processes.

One approach to doing this would be to write some kind of support for starting python as a "zygote process", where we keep a started-up python process around, forking from it, so that the freshly forked process (in theory) uses only memory for the pages that become different (copy-on-write). As I understand it, this is how Chromium implements its tabs.

Bernie Innocenti apparently tried this for OLPC's activities (each a PyGTK process) but only got 10-20% sharing. The problem is that every python object embeds a reference count, and constantly increments/decrements those ob_refcnt fields, leading to unsharing of the pages.

One way of improving this might be to invent a magic refcount value (e.g. 0) that means "don't refcount this", so that pages can be shared (stuff that's been created at vm startup will probably be long-lived). (do we need the ability to "seal" an object, to optimize for the common case where nothing's been monkeypatched?)

(Or we could completely scrap refcounting and go to a full gc model, but I'm looking for easy wins here)

A similar approach that doesn't require us to have a zygote process is to use KSM (Kernel SamePage Merging). This is a feature added in recent versions of the Linux kernel, where a program can mark regions of its address space. The kernel can try to hash these pages, and pages that are bit-for-bit identical will be shared in RAM across the various processes.

KSM was developed by Qumranet hackers (now at Red Hat) for use in KVM. Perhaps Python could use it?

Unfortunately I don't think this approach will work either: all of the various PyObject* pointers cross-referencing the data in memory will be non-equal, and that will kill the sharing; the pages are unlikely to be bit-for-bit identical.

One idea for achieving equality of pages was to mimic how we build emacs: as I understand it, we build emacs as a process, then take a core-dump of it. On starting up emacs we bring the coredump back to life. (I may be horribly mischaracterizing this - it seems very fragile to me).

An approach that sounded promising was to try to consolidate the representation of the immutable blobs of data in the loaded python modules: the docstrings, the constant strings and unicode representations; the bytecode blobs (actually these are PyStringObjects).

The idea was a new variant of .pyc/.pyo. A .pyc file is a hierarchical representation of a parsed python module, containing everything needed at runtime (e.g. optimized bytecodes blobs rather than source strings, with links to the constants needed in the code) serialized to disk using the "marshal" module. It's analogous to pickle, except that the marshal format only caters to strict hierarchies of objects, whereas pickle supports cross-references, and this the marshal code can be simpler (and hopefully more efficient).

So in our proposed variant of .pyc, we would split the data into two streams:
- control data for building the hierarchy of objects, to be thrown away after the module is loaded
- "large" data to be mmap-ed, to persist in the process' address space after the module is loaded, with the kernel sharing all instances of this data in RAM between all python processes.

This would require hooks in PyStringObject (need PyBytesObject to do it for py3k) e.g. a new ob_sstate: SSTATE_INTERNED_MMAPPED, which places the bytes in a pointer elsewhere in the address space.

Some approaches to doing this:
- use the .pyc format as is. Unfortunately I don't think this works: currently they're written to disk as (size, bytes) without a nul terminator, whereas PyStringObject assumes that ob_sval is nul-terminated.
- come up with a simple variant of .pyc that splits the marshalled data into two streams (as above), storing offsets into the second stream within the first whenever writing out e.g. a PyStringObject
- use the ELF format directly: ELF is a binary format supporting multiple chunks/streams of data, with room for expansion, and a library and command-line tools for manipulating them. We could invent some new types of section. However I suspect that tools for dealing with ELF files are only readily-available on Linux (it happens to be the native format for executables and shared libraries) (we came up with the name ".pye" to refer to these ELF-based bytecode files)

Another idea was linkage: to try to link together all of the .pyc files per RPM into one big file, linking together the large sections as far as possible, or perhaps a big sqlite db mapping dotted path to .pyc files for standard modules. The idea here was to try to reduce the number of syscalls needed to locate the .py files.

As it turned out, this seems to have been a classic case of optimizing without measuring.

Looking at the "stat" case, starting up the python interpreter 100 times:
[david@brick ~]$ time (for i in $(seq 1 100) ; do python -c "pass" ; done)

real	0m3.129s
user	0m2.328s
sys	0m0.652s


...so the bulk of the time taken is in user-space, rather than waiting on the kernel. (We tried this on the OLPC "boot animation" workload, and I believe the real slowdown is an accidental syncronous call that should have been asyncronous, that's stalling on waiting for a socket to close).

On my return from the conference I spent some time trying to capture real measurements to justify a possible pyc rewrite.

To look at the memory usage of all of those shared docstrings, I wrote some systemtap scripts.

You can see the latest version here: http://dmalcolm.fedorapeople.org/systemtap/marshal.stp

I trieds various approaches to instrumentation.

The one I've settled on is to instrument returns from the r_object() call in Python/marshal.c, and to record PyObject* instances returned from that function that are of ob_type "str" (i.e. are PyStringObject instances) and have ob_refcnt == 1 (i.e. they are shared with anything, and haven't been interned).

Assuming I've instrumented things correctly, a simple startup of the interpreter under systemtap:
$ stap -v marshal.stp -c"python -c'pass'"
has this usage of unique strings (the "value" is the length of the string); note that this includes docstrings, constant strings, and bytecode blobs:
(snipped, full output here: http://dmalcolm.fedorapeople.org/systemtap/output-of-marshal.stp-on-python-pass.txt )
Total cumulative size of r_object() calls returning strings with refcnt==1:  192K
value |-------------------------------------------------- count
    0 |                                                     0
    1 |@@                                                  54
    2 |@@@@@@@@@@@@@@                                     291
    4 |@@@@@@@@@@@                                        238
    8 |@@@@@@@@@@@@@@                                     281
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    959
   32 |@@@@@@@@@@@@@@@@@@@@@                              432
   64 |@@@@@@@@@                                          196
  128 |@@@@@@                                             138
  256 |@@@@                                                97
  512 |@@                                                  45
 1024 |                                                    11
 2048 |                                                     5
 4096 |                                                     1
 8192 |                                                     1
16384 |                                                     0
32768 |                                                     0


so (assuming my method is correct) we'd save 192K of mmap-ed data per python process.

For the OLPC case, each "activity" on the laptop is a python process that typically imports the GTK and DBus modules.

This shows a larger saving: 431K per python process:
$ stap -v marshal.stp -c"python -c'import gtk; import dbus'"

(output snipped; full output here: http://dmalcolm.fedorapeople.org/systemtap/output-of-marshal.stp-on-python-import-gtk-and-dbus.txt )
Total cumulative size of r_object() calls returning strings with refcnt==1:  431K
value |-------------------------------------------------- count
    0 |                                                      0
    1 |@                                                    65
    2 |@@@@@@@@@@@@@                                       534
    4 |@@@@@@@@@@@@@@                                      565
    8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                   1302
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1958
   32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      1173
   64 |@@@@@@@@@@@                                         478
  128 |@@@@@@@@                                            336
  256 |@@@@@                                               216
  512 |@@                                                   87
 1024 |                                                     22
 2048 |                                                     10
 4096 |                                                      2
 8192 |                                                      1
16384 |                                                      0
32768 |                                                      0


Similarly, I suspect that there may be savings if you have numerous python web apps on one box (mod_wsgi daemon mode?), or via KSM savings as above if dealing with multiple guest VMs running python on one host.

Worth pursuing?

I also started looking at "yum"'s memory usage; see http://dmalcolm.fedorapeople.org/systemtap/output-of-marshal.stp-on-yum-whatprovides-python.txt . I wrote a systemtap script to try to instrument the various levels of memory allocation inside the python runtime; see http://dmalcolm.fedorapeople.org/systemtap/python-mem.stp ; unfortunately this script doesn't work yet, owing to a systemtap bug. Hopefully when that's fixed we can get some real insight into this.


[1] with Bernie Innocenti, Colin Walters, Luke Macken, Adam Jackson and others; please forgive me if I've forgotten you.
Tags: ,
 
 
 
Frank Ch. Eiglerfuhchee on December 19th, 2009 02:05 am (UTC)
why only strings?
David, some of those string lengths seem small enough to
make me wonder whether non-string data structures may be
worth analyzing for shareability. What about the arrays?

Do you have a good profile as to what happens during those
30 ms of trivial "python -c pass" startup? (gcov?)
dmalcolm on December 21st, 2009 12:33 am (UTC)
Re: why only strings?
> David, some of those string lengths seem small enough to
> make me wonder whether non-string data structures may be
> worth analyzing for shareability. What about the arrays?

Lists aren't sharable as they are mutable. Tuples are immutable and thus sharable, and looking at the logs, there appear to be a lot of tuples being demarshalled; 2690 for the "pass" case.

I added some instrumentation of the sizes of these; there were about 1000 zero-length tuples, but these are all shared internally as a single PyTupleObject instance.

Here are the sizes of the remaining PyTupleObjects coming from .pyc files, for the minimal "pass" case:

Marshalled tuple lengths with refcnt==1:
value |-------------------------------------------------- count
    0 |                                                     0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 546
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@    762
    4 |@@@@@@@@@@@@@@@@@@@@@@@                            371
    8 |@@@@@@@@@@@@@                                      213
   16 |@@@@                                                73
   32 |@                                                   25
   64 |                                                    11
  128 |                                                     0
  256 |                                                     1
  512 |                                                     0
 1024 |                                                     0

sizeof(PyTupleObject) for marshalled tuples with refcnt==1: total size: 63.9K
value |-------------------------------------------------- count
    4 |                                                      0
    8 |                                                      0
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1458
   32 |@@@@@@@@@@@@@                                       394
   64 |@@@                                                 113
  128 |                                                     24
  256 |                                                     12
  512 |                                                      0
 1024 |                                                      1
 2048 |                                                      0
 4096 |                                                      0
...assuming 4-byte pointers i.e.
sizeof(PyTupleObject) = 4 * (3+length)
dmalcolm on December 21st, 2009 12:34 am (UTC)
Re: why only strings?
(had to split the comment due to max-comment-length restriction).

For the "import gtk; import dbus" case:
Marshalled tuple lengths with refcnt==1:
value |-------------------------------------------------- count
    0 |                                                      0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 1140
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1602
    4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        928
    8 |@@@@@@@@@@@@@@                                      467
   16 |@@@@                                                158
   32 |@                                                    55
   64 |                                                     17
  128 |                                                      0
  256 |                                                      1
  512 |                                                      0
 1024 |                                                      1
 2048 |                                                      0
 4096 |                                                      0

sizeof(PyTupleObject) for marshalled tuples with refcnt==1: total size:  143K
value |-------------------------------------------------- count
    4 |                                                      0
    8 |                                                      0
   16 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  3107
   32 |@@@@@@@@@@@@@@                                      937
   64 |@@@                                                 240
  128 |@                                                    63
  256 |                                                     20
  512 |                                                      0
 1024 |                                                      1
 2048 |                                                      0
 4096 |                                                      1
 8192 |                                                      0
16384 |                                                      0


So not much saving, and actually implementing this kind of optimization would be way too hard:

  • somehow need to avoid twiddling the ob_refcnt

  • the address of the global "PyTuple_Type" would need to be the same for all of the different Python processs, due to the (PyTypeObject *)ob_type at the top of each PyTupleObject

  • to share the pages for these objects between processes we'd have to lay out actual structures on-disk - they'd vary between 32-bit and 64-bit



So I think the cost is v.high, for little benefit.



I think there are bigger issues here with overall heap behavior; I'm looking forward to putting together a visualization of heap activity once that user-space probing bug is fixed :-)


wavetossed on December 19th, 2009 11:50 am (UTC)
split object storage
What would be the impact of splitting object storage so that the refcounts are all stored together in one memory pool, while the rest of the object is in a separate memory pool which is more likely to benefit from COW?

In fact, why not make the split based on static values and dynamic values, then write some code to profile the application and identify dynamic values which coukd be stored in the same pool as the refcounts. This would require some way to modify the .pyc files to tag which object attributes are static. All of this could be in the standard Python, but since it requires a profile and tag step.
dmalcolm on December 21st, 2009 12:54 am (UTC)
Re: split object storage
Hi! Thanks; yes, that's an approach that some people have experimented with it; see: http://groups.google.com/group/unladen-swallow/browse_thread/thread/21d7248e8279b328/46dbe07915072506?lnk=raot
their patch is here: http://www.deeplayer.com/claudio/misc/Python-2.6.1-refcount.patch
(is that you?)

If (and that's a big "if") I tried to do this, I think it might be worth doing pointer compression at the same time.

This is kind-of thinking aloud here, but one idea would be to replace every PyObject* throughout CPython with a PyObjectRef* (or somesuch), and have that be a 32-bit type, perhaps an index into a table of
  struct PyObjectRef {
     PyObject *ob_ptr;
     int ob_refcnt
  };

then replace the
  PyObject *ob_refcnt;

in struct PyObject
with a
  PyObjectRef *ob_selfref;


That way:

  • the reference counts could be manipulated separately from the objects themselves, permitting page sharing across processes

  • we'd prevent CPython's heap usage doubling on 64-bit architectures (due to the sizeof() of the various PyObject structs doubling, due to sizeof(PyObject*) doubling)

  • downside: add an extra dereference throughout the code, and lots of rewriting...



(Not that I'm volunteering to implement this; I'm going to try to do a good visualization of heap usage in CPython for one of the workloads I care about next, I think, to see if there are easier wins)
dmalcolm on December 21st, 2009 02:29 am (UTC)
Re: split object storage
(replying to self, sorry)
Brainstorming, perhaps something like this:
typedef struct PyObjectRef PyObjectRef;

typedef int PyObjectId; /* index into Py_LiveObjects array: */
PyObjectRef *Py_LiveObjects; /* needs to grow when new objects are created */
int Py_NumIds;

struct PyObjectRef {
  PyObject *or_objptr;
  int or_refcnt;
  /* we could have a debug mode where we track all of the references to each referent */
};

PyObjectId **Py_FreeIds;
int Py_NumFreeIds;
int Py_AllocedFreeIds;

/* Unchecked macro: */
#define Py_OBJREF_FROM_ID(id) (Py_LiveObjects[id])

/* Function */
PyObjectRef* Py_ObjRef_From_Id(PyObjectId id)
{
  assert(id >= 0);
  assert(id < Py_NumIds);

  return Py_LiveObjects[id];
}

/* Unchecked macro: */
#define Py_OBJPTR_FROM_ID(id) (Py_OBJREF_FROM_ID(id)->obj_ptr)

/* Function */
PyObject* Py_ObjPtr_From_Id(PyObjectId id, int incref)
{
  PyObjectRef* or = Py_ObjRef_From_Id(id);
  if (incref) {
    or->or_refcnt++;
  }
  return or->or_objptr;
}

#define Py_INCREF_ID(id) \
  (Py_OBJREF_FROM_ID(id)->or_refcnt++)

#define Py_DECREF_ID(id) \
   do {						\
       PyObjectRef* or = Py_OBJREF_FROM_ID(id); \
       if (0==--or->or_refcnt) {                \
           _Py_Dealloc(or->obj_ptr);            \
       }                                        \
   } while (0)

/*
  ...etc; need to manage all of these IDs, free IDs, etc
*/

One idea is to register the various typeobjects with standardized IDs and use for demarshalling; this would allow the ob_type field to become a PyObjectId, and thus be sharable in mmapped data; one sideffect is that marshalled objects all have the same widths on 32 vs 64 bit architectures, and thus .pyc can contain mmappable PyObject structs; to do this requires the cross-references to be the same (maybe we need offsets instead of indices?)

(Deleted comment)
dmalcolm on December 21st, 2009 01:00 am (UTC)
Yes - interesting ideas here.

One thing we could do is to store the precomputed hash of each object (or just strings/unicode?) within the .pyc file, saving the CPU time of doing it when reading back the strings.

More measurement needed, I think. We thought that all of those "stat" systemcalls were a problem, but timings show it isn't, or at least not with a warm cache. The only workload we came up with where minimizing the startup time of "python" was critical was for OLPC's boot animation, but if my measurements are right this turned out to be a mistake, it was a syncronous wait on a socket that was causing the delay) (if you've got a good workload where the startup time is an issue, let me know!)
scooby509 on December 19th, 2009 06:27 pm (UTC)
Weighted refcounts
The problem is that every python object embeds a reference count, and constantly increments/decrements those ob_refcnt fields, leading to unsharing of the pages.

One way of improving this might be to invent a magic refcount value (e.g. 0) that means "don't refcount this", so that pages can be shared (stuff that's been created at vm startup will probably be long-lived).


I think you can avoid this problem with weighted refcounts. Wikipedia's got a good summary of it, but in a nutshell the object stores a very large weight to start with, and the first reference stores has that same weight. When you copy a reference, you'd split the two references' weights. Thus the referent object's total weight doesn't change, so no updates.

It gets tricky when the weight drops too low to split, but I'm definitely thinking it's a great method for any kind of distributed system or shared system using COW.
dmalcolm on December 21st, 2009 01:53 am (UTC)
Re: Weighted refcounts
Thanks - yes, the Wikipedia summary is: http://en.wikipedia.org/wiki/Reference_counting#Weighted_reference_counting

This would optimize the case where you gain new references on an object (by splitting the weight), but doesn't help when you drop the references, I think.

(Sounds like an extremely complex patch, alas)
ex_cgwalter on December 21st, 2009 04:54 pm (UTC)
431k sounds about right for core+gtk+dbus. That's actually a LOT of string data if you think about it.

The next obvious question is; how many Python processes are running in a typical OLPC system?
bulbfeeblebulbfeeble on April 21st, 2014 04:00 am (UTC)
Its interesting
This is beautiful.
bloodyfilkbloodyfilk on June 13th, 2014 05:15 am (UTC)
Splendid

Its beneficial :)
bloodyfilkbloodyfilk on June 13th, 2014 05:17 am (UTC)
Its awe-inspiring
Exceptional...!!
kennethhackneyrkennethhackneyr on June 13th, 2014 09:26 am (UTC)

Thats excellent...