[Yum-devel] Dispelling rpm callback myths

Panu Matilainen pmatilai at laiskiainen.org
Tue Feb 22 11:42:59 UTC 2011


I promised to Seth on irc last Friday to explain the rpm transaction 
(python) callback - what's the deal with avoiding headers to save memory 
etc, so here goes... This is long, so go grab a coffee first.

I'll start with little bit of history first. Please remember this is not 
the "absolute truth of what really happened" but just my interpretation 
of things, based on bits and pieces of information from commit logs and 
other public archives.

Our story starts at the birth of Anaconda in 1999, in this early commit:
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=c6ca9181446e3ad83a46fe031a4688a92f9f0f98

+ts = rpm.TransactionSet(rootPath, db)
+
+for p in comps.selected():
+    ts.add(p.h, (p.h, p.h[1000000]))

...

+def cb(what, amount, total, key, data):
+    if (what == rpm.RPMCALLBACK_INST_OPEN_FILE):
+       (h, key) = key
+       data.setPackage(h[rpm.RPMTAG_NAME])
+       d = os.open("/mnt/redhat/test/6.0/i386/RedHat/RPMS/" + key, 
os.O_RDONLY)
+       return d

There we go, adding (header, pkg_path) tuples as the "key" argument to 
what was ts.addInstall() back then (the h[1000000] thing was a custom 
tag added by the genhdlist thing that wrote out the headerlist used by 
anaconda, containing the package path). This convention has since then 
been copied to/carried on in nearly every single user of rpm-python 
(including anaconda, up2date and yum at least).

Ewt wrote this part of the original rpm bindings (anaconda commit 
f1da6a4807d44c670453978b53d4b6d18b406ec1), so one would assume he knew 
what he was doing... and in fact back then, it was a "clever trick" to 
actually /save/ memory in anaconda, due to how rpm worked at that time. 
Unfortunately he + others (I dunno exact details of who wrote what) 
botched up various other aspects of the python callback design pretty 
badly - more on that later.

Fast-forward a few years and rpm had internally started saving memory by 
scraping just the information it needs for the transaction calculations 
(dependency checks, ordering, file conflicts etc) from the header passed 
in the /first/ argument of ts.addInstall(), instead of keeping the 
entire header around. Which turned the "clever trick" of saving memory 
into a huge waste of memory. But the anaconda-habbit of using (h, path) 
tuples for "package keys" stuck around, maybe because nobody clearly 
explained what's suddenly so wrong with that. There's a remark about a 
scaling issue related to header use of rpm-python users here: 
https://lists.dulug.duke.edu/pipermail/rpm-python-list/2003-October/000012.html, 
but if (note if) that's the only explanation given to rpm-python users, 
no wonder it never was understood. I remember boggling at the "headers 
are deprecated" comments myself back then and completely missing the point.

So what follows is the long, long overdue explanation.

Part of the long-standing confusion has to do with such a silly thing as 
argument naming, again copied around from anaconda to several places. If 
you look back at the early anaconda commit snippet above carefully, 
you'll see the callback arguments are
named "what, amount, total, key, data" - no headers in there. This is 
how it should be (except I'd replace "what" an "event", and "data" with 
"userdata"). At some point somebody replaced the "key" with "h" as in 
header, because that's pretty much what they got there in the callback, 
so it makes sense to call it a header and not some obscure "key", right?

But the "key" argument to ts.addInstall() is the key (pun intended) to 
this whole thing. The first and third arguments - "header" and "how", 
are for rpm's consumption. But the second argument, the "key", is for 
/yourself/. What you pass here as the key is the very same object that 
you get back in the callback in the "key" argument for the packages to 
be installed/updated, so that you can open a file descriptor to a 
package file and return it to rpm. A couple of trivial examples to 
demonstrate this (pass paths to local package(s) on the cli to install 
them):
http://laiskiainen.org/rpm/examples/python/minirpm-1.py
http://laiskiainen.org/rpm/examples/python/minirpm-2.py

See - no headers in the callback, and it still works. The sole purpose 
of the "key" argument is that you can open and close a file, and nothing 
more. Also there are no "keys" for erased elements at all - rpm doesn't 
need the help of callback to locate headers of installed packages.

Now, for a real-world callback, you'll want to be able to show things 
like name/nevra, size, summary etc of the package(s) being installed and 
removed. And this is where we get to the rather horrible misdesign of 
the python callback: rpm obviously has more information available, but 
in the python bindings, apart from the amount/total counters the only 
information you get is what you passed in as the "key" to 
ts.addInstall(). Since there are no keys for erased packages at all, rpm 
"helpfully" passes the name of the package as the key so you have at 
least some clue of whats going on. Which just isn't enough, especially 
in the multilib era.

So how do you show more information then? These are conveniently 
available in the header, so why not pass that along here? Well, in order 
to return the object back to you, rpm needs to hold a reference to it 
someplace. So what happens behind the scenes of ts.addInstall() is quite 
literally:

class TransactionSet:
     def __init__(self, ...):
         self.keys = []
         ...

     def addInstall(self, header, key, how='u'):
         self.keys.append(key)
         ...

Rpm itself never looks at the keys beyond passing around a pointer to 
them - the key is entirely the caller's business and rpm has no use for 
it (and could not use it even if it wanted to, for that matter). When 
you pass a header as (part of) the key, it gets pushend on to that list 
and never freed until the end of the transaction. To get an idea of the 
effect, try these two small examples which only differ in the key used:
http://laiskiainen.org/rpm/examples/python/memuse-1.py
http://laiskiainen.org/rpm/examples/python/memuse-2.py

On Fedora 14 DVD contents, I get this:
[pmatilai at localhost pyex]$ ./memuse-1.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 140820 kB
[pmatilai at localhost pyex]$ ./memuse-2.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 252752 kB

That's ~110MB worth of extra babbage that neither rpm or you have use 
for, when all you want is to a show a few tidbits of information like 
name, size etc in the callback. It's not entirely unlike lugging your 
entire personal library (of books, CD's, DVD's or such) around when 
shopping in order to avoid buying duplicates when all you'd really need 
is a list of titles and authors. It doesn't make much of a difference 
when you have, say, half a dozen of them to carry around, but with 
hundreds and thousands...

The difference is even more dramatic in reality because rpm goes out of 
its way (especially since >= 4.7.0 but to some extent in older versions 
too) to free up memory for the actual transaction run: all dependency 
information and very nearly all file data is thrown out, keeping only a 
couple of integer arrays per package to remember the actions calculated 
for each file.

Here's a version of the same memory use example, with just the data that 
an average callback might want for showing a bit of information to the 
user (ie the "title and author" from the analogue above):
http://laiskiainen.org/rpm/examples/python/memuse-3.py

With the same F14 package set I get:
[pmatilai at localhost pyex]$ ./memuse-3.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 142624 kB

...which is not that much more than the minimum of what rpm itself needs 
(case 1), with the information what you want to show in the callback 
added - contrast with case 2) which is what yum is currently doing. 
Here's a third version of the minirpm example, with a custom object used 
for the callback key for providing a bit of data about the package to 
the callback:
http://laiskiainen.org/rpm/examples/python/minirpm-3.py

The memory use would be similar to that of memuse-3 which uses dict's, 
the point is just to further demonstrate that the key can be any damn 
thing that is convenient /to you/.

For yum, the most convenient item to pass there would be a txmbr, as I 
suggested here: 
http://lists.baseurl.org/pipermail/yum-devel/2011-February/007964.html. 
Besides convenient, passing txmbr or txmbr.po as the key, would use even 
less memory than a partial copy of the header data into a dict, as 
they'd only be references to data that's already in the memory. As it's 
yum who calls ts.addInstall(), it's yum who defines the callback 
convention for it's own API users so it all can't be  changed "just like 
that" while API compatibility is needed. In any case, even the opt-in 
partial copy of the header is a HUGE step towards stopping the ancient 
waste of memory.

I hope this helps understanding why I want to change the yum callback 
convention so badly :) And if something here is not clear to you, please 
DO ASK. I want to get this straightened out for good, finally.

	- Panu -


More information about the Yum-devel mailing list