[Yum-devel] Dispelling rpm callback myths
Panu Matilainen
pmatilai at laiskiainen.org
Tue Feb 22 11:42:59 UTC 2011
I promised to Seth on irc last Friday to explain the rpm transaction
(python) callback - what's the deal with avoiding headers to save memory
etc, so here goes... This is long, so go grab a coffee first.
I'll start with little bit of history first. Please remember this is not
the "absolute truth of what really happened" but just my interpretation
of things, based on bits and pieces of information from commit logs and
other public archives.
Our story starts at the birth of Anaconda in 1999, in this early commit:
http://git.fedorahosted.org/git/?p=anaconda.git;a=commitdiff;h=c6ca9181446e3ad83a46fe031a4688a92f9f0f98
+ts = rpm.TransactionSet(rootPath, db)
+
+for p in comps.selected():
+ ts.add(p.h, (p.h, p.h[1000000]))
...
+def cb(what, amount, total, key, data):
+ if (what == rpm.RPMCALLBACK_INST_OPEN_FILE):
+ (h, key) = key
+ data.setPackage(h[rpm.RPMTAG_NAME])
+ d = os.open("/mnt/redhat/test/6.0/i386/RedHat/RPMS/" + key,
os.O_RDONLY)
+ return d
There we go, adding (header, pkg_path) tuples as the "key" argument to
what was ts.addInstall() back then (the h[1000000] thing was a custom
tag added by the genhdlist thing that wrote out the headerlist used by
anaconda, containing the package path). This convention has since then
been copied to/carried on in nearly every single user of rpm-python
(including anaconda, up2date and yum at least).
Ewt wrote this part of the original rpm bindings (anaconda commit
f1da6a4807d44c670453978b53d4b6d18b406ec1), so one would assume he knew
what he was doing... and in fact back then, it was a "clever trick" to
actually /save/ memory in anaconda, due to how rpm worked at that time.
Unfortunately he + others (I dunno exact details of who wrote what)
botched up various other aspects of the python callback design pretty
badly - more on that later.
Fast-forward a few years and rpm had internally started saving memory by
scraping just the information it needs for the transaction calculations
(dependency checks, ordering, file conflicts etc) from the header passed
in the /first/ argument of ts.addInstall(), instead of keeping the
entire header around. Which turned the "clever trick" of saving memory
into a huge waste of memory. But the anaconda-habbit of using (h, path)
tuples for "package keys" stuck around, maybe because nobody clearly
explained what's suddenly so wrong with that. There's a remark about a
scaling issue related to header use of rpm-python users here:
https://lists.dulug.duke.edu/pipermail/rpm-python-list/2003-October/000012.html,
but if (note if) that's the only explanation given to rpm-python users,
no wonder it never was understood. I remember boggling at the "headers
are deprecated" comments myself back then and completely missing the point.
So what follows is the long, long overdue explanation.
Part of the long-standing confusion has to do with such a silly thing as
argument naming, again copied around from anaconda to several places. If
you look back at the early anaconda commit snippet above carefully,
you'll see the callback arguments are
named "what, amount, total, key, data" - no headers in there. This is
how it should be (except I'd replace "what" an "event", and "data" with
"userdata"). At some point somebody replaced the "key" with "h" as in
header, because that's pretty much what they got there in the callback,
so it makes sense to call it a header and not some obscure "key", right?
But the "key" argument to ts.addInstall() is the key (pun intended) to
this whole thing. The first and third arguments - "header" and "how",
are for rpm's consumption. But the second argument, the "key", is for
/yourself/. What you pass here as the key is the very same object that
you get back in the callback in the "key" argument for the packages to
be installed/updated, so that you can open a file descriptor to a
package file and return it to rpm. A couple of trivial examples to
demonstrate this (pass paths to local package(s) on the cli to install
them):
http://laiskiainen.org/rpm/examples/python/minirpm-1.py
http://laiskiainen.org/rpm/examples/python/minirpm-2.py
See - no headers in the callback, and it still works. The sole purpose
of the "key" argument is that you can open and close a file, and nothing
more. Also there are no "keys" for erased elements at all - rpm doesn't
need the help of callback to locate headers of installed packages.
Now, for a real-world callback, you'll want to be able to show things
like name/nevra, size, summary etc of the package(s) being installed and
removed. And this is where we get to the rather horrible misdesign of
the python callback: rpm obviously has more information available, but
in the python bindings, apart from the amount/total counters the only
information you get is what you passed in as the "key" to
ts.addInstall(). Since there are no keys for erased packages at all, rpm
"helpfully" passes the name of the package as the key so you have at
least some clue of whats going on. Which just isn't enough, especially
in the multilib era.
So how do you show more information then? These are conveniently
available in the header, so why not pass that along here? Well, in order
to return the object back to you, rpm needs to hold a reference to it
someplace. So what happens behind the scenes of ts.addInstall() is quite
literally:
class TransactionSet:
def __init__(self, ...):
self.keys = []
...
def addInstall(self, header, key, how='u'):
self.keys.append(key)
...
Rpm itself never looks at the keys beyond passing around a pointer to
them - the key is entirely the caller's business and rpm has no use for
it (and could not use it even if it wanted to, for that matter). When
you pass a header as (part of) the key, it gets pushend on to that list
and never freed until the end of the transaction. To get an idea of the
effect, try these two small examples which only differ in the key used:
http://laiskiainen.org/rpm/examples/python/memuse-1.py
http://laiskiainen.org/rpm/examples/python/memuse-2.py
On Fedora 14 DVD contents, I get this:
[pmatilai at localhost pyex]$ ./memuse-1.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 140820 kB
[pmatilai at localhost pyex]$ ./memuse-2.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 252752 kB
That's ~110MB worth of extra babbage that neither rpm or you have use
for, when all you want is to a show a few tidbits of information like
name, size etc in the callback. It's not entirely unlike lugging your
entire personal library (of books, CD's, DVD's or such) around when
shopping in order to avoid buying duplicates when all you'd really need
is a list of titles and authors. It doesn't make much of a difference
when you have, say, half a dozen of them to carry around, but with
hundreds and thousands...
The difference is even more dramatic in reality because rpm goes out of
its way (especially since >= 4.7.0 but to some extent in older versions
too) to free up memory for the actual transaction run: all dependency
information and very nearly all file data is thrown out, keeping only a
couple of integer arrays per package to remember the actions calculated
for each file.
Here's a version of the same memory use example, with just the data that
an average callback might want for showing a bit of information to the
user (ie the "title and author" from the analogue above):
http://laiskiainen.org/rpm/examples/python/memuse-3.py
With the same F14 package set I get:
[pmatilai at localhost pyex]$ ./memuse-3.py /mnt/Packages/*.rpm
Memory used with 2766 packages in transaction: 142624 kB
...which is not that much more than the minimum of what rpm itself needs
(case 1), with the information what you want to show in the callback
added - contrast with case 2) which is what yum is currently doing.
Here's a third version of the minirpm example, with a custom object used
for the callback key for providing a bit of data about the package to
the callback:
http://laiskiainen.org/rpm/examples/python/minirpm-3.py
The memory use would be similar to that of memuse-3 which uses dict's,
the point is just to further demonstrate that the key can be any damn
thing that is convenient /to you/.
For yum, the most convenient item to pass there would be a txmbr, as I
suggested here:
http://lists.baseurl.org/pipermail/yum-devel/2011-February/007964.html.
Besides convenient, passing txmbr or txmbr.po as the key, would use even
less memory than a partial copy of the header data into a dict, as
they'd only be references to data that's already in the memory. As it's
yum who calls ts.addInstall(), it's yum who defines the callback
convention for it's own API users so it all can't be changed "just like
that" while API compatibility is needed. In any case, even the opt-in
partial copy of the header is a HUGE step towards stopping the ancient
waste of memory.
I hope this helps understanding why I want to change the yum callback
convention so badly :) And if something here is not clear to you, please
DO ASK. I want to get this straightened out for good, finally.
- Panu -
More information about the Yum-devel
mailing list