[Yum-devel] RepoView
Ryan Tomayko
rtomayko at naeblis.cx
Fri Mar 4 16:52:31 UTC 2005
On Mar 4, 2005, at 8:56 AM, Konstantin Ryabitsev wrote:
> On Thu, 2005-03-03 at 22:27 -0500, seth vidal wrote:
>> I did some import tests using cElementTree last night, it's, umm, big,
>> memory-wise. reading in filelists.xml.gz for only rawhide ate up 120M
I've been working on the impression that cET is the most efficient
(memory-wise) XML parser available for python based on these numbers:
<http://effbot.org/zone/celementtree.htm#benchmarks>
Maybe we need to run some comparisons of our own if you're seeing
something so different.
I was really surprised to see it weighing in so much lighter than
libxml2. There was some follow up discussion on the xml-sig mailing
list between Fredrik and DV:
<http://www.mail-archive.com/xml-sig@python.org/msg00064.html>
The justification for less memory use in cET as opposed to libxml2
seems to be that cET creates C based python objects where libxml2 has
Python based wrappers over C based stuff:
(Quoting DV from thread above:)
> If you can build a C layer dedicated to Python you
> should be able to get better performances than a generic
> engine with autogenerated python bindings
It also looks like this might not be noticeable directly after parsing
a document because libxml2 doesn't create the Python wrappers until you
start digging into the tree.
At any rate, if you're seeing cET memory use warp that of libxml2, I
want to know about it. I guess I should have ran a few tests myself...
> (Moving to yum-devel)
>
> Yes, I agree, but cElementTree provides iterparse() method, which
> effectively negates this problem: it allows you to keep just one
> package
> node at a time in memory, not all of them. For example, when using
> cElementTree.parse('primary.xml'), the memory goes up to 45M and is
> relative to the size of the xml file. When using cElementTree.iterparse
> ('primary.xml'), it stays constant at ~9M, which is just slightly above
> the normal python footprint. Other than parse<->iterparse, the API
> otherwise stays effectively the same.
iterparse is nice. libxml2 has something similar with XmlReader.
> I should have used it in RepoView, but the idea of being fully
> compatible with pure-python ElementTree was too appealing, and the
> python version has no iterparse.
It does now:
<http://online.effbot.org/2005_03_01_archive.htm#elementtree-iterparse>
IIRC, he's rolling this into the python implementation soon.
> I'll do some tests with iterparse and mdparser and see how it fares
> against libxml2.
I'd love to see these as well. I also wanted to throw out that there's
the lxml.etree project that implements the ElementTree API on top of
libxml2 at the C layer:
<http://faassen.n--tree.net/blog/view/weblog/2005/01/08/0>
The nice thing here is that you get full xpath, xslt, schema, etc.
support (not to mention automatic memory management) with an
ElementTree like interface.
- Ryan
More information about the Yum-devel
mailing list