[Yum-devel] RepoView

Fri Mar 4 16:52:31 UTC 2005

On Mar 4, 2005, at 8:56 AM, Konstantin Ryabitsev wrote:
> On Thu, 2005-03-03 at 22:27 -0500, seth vidal wrote:
>> I did some import tests using cElementTree last night, it's, umm, big,
>> memory-wise. reading in filelists.xml.gz for only rawhide ate up 120M

I've been working on the impression that cET is the most efficient 
(memory-wise) XML parser available for python based on these numbers:

<http://effbot.org/zone/celementtree.htm#benchmarks>

Maybe we need to run some comparisons of our own if you're seeing 
something so different.

I was really surprised to see it weighing in so much lighter than 
libxml2. There was some follow up discussion on the xml-sig mailing 
list between Fredrik and DV:

<http://www.mail-archive.com/xml-sig@python.org/msg00064.html>

The justification for less memory use in cET as opposed to libxml2 
seems to be that cET creates C based python objects where libxml2 has 
Python based wrappers over C based stuff:

(Quoting DV from thread above:)
 > If you can build a C layer dedicated to Python you
 > should be able to get better performances than a generic
 > engine with autogenerated python bindings

It also looks like this might not be noticeable directly after parsing 
a document because libxml2 doesn't create the Python wrappers until you 
start digging into the tree.

At any rate, if you're seeing cET memory use warp that of libxml2, I 
want to know about it. I guess I should have ran a few tests myself...

> (Moving to yum-devel)
>
> Yes, I agree, but cElementTree provides iterparse() method, which
> effectively negates this problem: it allows you to keep just one 
> package
> node at a time in memory, not all of them. For example, when using
> cElementTree.parse('primary.xml'), the memory goes up to 45M and is
> relative to the size of the xml file. When using cElementTree.iterparse
> ('primary.xml'), it stays constant at ~9M, which is just slightly above
> the normal python footprint. Other than parse<->iterparse, the API
> otherwise stays effectively the same.

iterparse is nice. libxml2 has something similar with XmlReader.

> I should have used it in RepoView, but the idea of being fully
> compatible with pure-python ElementTree was too appealing, and the
> python version has no iterparse.

It does now:

<http://online.effbot.org/2005_03_01_archive.htm#elementtree-iterparse>

IIRC, he's rolling this into the python implementation soon.

> I'll do some tests with iterparse and mdparser and see how it fares
> against libxml2.

I'd love to see these as well. I also wanted to throw out that there's 
the lxml.etree project that implements the ElementTree API on top of 
libxml2 at the C layer:

<http://faassen.n--tree.net/blog/view/weblog/2005/01/08/0>

The nice thing here is that you get full xpath, xslt, schema, etc. 
support (not to mention automatic memory management) with an 
ElementTree like interface.

- Ryan