[Yum-devel] Re: Importing filelists.xml (and SAX)

Thu Feb 3 21:55:41 UTC 2005

(Sorry if this breaks threading. I'm out of town and can't get to my 
older mail so I've grabbed the message text out of the web archives.)

Seth Vidal wrote:
> On Sat, 2005-01-29 at 16:06 +1000, Menno Smits wrote:
>> Hi all,
>> 
>> I've been playing around with trying to speed up import of filelist data 
>> into sqlite. See the attached standalone POC script for details. I've 
>> used libxml's push parser (SAX) interface.
>> 
>> Here's my findings:
>> 
>> - Using the SAX parser greatly reduces memory usage and is quite fast.
>>    On my machine this script can parse a 39M filelists.xml in 7 secs and
>>    uses only ~8MB of memory if its not writing to the database or
>>    otherwise storing the data. Has using SAX for the various XML
>>    metadata been considered for yum?  It could greatly reduce yum's
>>    memory footprint.
> 
> Have you looked at reading in the filelists.xml using the xmlReader
> stream parser w/o storing the data? That's what yum's using right now.
> Try running that w/o writing out the pickle. You'll notice it's only
> after the pickle is written that the memory size goes up. Also - try it
> on python 2.2 and watch the memory size for each import. It's
> irritating, it's a lot better on python 2.2 than 2.3.

You're right. On my machine the newTextReaderFilename interface is only 
0.3-0.4s slower than the sax parser and uses minimal amounts of memory 
(like sax). Considering that this interface is easier to program for, 
ignore everything I said about using sax!

> How long and how much memory does it eat using the Sax parser if you go
> from sax->pickle?

I've just tested this using the same filelists.xml as before (605 
packages). I pulled the filelist data into a simple dict of lists using 
the sax parser. At this point the program is using 56M of RAM and the 
process takes ~9s.

I then dumped it to disk using cPickle.dump(). This takes an additional 
~6s and memory usage jumps to 86M. So pickling definitely chews up some 
significant RAM. This pretty much confirms what you're saying.

>>    Is this really acceptable especially when metadata could change
>>    frequently?
> 
> Not terribly. However, if we can do an incremental import we might not
> have to do it very often, for example, if we check the checksum and it's
> not a new pkg then we don't have to touch it, b/c we should already have
> it and its contents in our database. But that initial import is going to
> be a bear.

Agreed.

>>    Gijs has already done a lot of good work with sqlite but I think we
>>    should think about this some more before commiting to it. I realise
>>    that filelist data is typically used less often but this wait is
>>    still fairly excessive.  Should we be investigating other options
>>    such as dbm style databases?
> 
> Yah, I just can't wait to have to deal with another berkeley database.
> What a thrill.

Yep I know.... its a big performance/cleanliness trade-off.

More to come later...

Menno

Scanned by the NetBox from NetBox Blue
(http://netboxblue.com/)