[Yum-devel] Importing filelists.xml (and SAX)

seth vidal skvidal at phy.duke.edu
Sat Jan 29 06:47:52 UTC 2005

On Sat, 2005-01-29 at 16:06 +1000, Menno Smits wrote:
> Hi all,
> I've been playing around with trying to speed up import of filelist data 
> into sqlite. See the attached standalone POC script for details. I've 
> used libxml's push parser (SAX) interface.
> Here's my findings:
> - Using the SAX parser greatly reduces memory usage and is quite fast.
>    On my machine this script can parse a 39M filelists.xml in 7 secs and
>    uses only ~8MB of memory if its not writing to the database or
>    otherwise storing the data. Has using SAX for the various XML
>    metadata been considered for yum?  It could greatly reduce yum's
>    memory footprint.

Have you looked at reading in the filelists.xml using the xmlReader
stream parser w/o storing the data? That's what yum's using right now.
Try running that w/o writing out the pickle. You'll notice it's only
after the pickle is written that the memory size goes up. Also - try it
on python 2.2 and watch the memory size for each import. It's
irritating, it's a lot better on python 2.2 than 2.3.

How long and how much memory does it eat using the Sax parser if you go
from sax->pickle?

> - The purpose of this script was to go straight from XML into the
>    sqlite database to see how fast the data could be imported. I can't
>    think how the import could go much faster. Even so, the import of
>    this 39M filelists.xml still takes around 61s on my machine, and
>    this is for just _1_ repository.

oy. That is pretty bad.

>    Is this really acceptable especially when metadata could change
>    frequently?

Not terribly. However, if we can do an incremental import we might not
have to do it very often, for example, if we check the checksum and it's
not a new pkg then we don't have to touch it, b/c we should already have
it and its contents in our database. But that initial import is going to
be a bear.

>    Gijs has already done a lot of good work with sqlite but I think we
>    should think about this some more before commiting to it. I realise
>    that filelist data is typically used less often but this wait is
>    still fairly excessive.  Should we be investigating other options
>    such as dbm style databases?

Yah, I just can't wait to have to deal with another berkeley database.
What a thrill.

The nicest thing I could see about sqlite is that it's not a mess to
program for.


More information about the Yum-devel mailing list