[Yum-devel] Importing filelists.xml (and SAX)
skvidal at phy.duke.edu
Sat Jan 29 06:47:52 UTC 2005
On Sat, 2005-01-29 at 16:06 +1000, Menno Smits wrote:
> Hi all,
> I've been playing around with trying to speed up import of filelist data
> into sqlite. See the attached standalone POC script for details. I've
> used libxml's push parser (SAX) interface.
> Here's my findings:
> - Using the SAX parser greatly reduces memory usage and is quite fast.
> On my machine this script can parse a 39M filelists.xml in 7 secs and
> uses only ~8MB of memory if its not writing to the database or
> otherwise storing the data. Has using SAX for the various XML
> metadata been considered for yum? It could greatly reduce yum's
> memory footprint.
Have you looked at reading in the filelists.xml using the xmlReader
stream parser w/o storing the data? That's what yum's using right now.
Try running that w/o writing out the pickle. You'll notice it's only
after the pickle is written that the memory size goes up. Also - try it
on python 2.2 and watch the memory size for each import. It's
irritating, it's a lot better on python 2.2 than 2.3.
How long and how much memory does it eat using the Sax parser if you go
> - The purpose of this script was to go straight from XML into the
> sqlite database to see how fast the data could be imported. I can't
> think how the import could go much faster. Even so, the import of
> this 39M filelists.xml still takes around 61s on my machine, and
> this is for just _1_ repository.
oy. That is pretty bad.
> Is this really acceptable especially when metadata could change
Not terribly. However, if we can do an incremental import we might not
have to do it very often, for example, if we check the checksum and it's
not a new pkg then we don't have to touch it, b/c we should already have
it and its contents in our database. But that initial import is going to
be a bear.
> Gijs has already done a lot of good work with sqlite but I think we
> should think about this some more before commiting to it. I realise
> that filelist data is typically used less often but this wait is
> still fairly excessive. Should we be investigating other options
> such as dbm style databases?
Yah, I just can't wait to have to deal with another berkeley database.
What a thrill.
The nicest thing I could see about sqlite is that it's not a mess to
More information about the Yum-devel