seth vidal skvidal at phy.duke.edu
Fri Apr 30 23:57:51 UTC 2004

> First idea: I remember that hashes are fast to search, but,
> comparatively, very slow to grow.  To overcome this, most
> hash libraries allow to define an initial size, which should
> be best guessed large enough to accomodate all the entries,
> to avoid frequent time consuming resizes while filling.
> Does your package offer this feature?

Have you programmed in python before? A simple python dict is what I'm
talking about. You can build up that sort of dict and traverse it but
you still have to:

open the package
get the data you want
put the data in the dict
close the package

Doesn't sound too bad - but the process for opening and looking through
a package does take some time.

> Second idea: you mentioned package traversal as time consuming.
> Is this time spent to open each package as a DB, grab the
> info, close it?  If this is the case, have you then considered
> building a cache of package contents, which can be updated
> and used in subsequent runs, to take advantage that most
> (if not all) the packages do not change between yum runs?

In this case it's opening up each header, getting the data and moving
along, but yes, it can take some time to search each one.

Where do you store that cache? How do you store it? How do you update it
to make sure it's not out of sync with the repository w/o reindexing all
the headers/packages? Feel free to answer any/all of those questions.

Some of these have already been addressed - many of them is why I spent
so much time working on the xml-metadata to sort out
easier/faster/better ways of indexing the packages so yum can:

1. know if there are changes
2. more easily traverse the packages and the metadata
3. have smaller amounts of data to download and sort through on any run.

Right now I'm making those changes work then I'm going to focus on
trimming time out of each session. It will still be some time b/c I'm
working on this as I can. 

If you want to be a big help, don't look at improvements for speedups to
the 2.0.X branch. I don't want to spend more time on 2.0.X if it is at
all possible. A lot of things in the structure has changed and cvs-HEAD
is where I'm trying to work the most. When I have a snapshot that does
some useful things I'll be sure to announce it here and yum-devel.

If you're a python programmer and you're familiar with libxml2 - then
take a look at http://linux.duke.edu/metadata/generate/ - feel free to
make that code:

 1. look for an existent repodata dir
 2. if it finds one - use the xml files there to speed up the update
creation of the new metadata for that repository.


