[Yum-devel] Importing filelists.xml (and SAX)

Sat Jan 29 13:44:58 UTC 2005

On Sat, 29 Jan 2005 16:06:51 +1000, Menno Smits <menno-yum at freshfoo.com> wrote:
> Here's my findings:
> - The purpose of this script was to go straight from XML into the
>    sqlite database to see how fast the data could be imported. I can't
>    think how the import could go much faster. Even so, the import of
>    this 39M filelists.xml still takes around 61s on my machine, and
>    this is for just _1_ repository.
Hi Menno,

If you don't use the quoting provided by the python sqlite module, but
use your own you can probably shave another 10 seconds off, but that's
about it I'm afraid. I've conducted a test with splitting filenames
into directory and filename, but that doesnt help much. What could
help would be to organize the table like this:

fileKey INTEGER PRIMARY KEY,
pkgKey INTEGER,
directory TEXT,
files TEXT,
filetypes TEXT

Populated like this:
directory: /etc/samba
files: file1|file2|link3|file4
filetypes: file|file|link|file

Because requirements for files always include the full pathname this
would still allow for fast searching (when there is an index on
directory) and it would dratiscly reduce the number of database
entries. I\m just making this up as I type here, so there could be
something I'm overlooking and I don't know how much of a speed gain
this would be.

>    Is this really acceptable especially when metadata could change
>    frequently?

The idea is to update the sqlite cache by processing changes only,
this is reasonably fast on my machine but I think we should move to a
situation where the user runs yum makecache in some sort of cron job.

BTW seth, is it possible to tell yum to not check if the metadata is
still up to date, apart from using the -C option? Very often I just
need to install a package from base I find myself delayed for 20
seconds or so because there are new updates or there are a few new
packages in some 3rd party repo.

>    Gijs has already done a lot of good work with sqlite but I think we
>    should think about this some more before commiting to it. I realise
>    that filelist data is typically used less often but this wait is
>    still fairly excessive.  Should we be investigating other options
>    such as dbm style databases?

Another thing that we could have a look at is at the cache that smart
uses, it uses a custom written cache module written in C. I haven't
had a look at this cache code yet, but smart has a low memory usage
and imports metadata relativly fast, so it might be an alternative.

Greets,
  Gijs