[Yum-devel] yum on an olpc machine (slooooooooooow)

Konstantin Ryabitsev icon at fedoraproject.org
Mon Dec 18 04:09:32 UTC 2006


On 12/17/06, seth vidal <skvidal at linux.duke.edu> wrote:
> we store the checksum of the old db (compressed and uncompressed) as
> well as the new db file in repomd.xml
>
> yum would download repomd.xml - if the checksum of its sqlite db files
> is the same as the old checksum, then it downloads the sqlite
> transaction diff b/c it can use it. when it doesn't match then grab the
> whole file.

Actually, it doesn't have to be that way. You can keep the "diffs" all
the way back to the first run of createrepo. E.g.:

initial run:
create primary.sqlite

during the next run createrepo does effectively the same thing yum
does and instead of blowing away the old primary.sqlite, it does
INSERT/DELETE operations, while creating changes.sqlite, which
contains a table something like:

|createrepo run timestamp|action(add/delete)|pkgdata[....]|

So, let's say the initial primary.sqlite run was a 0 unix seconds.
Next time we run createrpo, we have updated pkgA to 1.1 and removed
pkgA-1.0:

createrepo run at 111111 unix seconds:
|111111|add|pkgA-1.1|
|111111|del|pkgA-1.0|

createrepo run at 222222 unix seconds:
|222222|add|pkgB-3.0|
|222222|del|pkgB-2.5|

createrepo run at 333333 unix seconds:
|333333|add|pkgC-1.2|
|333333|add|pkgB-2.6|

...

primary.sqlite contains the timestamp when it was generated last. So,
if clientA downloads primary.sqlite when it was at 111111 unix
seconds, and then gets the changes.sqlite some time later, at 333333
unix seconds, it knows exactly what happened to primary.sqlite between
these two revisions and what it needs to do to get from 111111 to
333333.

In other words, changes.sqlite contains the entire history of what
happened to primary.sqlite between the time when it was first
generated, and until the last createrepo run.

If at some point changes.sqlite becomes larger than primary.sqlite,
then it should be blown away and started over, because any
bandwidth-saving benefits would be moot. The repomd.xml will contain
no "changes.sqlite" entry, so clients will know to download the
primary db. If they get the changes.sqlite after it's started all over
again, it would be easy for them to "see" that the "initial primary
run" timestamp is after the last timestamp they have on record for
that repository (hence, continuity is broken), so they should discard
the downloaded changes.sqlite and download the primary db to start the
process over again.

This might seem complex, but it really isn't. The database operations
for createrpo are limited to 2 simple actions -- insert row and delete
row, which are simple to record in changes.sqlite. Using timestamps
should help clients track how many transactions from changes.db they
need to rerun to get the latest changes to the repository.

This shouldn't be too hard to implement, and once done, the benefit
for large repositories like fedora extras would be very significant,
since that would cut down on both download size and parsing speed --
the things everyone complains about the most.

Does that sound sane? Do I need to elaborate on some points? I'm up a
bit late, so I may not come across as clearly as I might hope. :)

Regards,
-- 
Konstantin Ryabitsev
Montréal, Québec


More information about the Yum-devel mailing list