[Yum-devel] yum on an olpc machine (slooooooooooow)
seth vidal
skvidal at linux.duke.edu
Mon Dec 18 05:56:47 UTC 2006
On Sun, 2006-12-17 at 23:09 -0500, Konstantin Ryabitsev wrote:
> On 12/17/06, seth vidal <skvidal at linux.duke.edu> wrote:
> > we store the checksum of the old db (compressed and uncompressed) as
> > well as the new db file in repomd.xml
> >
> > yum would download repomd.xml - if the checksum of its sqlite db files
> > is the same as the old checksum, then it downloads the sqlite
> > transaction diff b/c it can use it. when it doesn't match then grab the
> > whole file.
>
> Actually, it doesn't have to be that way. You can keep the "diffs" all
> the way back to the first run of createrepo. E.g.:
>
> initial run:
> create primary.sqlite
>
> during the next run createrepo does effectively the same thing yum
> does and instead of blowing away the old primary.sqlite, it does
> INSERT/DELETE operations, while creating changes.sqlite, which
> contains a table something like:
>
> |createrepo run timestamp|action(add/delete)|pkgdata[....]|
>
> So, let's say the initial primary.sqlite run was a 0 unix seconds.
> Next time we run createrpo, we have updated pkgA to 1.1 and removed
> pkgA-1.0:
>
> createrepo run at 111111 unix seconds:
> |111111|add|pkgA-1.1|
> |111111|del|pkgA-1.0|
>
> createrepo run at 222222 unix seconds:
> |222222|add|pkgB-3.0|
> |222222|del|pkgB-2.5|
>
> createrepo run at 333333 unix seconds:
> |333333|add|pkgC-1.2|
> |333333|add|pkgB-2.6|
>
> ...
>
> primary.sqlite contains the timestamp when it was generated last. So,
> if clientA downloads primary.sqlite when it was at 111111 unix
> seconds, and then gets the changes.sqlite some time later, at 333333
> unix seconds, it knows exactly what happened to primary.sqlite between
> these two revisions and what it needs to do to get from 111111 to
> 333333.
>
> In other words, changes.sqlite contains the entire history of what
> happened to primary.sqlite between the time when it was first
> generated, and until the last createrepo run.
>
> If at some point changes.sqlite becomes larger than primary.sqlite,
> then it should be blown away and started over, because any
> bandwidth-saving benefits would be moot. The repomd.xml will contain
> no "changes.sqlite" entry, so clients will know to download the
> primary db. If they get the changes.sqlite after it's started all over
> again, it would be easy for them to "see" that the "initial primary
> run" timestamp is after the last timestamp they have on record for
> that repository (hence, continuity is broken), so they should discard
> the downloaded changes.sqlite and download the primary db to start the
> process over again.
>
> This might seem complex, but it really isn't. The database operations
> for createrpo are limited to 2 simple actions -- insert row and delete
> row, which are simple to record in changes.sqlite. Using timestamps
> should help clients track how many transactions from changes.db they
> need to rerun to get the latest changes to the repository.
>
> This shouldn't be too hard to implement, and once done, the benefit
> for large repositories like fedora extras would be very significant,
> since that would cut down on both download size and parsing speed --
> the things everyone complains about the most.
I'm curious how quickly they would get big but it seems like a
worthwhile thing to try out.
Would you be interested in working on the above for createrepo?
-sv
More information about the Yum-devel
mailing list