[Yum-devel] delta repodata

seth vidal skvidal at fedoraproject.org
Fri Mar 11 16:15:37 UTC 2011


Passing along this longer version of Martin's idea for providing deltas
of repodata for yum.

this idea came up at fudcon in AZ and I think it may be a great goal for
F16 or beyond.

Anyone have any thoughts on this?

-sv



-------- Forwarded Message --------
From: Martin Langhoff <martin at laptop.org>
To: seth vidal <skvidal at fedoraproject.org>
Subject: Re: git-pack repodata
Date: Sat, 12 Feb 2011 11:36:15 -0500

Hi Seth,

yes, I am definitely interested in helping you guys. I am under a ton
of work (OLPC, a 14 month old boy) so my latency is high. This has
been in the back of my mind, burning slowly in the last couple of
weeks.

If the notes below make sense, feel free to circulate (if possible,
keep me CC'd in the discussion). If they don't make sense, feel free
to ignore :-)

 - braindump follows -

If I understood correctly, you have

 - rawhide - moves quickly, discards old rpms
 - fedora release repos -- snapshot in time, does not move
 - fedora updates and updates-testing repos -- moves, discards old rpms
 - rhel repo -- moves, never forgets

I assume other repos -- EPEL, etc -- follow one of the models above.
Are there any other models to worry about? And we hope to optimize, in
order of importance for these use cases...

 - fedora updates and rhel repos -- this is what users hit the most.
This is what causes "yum is so slow, apt rulez" nags.
 - rawhide -- devs hit this, but it's a world of pain anyway
 - release repos -- if we shrink the data, it benefits network installs

Based on this, my current thinking is as follows:

 - the interesting case is "moving" repos, in two variants:
'forgetful' and not forgetful

 - distributing deltas is clearly needed
   - the life is very long, so you need recent "snapshots" to base the
deltas on -- akin to keyframes in video compression
   - those snapshots make it easy for clients to check they haven't
gotten corrupted
   - any repo state you publish must be verifiable (but that might be
costly for the client side)

So what I would draft at the moment would be a scheme

 - preserving the use of sqlite

 - generating a "delta" file for sqlite -- can be drafted as sql
insert/update statements, but we can improve on this. The sqlite3
format might be friendly to xdelta binary patches.

 - define a period between keyframes -- a week? -- the period is
controlled by the repo side, naturally

 - during the time between keyframes, the repo update tools create a
delta file based on the last valid keyframe

 - clients keep a pristine version of the 'keyframe' sqlite, apply the delta.

 - at new-keyframe prep time, the repo tools generate a
keyframe-to-keyframe delta so that clients can upgrade their pristine
keyframe, paranoid or bandwidth-blessed clients can prefer to fetch
the new keyframe file instead of the delta

 - a facility needs to be added to export the data in sqlite to a to a
stable checksummable format -- at OLPC we use a variant of JSON called
CJSON (for Canonical JSON) -- this is more compact than XML, and has a
single canonical representation for a given dataset (unlike XML where
for example it is legal to add or change whitespace in some parts of
the file)

One thing I like about this plan is that you can draft it as a yum
wrapper (or a plugin). No need to rework the guts of yum to do
something other than sqlite, no need for huge risky break-the-world
patches or branches. Of course you' ll

You see I haven't mentioned git :-) my experiments left me *very*
impressed with sqlite's compact on-disk representation -- sqlite is
generally fast so I had made assumptions on disk space vs performace
vs ACIDity tradeoffs. I had assumed the disk format was bloated (for
good reasons).

Using git you'll also need a keyframe + delta scheme, so either way
you need to drive git or sqlite with custom code. So the payoff isn't
clear.

Sqlite and xdelta notes --

The are perhaps good opportunities in trying xdelta patches to handle
sqlite updates -- with the scheme I propose above the repo management
tools will have to create the sqlite db once, and then apply updates
(right now they regenerate it from scratch, from XML, so they are
'different' everytime).

These "updated" sqlite databases are likely to be xdelta friendly.
Perhaps not running vacuum helps their xdelta-friendliness -- but
there'll be a space usage tradeoff.

If you use xdelta patches, then just checksumming the sqlite DB is all
you need to validate the state. No CJSON export required...

Sqlite developers might have some recommendations on this track -- I
find this discussion very interesting, but may be outdated
http://www.mail-archive.com/sqlite-users@sqlite.org/msg12841.html

 - - - hope that was interesting -- my wife is travelling (Liberia,
she also works for OLPC and is doing ground work there to get XOs in
the field...) and mr Alessandro's woken up. Gotta give him some time
now...

cheers,



m




More information about the Yum-devel mailing list