[Yum] metadata compression

Mon Apr 20 16:27:43 UTC 2009

On Monday 20 April 2009, James Antill wrote:
> Ville Skyttä <ville.skytta at iki.fi> writes:
>
> > Regarding CPU requirements, xz/lzma should be much better on metadata
> > consumer boxes than bzip2, and somewhat more memory intensive but I doubt
> > this would matter much if any at all as long as lzma compression levels
> > are kept at sane values.
>
>  The 25-35% savings were on .sqlite is at -9 ... what do you mean by
> "sane values" here.

That would need to be tested with typical sqlite repodata, but based on the 
lzma benchmarks URL I posted (http://tukaani.org/lzma/benchmarks) I'd guess 
levels <= 7.

>  Sure, people like choice in lots of things, but those choices have to
> be paid for. For instance some people like to choose to access rawhide
> from apt, or a random RHEL-5 version of yum.
>  So do we now keep N versions of all the .sqlite files, for each
> compression flavor and allow people to choose how many N versions
> (forwards and backwards) to generate?

IMHO it is fair enough to let people with such corner cases know that they'll 
be served with XML metadata or need to upgrade their depsolver (or install 
another copy on the side of the distro one for this purpose).

> -- Content-Encoding didn't work so well with that much choice.

Just out of interest, why was that?  Due to it requiring special web server 
config and not being available for other protocols than HTTP or something 
else?  (I'm assuming you don't mean compressing the database on the fly - I 
can see why that wouldn't be feasible.)

> > e.g. even if the CPU/memory requirements would be a problem
> > for boxes composing something large like Fedora Rawhide all the time, at
> > least for immutable final release repos it should be doable, ditto for
> > many scenarios between these extremes.
>
>  Exactly the opposite, IMNSHO. I download rawhide metadata a couple of
> times a week ...

Yes, of course that would be the scenario benefiting most of the improved 
compression.  But as I tried to explain above, if you can't have that, there 
are still other use cases that could benefit.

> I download "fedora" metadata somewhere between 0 and
> 1 times. I'd be happy with no compression at all there, I think.

Yes, there are quite a few different scenarios.  But to be a representative 
for a general use case, I think you've been spoiled by too fast network 
connectivity if given the choice you'd be happy with no compression even for 
those infrequently downloaded files.  The space savings would be useful on DVD 
images etc as well.

> > Regarding code requirements, if yum devs don't feel like implementing it,
> > I'm sure the code will just magically appear somewhere if there's a clear
> > green light given by the yum devs and when xz and its python bindings
> > reaches a stable release.
>
>  It's not like we know what the code will look like, although we can
> imagine. For instance if you think it's adding an import or two and
> doing some code in yum like:
>
> if url.endswith(".lz"): uncompress_lzip()
> if url.endswith(".bz2"): uncompress_bzip2()
> if url.endswith(".gz"): uncompress_gzip()

I haven't really even thought about it and it's pretty unlikely that I will 
spend time on doing that if there are no stronger hints of a buy-in from the 
yum devs (and I don't promise anything anyway at this point), but:

> ...then it's unlikely I'd commit it, because that's just the tip of
> the iceberg.

I think it would be useful for interested parties if you could elaborate on 
that iceberg in a couple of more lines, and/or URLs pointing to documentation 
that explains it.