[Yum-devel] Breaking yum repository metadata

Tue Oct 20 12:24:31 UTC 2009

On Tue, 20 Oct 2009, Hedayat Vatankhah wrote:

> Hi all,
>
> I'd like to create a prototype of a new repository metadata, and it would be 
> nice if you let me know about any negative points you see in the proposal 
> (like the previous one with security concerns):
>
> In the current implementation, the repository's primary database contains a 
> considerable amount of information about each package, and most of such 
> information won't be used by many users. It wastes bandwidth, which gets 
> worse as the size of the repository grows. For example, the current Fedora 
> repository primary database is about 12MB in compressed form and 47MB in 
> normal form. There are still many users for which downloading 12MB of data is 
> not fast, and as currently yum doesn't resume downloading metadata files, it 
> could be really frustrating for users with poor internet connection.
>
>
> IMHO, it would be nice if users download only what they really need, not the 
> complete repository data. So, I think it is nice to split the repository 
> based on packages, not based on the information about packages (like the 
> current separation of primary and file lists databases). As an example (and 
> the first thing that I want to work on), consider package requirements. 
> Currently, package requirements are stored in the primary database, but it 
> seems that you need a package's requirements only when you want to install 
> that package. By removing the requires table from Fedora repository's primary 
> database, its size shrinks from 47MB to 28MB (and in compressed form from 
> 12MB to 6.7MB). My initial proposal is to store each package's requirements 
> in a separate signed file (e.g. mypackage-0.0.1.fc10.i386.rpm_requirements). 
> So, yum will download such files when it needs them.  Now, what do you think 
> about this? Does it worth implementing?

How much space would we save if we got rid of the most common dependencies 
all together? ie: /bin/sh and libc. If we just said "all these pkgs assume 
the presence of the items provided by bash and glibc"

> To go farther in splitting, it might be nice to store package descriptions in 
> separate files too.

If you store them each in separate files it makes searching on the 
description a lot harder.

> Also, I thought a little about splitting package provides 
> too. It should be done based on the provides themselves, but creating a 
> separate file for each provides might be overkill. But it might be nice to 
> split the provides based on some initial characters of their hash code (e.g. 
> based on the first 2 characters of their hash code) into separate small 
> databases.
>
> The file lists could be also split, using the same method as requirements or 
> provides (maybe even both!), based on their most important use case (I'm not 
> sure of).

So searching or installing by provide or filelist wildcard would be 
impossible w/o downloading every single file?

> There could be a compatibility period in which both current style and new 
> style (after implementing all desired functionality) repository metadata are 
> created; which will have a very small space overhead for mirrors.
>
>
> I'd like to hear from you about your opinions.

Wouldn't it be much easier to think about providing deltas for the 
metadata rather than forcing every tool reading the repodata to change its 
format and how often it has to download?

-sv