[Yum-devel] Breaking yum repository metadata

Tue Oct 20 09:31:42 UTC 2009

Hi all,

I'd like to create a prototype of a new repository metadata, and it 
would be nice if you let me know about any negative points you see in 
the proposal (like the previous one with security concerns):

In the current implementation, the repository's primary database 
contains a considerable amount of information about each package, and 
most of such information won't be used by many users. It wastes 
bandwidth, which gets worse as the size of the repository grows. For 
example, the current Fedora repository primary database is about 12MB in 
compressed form and 47MB in normal form. There are still many users for 
which downloading 12MB of data is not fast, and as currently yum doesn't 
resume downloading metadata files, it could be really frustrating for 
users with poor internet connection.

IMHO, it would be nice if users download only what they really need, not 
the complete repository data. So, I think it is nice to split the 
repository based on packages, not based on the information about 
packages (like the current separation of primary and file lists 
databases). As an example (and the first thing that I want to work on), 
consider package requirements. Currently, package requirements are 
stored in the primary database, but it seems that you need a package's 
requirements only when you want to install that package. By removing the 
requires table from Fedora repository's primary database, its size 
shrinks from 47MB to 28MB (and in compressed form from 12MB to 6.7MB). 
My initial proposal is to store each package's requirements in a 
separate signed file (e.g. mypackage-0.0.1.fc10.i386.rpm_requirements). 
So, yum will download such files when it needs them.  Now, what do you 
think about this? Does it worth implementing?

To go farther in splitting, it might be nice to store package 
descriptions in separate files too. Also, I thought a little about 
splitting package provides too. It should be done based on the provides 
themselves, but creating a separate file for each provides might be 
overkill. But it might be nice to split the provides based on some 
initial characters of their hash code (e.g. based on the first 2 
characters of their hash code) into separate small databases.

The file lists could be also split, using the same method as 
requirements or provides (maybe even both!), based on their most 
important use case (I'm not sure of).

There could be a compatibility period in which both current style and 
new style (after implementing all desired functionality) repository 
metadata are created; which will have a very small space overhead for 
mirrors.

I'd like to hear from you about your opinions.

Thanks a lot,

Hedayat