[Yum-devel] Breaking yum repository metadata

Hedayat Vatankhah hedayat at grad.com
Tue Oct 20 23:20:52 UTC 2009


Hi again,

On ۰۹/۱۰/۲۰  06:20, James Antill wrote:
> On Tue, 2009-10-20 at 13:01 +0330, Hedayat Vatankhah wrote:
>    
>> Hi all,
>>
>> I'd like to create a prototype of a new repository metadata, and it
>> would be nice if you let me know about any negative points you see in
>> the proposal (like the previous one with security concerns):
>>
>> In the current implementation, the repository's primary database
>> contains a considerable amount of information about each package, and
>> most of such information won't be used by many users. It wastes
>> bandwidth, which gets worse as the size of the repository grows. For
>> example, the current Fedora repository primary database is about 12MB in
>> compressed form and 47MB in normal form. There are still many users for
>> which downloading 12MB of data is not fast, and as currently yum doesn't
>> resume downloading metadata files, it could be really frustrating for
>> users with poor internet connection.
>>      
>   This is misleading, updates (the only part of Fedora's release repos.
> that change) is currently (for F11) 22MB uncompressed and 5.5MB
> compressed. This is still not "tiny" but it's much smaller than updating
> everything.
>    
But not everyone can/will update everything. Anyway, the 5.3MB 
compressed updates primary db will shrink to 3MB without requires. Not 
that bad I think.

>> IMHO, it would be nice if users download only what they really need, not
>> the complete repository data. So, I think it is nice to split the
>> repository based on packages, not based on the information about
>> packages (like the current separation of primary and file lists
>> databases). As an example (and the first thing that I want to work on),
>> consider package requirements. Currently, package requirements are
>> stored in the primary database, but it seems that you need a package's
>> requirements only when you want to install that package. By removing the
>> requires table from Fedora repository's primary database, its size
>> shrinks from 47MB to 28MB (and in compressed form from 12MB to 6.7MB).
>> My initial proposal is to store each package's requirements in a
>> separate signed file (e.g. mypackage-0.0.1.fc10.i386.rpm_requirements).
>> So, yum will download such files when it needs them.  Now, what do you
>> think about this? Does it worth implementing?
>>      
>   The problem here is we need the requirements lookups to be fast, and
> being in a single .sqlite DB is going to be much faster than having
> N .xml files.
>    
As I said in the other reply, the downloaded parts can be merged in a 
.sqlite db on the client side.

>   Also things like "repoquery --whatrequires" will now be horrible.
>    
I think it is not uncommon to optimize systems for their most common use 
cases, with the expense of making some uncommon use cases a bit worse. 
And is downloading many small files really "horrible"? Specially that 
for later runs you'll download only missing requirements.

>   Saying that my suspicion is that requirements don't change that much,
> so if we could split them cleverly it's possible we could reuse them a
> lot.
>    
Yes. It might be even possible to share some information between package 
provides and requirements!

>   Feel free to investigate, I just don't think we can promise to accept
> anything.
>    
Yes I know. :) Even if you like to idea, the implementation could be 
terrible! :P

>> To go farther in splitting, it might be nice to store package
>> descriptions in separate files too.
>>      
>   One of the things that's on the TODO list is to remove summary and
> description from primary, and have them in locale specific files. This
> should solve a number of problems, and we'd be more than happy to have
> some extra hands to make this happen sooner.
>    
I'll try to help as far as I can, and I'm interested in some TODO items 
too. My main interests in yum is in the areas like downloading as small 
as possible, less locking and multiple downloads in parallel.

>>   Also, I thought a little about
>> splitting package provides too. It should be done based on the provides
>> themselves, but creating a separate file for each provides might be
>> overkill. But it might be nice to split the provides based on some
>> initial characters of their hash code (e.g. based on the first 2
>> characters of their hash code) into separate small databases.
>>      
>   I doubt this would be a win.
>
>    
I do too!

>> The file lists could be also split, using the same method as
>> requirements or provides (maybe even both!), based on their most
>> important use case (I'm not sure of).
>>      
>   What we'd really like to do, long term., is remove file requirements
> completely. But that requires a lot of work, mostly non-technical.
>
>    

Certainly that would be better! But while they exist, there might be a 
solution to provide a better experience. Anyway, since file lists are 
not used in many common use cases; I'm interested in splitting the 
primary metadata much more than the file lists.

Thanks a lot,
Hedayat



More information about the Yum-devel mailing list