[Yum-devel] Breaking yum repository metadata
Hedayat Vatankhah
hedayat at grad.com
Tue Oct 20 23:20:52 UTC 2009
Hi again,
On ۰۹/۱۰/۲۰ 06:20, James Antill wrote:
> On Tue, 2009-10-20 at 13:01 +0330, Hedayat Vatankhah wrote:
>
>> Hi all,
>>
>> I'd like to create a prototype of a new repository metadata, and it
>> would be nice if you let me know about any negative points you see in
>> the proposal (like the previous one with security concerns):
>>
>> In the current implementation, the repository's primary database
>> contains a considerable amount of information about each package, and
>> most of such information won't be used by many users. It wastes
>> bandwidth, which gets worse as the size of the repository grows. For
>> example, the current Fedora repository primary database is about 12MB in
>> compressed form and 47MB in normal form. There are still many users for
>> which downloading 12MB of data is not fast, and as currently yum doesn't
>> resume downloading metadata files, it could be really frustrating for
>> users with poor internet connection.
>>
> This is misleading, updates (the only part of Fedora's release repos.
> that change) is currently (for F11) 22MB uncompressed and 5.5MB
> compressed. This is still not "tiny" but it's much smaller than updating
> everything.
>
But not everyone can/will update everything. Anyway, the 5.3MB
compressed updates primary db will shrink to 3MB without requires. Not
that bad I think.
>> IMHO, it would be nice if users download only what they really need, not
>> the complete repository data. So, I think it is nice to split the
>> repository based on packages, not based on the information about
>> packages (like the current separation of primary and file lists
>> databases). As an example (and the first thing that I want to work on),
>> consider package requirements. Currently, package requirements are
>> stored in the primary database, but it seems that you need a package's
>> requirements only when you want to install that package. By removing the
>> requires table from Fedora repository's primary database, its size
>> shrinks from 47MB to 28MB (and in compressed form from 12MB to 6.7MB).
>> My initial proposal is to store each package's requirements in a
>> separate signed file (e.g. mypackage-0.0.1.fc10.i386.rpm_requirements).
>> So, yum will download such files when it needs them. Now, what do you
>> think about this? Does it worth implementing?
>>
> The problem here is we need the requirements lookups to be fast, and
> being in a single .sqlite DB is going to be much faster than having
> N .xml files.
>
As I said in the other reply, the downloaded parts can be merged in a
.sqlite db on the client side.
> Also things like "repoquery --whatrequires" will now be horrible.
>
I think it is not uncommon to optimize systems for their most common use
cases, with the expense of making some uncommon use cases a bit worse.
And is downloading many small files really "horrible"? Specially that
for later runs you'll download only missing requirements.
> Saying that my suspicion is that requirements don't change that much,
> so if we could split them cleverly it's possible we could reuse them a
> lot.
>
Yes. It might be even possible to share some information between package
provides and requirements!
> Feel free to investigate, I just don't think we can promise to accept
> anything.
>
Yes I know. :) Even if you like to idea, the implementation could be
terrible! :P
>> To go farther in splitting, it might be nice to store package
>> descriptions in separate files too.
>>
> One of the things that's on the TODO list is to remove summary and
> description from primary, and have them in locale specific files. This
> should solve a number of problems, and we'd be more than happy to have
> some extra hands to make this happen sooner.
>
I'll try to help as far as I can, and I'm interested in some TODO items
too. My main interests in yum is in the areas like downloading as small
as possible, less locking and multiple downloads in parallel.
>> Also, I thought a little about
>> splitting package provides too. It should be done based on the provides
>> themselves, but creating a separate file for each provides might be
>> overkill. But it might be nice to split the provides based on some
>> initial characters of their hash code (e.g. based on the first 2
>> characters of their hash code) into separate small databases.
>>
> I doubt this would be a win.
>
>
I do too!
>> The file lists could be also split, using the same method as
>> requirements or provides (maybe even both!), based on their most
>> important use case (I'm not sure of).
>>
> What we'd really like to do, long term., is remove file requirements
> completely. But that requires a lot of work, mostly non-technical.
>
>
Certainly that would be better! But while they exist, there might be a
solution to provide a better experience. Anyway, since file lists are
not used in many common use cases; I'm interested in splitting the
primary metadata much more than the file lists.
Thanks a lot,
Hedayat
More information about the Yum-devel
mailing list