[Yum-devel] Breaking yum repository metadata

Tue Oct 20 22:50:45 UTC 2009

Hi,

On ۰۹/۱۰/۲۰  03:54, Seth Vidal wrote:
>
>
> On Tue, 20 Oct 2009, Hedayat Vatankhah wrote:
>
>> Hi all,
>>
>> I'd like to create a prototype of a new repository metadata, and it
>> would be nice if you let me know about any negative points you see in
>> the proposal (like the previous one with security concerns):
>>
>> In the current implementation, the repository's primary database
>> contains a considerable amount of information about each package, and
>> most of such information won't be used by many users. It wastes
>> bandwidth, which gets worse as the size of the repository grows. For
>> example, the current Fedora repository primary database is about 12MB
>> in compressed form and 47MB in normal form. There are still many
>> users for which downloading 12MB of data is not fast, and as
>> currently yum doesn't resume downloading metadata files, it could be
>> really frustrating for users with poor internet connection.
>>
>>
>> IMHO, it would be nice if users download only what they really need,
>> not the complete repository data. So, I think it is nice to split the
>> repository based on packages, not based on the information about
>> packages (like the current separation of primary and file lists
>> databases). As an example (and the first thing that I want to work
>> on), consider package requirements. Currently, package requirements
>> are stored in the primary database, but it seems that you need a
>> package's requirements only when you want to install that package. By
>> removing the requires table from Fedora repository's primary
>> database, its size shrinks from 47MB to 28MB (and in compressed form
>> from 12MB to 6.7MB). My initial proposal is to store each package's
>> requirements in a separate signed file (e.g.
>> mypackage-0.0.1.fc10.i386.rpm_requirements). So, yum will download
>> such files when it needs them.  Now, what do you think about this?
>> Does it worth implementing?
>
>
> How much space would we save if we got rid of the most common
> dependencies all together? ie: /bin/sh and libc. If we just said "all
> these pkgs assume the presence of the items provided by bash and glibc"
This can be done for something like /bin/sh (binaries), but I'm not sure 
if that is possible to be done for glibc (and other libraries), since 
some package might depend an different versions of glibc (or glibc's 
compiled with incompatible flags).
BTW, that would be a nice addition anyway!

>
>> To go farther in splitting, it might be nice to store package
>> descriptions in separate files too.
>
> If you store them each in separate files it makes searching on the
> description a lot harder.
Yes, you're right. These are the possibilities IMHO:
1. Do not search in the descriptions using yum on the client side! The 
descriptions could be on a web server (like the repoview pages), and the 
search can be done using google! Even it is possible to do it (searching 
by google) using client applications.

2. Not "a lot harder". It just needs to download all description files. 
It is even possible to have an all-in-one-descriptions database for such 
use cases beside normal one-package-description files (not much 
interesting though!). So, when user wants to search in package 
descriptions (might be better to be a separate command than the normal 
search command), yum will download that all-in-one package, or all of 
the single description files). I wonder if downloading many small files 
is harder than downloading a single large file. Downloading small files 
could be even be done in parallel (e.g. 5 files at once) to become close 
to the behavior of download accelerators.

3. If it's really undesirable, so I'll forget this item at least for the 
official yum (splitting package descriptions).

>
>> Also, I thought a little about splitting package provides too. It
>> should be done based on the provides themselves, but creating a
>> separate file for each provides might be overkill. But it might be
>> nice to split the provides based on some initial characters of their
>> hash code (e.g. based on the first 2 characters of their hash code)
>> into separate small databases.
>>
>> The file lists could be also split, using the same method as
>> requirements or provides (maybe even both!), based on their most
>> important use case (I'm not sure of).
>
> So searching or installing by provide or filelist wildcard would be
> impossible w/o downloading every single file?
If using wildcards... yes. But in that case (the worst case) it won't be 
worse than the current situation. Will be? I wonder if these use cases 
are so usual. Also, in future runs you possibly won't need to download 
all of the files; just the modified files (however delta metadata will 
probably provide the same efficiency). (But yes; it is unclear that how 
often such files are modified).

>> There could be a compatibility period in which both current style and
>> new style (after implementing all desired functionality) repository
>> metadata are created; which will have a very small space overhead for
>> mirrors.
>>
>>
>> I'd like to hear from you about your opinions.
>
> Wouldn't it be much easier to think about providing deltas for the
> metadata rather than forcing every tool reading the repodata to change
> its format and how often it has to download?
It might be, and I might work on that instead. The main problem with 
delta metadata files is that it won't eliminate the need for downloading 
metadata for the first time. It will be solved for Fedora repository by 
putting the repository metadata inside installation media. But, I'm not 
sure if it will be done; and also it won't solve the problem for other 
repositories.
I think it is possible to make the required changes on the clients 
rather small. For example, while requirements and maybe other metadata 
will be downloaded as small files; the data could be inserted into a 
SQLite database after download using tables very similar to the 
currently used tables. It'll also solve the performance problem 
mentioned by James Antill, as the lookup will happen in the .sqlite db 
rather than N separate files.
For the requirements, the requirements of each package should be 
downloaded only once. For provides and file lists, the same 
metadata_expires could be applied separately for each part.

Yes, using delta metadata files have less overhead, but maybe the extra 
overhead worth it?!
Also, you might dislike splitting package descriptions, provides and/or 
file lists; but as I said splitting only the requirements seems to be a 
good improvement. Doesn't it?

(I've also tried removing the requires table from the fedora updates 
repository's primary database. It decreases the normal file from 22M to 
12M, and the compressed file from 5.3M to 3.0M.)

Thanks for your patience,
Hedayat

>
>
> -sv
>
>
> _______________________________________________
> Yum-devel mailing list
> Yum-devel at lists.baseurl.org
> http://lists.baseurl.org/mailman/listinfo/yum-devel