[Rpm-metadata] Request-For-Ideas: requires statistics and thoughts on making our metadata smaller

Mon Nov 2 15:09:46 UTC 2009

On Nov 2, 2009, at 9:50 AM, Duncan Mac-Vicar Prett wrote:

>
> This string repetition is the reason why the satsolver uses a hashed  
> string
> pool which is created from the metadata (very fast). The result are  
> the solv
> files.
>

Memoization (as in the satsolver) is an important reduction.

The problem with memoization used to remove data redundancy is that  
one cannot
do memoization (which forces a dictionary to uniqify all strings) and  
simultaneously
use a "standard" markup like XML.

If anything, there are more redundant strings in the
XML markup than the dependency content itself in rpm-metadata.
But that flaw is usually dismissed with
	Comress! Compress! Compress!

And sure one can use a database like sqlite as well, but that assumes
that you have normalized data in the schema, mostly not the case
for rpm-metadata stored in a sqlite3 database.

Both memoziation (as used in *.solv) and a database (as used by yum)
also force all lookups to go through the "dictionary" to be decoded.
While that clearly "works" for vendor specific applications like
zypp and fedora specific implementations like yum, there's no clear
"better" yet.

73 de Jeff