[Rpm-metadata] [Fwd: [alikins at redhat.com: Re: metadata format idea]]

Sat Oct 4 14:27:57 UTC 2003

This is a post from Adrian Likins that describes some of the idea I
think we had all gotten reasonably comfortable with.

If I'm wrong about that then it bears discussion.

-sv

-----Forwarded Message-----
From: Adrian Likins <alikins at redhat.com>
To: jbj at redhat.com
Cc: seth vidal <skvidal at phy.duke.edu>, jbj at redhat.com, pzb at ximian.com, katzj at redhat.com, niemeyer at conectiva.com, trow at ximian.com, joe at ximian.com, bfox at redhat.com, herrold at owlriver.com, dburcaw at terrasoftsolutions.com
Subject: [alikins at redhat.com: Re: metadata format idea]
Date: Mon, 04 Aug 2003 15:41:55 -0400

Forgot to cc this... This is my proposal for
an alternative to "one file"

Also, for the clients at the end of the network,
the headers wouldnt have to be split out. The
client could include the size of the headers in
the package list, and byte range request could
be used to get them. Then ditto for getting the
rest of the package. That way you have a total of
about 4 metadata files, no exploded headers,
and about 150k of bandwidth overhead starting
from scratch. 

Adrian

________________________________________________________________________
From: Adrian Likins <alikins at redhat.com>
To: Daniel Veillard <veillard at redhat.com>
Cc: seth vidal <skvidal at phy.duke.edu>, jbj at redhat.com, pzb at ximian.com, katzj at redhat.com, niemeyer at conectiva.com, trow at ximian.com, joe at ximian.com, bfox at redhat.com, herrold at owlriver.com, dburcaw at terrasoftsolutions.com
Subject: Re: metadata format idea
Date: Sun, 03 Aug 2003 20:13:54 -0400

On Sun, Aug 03, 2003 at 04:39:42PM -0400, Daniel Veillard wrote:
> On Sun, Aug 03, 2003 at 04:13:57PM -0400, Adrian Likins wrote:
> > On Sun, Aug 03, 2003 at 10:57:48AM -0400, Daniel Veillard wrote:
> > > On Sun, Aug 03, 2003 at 12:11:14AM -0400, Adrian Likins wrote:
> > >   they are bloat if you want to minimize downloads for the only
> > > purpose of dependancy resolution from external repositories.
> > >
> > 	again, theres no reason to download that info until
> > you need it, and then you can do it very small targetted chunks. 
> 
>   The problem is taht we want to isolate the useful metadata for the
> queries, File goes in or not, that's the question.
>
	Oh. Seems like the wrong question to me... I'd worry
more about why you are trying to shove all the data down in
one file when saving bandwidth is a concern. It seems like
a odd choice to me. Especially when you have to keep refetching
that file when only small bits change.

	Or is the goal that clients rsync that file from the
servers? I assumed that was ruled out on grounds that distributed
DOS's against servers are frowned upon ;->

> > > (usually a data file in the case left out) then it break. Over the total
> > > set of packages indexed on rpmfind only 40 would not work.
> > >
> > 	deps failing == bad, but alas, if no one else cares, I'm
> > not too picky. 
> 
>   40 "broken" packages over 150,000+ , I ready to chase the 
> 40 packagers individually !
>
	Your braver than me ;->

	My experience is that if I ever utter a phrase
like "no one uses rpm package feature FOO right?" all of
a sudden 10 people go "You can do that??!! Cool!" 

	But perhaps I'm biased ;->

> > >   it is a frightening amount of information in my experience.
> > > The file data table on the rpmfond database explodes any other
> > > one and I restricted the path to wipe out any path with more
> > > than 35 char... get your facts Adrian it is huge !!!
> > >
> > 	define huge? Numbers would help here.
> 
>   -rw-rw----    1 mysql    mysql    123994632 Aug  3 10:36 Files.MYD
>   -rw-rw----    1 mysql    mysql    104170936 Aug  3 10:36 Packages.MYD
>   -rw-rw----    1 mysql    mysql    11983976 Aug  3 10:36 Provides.MYD
> 
> This is with a priori removal of any filename with a path longer
> than 35 chars. Take something like a latex package ...
> The general file information would double in my estimation the
> size of the metadata.
> 
	Doesnt seem too bad. Especially considering you
have a bigger package universe than anyone is likely
to maintain. I was figuring around ~1 gig of file
info for a repo that size. 

	Another datapoint, the file -> package
data for debian sid-i386 comes to about 6 megs
gzipped. I suspect that would be considered 
reasonably representative of a "large" distro.

	But, if "one file" is the chosen
path, you probabaly dont want that data in there.
(packages.gz for the same debian is about 2.5M 
gzipped). But as I already mention, I'm not sure
"one file" would of been my first approach[1].

[1] Well, except of course, that it was my first
approach ~4 years ago. But then, I was also trying
to write a dep solver in perl, so my judgement at
the time was perhaps flawed a bit. The ldap based
one was a bit warped as well. But I'm feeling
_much_ better now.  

> > 	And again, just because the server has the
> > data doesnt mean the client has to download it. 
> 
>   We are looking onto a single static file.
> 
	Okay, I missed that discussion. May
I ask why one static file is so important?

	I assume the answer is "simplicity".
But I dont think a handful of files is that much
more complicated, for potentially strong 
bandwidth savings. 

> > 	also, is the plan for the backend to be 
> > database based? 
> 
>   No, a single static file.
> 
	okay, so sizes of database files arent
really relevent. 

> > > We are trying to design metadata for needed for distributed search
> > > the equivalent of apt/yum/... metadata to allow unified and generic
> > > export of information needed for the dependancy resolution. Those
> > > metadata are stored on the servers, and queried by the client to 
> > > drive the transitive closure between package requirement and provides.
> > > The usual stuff, but not using RPM headers directly (too large dixit
> > > Jeff) and trying to design a very simple minimal format to reach 
> > > this goal. Minimizing the amount of data transfer is key for success.
> > 
> > 	distributed how? All the folks that have talked to me
> > about this never mentioned anything about it being distributed,
> > so I'm curious what exactly "distributed" means in this context?
> 
>   The client doesn't have all the informations. The information is
> possibly distributed over multiple repository. Eg:
>    Solving epiphany-0.8.2 lookup would require to make a closure on
> the set of information coming from
>      - the local installed base
>      - the Mozilla project metadata
>      - the Gnome project metadata
> 
>  it's distributed in the sense that the full set of information is
> initially present on a separated set of machines.
>
	okay, so apt/yum/redcarpet are already "distributed" in
that sense. 

> > 	Do the servers talk to each other? Or is the "distribution"
> > totally client side driven? Or do we just mean "mirrored"? Something
> > else?
> 
>   servers are dumb, export 1 static file and the client does the work
> 
> > 	If you want to minimize the data transfer, I think
> > dynamic content and lazy pull of data is the key. (ie, something
> > more or less like up2date + delta syncing of data + tiny "header
> > lite" blobs is probabaly about as little data transfer as you
> > are going to be able to get away with). Or is static mirrorable
> > content a requirement? 
> 
>     a requirement. you won't get the mirrors to run a database open
> to the internet for you 
> 
	If you want absolutely minimum bandwidth use, I
think dynamic servers are required. Though, with clever
static content, you could get close enough. If you want to
ignore certain corner cases, you might even be able to
do it with the one big file.

> > 	fwiw, I'd lean towards sendings headers (duh, since
> > thats the way I already do it). Or something suffiently 
> 
>   Headers are far too big. Are you gonna fetch all headers of
> the relevant packages before building your transaction ?

	Yes. Either that, or fetch all the info in the headers
and then build up fake headers and do it. Seems like your still
passing around the same info. 

	In my case, headers have the info I need, in the
format I need it in. So, I dont perceive a strong need
to put them in another format.

> It's
> okay on a closed universe where you can precompute the set of exact
> packages needed (which is the case of up2date/anaconda) but IMHO
> not okay if you are subscribed to a number of repository which may
> export similar packages.

	This doesnt make any sense. Up2date doesnt have a particulary
closed universe. Why is it not okay for yum/apt/etc? thats more
or less what they do now (actually, they download _all_ the headers,
not just the ones for the packages they are interested in). But
it would be fairly easy to fix that with smarter metadata (aka,
fetch a file containing normal deps -> packagename, then just
fetch the headers you need to build the transaction. If you
dont care about file deps, that info is tiny...). 

Adrian