[Rpm-metadata] two other areas needed

Tue Oct 7 22:13:11 UTC 2003

On Sat, Oct 04, 2003 at 04:01:42AM -0400, seth vidal wrote:
> Hi,
>  I wanted to bring up a couple of things that may or not have been in
> the archives. Originally we have discussed a way to represent the
> metadata, the key information use for depsolving and for grabbing
> package information. The goal was to see if we could come to some
> consensus on a format to use so we don't end up having N repository
> metadata data types. We all wanted something to be as small as possible
> but carry a lot of data.
> 
> I think the idea we got fairly comfortable with was:
> 
>  handful of files idea - this is adrian's - the idea is to have 3 or 4
> files which house all the data. The first file maybe lists the
> channels/repositories and checksums on them - that way if that file has
> changed you know if you need to get the others. The second is the file I
> posted a little bit ago - the main package information file. The third
> is a file containing the complete list of all the files for every
> package. 

	I was thinking that file #2 would probabaly only be
something like:

name version release epoch arch size headersize [url]

for each package in the channel/repo/dir/whatever
(url in brackets since it wouldnt be needed if you stick
all the rpms in the same dir, but adding it would be
theoretically more flexible).

in up2date terms, this is the listPackages() call that
returns the latest and greatest for a channel so I
can see whats interesting before going to get any
more metadata. for "update" oriented clients aiming
for low bandwidth, this is handy. 

The data like seth posted would be in another
file in case you want to scan packges by
buildhost or url or packager, or whatever...
Collapsing this into the file with dep/provides/obsoletes
info is probabaly a workable compromise, though I
would tend to exploding/normalizing them a bit more.

For an update only case, the win is you grab maybe
20k of data, see what files you want to update,
grab the needed headers, solve deps, etc. Then fetch
the rest of the packages (skipping the header, since
you already have it). 

Doing it the latter way uses less bandwidth, but
you potentially have less data lying around to
do arbitrary poking at the repos with (say, the
"show me the packages with changelog entries from
today", someone mentioned the other day)

Probabaly a good compromise in the middle 
somewhere.

Adrian