[Rpm-metadata] [Patch] createrepo --check option take two

seth vidal skvidal at phy.duke.edu
Mon Jul 11 06:02:08 UTC 2005


On Wed, 2005-06-01 at 22:19 +0200, Hans-Peter Jansen wrote:
> Am Donnerstag, 26. Mai 2005 06:03 schrieb seth vidal:
> > On Tue, 2005-05-24 at 12:07 +0200, Hans-Peter Jansen wrote:
> > > Hi Seth,
> > >
> > > Here's a slightly reworked version of the --check option, it now
> > > checks the directory timestamp, containing the rpm, as it happened,
> > > that an older rpm appeared today in one of my rsyned suse update
> > > repos (due to some internal lags), which didn't triggered the
> > > rebuild, then..
> > >
> > > As a nice plus, the number of stats are greatly decreased, if a dir
> > > in the repo is not up to date (not that it matters, compared to the
> > > following repo rebuild...).
> > >
> > > Do you think, it's worth to include it upstream now?
> >
> > quite possibly, yes.
> 
> than please commit the attached patch on top of the previous.
> It fixes a problem, when the rpm files are in the current directory.
> That results in os.path.dirname() returning an empty string, which
> os.path.getmtime() doesn't like :-(.
> 
> > There are some other things I'd like to see done to the
> > format/program as well:
> > 1. make the checksum be an internal package checksum and/or store a
> > cache of package checksums and rebuild based on timestamp change (for
> > quicker re-indexing of a repo)
> 
> Will need to take a deeper look into things to grok this.
> 
> > 2. split out the metadata some more as described a few months ago
> 
> Do you have a pointer handy? Either I missed it, or I wasn't subscribed 
> then..
> 
> > 3. work on any ways to make the repo creation as fast as possible.
> 
> Sure, but this option already has a nice ROI for the pretty common case 
> of an unchanged repo, but you're right, speeding up the creation case 
> wouldn't harm also ;-).
> 

Pete,
 sorry for such a long delay but I don't focus too much time on my
createrepo mail - it's a shame but I don't. I spent some time today
futzing with this problem at the behest of dan williams working with the
extras-buildsystem. So I took a look at the hotshot profiles statistics
and compared them to the timing of yum-arch runs as well.

The majority of the additional time in any createrepo run is being eaten
up by the checksum creation. So on a repository that is relatively
unchanged you shouldn't really need to regenerate the checksum if you've
already done it once. The problem is - reading through the repodata
present to find the checksums takes a lot of time, too and it would add
a lot of code. So instead of that I just added an option for createrepo
to point to a cachedir.

createrepo -c /path/somewhere repodir

now if there is no /path/somewhere then it will make that directory and
when it checksums the packages in repodir it will write a file out to
that directory for each package. The file is named 'name-hdrid' for each
file ex:
   yum-utils-e5d663a415d33630bb94596153ae57a80447e38f
The contents of the file is just the checksum of the file. The whole
point is to be able to skip over the checksumming if the file hasn't
changed and if the hdrid is unchanged it is extremely unlikely that the
file has changed.

So the next time you run createrepo pointing to that cachedir then if it
finds a name-hdrid that is the same it will open that file and read out
the first line, which should be the checksum of the package file.  Now,
I grant you if someone can write to your cachedir then they could change
the checksum createrepo would put into the metadata but if they can
write to your cachedir then you'd think they'd just go and modify a
package b/c if they can write to the cachedir then they can write to the
repository, too.


I've checked in the code and tested it. It makes a huge difference in
repo creation time.

First run generation of a repository is still the same speed b/c it has
to checksum each file. The second run is about half the time. On a
p3-1ghz with 384M of ram indexing 1989 packages: first run took 3m8s
second run took 1m43s

Does that sound pretty reasonable?

-sv





More information about the Rpm-metadata mailing list