[Rpm-metadata] [Patch] createrepo --check option take two

Hans-Peter Jansen hpj at urpla.net
Mon Jul 11 21:09:14 UTC 2005


Hi Seth,

Am Montag, 11. Juli 2005 08:02 schrieb seth vidal:
>
> Pete,
>  sorry for such a long delay but I don't focus too much time on my
> createrepo mail - it's a shame but I don't.

Seth, that's absolutely no problem, as long as you come back from time 
to time to look for it ;-)

> I spent some time today 
> futzing with this problem at the behest of dan williams working with
> the extras-buildsystem. So I took a look at the hotshot profiles
> statistics and compared them to the timing of yum-arch runs as well.
>
> The majority of the additional time in any createrepo run is being
> eaten up by the checksum creation. So on a repository that is
> relatively unchanged you shouldn't really need to regenerate the
> checksum if you've already done it once. The problem is - reading
> through the repodata present to find the checksums takes a lot of
> time, too and it would add a lot of code. So instead of that I just
> added an option for createrepo to point to a cachedir.
>
> createrepo -c /path/somewhere repodir
>
> now if there is no /path/somewhere then it will make that directory
> and when it checksums the packages in repodir it will write a file
> out to that directory for each package. The file is named
> 'name-hdrid' for each file ex:
>    yum-utils-e5d663a415d33630bb94596153ae57a80447e38f
> The contents of the file is just the checksum of the file. The whole
> point is to be able to skip over the checksumming if the file hasn't
> changed and if the hdrid is unchanged it is extremely unlikely that
> the file has changed.
>
> So the next time you run createrepo pointing to that cachedir then if
> it finds a name-hdrid that is the same it will open that file and
> read out the first line, which should be the checksum of the package
> file.  Now, I grant you if someone can write to your cachedir then
> they could change the checksum createrepo would put into the metadata
> but if they can write to your cachedir then you'd think they'd just
> go and modify a package b/c if they can write to the cachedir then
> they can write to the repository, too.
>
>
> I've checked in the code and tested it. It makes a huge difference in
> repo creation time.
>
> First run generation of a repository is still the same speed b/c it
> has to checksum each file. The second run is about half the time. On
> a p3-1ghz with 384M of ram indexing 1989 packages: first run took
> 3m8s second run took 1m43s
>
> Does that sound pretty reasonable?

More than that, this is an overwhelming progress in my eyes.

Unfortunately I'm away from my office ATM, will check out the code and 
test it as soon as I'm back (next week at latest). Do you see a 
probability, that if the timestamp check would be applied to the 
checksum files, could further improve that runtime behavior, or would 
it just add unnecessary complexity?

Pete



More information about the Rpm-metadata mailing list