[Yum-devel] [PATCH] DMD: use pkgId to join filelists_db & primary_db.
James Antill
james at fedoraproject.org
Fri Nov 9 17:16:23 UTC 2012
On Fri, 2012-11-09 at 06:07 -0500, Zdenek Pavlas wrote:
> > > Yum relies too much on createrepo inner workings, assumes that
> > > pkgKeys in filelists_db and primary_db are equal.
>
> > That's not true, AIUI. pkgKey is generated by the order things are
> > found for primary, that is true ... but filelists/other both lookup
> > pkgKey based on pkgId, which is why you have to generate primary
> > before filelists/other in createrepo.
>
> Uhm, haven't checked that.. Yes, createrepo creates "current_packages"
> and "all_packages" pkgId hashes, but does not seem to actually use it,
> as adding a dummy <package> entry at the beginning of filelists.xml
> produces different pkgKeys:
Yeh, the hashes seem to only be used in the update code. *sigh*.
Still, given it's worked for 6 years or so I'm reticent to change it
without a pressing need.
> createrepo/db.c:
> yum_db_package_ids_prepare():
> INSERT INTO packages (pkgId) VALUES (?)
> ..so {filelists,other}.packages.pkgKey is auto-generated.
This is in y-m-p, but yeh.
> > Any form of delta metadata that doesn't produce a byte for byte
> > compatible version of _something_ from upstream is going to require a
> > huge amount of verification work.
>
> Have been thinking about this for some time. Being byte-compatible,
> and handling updates at package level is impossible.
Impossible seems like a strong word to use.
> Even if we
> give up on "fast" DB updates, and patch and compile XML instead,
> due to things like inconsistent use of whitespace between </package>
> and the following <package> tags, checksums still won't match!
That's like saying "git diff" is impossible because if you screw the
whitespace up on a patch then you won't get the same result even though
it might be functionally identical.
If we've written the code which generates the XML in all the spots,
then it's even easier ... we can just have some std. whitespace
convention and always use it and it'll always be byte compatible. I
doubt anyone has a problem saying DMD is unsupported if someone wants to
do things like hand hack their primary.xml file.
> But there's no sane reason to require byte-level compatibility.
> All we need is to make sure the local DB contains the right set
> of packages. So something like:
>
> 'SELECT pkgId from packages ORDER BY pkgId'| sha256sum
Yes, in theory we could use a "yum version" like checksum. But that
will have a number of downsides:
1. It's almost certainly going to be slower to calculate than a simple
file check.
2. We have to update all the code which uses metadata to move to the new
checksumming method, like _preload_md_from_system_cache() and
intelligent mirror etc.
3. Instead of checksumming all the data, we are now checksumming some
subset and assuming that means it's all good.
> This also allows using a single DMD file that could be applied
> to ANY recent snapshot, and bring it to the current state.
I don't see how this is affected either way.
> > ...and we'll have to keep and follow the entire chain (and we can't
> > actually verify any of the repomd's that aren't current).
>
> I think chaining diffs is a bad idea. Just add a single file that
> contains packages recently added or removed. Or add two such files,
> one to cover last 2 days, other for last 2 weeks.
The point is that we'd need to allow the chain back to when we had the
first valid full metadata, how we get each set of patches is orthogonal.
> > We had the same problem when we used to download just the new
> > primary.xml files and update our local .sqlite files ... and we just
> > assumed it'd be fine (we didn't do verification) ... and it mostly
> > worked, except when it didn't. We eventually fixed these problems
> > just no updating.
>
> Why this didn't work? (I assume the primary.xml was checksummed).
Yes, everything that was downloaded was checksumed. The problem was
that there is a near infinite chain when a bug occurs.
As I recall we started to see pkgKey mismatches (the big problem being
pkgKey for X was different for primary/filelists). It was impossible to
debug though, because it almost never happened and when it did we had no
way to know what had happened to cause it (random numbers of
primary.xml/filelists.xml had been downloaded and updates applied to
the .sqlite files, one of which broke something).
Obviously removing the .sqlite file, and thus. getting yum to
regenerate it from scratch, fixed everything.
After trying to read the code and see what could be the problem for 6+
months (only seeing 2-3 cases, IIRC), we just gave up and turned updates
off.
> > It also makes delta's _much_ safer if we can just test "did the
> > repodata that came out match what we would have downloaded".
>
> If we loop over all packages in a repo in the same order as createrepo,
> and po.dump_xml() it, we should also get byte-compatible XML,
> and use it to detect bugs in y-m-p.
Right, so now we can check that against the open-checksum of
primary.xml.gz and we'll have the byte compatible check.
It's possible we'll want to have this be a check on a "non-open" file,
either by compressing the dump_xml locally with gzip/etc. (what we do
for rpm deltas) ... or even listing the raw primary.xml, when/if we have
multiple compressed metadata.
> Isn't being a tiny bit slower better than being broken?
Again, I would not label it as broken given how long it's worked and
that to break it people need to hand edit their repodata.
More information about the Yum-devel
mailing list