[Yum-devel] [PATCH] DMD: use pkgId to join filelists_db & primary_db.

Zdenek Pavlas zpavlas at redhat.com
Fri Nov 9 11:07:20 UTC 2012


> > Yum relies too much on createrepo inner workings, assumes that
> > pkgKeys in filelists_db and primary_db are equal.

>  That's not true, AIUI. pkgKey is generated by the order things are
> found for primary, that is true ... but filelists/other both lookup
> pkgKey based on pkgId, which is why you have to generate primary
> before filelists/other in createrepo.

Uhm, haven't checked that..  Yes, createrepo creates "current_packages"
and "all_packages" pkgId hashes, but does not seem to actually use it,
as adding a dummy <package> entry at the beginning of filelists.xml
produces different pkgKeys:

$ sqlite3 primary.sqlite 'SELECT pkgKey,pkgId from packages'
1|1cdbab9c472ae1f093aa855c02acb9abc8a163a92716b628c889734d5ff7fd6f
2|58be58073b41ef0023455ccaf24535f70260725713156d56d1b1158e670a485e
3|9f42e4b19496f3a53e13db50164970fc5d09db65d4f40466694ded80d96b157e

$ sqlite3 filelists.sqlite 'SELECT pkgKey,pkgId from packages'
1|dummy00000000000000000000000000000000000000000000000000000000000
2|1cdbab9c472ae1f093aa855c02acb9abc8a163a92716b628c889734d5ff7fd6f
3|58be58073b41ef0023455ccaf24535f70260725713156d56d1b1158e670a485e
4|9f42e4b19496f3a53e13db50164970fc5d09db65d4f40466694ded80d96b157e

createrepo/db.c:
  yum_db_package_ids_prepare():
    INSERT INTO packages (pkgId) VALUES (?)
    ..so {filelists,other}.packages.pkgKey is auto-generated.

>  Any form of delta metadata that doesn't produce a byte for byte
> compatible version of _something_ from upstream is going to require a
> huge amount of verification work.

Have been thinking about this for some time.  Being byte-compatible,
and handling updates at package level is impossible.  Even if we
give up on "fast" DB updates, and patch and compile XML instead, 
due to things like inconsistent use of whitespace between </package>
and the following <package> tags, checksums still won't match!

But there's no sane reason to require byte-level compatibility.
All we need is to make sure the local DB contains the right set
of packages.  So something like:

'SELECT pkgId from packages ORDER BY pkgId'| sha256sum

Should be included in DMD update info.  Then, we can download
the changeset and verify that when applied, we end with a consitent
database.

This also allows using a single DMD file that could be applied
to ANY recent snapshot, and bring it to the current state.

> ...and we'll have to keep and follow the entire chain (and we can't
> actually verify any of the repomd's that aren't current).

I think chaining diffs is a bad idea.  Just add a single file that
contains packages recently added or removed.  Or add two such files,
one to cover last 2 days, other for last 2 weeks.

>  We had the same problem when we used to download just the new
> primary.xml files and update our local .sqlite files ... and we just
> assumed it'd be fine (we didn't do verification) ... and it mostly
> worked, except when it didn't. We eventually fixed these problems
> just no updating.

Why this didn't work?  (I assume the primary.xml was checksummed).

>  It also makes delta's _much_ safer if we can just test "did the
> repodata that came out match what we would have downloaded".

Yes.. the "checksum of the set of pkgIds" should cover this.
If we loop over all packages in a repo in the same order as createrepo,
and po.dump_xml() it, we should also get byte-compatible XML,
and use it to detect bugs in y-m-p.

>  Putting some numbers there:
> 'SELECT * FROM filelist WHERE pkgKey=19133'
> 0.003
> 'SELECT * FROM filelist JOIN packages USING(pkgKey) WHERE
> pkgId="af3720e24e9a509ee263916b7061387c8bb16b8679bd848ddcd2199fd2a4d030"'
> 0.004

The patch changes file provide queries, WHERE clause does not change.
Just need to return pkgId instead of pkgKey.

~$ time sqlite3 filelists_db.sqlite 'SELECT pkgKey FROM filelist WHERE dirname LIKE "/usr/share/mc/%"'
106963
106963
real	0m0.174s
~$ time sqlite3 filelists_db.sqlite 'SELECT pkgId FROM filelist JOIN packages USING(pkgKey) WHERE dirname LIKE "/usr/share/mc/%"'
6478b25cc455013b8e2bbcaa15d1c742a10a41d8999ba8f7e91341d8ec74139b
6478b25cc455013b8e2bbcaa15d1c742a10a41d8999ba8f7e91341d8ec74139b
real	0m0.178s

Isn't being a tiny bit slower better than being broken?
A slightly changed patch: not overloading _sql_pkgKey2po(), added _sql_pkgId2po().

diff --git a/yum/sqlitesack.py b/yum/sqlitesack.py
index a955895..d07f892 100644
--- a/yum/sqlitesack.py
+++ b/yum/sqlitesack.py
@@ -438,6 +438,7 @@ class YumSqlitePackageSack(yumRepo.YumPackageSack):
             'requires' : { },
             }
         self._key2pkg = {}
+        self._id2pkg = {}
         self._pkgname2pkgkeys = {}
         self._pkgtup2pkgs = {}
         self._pkgnames_loaded = set()
@@ -504,6 +505,7 @@ class YumSqlitePackageSack(yumRepo.YumPackageSack):
             del self.pkgobjlist
         self._pkgobjlist_dirty = False
         self._key2pkg = {}
+        self._id2pkg = {}
         self._pkgname2pkgkeys = {}
         self._pkgnames_loaded = set()
         self._pkgmatch_fails = set()
@@ -837,6 +839,24 @@ class YumSqlitePackageSack(yumRepo.YumPackageSack):
             pkgs.append(pkg)
         return pkgs
 
+    def _sql_pkgId2po(self, repo, cur, pkgs):
+        """ Takes a cursor and maps the pkgId rows into a list of packages. """
+        for ob in cur:
+            pkgId = ob['pkgId']
+            try:
+                pkg = self._id2pkg[repo][pkgId]
+            except KeyError:
+                ob = self._sql_MD('primary', repo, '''
+                    SELECT pkgKey, pkgId, name, epoch, version, release, arch
+                    FROM packages WHERE pkgId = ?''', (pkgId,)).fetchone()
+                if ob is None:
+                    msg = "pkgId %s doesn't exist in repo %s" % (pkgId, repo)
+                    raise Errors.RepoError, msg
+                pkg = self._packageByKeyData(repo, ob['pkgKey'], ob)
+                self._id2pkg.setdefault(repo, {})[pkgId] = pkg
+            if pkg:
+                pkgs.append(pkg)
+
     def _skip_all(self):
         """ Are we going to skip every package in all our repos? """
         skip_all = True
@@ -964,10 +984,10 @@ class YumSqlitePackageSack(yumRepo.YumPackageSack):
 
                 cur = cache.cursor()
                 sql_params.append(dirname)
-                executeSQL(cur, """SELECT pkgKey FROM filelist
+                executeSQL(cur, """SELECT pkgId FROM filelist JOIN packages USING(pkgKey)
                                    WHERE dirname %s ?""" % (querytype,),
                            sql_params)
-                self._sql_pkgKey2po(rep, cur, pkgs)
+                self._sql_pkgId2po(rep, cur, pkgs)
 
             return misc.unique(pkgs)
 
@@ -979,11 +999,11 @@ class YumSqlitePackageSack(yumRepo.YumPackageSack):
 
             # grab the entries that are a single file in the 
             # filenames section, use sqlites globbing if it is a glob
-            executeSQL(cur, "select pkgKey from filelist where \
+            executeSQL(cur, "SELECT pkgId FROM filelist JOIN packages USING(pkgKey) WHERE \
                     %s length(filetypes) = 1 and \
                     dirname || ? || filenames \
                     %s ?" % (dirname_check, querytype), sql_params + ['/',name])
-            self._sql_pkgKey2po(rep, cur, pkgs)
+            self._sql_pkgId2po(rep, cur, pkgs)
 
             if file_glob:
                 name_re = re.compile(fnmatch.translate(name))
@@ -1005,12 +1025,12 @@ class YumSqlitePackageSack(yumRepo.YumPackageSack):
             cache.create_function("filelist_globber", 2, filelist_globber)
             # for all the ones where filenames is multiple files, 
             # make the files up whole and use python's globbing method
-            executeSQL(cur, "select pkgKey from filelist where \
+            executeSQL(cur, "SELECT pkgId FROM filelist JOIN packages USING(pkgKey) WHERE \
                              %s length(filetypes) > 1 \
                              and filelist_globber(dirname,filenames)" % dirname_check,
                        sql_params)
 
-            self._sql_pkgKey2po(rep, cur, pkgs)
+            self._sql_pkgId2po(rep, cur, pkgs)
 
         pkgs = misc.unique(pkgs)
         return pkgs


More information about the Yum-devel mailing list