[Yum-devel] [PATCH] configure sqlite to return utf-8-encoded strs instead of unicode objects

Mike Bonnet mikeb at redhat.com
Fri Aug 7 15:54:23 UTC 2009


On 08/06/2009 05:25 PM, James Antill wrote:
> On Thu, 2009-08-06 at 13:50 -0400, Mike Bonnet wrote:
>> sqlite by default returns all text as unicode objects, and this causes a
>> number of problems when merging repos which contain utf-8 characters in
>> Provides or Requires (as the current F11/F12 repos do).  For a testcase,
>> try merging 2 F12 repos, and you should see it fail with a
>> UnicodeDecodeError in packages.py:_dump_pco().  This patch instructs
>> sqlite to return all text as utf-8-encoded strs, which avoids these
>> encoding issues.
>
>   Ugh. While it would be nice to move everything to utf8, the timing here
> seems bad ... in that it seems to close to F12/RHEL-6 to do this kind of
> change without a good idea it isn't going to break anything else.

Yeah, I understand the timing might be bad, but I'm honestly surprised nothing other than mergerepo is breaking.  I guess that's the only place that a lot of string concatenation happens.  But we now have packages that include utf8 in their Provides, so I'd expect this problem to get worse, not better.

http://koji.fedoraproject.org/koji/rpminfo?rpmID=1118421

Look at the bottom of the Provides list.  This is generated by the output of fc-query on the font files, and apparently utf8 is valid there.

The rpm python api is also returning all strings as utf-8-encoded strs, not unicode, so this should increase consistency and remove a lot of cases where we're doing unnecessary unicode->str conversions.

>   So what testing did you do to see what else it broke?
>   For instance, does "yum search ®" still work?
>   Did you test with any weird LANG= values?
>   I'm pretty sure you didn't test with ./test/yum-release-i18n-test.sh,
> but if you did that'd be cool.

I just ran ./test/yum-release-i18n-test.sh.  It generated a ton of output and seemed to complete successfully (return value of 0).  Anything specific I should be looking for?

yum search ® works with and without the patch, and returns the same list of packages.

yum search 'font(эвристика)' works with and without the patch, though it doesn't find anything in either case.

Without the patch:

# yum install 'font(эвристика)'
Loaded plugins: fastestmirror, presto, refresh-packagekit
Loading mirror speeds from cached hostfile
  * fedora: download.bos.redhat.com
  * updates: download.bos.redhat.com
Setting up Install Process
Error: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

# repoquery --whatprovides 'font(эвристика)'
<big long traceback truncated>
   File "/usr/lib/python2.6/site-packages/yum/sqlitesack.py", line 1209, in returnPackages
     pkgobjlist = self._buildPkgObjList(repoid, patterns, ignore_case)
   File "/usr/lib/python2.6/site-packages/yum/sqlitesack.py", line 53, in newFunc
     raise Errors.RepoError, str(e)
yum.Errors.RepoError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.


With the patch:

# yum install 'font(эвристика)'
Loaded plugins: fastestmirror, presto, refresh-packagekit
Loading mirror speeds from cached hostfile
  * fedora: download.bos.redhat.com
  * updates: download.bos.redhat.com
Setting up Install Process
Resolving Dependencies
--> Running transaction check
---> Package apanov-heuristica-fonts.noarch 0:20090125-5.fc11 set to be updated
--> Finished Dependency Resolution

Dependencies Resolved

=================================================================================================================================================
  Package                                      Arch                        Version                              Repository                   Size
=================================================================================================================================================
Installing:
  apanov-heuristica-fonts                      noarch                      20090125-5.fc11                      fedora                      185 k

Transaction Summary
=================================================================================================================================================
Install      1 Package(s)
Update       0 Package(s)
Remove       0 Package(s)

Total download size: 185 k
Is this ok [y/N]:

# repoquery --whatprovides 'font(эвристика)'
/usr/lib/python2.6/site-packages/yum/packages.py:397: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
   if reqn == n:

(No results, so still failure, but no traceback.  Probably worth looking into where the unicode is coming from in this case.)

So I see no regressions in functionality, and increased functionality in at least one case.  All of these tests were run on F11.  It's a one-line patch, so it's extremely easy to test locally.  Anything else you want me to try?

>   How much did you look for more local change that would make mergerepo
> happy?

The Koji mergerepos script (whose failures prompted all this) already has a number of local hacks to deal with unicode values for the pkgId, name, version, etc.  But with even Provides and Requires coming back as unicode, I would have to essentially iterate over every PRCO value for every package and convert each element of each of those tuples from unicode to str.  If *any* element we try to concatenate onto the xml is unicode, then the entire xml gets coerced to unicode via the ascii codec, and the previously-concatenated utf-8 strs fail decoding.  Dealing with everything as utf-8 seems like the only reasonable way to deal with this.  An alternative would be to construct the xml using a DOM library, which could hopefully avoid all the string concatenation and deal sensibly with unicode values, or stream the data out to disk directly instead of building up a huge in-memorystring.  But either of those options seemed much more invasive.

>> ---
>>    sqlitecachec.py |    1 +
>>    1 files changed, 1 insertions(+), 0 deletions(-)
>>
>> diff --git a/sqlitecachec.py b/sqlitecachec.py
>> index 7ed5056..8b0ca08 100644
>> --- a/sqlitecachec.py
>> +++ b/sqlitecachec.py
>> @@ -29,6 +29,7 @@ class RepodataParserSqlite:
>>            if not filename:
>>                return None
>>            con = sqlite.connect(filename)
>> +        con.text_factory = str
>>            if sqlite.version_info[0]>  1:
>>                con.row_factory = sqlite.Row
>>            cur = con.cursor()



More information about the Yum-devel mailing list