[Yum-devel] [PATCH] configure sqlite to return utf-8-encoded strs instead of unicode objects

Seth Vidal skvidal at fedoraproject.org
Tue Aug 18 15:19:03 UTC 2009



On Fri, 7 Aug 2009, Mike Bonnet wrote:

> On 08/06/2009 05:25 PM, James Antill wrote:
>> On Thu, 2009-08-06 at 13:50 -0400, Mike Bonnet wrote:
>>> sqlite by default returns all text as unicode objects, and this causes a
>>> number of problems when merging repos which contain utf-8 characters in
>>> Provides or Requires (as the current F11/F12 repos do).  For a testcase,
>>> try merging 2 F12 repos, and you should see it fail with a
>>> UnicodeDecodeError in packages.py:_dump_pco().  This patch instructs
>>> sqlite to return all text as utf-8-encoded strs, which avoids these
>>> encoding issues.
>>
>>   Ugh. While it would be nice to move everything to utf8, the timing here
>> seems bad ... in that it seems to close to F12/RHEL-6 to do this kind of
>> change without a good idea it isn't going to break anything else.
>
> Yeah, I understand the timing might be bad, but I'm honestly surprised 
> nothing other than mergerepo is breaking.  I guess that's the only place that 
> a lot of string concatenation happens.  But we now have packages that include 
> utf8 in their Provides, so I'd expect this problem to get worse, not better.
>
> http://koji.fedoraproject.org/koji/rpminfo?rpmID=1118421
>
> Look at the bottom of the Provides list.  This is generated by the output of 
> fc-query on the font files, and apparently utf8 is valid there.
>
> The rpm python api is also returning all strings as utf-8-encoded strs, not 
> unicode, so this should increase consistency and remove a lot of cases where 
> we're doing unnecessary unicode->str conversions.
>
>>   So what testing did you do to see what else it broke?
>>   For instance, does "yum search ®" still work?
>>   Did you test with any weird LANG= values?
>>   I'm pretty sure you didn't test with ./test/yum-release-i18n-test.sh,
>> but if you did that'd be cool.
>
> I just ran ./test/yum-release-i18n-test.sh.  It generated a ton of output and 
> seemed to complete successfully (return value of 0).  Anything specific I 
> should be looking for?
>
> yum search ® works with and without the patch, and returns the same list of 
> packages.
>
> yum search 'font(эвристика)' works with and without the patch, though it 
> doesn't find anything in either case.
>
> Without the patch:
>
> # yum install 'font(эвристика)'
> Loaded plugins: fastestmirror, presto, refresh-packagekit
> Loading mirror speeds from cached hostfile
> * fedora: download.bos.redhat.com
> * updates: download.bos.redhat.com
> Setting up Install Process
> Error: You must not use 8-bit bytestrings unless you use a text_factory that 
> can interpret 8-bit bytestrings (like text_factory = str). It is highly 
> recommended that you instead just switch your application to Unicode strings.
>
> # repoquery --whatprovides 'font(эвристика)'
> <big long traceback truncated>
>  File "/usr/lib/python2.6/site-packages/yum/sqlitesack.py", line 1209, in 
> returnPackages
>    pkgobjlist = self._buildPkgObjList(repoid, patterns, ignore_case)
>  File "/usr/lib/python2.6/site-packages/yum/sqlitesack.py", line 53, in 
> newFunc
>    raise Errors.RepoError, str(e)
> yum.Errors.RepoError: You must not use 8-bit bytestrings unless you use a 
> text_factory that can interpret 8-bit bytestrings (like text_factory = str). 
> It is highly recommended that you instead just switch your application to 
> Unicode strings.
>
>
> With the patch:
>
> # yum install 'font(эвристика)'
> Loaded plugins: fastestmirror, presto, refresh-packagekit
> Loading mirror speeds from cached hostfile
> * fedora: download.bos.redhat.com
> * updates: download.bos.redhat.com
> Setting up Install Process
> Resolving Dependencies
> --> Running transaction check
> ---> Package apanov-heuristica-fonts.noarch 0:20090125-5.fc11 set to be 
> updated
> --> Finished Dependency Resolution
>
> Dependencies Resolved
>
> =================================================================================================================================================
> Package                                      Arch 
> Version                              Repository                   Size
> =================================================================================================================================================
> Installing:
> apanov-heuristica-fonts                      noarch 
> 20090125-5.fc11                      fedora                      185 k
>
> Transaction Summary
> =================================================================================================================================================
> Install      1 Package(s)
> Update       0 Package(s)
> Remove       0 Package(s)
>
> Total download size: 185 k
> Is this ok [y/N]:
>
> # repoquery --whatprovides 'font(эвристика)'
> /usr/lib/python2.6/site-packages/yum/packages.py:397: UnicodeWarning: Unicode 
> equal comparison failed to convert both arguments to Unicode - interpreting 
> them as being unequal
>  if reqn == n:
>
> (No results, so still failure, but no traceback.  Probably worth looking into 
> where the unicode is coming from in this case.)
>
> So I see no regressions in functionality, and increased functionality in at 
> least one case.  All of these tests were run on F11.  It's a one-line patch, 
> so it's extremely easy to test locally.  Anything else you want me to try?
>
>>   How much did you look for more local change that would make mergerepo
>> happy?
>
> The Koji mergerepos script (whose failures prompted all this) already has a 
> number of local hacks to deal with unicode values for the pkgId, name, 
> version, etc.  But with even Provides and Requires coming back as unicode, I 
> would have to essentially iterate over every PRCO value for every package and 
> convert each element of each of those tuples from unicode to str.  If *any* 
> element we try to concatenate onto the xml is unicode, then the entire xml 
> gets coerced to unicode via the ascii codec, and the previously-concatenated 
> utf-8 strs fail decoding.  Dealing with everything as utf-8 seems like the 
> only reasonable way to deal with this.  An alternative would be to construct 
> the xml using a DOM library, which could hopefully avoid all the string 
> concatenation and deal sensibly with unicode values, or stream the data out 
> to disk directly instead of building up a huge in-memorystring.  But either 
> of those options seemed much more invasive.
>
>>> ---
>>>    sqlitecachec.py |    1 +
>>>    1 files changed, 1 insertions(+), 0 deletions(-)
>>> 
>>> diff --git a/sqlitecachec.py b/sqlitecachec.py
>>> index 7ed5056..8b0ca08 100644
>>> --- a/sqlitecachec.py
>>> +++ b/sqlitecachec.py
>>> @@ -29,6 +29,7 @@ class RepodataParserSqlite:
>>>            if not filename:
>>>                return None
>>>            con = sqlite.connect(filename)
>>> +        con.text_factory = str
>>>            if sqlite.version_info[0]>  1:
>>>                con.row_factory = sqlite.Row
>>>            cur = con.cursor()
>

James and I talked about this on irc for a bit. Seems like you've done the 
due-dilligence and testing so, if it breaks the world in rawhide we'll 
find out quickly enough and blame it on you. :)

So there's no good reason to keep it out.

Thanks,
-sv


More information about the Yum-devel mailing list