[Yum-devel] [PATCH] configure sqlite to return utf-8-encoded strs instead of unicode objects
Seth Vidal
skvidal at fedoraproject.org
Tue Aug 18 15:19:03 UTC 2009
On Fri, 7 Aug 2009, Mike Bonnet wrote:
> On 08/06/2009 05:25 PM, James Antill wrote:
>> On Thu, 2009-08-06 at 13:50 -0400, Mike Bonnet wrote:
>>> sqlite by default returns all text as unicode objects, and this causes a
>>> number of problems when merging repos which contain utf-8 characters in
>>> Provides or Requires (as the current F11/F12 repos do). For a testcase,
>>> try merging 2 F12 repos, and you should see it fail with a
>>> UnicodeDecodeError in packages.py:_dump_pco(). This patch instructs
>>> sqlite to return all text as utf-8-encoded strs, which avoids these
>>> encoding issues.
>>
>> Ugh. While it would be nice to move everything to utf8, the timing here
>> seems bad ... in that it seems to close to F12/RHEL-6 to do this kind of
>> change without a good idea it isn't going to break anything else.
>
> Yeah, I understand the timing might be bad, but I'm honestly surprised
> nothing other than mergerepo is breaking. I guess that's the only place that
> a lot of string concatenation happens. But we now have packages that include
> utf8 in their Provides, so I'd expect this problem to get worse, not better.
>
> http://koji.fedoraproject.org/koji/rpminfo?rpmID=1118421
>
> Look at the bottom of the Provides list. This is generated by the output of
> fc-query on the font files, and apparently utf8 is valid there.
>
> The rpm python api is also returning all strings as utf-8-encoded strs, not
> unicode, so this should increase consistency and remove a lot of cases where
> we're doing unnecessary unicode->str conversions.
>
>> So what testing did you do to see what else it broke?
>> For instance, does "yum search ®" still work?
>> Did you test with any weird LANG= values?
>> I'm pretty sure you didn't test with ./test/yum-release-i18n-test.sh,
>> but if you did that'd be cool.
>
> I just ran ./test/yum-release-i18n-test.sh. It generated a ton of output and
> seemed to complete successfully (return value of 0). Anything specific I
> should be looking for?
>
> yum search ® works with and without the patch, and returns the same list of
> packages.
>
> yum search 'font(эвристика)' works with and without the patch, though it
> doesn't find anything in either case.
>
> Without the patch:
>
> # yum install 'font(эвристика)'
> Loaded plugins: fastestmirror, presto, refresh-packagekit
> Loading mirror speeds from cached hostfile
> * fedora: download.bos.redhat.com
> * updates: download.bos.redhat.com
> Setting up Install Process
> Error: You must not use 8-bit bytestrings unless you use a text_factory that
> can interpret 8-bit bytestrings (like text_factory = str). It is highly
> recommended that you instead just switch your application to Unicode strings.
>
> # repoquery --whatprovides 'font(эвристика)'
> <big long traceback truncated>
> File "/usr/lib/python2.6/site-packages/yum/sqlitesack.py", line 1209, in
> returnPackages
> pkgobjlist = self._buildPkgObjList(repoid, patterns, ignore_case)
> File "/usr/lib/python2.6/site-packages/yum/sqlitesack.py", line 53, in
> newFunc
> raise Errors.RepoError, str(e)
> yum.Errors.RepoError: You must not use 8-bit bytestrings unless you use a
> text_factory that can interpret 8-bit bytestrings (like text_factory = str).
> It is highly recommended that you instead just switch your application to
> Unicode strings.
>
>
> With the patch:
>
> # yum install 'font(эвристика)'
> Loaded plugins: fastestmirror, presto, refresh-packagekit
> Loading mirror speeds from cached hostfile
> * fedora: download.bos.redhat.com
> * updates: download.bos.redhat.com
> Setting up Install Process
> Resolving Dependencies
> --> Running transaction check
> ---> Package apanov-heuristica-fonts.noarch 0:20090125-5.fc11 set to be
> updated
> --> Finished Dependency Resolution
>
> Dependencies Resolved
>
> =================================================================================================================================================
> Package Arch
> Version Repository Size
> =================================================================================================================================================
> Installing:
> apanov-heuristica-fonts noarch
> 20090125-5.fc11 fedora 185 k
>
> Transaction Summary
> =================================================================================================================================================
> Install 1 Package(s)
> Update 0 Package(s)
> Remove 0 Package(s)
>
> Total download size: 185 k
> Is this ok [y/N]:
>
> # repoquery --whatprovides 'font(эвристика)'
> /usr/lib/python2.6/site-packages/yum/packages.py:397: UnicodeWarning: Unicode
> equal comparison failed to convert both arguments to Unicode - interpreting
> them as being unequal
> if reqn == n:
>
> (No results, so still failure, but no traceback. Probably worth looking into
> where the unicode is coming from in this case.)
>
> So I see no regressions in functionality, and increased functionality in at
> least one case. All of these tests were run on F11. It's a one-line patch,
> so it's extremely easy to test locally. Anything else you want me to try?
>
>> How much did you look for more local change that would make mergerepo
>> happy?
>
> The Koji mergerepos script (whose failures prompted all this) already has a
> number of local hacks to deal with unicode values for the pkgId, name,
> version, etc. But with even Provides and Requires coming back as unicode, I
> would have to essentially iterate over every PRCO value for every package and
> convert each element of each of those tuples from unicode to str. If *any*
> element we try to concatenate onto the xml is unicode, then the entire xml
> gets coerced to unicode via the ascii codec, and the previously-concatenated
> utf-8 strs fail decoding. Dealing with everything as utf-8 seems like the
> only reasonable way to deal with this. An alternative would be to construct
> the xml using a DOM library, which could hopefully avoid all the string
> concatenation and deal sensibly with unicode values, or stream the data out
> to disk directly instead of building up a huge in-memorystring. But either
> of those options seemed much more invasive.
>
>>> ---
>>> sqlitecachec.py | 1 +
>>> 1 files changed, 1 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/sqlitecachec.py b/sqlitecachec.py
>>> index 7ed5056..8b0ca08 100644
>>> --- a/sqlitecachec.py
>>> +++ b/sqlitecachec.py
>>> @@ -29,6 +29,7 @@ class RepodataParserSqlite:
>>> if not filename:
>>> return None
>>> con = sqlite.connect(filename)
>>> + con.text_factory = str
>>> if sqlite.version_info[0]> 1:
>>> con.row_factory = sqlite.Row
>>> cur = con.cursor()
>
James and I talked about this on irc for a bit. Seems like you've done the
due-dilligence and testing so, if it breaks the world in rawhide we'll
find out quickly enough and blame it on you. :)
So there's no good reason to keep it out.
Thanks,
-sv
More information about the Yum-devel
mailing list