[Rpm-metadata] createrepo/utils.py

Toshio Kuratomi a.badger at gmail.com
Wed Apr 16 20:23:36 UTC 2008


James Antill wrote:
> On Wed, 2008-04-16 at 12:16 -0400, Luke Macken wrote:
> 
>> Ok, so it looks like we're losing here.
>>
>> This utf8String method seems to be a bit misleading, and full of pain.  I assume we
>> want to give it a utf-8 encoded string, and get back a unicode object, right?
> 
>  See my later patch, that is probably less mis-leading?
> 
>  In the caller, that we are having problems with, we want to give it a
> str() from RPM (which may or may not be utf8) and get a valid utf8 str()
> object back _that is also valid inside an XML document_ (excepting
> random < > & bytes, which get converted).
> 
I took a look and libxml2 has a bug:  When creating an xml document, it 
should be removing control characters as they are not valid xml.  Does 
Daniel Veillard know about this (I know you talked to him about reading 
which he's right about but I don't know about writing.)

Doing this in utf8String() instead of libxml2 is certainly a valid 
workaround.

>  The big problem here being that a bunch of the "small bytes" like 0x01
> are valid utf8 but aren't valid XML data. Hence the patches.
> 
>  After looking again, it's now obvious that we still screw up if we pass
> a unicode() object in that has 0x01 bytes in it ... so we should
> probably fix that too (although I'm not sure if that's possible).
> 
Yeah.  Forgive me for saying this, but utf8String() is a bit crazy :-)

I'm attaching a saner version.

One note: This function should be split in two.  When you read from the 
rpm information, you should pass it to a unicodeString() function that 
performs the conversion into a unicode string.  When you output the 
value to libxml2 or another place that doesn't understand python unicode 
strings you pass that unicode string to the much smaller utf8String() 
which strips the control codes and changes it to a utf-8 encoded byte 
string.  That way everything you operate on within your program is a 
unicode string.  You only encode it to a byte representation when you 
output.

P.S. Mike Bonnet notes that iso-8859-15 covers every one byte value so 
no encoding mentioned after it will ever get called.  I didn't change 
that as I don't know how you want to address that.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: createrepo-utf8string-fix.patch
Type: text/x-patch
Size: 2560 bytes
Desc: not available
Url : http://lists.baseurl.org/pipermail/rpm-metadata/attachments/20080416/453aa61a/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: OpenPGP digital signature
Url : http://lists.baseurl.org/pipermail/rpm-metadata/attachments/20080416/453aa61a/attachment.pgp 


More information about the Rpm-metadata mailing list