[Rpm-metadata] createrepo/utils.py
Luke Macken
lmacken at redhat.com
Wed Apr 16 16:16:13 UTC 2008
On Wed, Apr 16, 2008 at 10:34:21AM -0400, James Antill wrote:
> createrepo/utils.py | 9 +++++++--
> 1 file changed, 7 insertions(+), 2 deletions(-)
>
> New commits:
> commit 0255b00e7246ec019a0e7c96fb04b1b0dbf6399f
> Author: James Antill <james at and.org>
> Date: Wed Apr 16 10:34:10 2008 -0400
>
> Just remove bad small bytes, like 0x01 atm.
>
> diff --git a/createrepo/utils.py b/createrepo/utils.py
> index ffd7f14..1af6b94 100644
> --- a/createrepo/utils.py
> +++ b/createrepo/utils.py
> @@ -79,9 +79,10 @@ def utf8String(string):
> return ''
> elif isinstance(string, unicode):
> return string
> + du = False
> try:
> x = unicode(string, 'ascii')
> - return string
> + du = True
> except UnicodeError:
> encodings = ['utf-8', 'iso-8859-1', 'iso-8859-15', 'iso-8859-2']
> for enc in encodings:
> @@ -93,8 +94,12 @@ def utf8String(string):
> if x.encode(enc) == string:
> return x.encode('utf-8')
> newstring = ''
> + # Allow BS, HT, LF, VT, FF, CR
> + bad_small_bytes = range(0, 8) + range(14, 32)
> for char in string:
> - if ord(char) > 127:
> + if ord(char) in bad_small_bytes:
> + newstring = newstring + '?'
> + elif not du and ord(char) > 127:
> newstring = newstring + '?'
> else:
> newstring = newstring + char
Ok, so it looks like we're losing here.
This utf8String method seems to be a bit misleading, and full of pain. I assume we
want to give it a utf-8 encoded string, and get back a unicode object, right?
If we can assume everything is already utf8 encoded, couldn't we just do
something like this (replacing decoding errors with question marks)
def utf8String(string):
"""hands back a unicoded string"""
if string is None:
return u''
elif isinstance(string, unicode):
return string
else:
return unicode(string, 'utf-8', errors='replace')
Or, if there is a reason to try falling back to ['iso-8859-1', 'iso-8859-15', 'iso-8859-2']
encodings, we could probably do something like this:
def utf8String(string):
"""hands back a unicoded string"""
if string is None:
return u''
elif isinstance(string, unicode):
return string
try:
x = unicode(string, 'utf-8')
except UnicodeError:
encodings = ['iso-8859-1', 'iso-8859-15', 'iso-8859-2']
for enc in encodings:
try:
x = unicode(string, enc)
break
except UnicodeError:
pass
x = unicode(string, 'utf-8', errors='replace')
return x
What do you guys think?
luke
More information about the Rpm-metadata
mailing list