[Yum-devel] [PATCH] clean up misc.to_xml(), make it faster, add tests. BZ 716235.

Toshio Kuratomi a.badger at gmail.com
Mon Nov 19 18:29:40 UTC 2012


On Mon, Nov 19, 2012 at 04:36:12AM -0500, Zdenek Pavlas wrote:
> >     $ python -c "import yum.misc; print yum.misc.to_xml('Skytt\xe4')"
> >     Skyttä
> > 
> > After this patch:
> > 
> >     $ python -c "import yum.misc; print yum.misc.to_xml('Skytt\xe4')"
> >     Skytt�
> > 
> > That'd be a regression in my opinion.
> 
> I see, and agree.
> 
Note that I would encourage this to be behaviour that you mark as
deprecated and schedule to get rid of at some defined point in the future.
People using latin-1 can switch to utf-8 with extremely limited
repurcussions (compared to, say, people who use big5 or shift-jis who get
hit with 1) more of their characters being outside of the ascii subset, and
2) more extra bytes being needed to represent those characters in utf-8 than
to represent latin-1 characters in utf-8 )

Doing this is more confusing to users of other encodings (for instance, the
large number of people who use shift-jis and big-5).  All of them will get
gibberish instead of a replacement character.

We can never be 100% correct with this.  For instance:
"Driver for SKYTT\xc4\xae Brand Video Cards" (SKYTTÄ®) would be interpreted
as UTF-8 and thus rendered as gibberish.  I do note that these cases are
pretty rare.  They would need to have a character whose byte is in the range
0xC0-0xDF followed by characters whose bytes are in the range 0xA0-0xBF.  If
you look at a latin-1 character chart, you can see that sequences of the
latin-1 characters that map to these bytes aren't impossible but they are
very rare:
http://en.wikipedia.org/wiki/Latin1#Codepage_layout

Also note that upstream python has been unsympathetic to arguments about
latin-1 locales.  (they're unsympathetic to non-utf-8 locales in general but
the space savings and widespread use of shift-jis and big5 make them
slightly more sympathetic to issues there than to latin-1).

>          # check if valid utf8
>          try: unicode(item, 'utf-8')
>          except UnicodeDecodeError:
> -            # replace invalid bytes with \ufffd
> -            item = unicode(item, 'utf-8', 'replace').encode('utf-8')
> +            # assume iso-8859-1
> +            item = unicode(item, 'iso-8859-1').encode('utf-8')
>      elif type(item) is unicode:
>          item = item.encode('utf-8')
>      elif item is None:

ACK.

Also a bit of bad news... I re-read
http://en.wikipedia.org/wiki/XML#Valid_characters while researching this.
It would appear that we should also be removing the C1 control codes from
the output, not jsut the C0 control codes.

Unfortunately, the C1 control codes fall outside of the ascii subset.  And
that means that we can't use str.translate to remove them.  In kitchen,
where I'm already taking the hit of transforming to unicode and using
unicode.translate(), I can extend it to delete these bytes just by
modifying the translation table but if we want to avoid that with the yum
code....  Options:

* Don't do anyhting to the C1 control codes: The current yum code does not
  handle C1 control codes and we haven't seen any problems yet.  This
  probably means that the C1 codes are not being used in the wild.  We might
  also want to try passing some C1 control codes into libxml2 and seeing
  what happens -- perhaps libxml2 only barfs on C0 codes.
* Convert to unicode and use unicode.translate() -- more correct and we
  still get a significant speedup over the original code.
  * Subset of this is that it would be possible to translate the control
    codes to their escaped equivalent since XML-1.1 specifies this as valid.
    Probably not needed for yum, though.
* Write our own loop that keeps track of multibyte sequences to decide if
  the sequence is a control code -- this is the kind of code we're trying to
  remove so I'd be highly against this.

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.baseurl.org/pipermail/yum-devel/attachments/20121119/8d2df0fd/attachment.asc>


More information about the Yum-devel mailing list