[Yum-devel] [PATCH] Fix main speed issue in to_xml(), slows down new createrepo a lot. BZ 716235.

Thu Nov 15 19:02:51 UTC 2012

This function has always made me cringe.  It's riddled with false
assumptions.  Attempting to fix a small part of it usually interacts badly
with assumptions made elsewhere in the code.  Here's some problems with the
function even after adding both James and Zdenek's changes:

* If the string is valid utf-8, this function returns unicode.  If not, it
  returns a utf-8 encoded byte str (mitigated by calling to_utf8 in the
  calling function (to_xml).
* Everything after the except UnicodeError: block is dead code because:
  - any sequence of bytes will decode with 'iso-8859-1'.  This makes the
    for enc in encodings loop superfluous.  We'll only use the first
    encoding in that list.
  - Similarly, try: x = unicode(item, enc) doesn't need to be a try block
    because this will never fail.
  - Since that try: never fails, you always enter the else: block.
    + if x.encode(enc) == item seems like it will always be True here.
    (It might not be true with utf-8 because utf-8 can have different
    combinations of codepoints that represent the same glyph.  For
    instance, an é can be a single codepoint or composed from the "'" and
    "e" character. But we've already taken care of utf-8).
  - No longer need the if enc != 'utf-8' since enc can no longer be utf-8
  - We then hit the return statement so we never get to anything beyond
    this point
* The return x.encode('utf-8') also needs to do translation as well.  This
  is probably right: return x.translate(_bad_small_bytes).encode('utf-8')

However, before you go making changes to address these, I took a look at how
I reimplemented this in kitchen.  The code there is more flexible than the
one in yum.misc but even with that flexibility it's more readable than
what's in _ugly_utf8_string_hack().  It's much faster than the present code
but in testing, it's still about 20 seconds slower than the yum.misc code.
(for f17 createrepo [*No* --update] on 3k packages).  I moved generation of
the translation table (similar to your _bad_small_bytes variable) outside of
the function and optimized kitchen's strategy for replacing control
characters to favor strings that do not have control characters in them and
reduced that to just slightly slower (there was more variance between runs
of the same code than there was between the different versions of code.)

I then removed the additional flexibility that the code had and pushed it
into the yum.misc file as a replacement for _ugly_utf8_string_hack() and
to_xml().  Benchmarking that was marginally faster than the code with the
patches suggested here (once again, with greater variance between the test
runs than between the different versions of the code).

The kitchen code handles strings that aren't valid utf-8 slightly
differently than the code with the proposed changes.  The proposed changes
will take invalid utf-8, decode it using latin-1, and then reencode it using
utf-8 [Which will mangle the characters if they aren't latin-1].  The
kitchen-derived code will replace invalid utf-8 bytes with the unicode
unknown character: "�".  This seems okay since the original code attempted
to do something very similar::

elif not du and ord(char) > 127:
        newitem = newitem + '?' # byte by byte equiv of escape


I'll attach the patch for the kitchen-derived code  and you can look for
flaws.

-Toshio
-------------- next part --------------

diff --git a/yum/misc.py b/yum/misc.py
index a0bac7b..183f296 100644
--- a/yum/misc.py
+++ b/yum/misc.py
@@ -23,6 +23,8 @@ import bz2
 import gzip
 import shutil
 import urllib
+import itertools
+
 _available_compression = ['gz', 'bz2']
 try:
     import lzma
@@ -897,52 +899,11 @@ def seq_max_split(seq, max_entries):
     ret.append(seq[beg:])
     return ret
 
-def _ugly_utf8_string_hack(item):
-    """hands back a unicoded string"""
-    # this is backward compat for handling non-utf8 filenames 
-    # and content inside packages. :(
-    # content that xml can cope with but isn't really kosher
-
-    # if we're anything obvious - do them first
-    if item is None:
-        return ''
-    elif isinstance(item, unicode):    
-        return item
-    
-    # this handles any bogon formats we see
-    du = False
-    try:
-        x = unicode(item, 'ascii')
-        du = True
-    except UnicodeError:
-        encodings = ['utf-8', 'iso-8859-1', 'iso-8859-15', 'iso-8859-2']
-        for enc in encodings:
-            try:
-                x = unicode(item, enc)
-            except UnicodeError:
-                pass
-                
-            else:
-                if x.encode(enc) == item:
-                    if enc != 'utf-8':
-                        print '\n%s encoding on %s\n' % (enc, item)
-                    return x.encode('utf-8')
-    
-    
-    # Kill bytes (or libxml will die) not in the small byte portion of:
-    #  http://www.w3.org/TR/REC-xml/#NT-Char
-    # we allow high bytes, if it passed the utf8 check above. Eg.
-    # good chars = #x9 | #xA | #xD | [#x20-...]
-    newitem = ''
-    bad_small_bytes = range(0, 8) + [11, 12] + range(14, 32)
-    for char in item:
-        if ord(char) in bad_small_bytes:
-            pass # Just ignore these bytes...
-        elif not du and ord(char) > 127:
-            newitem = newitem + '?' # byte by byte equiv of escape
-        else:
-            newitem = newitem + char
-    return newitem
+
+# ASCII control codes that are illegal in xml 1.0
+_CONTROL_CODES = frozenset(range(0, 8) + [11, 12] + range(14, 32))
+_CONTROL_CHARS = frozenset(itertools.imap(unichr, _CONTROL_CODES))
+_CONTROL_REPLACE_TABLE = dict(zip(_CONTROL_CODES, [u'?'] * len(_CONTROL_CODES)))
 
 __cached_saxutils = None
 def to_xml(item, attrib=False):
@@ -951,13 +912,24 @@ def to_xml(item, attrib=False):
         import xml.sax.saxutils
         __cached_saxutils = xml.sax.saxutils
 
-    item = _ugly_utf8_string_hack(item)
-    item = to_utf8(item)
-    item = item.rstrip()
+    item = to_unicode(item, encoding='utf-8', errors='replace')
+    data = frozenset(item)
+    # Most strings do not have control codes so test before modifying
+    # is a performance win
+    if not _CONTROL_CHARS.isdisjoint(data):
+        item = item.translate(_CONTROL_REPLACE_TABLE)
+
+    # Escape characters that have special meaning in xml
     if attrib:
         item = __cached_saxutils.escape(item, entities={'"':"""})
     else:
         item = __cached_saxutils.escape(item)
+
+    # We shouldn't need xmlcharrefreplace when encoding to utf-8 (as utf-8 can
+    # represent all unicode codepoints) but use it in case we ever change the
+    # encoding
+    item = item.encode('utf-8', 'xmlcharrefreplace')
+
     return item
 
 def unlink_f(filename):
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.baseurl.org/pipermail/yum-devel/attachments/20121115/e2d6ef73/attachment.asc>