[Yum-devel] [PATCH] clean up misc.to_xml(), make it faster, add tests. BZ 716235.

Fri Nov 16 17:05:10 UTC 2012

Logic looks good.

James should test that this doesn't cause the performance regression that he
mentioned yesterday.  Since it doesn't use saxutils.escape anymore, it
shouldn't but we never tracked down why it was doing that so it's better to
check.

I have some style comments.  I'll mention them inline

On Fri, Nov 16, 2012 at 02:47:12PM +0100, Zdeněk Pavlas wrote:
> diff --git a/yum/misc.py b/yum/misc.py
> index a0bac7b..9f403db 100644
> --- a/yum/misc.py
> +++ b/yum/misc.py
> +_deletechars = ''.join(chr(i) for i in range(32) if i not in (9, 10, 13))
>  
>  def to_xml(item, attrib=False):
> +    if type(item) is str:

type checking is better done with isinstance
    if isinstance(item, str):

> +        # check if valid utf8
> +        try: unicode(item, 'utf8')

I benchmarked 'utf8', 'UTF8', 'utf-8', 'UTF-8', and unicode() vs
str.decode()  at one point (with both python2.6 and python-2.7)
at one point.  unicode('item', 'utf-8') was a runaway winner.  There was
a thread about it on upstream's python-dev list at one point.  I believe the
reason was that 'utf-8' had been hardcoded in somewhere but the others had
to do a lookup in the codecs module when they were used.

> +        except UnicodeDecodeError:
> +            item = unicode(item, 'utf8', 'replace').encode('utf8')
> +    elif type(item) is unicode:
> +        item = item.encode('utf8')
> +    elif item is None:
> +        return ''
>  
Should probably have a final else: clause here.  It can be as simple as
throwing an TypeError exception or complex and retrieve the string
representation of item (if you do more complex: the output should then go
through the above steps if: else: steps to make sure it is valid utf-8).

> +    # compat cruft...
>      item = item.rstrip()
> +
> +    # kill ivalid low bytes
> +    item = item.translate(None, _deletechars)
> +
> +    # quote reserved XML characters
> +    item = item.replace("&", "&")
> +    item = item.replace("<", "<")
> +    item = item.replace(">", ">")
>      if attrib:
> +        item = item.replace('"', '"')
> +
>      return item

In certain circumstances (if you don't know whether the string will be used
in an attribute that uses single or double quotes) you'll need to do this
replacement as well:
   item = item.replace("'", ''')

If you had a unicode string here, you could do all of the replacing and
killing of bad bytes with a single call to item.translate() [because unicode
string's translate takes a dict].  You'd need to benchmark with some actual
data to see whether the function call overhead of multiple item.replace()
calls or the cost of encoding everything [note: the cost of decoding strings
is already paid for as that was part of the test at the top for being valid
utf-8] is more expensive.

(Function calls are surprisingly expensive in python so I wouldn't be
surprised either way).

-Toshio
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.baseurl.org/pipermail/yum-devel/attachments/20121116/06b86eaf/attachment.asc>