[Yum-devel] [Kitchen-devel] str_eq within yum (and kitchen)

Wed Aug 25 05:10:22 UTC 2010

On Tue, Aug 24, 2010 at 08:14:11PM -0400, David Malcolm wrote:
> On Tue, 2010-08-24 at 18:28 -0400, David Malcolm wrote:
> > Profiling of a Fedora 14 "yum install *" showed it spending 80 seconds
> > out of 274 within its implementation of "str_eq" (24 million calls), so
> > optimizing this seems worthwhile.
> 
> (replying to self, sorry)
> 
> I added some instrumentation to my copy of yum/i18n.py:
> 
> str_eq_calls = {}
> def str_eq(a, b):
>     """ convert between unicode and not and compare them, w/o warning or
> being annoying"""
>     typesig = (type(a), type(b))
>     if typesig in str_eq_calls:
>         str_eq_calls[typesig] += 1
>     else:
>         str_eq_calls[typesig] = 1
>     (rest of function follows)
> 
> 
> to capture the type signature of all of these calls, adding:
>     from yum.i18n import str_eq_calls
>     from pprint import pprint
>     pprint(str_eq_calls)
> to yummain.py, to dump the stats on output.
> 
> The output indicates that in this case at least [1], all of the calls
> are for the str vs str case:
> 
>     {(<type 'str'>, <type 'str'>): 24274562}
> 
Hah!  So there's a few things that we should think about here:

* We're dealing almost exclusively with byte strings here.
* This function exists because we must occassionally get unicode strings
* I count exactly two places in the yum code where this function is called

There's multiple ways that we could approach this issue.  We could track
down where we're getting unicode strings from and make sure that it's
converted to (or remains a) byte string instead.

We could move the str_eq code back to the two places that it exists and put
it there.  (Function calls are expensive in python... 24 million function
calls can add up).

We could decide that it's okay to issue a UnicodeWarning here (not
a UnicodeError -- we'd need to keep a patch for RHEL5/python-2.4.x systems
to pass errors) and just use str1 == str2.  The two effects of this would
be:
  1) UnicodeWarning would be issued whenever data that is unicode gets
  compared to the data that is byte str.

2) When unicode and str is compared and the two are not ASCII subsets,
  the strings will compare False -- currently if the byte string is utf8 and
  the converted utf8 is equal to the unicode string, the two will compare
  true.

> So it seems to me to be worth optimizing for this case (24 million
> calls, taking 29% of user time on a warm cache).  I can try to cook
> something up in .c or Cython, if this would be acceptable to the
> yum/kitchen maintainers?
> 
We could.  But I've sped this up so that 24,274,562 calls on a trivial data
set is much quicker for most cases:

                            yum         kitchen
  no unicode      19.1686849594   9.54083299637
  eq high_m       75.4539899826   174.357154846
  eq high_b       25.5565230846   10.6922521591
  no bytes        26.7292191982   9.38552594185
  eq high_u       16.9556469917   11.5864810944
  no high_b       28.2855761051   12.2306518555
  no mixed        78.3247509003   12.7160179615
  no high_m       80.1753909588   191.129307985
  eq mixed         75.383769989    17.843116045
  no high_u       19.0863358974   11.2000300884
  no mixed 2      76.1719119549   12.2282829285
  eq unicode      16.7433609962    11.509526968
  eq bytes        26.5654168129   11.2505509853

The right hand columns are wall clock seconds for the vanilla yum code and
the currently checked in kitchen code.  The key to the left hand column is:

eq: The two strings evaluate equal
no: The two strings evaluate unequal
unicode: Both strings are unicode type
bytes: Both strings are byte type
mixed: One string is unicode, the other is bytes
high: The strings are outside of the ASCII subset
high_(u|b|m): strings outside the ASCII subset where the strings are
      (u)nicode, (bytes), or (m)ixed

As you can see, the two pathological cases for the new code are when dealing
with non-ascii characters mixed between unicode and bytes.  The common case
of <str> == <str> is between 2 and 3 times faster.

If you'd care to try the new code out on the real data, it's here:
http://bzr.fedorahosted.org/bzr/kitchen/devel/annotate/head%3A/kitchen/text/misc.py#L96

As another point of interest, here's the timings for running with:
def str_eq(str1, str2):
    try:
        return str1 == str2
    except UnicodeError:
        pass
    return False

This basically allows us to emulate the python-2.5+ behaviour of
``<unicode> == <str>`` in older versions of python.  As you can see, it
has problems with the same cases as the current code but those cases degrade
performance by a much smaller margin and overall performance is improved.
(the <str> <str> common case is 3 to 4 times faster)

                            yum         kitchen
  no unicode      19.1630501747   7.34791111946
  eq high_m       77.9920949936   116.689537048
  eq high_b       24.3305389881   6.65209698677
  no bytes        26.4734601974   6.85160303116
  eq high_u       17.3166980743   7.12775802612
  no high_b       28.1892650127   8.82573199272
  no mixed        75.0359661579   10.1168701649
  no high_m        77.378319025   119.901776791
  eq mixed        74.2815010548   10.0155761242
  no high_u       18.7383151054   7.27747106552
  no mixed 2       72.941822052   9.82383584976
  eq unicode      16.5604941845   7.07374501228
  eq bytes        24.1317751408   6.49520802498

* Note that this version prints the following warning to stderr:
/home/badger/kitchen/devel/kitchen/text/misc.py:118: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal return str1 == str2

I'll attach the script I'm using to generate these timings

-Toshio
-------------- next part --------------
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-

import sys
import timeit
from kitchen.text.misc import str_eq as kitchen
from yum.i18n import str_eq as yum

reps = 24274562
#reps = 10000000
datasets={'eq bytes': ('a', 'a'), 'eq unicode': (u'a', u'a'), 'eq mixed': ('a', u'a'),
        'no bytes': ('a', 'b'), 'no unicode': (u'a', u'b'), 'no mixed': ('a', u'b'), 'no mixed 2': (u'a', 'b'),
        'eq high_b': ('?', '?'), 'eq high_u': (u'?', u'?'), 'eq high_m': (u'?', '?'), 
        'no high_b': ('?', '?'), 'no high_u': (u'?', u'?'), 'no high_m': ('?', u'?'),
        }

def invoke_yum(foo1, foo2):
    for i in xrange(0, reps):
        if yum(foo1, foo2):
            pass

def invoke_kitchen(foo1, foo2):
    for i in xrange(0, reps):
        if kitchen(foo1, foo2):
            pass

if __name__ == '__main__':
        print '%-13s %15s %15s' % (' ', 'yum', 'kitchen')
        for i in datasets.keys():
            a = timeit.Timer('invoke_yum(datasets["%s"][0], datasets["%s"][1])' % (i, i), 'from __main__ import datasets, invoke_yum')
            b = timeit.Timer('invoke_kitchen(datasets["%s"][0], datasets["%s"][1])' % (i, i), 'from __main__ import datasets, invoke_kitchen')
            print '%-13s %15s %15s' % (i, a.timeit(1), b.timeit(1))
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.baseurl.org/pipermail/yum-devel/attachments/20100825/f9769e0a/attachment.pgp>