[Yum-devel] [Kitchen-devel] str_eq within yum (and kitchen)
Toshio Kuratomi
a.badger at gmail.com
Wed Aug 25 05:10:22 UTC 2010
On Tue, Aug 24, 2010 at 08:14:11PM -0400, David Malcolm wrote:
> On Tue, 2010-08-24 at 18:28 -0400, David Malcolm wrote:
> > Profiling of a Fedora 14 "yum install *" showed it spending 80 seconds
> > out of 274 within its implementation of "str_eq" (24 million calls), so
> > optimizing this seems worthwhile.
>
> (replying to self, sorry)
>
> I added some instrumentation to my copy of yum/i18n.py:
>
> str_eq_calls = {}
> def str_eq(a, b):
> """ convert between unicode and not and compare them, w/o warning or
> being annoying"""
> typesig = (type(a), type(b))
> if typesig in str_eq_calls:
> str_eq_calls[typesig] += 1
> else:
> str_eq_calls[typesig] = 1
> (rest of function follows)
>
>
> to capture the type signature of all of these calls, adding:
> from yum.i18n import str_eq_calls
> from pprint import pprint
> pprint(str_eq_calls)
> to yummain.py, to dump the stats on output.
>
> The output indicates that in this case at least [1], all of the calls
> are for the str vs str case:
>
> {(<type 'str'>, <type 'str'>): 24274562}
>
Hah! So there's a few things that we should think about here:
* We're dealing almost exclusively with byte strings here.
* This function exists because we must occassionally get unicode strings
* I count exactly two places in the yum code where this function is called
There's multiple ways that we could approach this issue. We could track
down where we're getting unicode strings from and make sure that it's
converted to (or remains a) byte string instead.
We could move the str_eq code back to the two places that it exists and put
it there. (Function calls are expensive in python... 24 million function
calls can add up).
We could decide that it's okay to issue a UnicodeWarning here (not
a UnicodeError -- we'd need to keep a patch for RHEL5/python-2.4.x systems
to pass errors) and just use str1 == str2. The two effects of this would
be:
1) UnicodeWarning would be issued whenever data that is unicode gets
compared to the data that is byte str.
2) When unicode and str is compared and the two are not ASCII subsets,
the strings will compare False -- currently if the byte string is utf8 and
the converted utf8 is equal to the unicode string, the two will compare
true.
> So it seems to me to be worth optimizing for this case (24 million
> calls, taking 29% of user time on a warm cache). I can try to cook
> something up in .c or Cython, if this would be acceptable to the
> yum/kitchen maintainers?
>
We could. But I've sped this up so that 24,274,562 calls on a trivial data
set is much quicker for most cases:
yum kitchen
no unicode 19.1686849594 9.54083299637
eq high_m 75.4539899826 174.357154846
eq high_b 25.5565230846 10.6922521591
no bytes 26.7292191982 9.38552594185
eq high_u 16.9556469917 11.5864810944
no high_b 28.2855761051 12.2306518555
no mixed 78.3247509003 12.7160179615
no high_m 80.1753909588 191.129307985
eq mixed 75.383769989 17.843116045
no high_u 19.0863358974 11.2000300884
no mixed 2 76.1719119549 12.2282829285
eq unicode 16.7433609962 11.509526968
eq bytes 26.5654168129 11.2505509853
The right hand columns are wall clock seconds for the vanilla yum code and
the currently checked in kitchen code. The key to the left hand column is:
eq: The two strings evaluate equal
no: The two strings evaluate unequal
unicode: Both strings are unicode type
bytes: Both strings are byte type
mixed: One string is unicode, the other is bytes
high: The strings are outside of the ASCII subset
high_(u|b|m): strings outside the ASCII subset where the strings are
(u)nicode, (bytes), or (m)ixed
As you can see, the two pathological cases for the new code are when dealing
with non-ascii characters mixed between unicode and bytes. The common case
of <str> == <str> is between 2 and 3 times faster.
If you'd care to try the new code out on the real data, it's here:
http://bzr.fedorahosted.org/bzr/kitchen/devel/annotate/head%3A/kitchen/text/misc.py#L96
As another point of interest, here's the timings for running with:
def str_eq(str1, str2):
try:
return str1 == str2
except UnicodeError:
pass
return False
This basically allows us to emulate the python-2.5+ behaviour of
``<unicode> == <str>`` in older versions of python. As you can see, it
has problems with the same cases as the current code but those cases degrade
performance by a much smaller margin and overall performance is improved.
(the <str> <str> common case is 3 to 4 times faster)
yum kitchen
no unicode 19.1630501747 7.34791111946
eq high_m 77.9920949936 116.689537048
eq high_b 24.3305389881 6.65209698677
no bytes 26.4734601974 6.85160303116
eq high_u 17.3166980743 7.12775802612
no high_b 28.1892650127 8.82573199272
no mixed 75.0359661579 10.1168701649
no high_m 77.378319025 119.901776791
eq mixed 74.2815010548 10.0155761242
no high_u 18.7383151054 7.27747106552
no mixed 2 72.941822052 9.82383584976
eq unicode 16.5604941845 7.07374501228
eq bytes 24.1317751408 6.49520802498
* Note that this version prints the following warning to stderr:
/home/badger/kitchen/devel/kitchen/text/misc.py:118: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal return str1 == str2
I'll attach the script I'm using to generate these timings
-Toshio
-------------- next part --------------
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import sys
import timeit
from kitchen.text.misc import str_eq as kitchen
from yum.i18n import str_eq as yum
reps = 24274562
#reps = 10000000
datasets={'eq bytes': ('a', 'a'), 'eq unicode': (u'a', u'a'), 'eq mixed': ('a', u'a'),
'no bytes': ('a', 'b'), 'no unicode': (u'a', u'b'), 'no mixed': ('a', u'b'), 'no mixed 2': (u'a', 'b'),
'eq high_b': ('?', '?'), 'eq high_u': (u'?', u'?'), 'eq high_m': (u'?', '?'),
'no high_b': ('?', '?'), 'no high_u': (u'?', u'?'), 'no high_m': ('?', u'?'),
}
def invoke_yum(foo1, foo2):
for i in xrange(0, reps):
if yum(foo1, foo2):
pass
def invoke_kitchen(foo1, foo2):
for i in xrange(0, reps):
if kitchen(foo1, foo2):
pass
if __name__ == '__main__':
print '%-13s %15s %15s' % (' ', 'yum', 'kitchen')
for i in datasets.keys():
a = timeit.Timer('invoke_yum(datasets["%s"][0], datasets["%s"][1])' % (i, i), 'from __main__ import datasets, invoke_yum')
b = timeit.Timer('invoke_kitchen(datasets["%s"][0], datasets["%s"][1])' % (i, i), 'from __main__ import datasets, invoke_kitchen')
print '%-13s %15s %15s' % (i, a.timeit(1), b.timeit(1))
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
URL: <http://lists.baseurl.org/pipermail/yum-devel/attachments/20100825/f9769e0a/attachment.pgp>
More information about the Yum-devel
mailing list