[Yum-devel] [PATCH] clean up misc.to_xml(), make it faster, add tests. BZ 716235.
Toshio Kuratomi
a.badger at gmail.com
Mon Nov 19 23:24:49 UTC 2012
On Mon, Nov 19, 2012 at 10:29 AM, Toshio Kuratomi <a.badger at gmail.com> wrote:
>
> Also a bit of bad news... I re-read
> http://en.wikipedia.org/wiki/XML#Valid_characters while researching this.
> It would appear that we should also be removing the C1 control codes from
> the output, not jsut the C0 control codes.
>
> Unfortunately, the C1 control codes fall outside of the ascii subset. And
> that means that we can't use str.translate to remove them. In kitchen,
> where I'm already taking the hit of transforming to unicode and using
> unicode.translate(), I can extend it to delete these bytes just by
> modifying the translation table but if we want to avoid that with the yum
> code.... Options:
>
> * Don't do anyhting to the C1 control codes: The current yum code does not
> handle C1 control codes and we haven't seen any problems yet. This
> probably means that the C1 codes are not being used in the wild. We might
> also want to try passing some C1 control codes into libxml2 and seeing
> what happens -- perhaps libxml2 only barfs on C0 codes.
I took a look at this on F17 with some interesting results. test code:
#!/usr/bin/python -tt
# -*- coding: utf-8 -*-
import libxml2
content_list = []
# Note: Null byte: 0x00 raises an error
for i, v in enumerate(map(unichr, range(1,256))):
content_list.append(u'%s:%s:' % (i+1, v))
content = u'\n'.join(content_list).encode('utf-8')
# test.xml just contains a start and end tag like: <test></test>
doc = libxml2.parseFile('test.xml')
root = doc.children
root.newTextChild(None, 'test', content)
doc.saveFormatFileEnc("-", "UTF-8", 1);
doc.freeDoc()
This printed a complete list from 1-255. Many of the control code
values were empty. codepoint 13 (carriage return) was escaped:
codepoint 8 (backspace), 27 (escape) and 155 (Control Sequence
Introducer) might be dangerous because they caused changes to other
characters. codepoint 24 and 26 printed some odd characters that can
cause python's print to barf even though it passed through libxml2.
So it appears that strictly speaking it's only necessary to strip out
0x00 to make libxml2 happy. libxml2 will deal with the other
characters somehow. However, some of these other characters may cause
enough issues with a string of characters that we need to get rid of
them. And unfortunately, some of those (code point 155) are outside
of the ascii subset so if we deal with them we still have the same
issue of dealing with multiple bytes per character if it's encoded in
utf-8.
-Toshio
More information about the Yum-devel
mailing list