[Yum-devel] [PATCH] clean up misc.to_xml(), make it faster, add tests. BZ 716235.

James Antill james at fedoraproject.org
Tue Nov 20 14:54:16 UTC 2012


On Mon, 2012-11-19 at 15:24 -0800, Toshio Kuratomi wrote:
> On Mon, Nov 19, 2012 at 10:29 AM, Toshio Kuratomi <a.badger at gmail.com> wrote:
> >
> > Also a bit of bad news... I re-read
> > http://en.wikipedia.org/wiki/XML#Valid_characters while researching this.
> > It would appear that we should also be removing the C1 control codes from
> > the output, not jsut the C0 control codes.
> >
> > Unfortunately, the C1 control codes fall outside of the ascii subset.  And
> > that means that we can't use str.translate to remove them.  In kitchen,
> > where I'm already taking the hit of transforming to unicode and using
> > unicode.translate(), I can extend it to delete these bytes just by
> > modifying the translation table but if we want to avoid that with the yum
> > code....  Options:
> >
> > * Don't do anyhting to the C1 control codes: The current yum code does not
> >   handle C1 control codes and we haven't seen any problems yet.  This
> >   probably means that the C1 codes are not being used in the wild.  We might
> >   also want to try passing some C1 control codes into libxml2 and seeing
> >   what happens -- perhaps libxml2 only barfs on C0 codes.
> 
> I took a look at this on F17 with some interesting results.  test code:
> 
> #!/usr/bin/python -tt
> # -*- coding: utf-8 -*-
> import libxml2
> 
> content_list = []
> # Note: Null byte: 0x00 raises an error
> for i, v in enumerate(map(unichr, range(1,256))):
>     content_list.append(u'%s:%s:' % (i+1, v))
> 
> content = u'\n'.join(content_list).encode('utf-8')
> # test.xml just contains a start and end tag like: <test></test>
> doc = libxml2.parseFile('test.xml')
> root = doc.children
> root.newTextChild(None, 'test', content)
> doc.saveFormatFileEnc("-", "UTF-8", 1);
> doc.freeDoc()
> 
> This printed a complete list from 1-255.  Many of the control code
> values were empty. codepoint 13 (carriage return) was escaped:
> codepoint 8 (backspace), 27 (escape) and 155 (Control Sequence
> Introducer) might be dangerous because they caused changes to other
> characters. codepoint 24 and 26 printed some odd characters that can
> cause python's print to barf even though it passed through libxml2.
> 
> So it appears that strictly speaking it's only necessary to strip out
> 0x00 to make libxml2 happy.

 I retried this on RHEL-5, assuming that would be "the worst" and it
looked like it showed the same thing (only 0 disappeared), however if I
redirected the output to a file and then moved that file back over to
test.xml and ran it again:

test.xml:3: parser error : PCDATA invalid Char value 1
  <test>1::
          ^
test.xml:4: parser error : PCDATA invalid Char value 2
2::
  ^

...and the same for value 3, 4, 5, 6, 7, 8, 11, 12, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31.




More information about the Yum-devel mailing list