[Rpm-metadata] createrepo: initial comments and a UTF-8 patch
Ville Skyttä
ville.skytta at iki.fi
Sat Jul 24 17:11:34 UTC 2004
On Sat, 2004-07-24 at 17:23, seth vidal wrote:
> > It does not currently do a decent job in UTF-8'ifying content. Not that
> > it would generate broken XML, but for example the UTF-8 "ä" in my
> > surname turns in to two "?"s, when it's already UTF-8 in a RPM header!
> >
> > Patch attached. This is a simplified version of what I use in fancix,
> > and the idea originates to decode() Skip Montanaro's query.py at
> > http://manatee.mojam.com/~skip/python/query.py
>
> my only concern is that it works with some of the suse and pld rpms.
> When I tested them before it was very difficult to guess what encoding
> they were in. I'll take a look again, thanks.
I tested with some self-created nasty ones, as well as actual Conectiva
packages. Could not find a PLD package with "bad" chars in any of the
fields in a quick search (but added ISO-8859-2 to the list of encodings
to guess anyway, thinking about PLD :)
Note that if the conversion is unsuccessful, it falls back to the old
"?" stuff. And as said, the current code fails with stuff that is
already in UTF-8. Try any recent package by yours truly...
> That only happens if you do serialize the whole thing, not just a node.
> You can't serialize the whole thing b/c it would grow in memory use w/o
> bound. That's why I use xmlCleanString().
Well, in my tests I could not find a case where libxml would not
automatically do the string cleanup, ie. where createrepo would produce
invalid XML without xmlCleanString() calls in place. If you have a case
where the auto-escaping does not happen, let me know.
> > The name "author" attribute in <changelog> is not a very good choice
> > IMO. RPM defines it as the "name" of the changelog entry. It is very
> > common that for RPMs the author attribute will contain stuff like "John
> > Doe <john at doe dot com> - 2.6.8-0.1", ie. it's not only the
> > author -> suggesting changing "author" to "name" unless it causes too
> > much problems.
>
> There is no standard for the 'author' field and it is what rpm calls it
> for the changelog.
Where? I see RPMTAG_CHANGELOG{TIME,NAME,TEXT}, no author.
> I think just dumping the output as it occurs in the
> rpm and letting the client program mangle it would be best.
Agreed, but isn't it that way already? Changelog is stored in three
arrays in the rpm header, not one big formatted lump.
> > Issue 3:
> >
> > $ createrepo .
> > [...]
> > Saving Primary metadata
> > Saving file lists metadata
> > Saving other metadata
> > $ echo foo > repodata/foo.txt
> > $ createrepo .
> > [...]
> > Saving Primary metadata
> > Saving file lists metadata
> > Saving other metadata
> > Could not remove old metadata dir: .olddata
> > Error was [Errno 39] Directory not empty: '.olddata'
> > $ createrepo .
> > Old data directory exists, please remove: .olddata
> >
> > Bug or feature?
>
> feature, I think. Why would you be putting more data into the repodata
> dir?
Accidentally, .htaccess, something? Whatever, I tend to think that
createrepo should not choke on it.
More information about the Rpm-metadata
mailing list