Bug #191199 “XML from Bugzilla is not guaranteed well-formed or ...” : Bugs : Launchpad itself

Revision history for this message

In Mozilla Bugzilla #105960, Matty-is-a-geek (matty-is-a-geek) wrote on 2001-11-09:

#1

We expect proper escaping for HTML templates, we should do the same for XML
templates. Template Toolkit has the 'html' filter, does it have an 'xml'
filter, or are they the same?

Revision history for this message

In Mozilla Bugzilla #105960, Xyzzy-dataphile (xyzzy-dataphile) wrote on 2001-12-24:

#2

Even for cases that have been anticipated, it looks like something weird is
going on here. Go to http://bugzilla.mozilla.org/xml.cgi?id=1000 and browse the
output and notice the invalid entities being generated for known cases.

QuoteXMLChars() in Bug.pm is supposed to be taking care of &, <, >, ', and ",
but the output above is somehow translating & into &amp;, < into &lt;,
and so on.

It looks as though translation is somehow happening twice, once expanding < to
< for example, then expanding that to &lt; . But QuoteXMLChars() doesn't
seem to be getting called twice within the <text /> tags output, and it
translates & characters before <, not after, as my guess would indicate. If it
is because of a recursive substitution, why does it stop? I am completely
baffled. Perhaps it is an escaping problem in QuoteXMLChars()?

Templates might be able to fix this, but without knowing what the problem is or
what the html filter covers, I can't be certain. IIRC, the html filter didn't
cover ", much less umlauts, and I didn't see an xml filter.

Revision history for this message

In Mozilla Bugzilla #105960, James Henstridge (jamesh) wrote on 2002-04-13:

#3

To handle the accented characters, would it make sense to set the encoding of
the xml output to latin1? Something like this:
<?xml version="1.0" encoding="iso8859-1" standalone="no"?>

Alternatively, the data could be reencoded to UTF-8 (which is the default
encoding for XML files).

The first option is probably the easiest one, and is probably correct if
bugzilla is assuming things work in latin1.

The encoding could easily be part of the template, so that bugzillas in other
countries which have bugs using a different encoding could be exported correctly.

I think a compromise like this would be necessary, given that the encoding of
the text returned by the bugzilla user isn't known (or isn't recorded).

Revision history for this message

In Mozilla Bugzilla #105960, James Henstridge (jamesh) wrote on 2002-04-13:

#4

An example of the accented chars causing problems is:
http://bugzilla.mozilla.org/xml.cgi?id=384

Setting the encoding in the <?xml?> PI makes the file parseable.

Revision history for this message

In Mozilla Bugzilla #105960, Axel-pike (axel-pike) wrote on 2002-04-16:

#5

note that xml.cgi sends the wrong mimetype, too.
The current version on b.m.o sends text/plain, and lxr said it would send
text/html. tststs, text/xml it should be.

Talking about entities, the xml produces by xml.cgi should be readable for
Mozilla, IMHO, so it needs to be standalone. No dtds are loaded over the
network at all.
If you want to use entities, declare them in the document, or stick to the basic
XML entities plus numerical.

Revision history for this message

In Mozilla Bugzilla #105960, Bradley Baetz (bbaetz) wrote on 2002-04-16:

#6

lxr is showing cvs bugzilla - that first content-type happens if you don't put
an id in at all, in which case we do generate an html page.

The dtd is just for validation - there are no entities used.

justdave: should we fix the mimetype for 2.16? Its a simple, obviously correct
fix. (Which will mean that we can't see the output in mozilla directly, w/o
using view source, but....)

Revision history for this message

In Mozilla Bugzilla #105960, Axel-pike (axel-pike) wrote on 2002-04-16:

#7

If the DTD is only good for validation, then the xml pi should have
<?xml version="1.0" standalone="yes"?>, but xml.cgi serves "no".
If you want to keep validation information, give the right hint that this is
only needed for validation, but not for the creation of the content.
http://www.w3.org/TR/2000/REC-xml-20001006#sec-rmd

Revision history for this message

In Mozilla Bugzilla #105960, David D Miller (justdave) wrote on 2002-04-17:

#8

Brad: if you guys agree it needs to be done, I won't stop it from getting
checked in. :-) Although I learned a lot about DTD-writing recently I'm still
generally clueless on the way the XML gets transmitted.

Revision history for this message

In Mozilla Bugzilla #105960, James Henstridge (jamesh) wrote on 2002-04-17:

#9

An easy way to test if the output is formal XML is with the xmllint program that
comes with the libxml2 library (http://www.xmlsoft.org). Just download the
output of xml.cgi, then run xmllint on it like this:

xmllint -noout file.xml

If it prints anything, then the file has character set or tag balancing issues.
You can also perform validity checks (it should be able to download the DTD) with:

xmllint -noout -valid file.xml

At the moment, it looks like the output from bugzilla.mozilla.org has formality
problems (isn't utf-8, but doesn't specify its character set), and validity
problems (looks like some elements are out of order).

For my purposes (my python module for talking to a bugzilla installation), only
the formality issue is a problem. It might be worth fixing the validity bug though.

Here is the output from a validity check after setting the character set to
iso8859-1 in the <?xml?> PI:

$ xmllint -noout -valid 384.xml
384.xml:20: validity error: No declaration for element resolution
  <resolution>INVALID</resolution>
                                 ^
384.xml:199: validity error: Element bug content doesn't follow the DTD
Expecting (bug_id , exporter , urlbase , bug_status , resolution? , product ,
priority , version , rep_platform , assigned_to , delta_ts , component ,
reporter , target_milestone? , bug_severity , creation_ts , qa_contact? ,
status_whiteboard? , op_sys , short_desc? , keywords* , dependson* , blocks* ,
cc* , long_desc? , attachment*), got (bug_id bug_status product priority version
rep_platform assigned_to delta_ts component reporter target_milestone
bug_severity creation_ts qa_contact op_sys resolution short_desc long_desc
long_desc long_desc long_desc long_desc long_desc long_desc long_desc long_desc )
</bug>
     ^

Revision history for this message

In Mozilla Bugzilla #105960, Gervase Markham (gerv-mozilla) wrote on 2002-04-26:

#10

Axel: to summarise: for 2.16, you want:

a) The XML served as text/xml
b) The standalone="no" to become standalone="yes"?

Gerv

Revision history for this message

In Mozilla Bugzilla #105960, Bradley Baetz (bbaetz) wrote on 2002-04-27:

#11

Created attachment 81244
patch

That seems like the correct thing to do

Revision history for this message

In Mozilla Bugzilla #105960, Caillon (caillon) wrote on 2002-04-27:

#12

Comment on attachment 81244
patch

2r=caillon. This change is fine. (is there anything else we need to do though
before closing this bug? Comment 9 hints that we don't validate per the
bugzilla DTD)

Revision history for this message

In Mozilla Bugzilla #105960, Bradley Baetz (bbaetz) wrote on 2002-04-27:

#13

we validate now - that was a separate bug.

Charset encoding stuff is covered by a separate bug (bug 126255) Won't using
utf8 require us to use perl 5.6? That bug has a patch on it, but it doesn't
really cover whats needed here, it just allows an encoding to be set.

Leaving open, -> 2.18, and depending on bug 126255. I'll mark the other patch as
obsolete, too, so that it doesn't affect queries.

Revision history for this message

In Mozilla Bugzilla #105960, James Henstridge (jamesh) wrote on 2002-04-27:

#14

Bug 126255 seems to be something about logging into amazon.com, rather than a
bugzilla bug. I think you meant bug 126266. That patches for that bug seems to
cover setting the encoding of the html output of xml.cgi (ie. when you don't
give any bug numbers). Not sure which bug the charset issue fits into better
(we definitely aren't producing UTF-8 at the moment, so we should be setting it
to _some_ encoding, if we want it to validate).

As for the charset issue, specifying the encoding as iso8859-1 is probably the
best option at the moment.

Revision history for this message

In Mozilla Bugzilla #105960, Bradley Baetz (bbaetz) wrote on 2002-04-27:

#15

I forgot to mention that the reason I marked my patch obsolete, and moved this
to 2.18, was because I checked that patch in. Oops.

Yes, I did mean bug 126266, sorry. If that patch goes in as is, we could just
add the use of the Param to the xml header. In any event, working out the
encoding for the xml output implies working out the encoding for the rest of
bugzilla.

We only really support ascii at the moment, I believe, so UTF8 is sort of
correct. The problem in that other bug and related issues is that we don't have
a way of defined what encoding is being used.

Revision history for this message

In Mozilla Bugzilla #105960, Axel-pike (axel-pike) wrote on 2002-04-27:

#16

I just tried on http://landfill.tequilarista.org/bugzilla-tip/xml.cgi?id=267,
that looks pretty much like what you would expect.
It works in document inspector, entities are as they should be.

Rossi, would you take a look and see if the output works for you?

Revision history for this message

In Mozilla Bugzilla #105960, Rossi (rossi) wrote on 2002-04-27:

#17

yes, this one works great. thanks for checking with me!

Revision history for this message

In Mozilla Bugzilla #105960, Rhg-marxmeier (rhg-marxmeier) wrote on 2003-08-20:

#18

I found another problem caused by the QuoteXMLChars function in Bug.pm regarding
the '&' -> '&' conversion.

If I have for example a sequence like '->' in a bug description it is converted
to '-&gt;' when exporting this bug with xml.cgi.

Seems to be caused by QuoteXMLChars to be applied more than once on the same
field, thus first converting '>' to '&gt' and then '&gt' to '&gt;'.

The solution is an additional look-ahead in QuoteXMLChars:
$_[0] =~ s/&/&/g;
should be replaced with
$_[0] =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;

This way, a '&' is not touched if it is used as a character prefix (as in >
or Ä or ™).

Revision history for this message

In Mozilla Bugzilla #105960, Bradley Baetz (bbaetz) wrote on 2003-08-20:

#19

What verson are you using? I have a recollection of our double (and in some
cases triple) XML Escaping being fixed a while back. See bug 109530.

This is fixed in CVS where the whole xml thing got redone to just be a display
format.

Revision history for this message

In Mozilla Bugzilla #105960, Joel Peshkin (bugreport) wrote on 2004-03-19:

#20

Unloved bugs targetted for 2.18 but untouched since 9-15-2003 are being
retargeted to 2.20
If you plan to act on one immediately, go ahead and pull it back to 2.18.

Revision history for this message

In Mozilla Bugzilla #105960, Daniel Berlin (dberlin) wrote on 2004-09-14:

#21

It still generates invalid xml if you use things like smart quotes, etc. It
should convert them to &<ordinal value>; type deals.
I've attached a patch that fixes it.

Revision history for this message

In Mozilla Bugzilla #105960, Daniel Berlin (dberlin) wrote on 2004-09-14:

#22

Created attachment 158828
XML quoting fix

Revision history for this message

In Mozilla Bugzilla #105960, Daniel Berlin (dberlin) wrote on 2004-10-08:

#23

Sigh, i was up too late that night.
What i meant is that you still have to remove the unrepresentable characters.
There are some characters that xml simply doesn't allow.
However, they can still make it into the db occassionally.
I believe they are 0x01-0x08, 0x0b-0x0c 0x0e-0x19. THe XML Spec 2.2 says:

Those not in here need to be simply dropped from the output.

ie $var =~ s<([\x01\x02\x03\x04\x05\x06\x07\x08\x0b\x0c\x0e-\x19])><>seg;

Revision history for this message

In Mozilla Bugzilla #105960, Max Kanat-Alexander (mkanat) wrote on 2005-03-05:

#24

This bug has not been touched by its owner in over six months, even though it is
targeted to 2.20, for which the freeze is 10 days away. Unsetting the target
milestone, on the assumption that nobody is actually working on it or has any
plans to soon.

If you are the owner, and you plan to work on the bug, please give it a real
target milestone. If you are the owner, and you do *not* plan to work on it,
please reassign it to <email address hidden> or a .bugs component owner. If you are
*anybody*, and you get this comment, and *you* plan to work on the bug, please
reassign it to yourself if you have the ability.

Revision history for this message

In Mozilla Bugzilla #105960, Myk (myk) wrote on 2005-03-15:

#25

Comment on attachment 158828
XML quoting fix

Canceling review requests since comment 23 indicates this patch needs more
work. Daniel, are you still around to update your patch?

Revision history for this message

In Mozilla Bugzilla #105960, David D Miller (justdave) wrote on 2005-04-22:

#26

*** Bug 291503 has been marked as a duplicate of this bug. ***

Revision history for this message

In Mozilla Bugzilla #105960, Luis-tieguy (luis-tieguy) wrote on 2005-07-07:

#27

Poking dberlin, who AFAICT was not on the cc list.

Revision history for this message

In Mozilla Bugzilla #105960, Daniel Berlin (dberlin) wrote on 2005-07-07:

#28

I'll update it. It's a one liner.

Revision history for this message

In Mozilla Bugzilla #105960, Timeless-bemail (timeless-bemail) wrote on 2006-12-26:

#29

the following seem to work:
https://bugzilla-test.mozilla.org/show_bug.cgi?ctype=xml&id=384
https://bugzilla-test.mozilla.org/show_bug.cgi?ctype=xml&id=267

is this bug still real?

Revision history for this message

In Mozilla Bugzilla #105960, Bugzilla-tevp (bugzilla-tevp) wrote on 2007-03-12:

#30

It's still possible to generate bad XML. See for example http://landfill.bugzilla.org/bugzilla-tip/show_bug.cgi?ctype=xml&id=5032 (plus real-world example at http://bugzilla.gnome.org/show_bug.cgi?id=417196&ctype=xml)

Revision history for this message

Gavin Panella (allenap) wrote on 2008-02-12:

#31

On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <email address hidden> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>> 10:58:23 INFO Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>> 10:58:26 ERROR Failed to parse XML description for
>>> http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>> http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>> https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing. This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ... 'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ... print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good. Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James.

On 11 Feb 2008, at 14:17, James Henstridge wrote:
> On 11/02/2008, Gavin Panella <gavin.panella@canonical.com> wrote:
>> On 9 Feb 2008, at 21:26, Christian Robottom Reis wrote:
>>> Because I don't have anything better to do I also looked at the
>>> checkwatches failures today. They are not all bad. But there's one
>>> which
>>> bothered me; elinks.cz fails when pulling information for bug 987.
>>>
>>>     10:58:23    INFO    Updating 1 watches on http://
>>> bugzilla.elinks.cz
>>>
>>>     10:58:26    ERROR   Failed to parse XML description for
>>>     http://bugzilla.elinks.cz bugs [u'987']: syntax error: line 10,
>>> column 62
>>>
>>> Now this is failing because somebody decided it would be a good
>>> idea to
>>> put a non-printable character in a bug comment:
>>>
>>>     http://bugzilla.elinks.cz/show_bug.cgi?id=987#c2
>>>
>>> What should our long-term plan be for this sort of situation? Get it
>>> fixed upstream? Or replace unprintables when importing comments? Or
>>> blacklisting bugs so they stop spamming our logs?
>>
>> Bugzilla shouldn't create invalid XML, so it should ideally be fixed
>> there. See:
>>
>>    https://bugzilla.mozilla.org/show_bug.cgi?id=105960
>>
>> But... it's been open since 2001 and has actually been commented on
>> by James H as recently as April 2002. Looks like this one is not
>> getting fixed, and we should probably try to unfuck the XML from
>> Bugzilla ourselves.
>
> My opinion is that since Bugzilla does not guarantee that it will
> produce valid XML, we should not treat said data as XML.
>
> I'd suggest using the BeautifulSoup.BeautifulStoneSoup class
> (BeautifulSoup minus HTML specific tweaks) to do the parsing.  This
> should give us some data even for invalid pages:
>
>>>> import urllib2
>>>> from BeautifulSoup import BeautifulStoneSoup
>>>> data = urllib2.urlopen(
> ...     'http://bugzilla.elinks.cz/xml.cgi?id=987').read()
>>>> soup = BeautifulStoneSoup(data)
>>>> for comment in soup.findAll('long_desc'):
> ...     print repr(comment.find('thetext').renderContents())
> ...
> 'Patch against elinks-0.11 GIT based on https://bugs.launchpad.net/ 
> bugs/64590'
> 'Created an attachment (id=423)\nTypos and language corrections in
> ELinks strings\n'
> 'Looks good.  Should I credit\x01Malcolm Parsons in the AUTHORS file?'
> 'Yes, please credit Malcolm.'
>
> We still need to work out what to do about character encodings, but
> that is necessary anyway: as mentioned in the bug report old Bugzilla
> had no concept of character encoding, so old bug data can be
> misencoded (one of the sources of invalid XML from bugzilla).
>
> James.

Diogo Matsubara (matsubara) on 2008-02-27

Changed in malone:
status:	New → Confirmed

Diogo Matsubara (matsubara) on 2008-04-16

description:

updated

Revision history for this message

In Mozilla Bugzilla #105960, Matthew-beermann (matthew-beermann) wrote on 2008-04-29:

#32

I stumbled across another real-world example of this today; the following link results in invalid XML...

https://bugs.eclipse.org/bugs/show_bug.cgi?ctype=xml&id=140108

Bug Watch Updater (bug-watch-updater) on 2008-05-06

Changed in bugzilla:
status:	Unknown → Confirmed

Revision history for this message

In Mozilla Bugzilla #105960, Frédéric Buclin (lpsolit-deactivatedaccount) wrote on 2008-12-28:

#33

*** Bug 458218 has been marked as a duplicate of this bug. ***

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-29:

#34

Created attachment 354687
V1

The proposed patch addresses two issues, both related to generating well-formed XML:

1) It replaces instances of '&' with '&' while skipping over valid character entities (so that '™' would *not* become '&trade;' for example).

2) It knocks out characters from disallowed character ranges (control characters, etc) per the XML 1.0 Spec (production 2.2). Note that the more verbose form of expressing the hex characters is used in the regex because the abbreviated form appearing in character classes evidently makes certain perl's lexers cry bitter tears of failure.

Revision history for this message

In Mozilla Bugzilla #105960, Max Kanat-Alexander (mkanat) wrote on 2008-12-29:

#35

Comment on attachment 354687
V1

>+ # substitute & for & unless it is already
>+ # used in a character entity.
>+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;

That's getting too complex. The way the filter is used, it should be displaying "™" if somebody writes "™".

>+ # the following nukes characters disallowed by the XML 1.0
>+ # spec, Production 2.2. 1.0 declares that only the following
>+ # are valid:
>+ # (#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF])
>+ $var =~ s/([\x{0001}-\x{0008}]|
>+ [\x{000B}-\x{000C}]|
>+ [\x{00E}-\x{0019}]|
>+ [\x{D800}-\x{DFFF}]|
>+ [\x{FFFE}-\x{FFFF}])//gx;

I'd rather replace them with HTML entities, is that possible? People export data via XML, and theoretically some of these characters could be in comments (as unlikely as it seems).

Revision history for this message

In Mozilla Bugzilla #105960, L10n-mozilla (l10n-mozilla) wrote on 2008-12-29:

#36

html entities have nothing to do in xml output. You could use numeric xml entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references.

Revision history for this message

In Mozilla Bugzilla #105960, Philringnalda (philringnalda) wrote on 2008-12-29:

#37

Except that you can't, see the fourth line in that section (or any of the other thousands of places googling "form feed xml" will show you people being shocked to find they can't "just use XML" to wrap any data). You simply *cannot* represent a form-feed in XML 1.0 content. You can use &#x12;, to turn an actual form-feed character into the literal string "&x12;" if you find that noisy dataloss is better than silent dataloss, but you can no more use  than you can use the actual character.

(Thanks to its desire to be as incompatible and unused as possible, you actually can have  in your XML 1.1, though good luck finding anything to consume XML 1.1 :)

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-30:

#38

(In reply to comment #34)

> >+ $var =~ s/&(?![#A-Za-z][0-9A-Za-z]+;)/&/g;
>
> That's getting too complex. The way the filter is used, it should be
> displaying "™" if somebody writes "™".

Well, that's the real question, I suppose: "display" vs. "consume". If I'm only looking at a bug "in XML" in a browser I'd expect what you expect; if I'm passing that data to some kind of XML tool chain in order to do something with it, I'd expect the entities to be preserved as-is.
>
> >+ # the following nukes characters disallowed by the XML 1.0
> >+ # spec, Production 2.2.

::snipping fulgy regex::

> I'd rather replace them with HTML entities, is that possible? People export
> data via XML, and theoretically some of these characters could be in comments
> (as unlikely as it seems).

As Phil rightly points out below, any instance of one of the characters not allowed in Production 2.2-- whether expressed by cutting and pasting from a Word doc or written by hand as entity ref to the code point-- instantly makes the document not-well-formed and all XML 1.0 parsers are /required/ to throw a fatal error. I realize that just nuking them is lossy and might break expectations, but there's simply no way to doll them up to make XML parsers happy and still maintain data integrity.

Revision history for this message

In Mozilla Bugzilla #105960, Frédéric Buclin (lpsolit-deactivatedaccount) wrote on 2008-12-30:

#39

(In reply to comment #37)
> Well, that's the real question, I suppose: "display" vs. "consume".

If I type "& means & in HTML" in a comment, I want it to be set back to this exact string after the XML parser has read the XML file. I don't want the XML parser to convert it back to "& means & in HTML". So when you generate the XML file, you should encode & in all cases, and not assume that the input data was already escaped.

About illegal characters, as philor says we have no way to pass them due to limitations of the XML 1.0 spec, I think it's better to silently skip them as it's very unlikely that we will "see" them anyway.

So simply remove the first part of your patch about & and re-request review.

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-30:

#40

(In reply to comment #36)
> Except that you can't, see the fourth line in that section (or any of the other
> thousands of places googling "form feed xml" will show you people being shocked
> to find they can't "just use XML" to wrap any data). You simply *cannot*
> represent a form-feed in XML 1.0 content.

Exactly.

> You can use &#x12;, to turn an
> actual form-feed character into the literal string "&x12;" if you find that
> noisy dataloss is better than silent dataloss, but you can no more use 
> than you can use the actual character.

Okay, that's interesting. It doesn't buy much but at least it keep the code
from silently eating parts of the users' data. Personally, I favor just nuking them since the most common case in my experience is that these characters are artifacts of copy/paste and are not intentional.

> (Thanks to its desire to be as incompatible and unused as possible, you
> actually can have  in your XML 1.1, though good luck finding anything to
> consume XML 1.1 :)

Heh, yeah. I expect to have my flying car /long/ before any 1.1 parsers make it
out of the lab.

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-30:

#41

(In reply to comment #35)
> html entities have nothing to do in xml output. You could use numeric xml
> entities, fwiw, http://www.w3.org/TR/2006/REC-xml11-20060816/#sec-references.

Check the version on that spec. XML 1.1 does indeed allow for certain characters that XML 1.0 does not, but, to my knowledge, there are /no/ 1.1 parsers in the wild.

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-30:

#42

(In reply to comment #38)

> So simply remove the first part of your patch about & and re-request review.

Thanks, Frédéric. Will do.

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-30:

#43

Created attachment 354764
Character Stripping V2

Resubmitted with suggested changes. The patch now only nukes non-XML 1.0-friendly characters.

Revision history for this message

In Mozilla Bugzilla #105960, Frédéric Buclin (lpsolit-deactivatedaccount) wrote on 2008-12-30:

#44

Comment on attachment 354764
Character Stripping V2

>+ [\x{00E}-\x{0019}]|

For consistency, should you write 000E instead of 00E? That's minor, though.

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-30:

#45

(In reply to comment #43)
> For consistency, should you write 000E instead of 00E? That's minor, though.

Yep, typo. Good catch. Resubmitting.

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2008-12-30:

#46

Created attachment 354766
Character Stripping V3

Fixed typo.

Revision history for this message

In Mozilla Bugzilla #105960, Max Kanat-Alexander (mkanat) wrote on 2008-12-30:

#47

Comment on attachment 354766
Character Stripping V3

This looks fine to me. :-)

Revision history for this message

In Mozilla Bugzilla #105960, Frédéric Buclin (lpsolit-deactivatedaccount) wrote on 2008-12-30:

#48

tip:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.78; previous revision: 1.77
done

3.2:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.69.2.6; previous revision: 1.69.2.5
done

Bug Watch Updater (bug-watch-updater) on 2008-12-30

Changed in bugzilla:
status:	Confirmed → Fix Released

Revision history for this message

Gavin Panella (allenap) wrote on 2008-12-31:

#49

The root bug has been marked as Fix Released in Bugzilla, so we should revisit this issue, if only to prioritise and schedule it.

Changed in malone:
milestone:	none → 2.2.1

Björn Tillenius (bjornt) on 2009-02-05

Changed in malone:
milestone:	2.2.1 → none

Revision history for this message

In Mozilla Bugzilla #105960, David Marshall (dmarshal) wrote on 2009-03-05:

#50

Created attachment 365723
patch to V3

I've made this same mistake twice, publicly, in the last 24 hours!

The character that precedes 0x20 is 0x1f, not 0x19.

Revision history for this message

In Mozilla Bugzilla #105960, Max Kanat-Alexander (mkanat) wrote on 2009-03-05:

#51

ubu: Is dwm correct?

Revision history for this message

In Mozilla Bugzilla #105960, Khampton (khampton) wrote on 2009-03-06:

#52

(In reply to comment #49)
> ubu: Is dwm correct?

Yes.

Revision history for this message

In Mozilla Bugzilla #105960, Frédéric Buclin (lpsolit-deactivatedaccount) wrote on 2009-03-06:

#53

Comment on attachment 365723
patch to V3

Yes, this change is correct, per the spec. r=LpSolit

Revision history for this message

In Mozilla Bugzilla #105960, Frédéric Buclin (lpsolit-deactivatedaccount) wrote on 2009-03-06:

#54

tip:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.86; previous revision: 1.85
done

3.2.2:

Checking in Bugzilla/Util.pm;
/cvsroot/mozilla/webtools/bugzilla/Bugzilla/Util.pm,v <-- Util.pm
new revision: 1.69.2.9; previous revision: 1.69.2.8
done

Gavin Panella (allenap) on 2010-01-25

tags:

added: story-reliable-bug-syncing

Bug Watch Updater (bug-watch-updater) on 2010-09-18

Changed in bugzilla:
importance:	Unknown → Medium

Curtis Hovey (sinzui) on 2010-11-26

Changed in malone:
status:	Confirmed → Triaged
importance:	Undecided → Low

Robert Collins (lifeless) on 2011-01-12

Changed in launchpad:
importance:	Low → Critical

Curtis Hovey (sinzui) on 2011-03-05

description:

updated

Benji York (benji) on 2011-06-07

Changed in launchpad:
assignee:	nobody → Benji York (benji)

Revision history for this message

Launchpad QA Bot (lpqabot) wrote on 2011-06-15:

#55

Fixed in stable r13240 <http://bazaar.launchpad.net/~launchpad-pqm/launchpad/stable/revision/13240>.

tags:	added: qa-needstesting
Changed in launchpad:
status:	Triaged → Fix Committed

Benji York (benji) on 2011-06-16

tags:

added: qa-untestable
removed: qa-needstesting

William Grant (wgrant) on 2011-06-17

Changed in launchpad:
status:	Fix Committed → Fix Released

Launchpad itself

XML from Bugzilla is not guaranteed well-formed or correctly encoded, so treat it as such

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	Bugzilla	Fix Released	Medium	mozilla-bugs #105960
	Launchpad itself	Fix Released	Critical	Benji York