tostring truncates output when encoding set to "utf8"

Bug #1944751 reported by Micha Moskovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned
lxml (Debian)
New
Unknown

Bug Description

I ran into a bug that causes lxml to truncate the output when using "tostring" with encoding set to "utf8", while it works correctly when encoding is set to "utf-8". Running the attached example file produces the following output for me:

Bad:
b'<record><datafield tag="520" ind1=" " ind2=" "><subfield code="9">APS</subfield><subfield code="a">The first measurement of the dependence of &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xb3&lt;/mi&gt;&lt;mi&gt;\xce\xb3&lt;/mi&gt;&lt;mo stretchy="false"&gt;\xe2\x86\x92&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;\xe2\x88\x92&lt;/mo&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/math&gt; production on the multiplicity of neutrons emitted very close to the beam direction in ultraperipheral heavy ion collisions is reported. Data for lead-lead interactions at &lt;math display="inline"&gt;&lt;mrow&gt;&lt;msqrt&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msqrt&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;5.02&lt;/mn&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mi&gt;TeV&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt;, with an integrated luminosity of approximately &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mn&gt;1.5&lt;/mn&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;nb&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;-&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/math&gt;, are collected using the CMS detector at the LHC. The azimuthal correlations between the two muons in the invariant mass region &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mn&gt;8&lt;/mn&gt;&lt;mo&gt;&amp;lt;&lt;/mo&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;lt;&lt;/mo&gt;&lt;mn&gt;60&lt;/mn&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mi&gt;GeV&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; are extracted for events including 0, 1, or at least 2 neutrons detected in the forward pseudorapidity range &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;|&lt;/mo&gt;&lt;mi&gt;\xce\xb7&lt;/mi&gt;&lt;mo stretchy="false"&gt;|&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;gt;&lt;/mo&gt;&lt;mn&gt;8.3&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;. The back-to-back correlation structure from leading-order photon-photon scattering is found to be significantly broader for events with a larger number of emitted neutrons from each nucleus, corresponding to interactions with a smaller impact parameter. This observation provides a data-driven demonstration that the average transverse momentum of photons emitted from relativistic heavy ions has an impact parameter dependence. These results provide new constraints on models of photon-induced interactions in ultraperipheral collisions. They also provide a baseline to search for possible final-state effects on lepton pairs caused by traversing a quark-gluon plasma produced in hadronic heavy ion collisions.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="9">arXiv</subfield><subfield code="a">The first measurement of the dependence of $\\gamma\\gamma$$\\to$$\\mu^{+}\\mu^{-}$ production on the multiplicity of neutrons emitted very close to the beam direction in ultraperipheral heavy ion collisions is reported. Data for lead-lead interactions at $\\sqrt{s_\\mathrm{NN}} =$ 5.02 TeV, with an integrated luminosity of approximately 1.5 nb$^{-1}$, were collected using the CMS detector at the LHC. The azimuthal correlations between the two muons in the invariant mass region 8 $\\lt$$m_{\\mu\\mu}$$\\lt$ 60 GeV are extracted for events including 0, 1, or at least 2 neutrons detected in the forward pseudorapidity range $|\\eta|$$\\gt$ 8.3. The back-to-back correlation structure from leading-order photon-photon scattering is found to be significantly broader for events with a larger number of emitted neutrons from each nucleus, corresponding to interactions with a smaller impact parameter. This observation provides a data-driven demonstrat</subfield></datafield></record>'
Good:
b'<record><datafield tag="520" ind1=" " ind2=" "><subfield code="9">APS</subfield><subfield code="a">The first measurement of the dependence of &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xb3&lt;/mi&gt;&lt;mi&gt;\xce\xb3&lt;/mi&gt;&lt;mo stretchy="false"&gt;\xe2\x86\x92&lt;/mo&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;+&lt;/mo&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;\xe2\x88\x92&lt;/mo&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/math&gt; production on the multiplicity of neutrons emitted very close to the beam direction in ultraperipheral heavy ion collisions is reported. Data for lead-lead interactions at &lt;math display="inline"&gt;&lt;mrow&gt;&lt;msqrt&gt;&lt;mrow&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;s&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;mi&gt;N&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;/mrow&gt;&lt;/msqrt&gt;&lt;mo&gt;=&lt;/mo&gt;&lt;mn&gt;5.02&lt;/mn&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mi&gt;TeV&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt;, with an integrated luminosity of approximately &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mn&gt;1.5&lt;/mn&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;msup&gt;&lt;mrow&gt;&lt;mi&gt;nb&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mo&gt;-&lt;/mo&gt;&lt;mn&gt;1&lt;/mn&gt;&lt;/mrow&gt;&lt;/msup&gt;&lt;/mrow&gt;&lt;/math&gt;, are collected using the CMS detector at the LHC. The azimuthal correlations between the two muons in the invariant mass region &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mn&gt;8&lt;/mn&gt;&lt;mo&gt;&amp;lt;&lt;/mo&gt;&lt;msub&gt;&lt;mrow&gt;&lt;mi&gt;m&lt;/mi&gt;&lt;/mrow&gt;&lt;mrow&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;mi&gt;\xce\xbc&lt;/mi&gt;&lt;/mrow&gt;&lt;/msub&gt;&lt;mo&gt;&amp;lt;&lt;/mo&gt;&lt;mn&gt;60&lt;/mn&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mtext&gt;\xe2\x80\x89&lt;/mtext&gt;&lt;mi&gt;GeV&lt;/mi&gt;&lt;/mrow&gt;&lt;/math&gt; are extracted for events including 0, 1, or at least 2 neutrons detected in the forward pseudorapidity range &lt;math display="inline"&gt;&lt;mrow&gt;&lt;mrow&gt;&lt;mo stretchy="false"&gt;|&lt;/mo&gt;&lt;mi&gt;\xce\xb7&lt;/mi&gt;&lt;mo stretchy="false"&gt;|&lt;/mo&gt;&lt;/mrow&gt;&lt;mo&gt;&amp;gt;&lt;/mo&gt;&lt;mn&gt;8.3&lt;/mn&gt;&lt;/mrow&gt;&lt;/math&gt;. The back-to-back correlation structure from leading-order photon-photon scattering is found to be significantly broader for events with a larger number of emitted neutrons from each nucleus, corresponding to interactions with a smaller impact parameter. This observation provides a data-driven demonstration that the average transverse momentum of photons emitted from relativistic heavy ions has an impact parameter dependence. These results provide new constraints on models of photon-induced interactions in ultraperipheral collisions. They also provide a baseline to search for possible final-state effects on lepton pairs caused by traversing a quark-gluon plasma produced in hadronic heavy ion collisions.</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="9">arXiv</subfield><subfield code="a">The first measurement of the dependence of $\\gamma\\gamma$$\\to$$\\mu^{+}\\mu^{-}$ production on the multiplicity of neutrons emitted very close to the beam direction in ultraperipheral heavy ion collisions is reported. Data for lead-lead interactions at $\\sqrt{s_\\mathrm{NN}} =$ 5.02 TeV, with an integrated luminosity of approximately 1.5 nb$^{-1}$, were collected using the CMS detector at the LHC. The azimuthal correlations between the two muons in the invariant mass region 8 $\\lt$$m_{\\mu\\mu}$$\\lt$ 60 GeV are extracted for events including 0, 1, or at least 2 neutrons detected in the forward pseudorapidity range $|\\eta|$$\\gt$ 8.3. The back-to-back correlation structure from leading-order photon-photon scattering is found to be significantly broader for events with a larger number of emitted neutrons from each nucleus, corresponding to interactions with a smaller impact parameter. This observation provides a data-driven demonstration that the average transverse momentum of photons emitted from relativistic heavy ions has an impact parameter dependence. These results provide new constraints on models of photon-induced interactions in ultraperipheral collisions. They also provide a baseline to search for possible final-state effects on lepton pairs caused by traversing a quark-gluon plasma produced in hadronic heavy ion collisions.</subfield></datafield></record>'

As you can see, the ouput of the last subfield is truncated in the first case.

Required information:

Python : sys.version_info(major=3, minor=9, micro=2, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Revision history for this message
Micha Moskovic (micha-mosk) wrote :
Revision history for this message
Micha Moskovic (micha-mosk) wrote :

Further testing show that this affects Debian, it works correctly when using the binary wheel on the same system, so I've reported the bug against the Debian package.

Changed in lxml (Debian):
status: Unknown → New
Revision history for this message
Micha Moskovic (micha-mosk) wrote :

might this be related to https://bugs.launchpad.net/lxml/+bug/1873306, which is also a serialization issue in "etree.tostring" that goes away when specifying the "UTF-8" encoding?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.