Truncated serialized value (etree.tostring) for long tag value and encoding != 'utf-8'

Bug #1893462 reported by Marcin Raczyński
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
lxml
Triaged
Undecided
Unassigned

Bug Description

Function etree.tostring(xml, encoding=ENC) returns a serialized XML with a truncated tag value when the encoding is other than 'utf-8' and an XML has a tag with a long text value.

Failing test:

  from lxml import etree
  N = 5000
  xml = etree.Element('x')
  xml.text = 'a' * N
  out_str = etree.tostring(xml, encoding='iso-8859-2')
  assert len(out_str) > N

Python : sys.version_info(major=3, minor=7, micro=8, releaselevel='final', serial=0)
lxml.etree : (4, 5, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Revision history for this message
Marcin Raczyński (marc1nr) wrote :

Bug was introduced in lxml 5.0.0 (in last 4.4.x version or 4.4.3 is ok)

description: updated
Revision history for this message
Marcin Raczyński (marc1nr) wrote :

Correction: Bug was introduced in lxml 4.5.0 (there is no problem in the last 4.4.x version or 4.4.3)

description: updated
description: updated
Revision history for this message
scoder (scoder) wrote :

This is most likely an artefact of switching to libxml2 2.9.10 for the lxml 4.5.0 binary wheels. Worth checking if there's a bug report on their side and/or if it has already been fixed for their next release.

Changed in lxml:
status: New → Triaged
Revision history for this message
Marcin Raczyński (marc1nr) wrote :

Maybe this issue is related with https://gitlab.gnome.org/GNOME/libxml2/-/issues/166:

"I'm using xmlDocDumpFormatMemoryEnc to dump a DOM tree value with string length more than 5000 chars. The return content is truncated. The problem only exists in 2.9.10 and it's ok in 2.9.9."

summary: - Invalid serialization (etree.tostring) for long tag value and encoding
- != 'utf-8'
+ Truncated serialized value (etree.tostring) for long tag value and
+ encoding != 'utf-8'
Revision history for this message
Marcin Raczyński (marc1nr) wrote :

The issue is still valid in lxml 4.6.3

Revision history for this message
scoder (scoder) wrote :

The ticket was resolved in libxml2, but there hasn't been a new release on their side yet that fixes this.

Revision history for this message
Marcin Raczyński (marc1nr) wrote :

Could you paste here a link to this libxml2 issue. please?

Revision history for this message
scoder (scoder) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.