lxml

Lower casing the attribute names

Bug #1849229 reported by Vishwas on 2019-10-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

Version Information

Python : sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)

For HTML strings like

```html

TestSpan

```

parsing it with lxml changes the attribute name from camel case to lower case as

```html

TestSpan

```

This especially is a problem if you are searching for attributes using string comparision. For example using BeautifulSoup it looks something like this.

```python

from bs4 import BeautifulSoup
import re

htmlstr = "TestSpan"

soup = BeautifulSoup(htmlstr, "html.parser")
# Even if initialized as below.
# soup = BeautifulSoup(htmlstr, "lxml")
res = soup.find_all(True, {'myCamelCasedAttr': re.compile(r".*")})

# Returns
# res = []

```

Parsing HTML strings should maintain the casing of the attributes that gets parsed.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-10-22:

This ticket was originally filed against Beautiful Soup: https://bugs.launchpad.net/beautifulsoup/+bug/1849211

The HTML spec defines tag names and attribute values as case-insensitive (http://w3c.github.io/html-reference/documents.html#case-insensitivity), so lxml's behavior is correct, but it might make sense to allow this as an option.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.