Lower casing the attribute names
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
Version Information
Python : sys.version_
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)
For HTML strings like
```html
<span myCamelCasedAtt
```
parsing it with lxml changes the attribute name from camel case to lower case as
```html
<span mycamelcasedatt
```
This especially is a problem if you are searching for attributes using string comparision. For example using BeautifulSoup it looks something like this.
```python
from bs4 import BeautifulSoup
import re
htmlstr = "<span mycamelcasedatt
soup = BeautifulSoup(
# Even if initialized as below.
# soup = BeautifulSoup(
res = soup.find_all(True, {'myCamelCasedA
# Returns
# res = []
```
Parsing HTML strings should maintain the casing of the attributes that gets parsed.
This ticket was originally filed against Beautiful Soup: https:/ /bugs.launchpad .net/beautifuls oup/+bug/ 1849211
The HTML spec defines tag names and attribute values as case-insensitive (http:// w3c.github. io/html- reference/ documents. html#case- insensitivity), so lxml's behavior is correct, but it might make sense to allow this as an option.