recognize tag <meta name="content-type" content="charset=utf-8" /> issue

Bug #1779541 reported by tablecell on 2018-07-01
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

## -- environment
---------- python27 ----------
Python : sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
lxml.etree : (4, 2, 2, 0)
libxml used : (2, 9, 7)
libxml compiled : (2, 9, 7)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

## -----
test script
-------------------
# -*- coding=utf8 -*-

from lxml import html
style='.item_title'
html_recognize_meta_has_issue = '''

 <!doctype html>
 <html lang="en">
 <head>
  <meta name="content-type" content="charset=utf-8" /> <-- recognize wrong -->
  <title></title>
 </head>
 <body>
<span class="item_title"><a href="/t/466992#reply59">三星 s7 国行可更新安卓 8.0 了,比 7.0 流畅很多</a></span>
</span>
<body></html>

'''
html_recognize_meta_correct = '''

 <!doctype html>
 <html lang="en">
 <head>
     <meta charset="utf-8"> <-- recognize correct -->
  <title></title>
 </head>
 <body>
<span class="item_title"><a href="/t/466992#reply59">三星 s7 国行可更新安卓 8.0 了,比 7.0 流畅很多</a></span>
</span>
<body></html>

'''
#print html_frag
doc = html.fromstring(html_recognize_meta_has_issue)
#doc = html.fromstring(html_recognize_meta_correct)

for span in doc.cssselect(style):
 text=span.text_content()
 #print(repr(text))
 print(text)

scoder (scoder) wrote :

The parsing is done by libxml2. Please report the problem there.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers