recognize tag <meta name="content-type" content="charset=utf-8" /> issue

Bug #1779541 reported by tablecell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

## -- environment
---------- python27 ----------
Python : sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
lxml.etree : (4, 2, 2, 0)
libxml used : (2, 9, 7)
libxml compiled : (2, 9, 7)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

## -----
test script
-------------------
# -*- coding=utf8 -*-

from lxml import html
style='.item_title'
html_recognize_meta_has_issue = '''

 <!doctype html>
 <html lang="en">
 <head>
  <meta name="content-type" content="charset=utf-8" /> <-- recognize wrong -->
  <title></title>
 </head>
 <body>
<span class="item_title"><a href="/t/466992#reply59">三星 s7 国行可更新安卓 8.0 了,比 7.0 流畅很多</a></span>
</span>
<body></html>

'''
html_recognize_meta_correct = '''

 <!doctype html>
 <html lang="en">
 <head>
     <meta charset="utf-8"> <-- recognize correct -->
  <title></title>
 </head>
 <body>
<span class="item_title"><a href="/t/466992#reply59">三星 s7 国行可更新安卓 8.0 了,比 7.0 流畅很多</a></span>
</span>
<body></html>

'''
#print html_frag
doc = html.fromstring(html_recognize_meta_has_issue)
#doc = html.fromstring(html_recognize_meta_correct)

for span in doc.cssselect(style):
 text=span.text_content()
 #print(repr(text))
 print(text)

Revision history for this message
scoder (scoder) wrote :

The parsing is done by libxml2. Please report the problem there.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.