lxml

recognize tag <meta name="content-type" content="charset=utf-8" /> issue

Bug #1779541 reported by tablecell on 2018-07-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

## -- environment
---------- python27 ----------
Python : sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
lxml.etree : (4, 2, 2, 0)
libxml used : (2, 9, 7)
libxml compiled : (2, 9, 7)
libxslt used : (1, 1, 32)
libxslt compiled : (1, 1, 32)

## -----
test script
-------------------
# -*- coding=utf8 -*-

from lxml import html
style='.item_title'
html_recognize_meta_has_issue = '''

<!doctype html>
<html lang="en">
<head>
<meta name="content-type" content="charset=utf-8" /> <-- recognize wrong -->
<title></title>
</head>
<body>
<a href="/t/466992#reply59">三星 s7 国行可更新安卓 8.0 了，比 7.0 流畅很多</a>

<body></html>

'''
html_recognize_meta_correct = '''

<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8"> <-- recognize correct -->
<title></title>
</head>
<body>
<a href="/t/466992#reply59">三星 s7 国行可更新安卓 8.0 了，比 7.0 流畅很多</a>

<body></html>

'''
#print html_frag
doc = html.fromstring(html_recognize_meta_has_issue)
#doc = html.fromstring(html_recognize_meta_correct)

for span in doc.cssselect(style):
text=span.text_content()
#print(repr(text))
print(text)

Revision history for this message

scoder (scoder) wrote on 2018-07-04:

The parsing is done by libxml2. Please report the problem there.

Changed in lxml:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.