lxml.html.document_fromstring is different from origin html document

Bug #1913029 reported by bringbladetodream
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

I use lxml parse html to document.But lxml document is difference from browser.

import requests
import lxml
url='http://www.bjdch.gov.cn/n3952/n9279505/c10412743/content.html'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60'}
html_string=requests.get(url,headers=headers,timeout=3).content.decode('utf-8','ignore')
doc = lxml.html.document_fromstring(html_string)
xpath_result=doc.xpath('//div[@class="header navBlock"]/h1/p')

xpath_result is None
And other xpath has result
doc.xpath('//div[@class="header navBlock"]/h1')
doc.xpath('//div[@class="header navBlock"]/p')

browser html like :
<div>
  <h1>
    <p>
    </p>
  </h1>
</div>
lxml html like:
<div>
  <h1>
  </h1>
  <p>
  </p>
</div>

May I ask what causes this.How to get ture document and use true xpath to parse html document.
Thanks.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.