hxselect crash over not-closed tag

Bug #1878637 reported by Fabio
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
html-xml-utils (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hello
I can't parse any html page anymore because hxselect crash when ecountering a tag that was not closed.
For example
<img src=".......>
<a>....</a>
crash because img was not closed correctly, even after the command xhnormalize.

Example

curl http://coincapital.live | hxnormalize | hxselect '#BTC_DTCO'

  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 37821 100 37821 0 0 22182 0 0:00:01 0:00:01 --:--:-- 22182

End tag </a> doesn't match start tag <img> <------------------------------------------------

Thank you

Revision history for this message
Carlos (colosseum) wrote :

Same problem, different tags:

curl -s "https://www.altrogiornale.org/aristarco-di-samo-e-la-luna/" | tac | tac | hxclean | hxnormalize | hxselect "div.cmsmasters_post_content:nth-child(1)"
End tag </div> doesn't match start tag <input>

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in html-xml-utils (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.