Beautiful Soup 4.5.3-1 (Python 3) - refusing to return the first div in a 'div with attribute' CSS select?
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Low
|
Unassigned |
Bug Description
Originally posted this in the Google Group but it was ignored (https:/
I have attached a tweet page I'm trying to parse - I fetch all tweets
via 'div.original-
detect retweets via 'div[data-
'data-retweeter') - however that select never returns a result, and I
have fallen back to detecting '.Icon--retweeted' instead.
I have noticed other times where the first div in a list of divs was not
being returned by a select, but this is the first time for me to look
into it properly.
Here is a cut down version of the code:
=======
import io
import bs4
html_data = io.open(
parsed = bs4.BeautifulSo
parsed.
=======
From 'parsed.
first element is a div with attribute 'data-retweeter', but it is not
returned.
For reference if you create a new BeautifulSoup object from the 'parsed.
Python 3.5.3, BS 4.5.3-1, lxml 3.7.1-1 in a Devuan Testing system (very close to Debian Testing).
Thanks
tags: | added: css |
Changed in beautifulsoup: | |
importance: | Undecided → Low |
I have looked into this failure - matching any top-level element will never work - in my case, the tags that the code will try to match are determined by bs4/element/ py:1509:
======= ======= ======= ======= ======= ======= ======= =====
_use_candidate_ generator = lambda tag: tag.descendants
======= ======= ======= ======= ======= ======= ======= =====
So by definition, Beautiful Soup will only try to match the tags inside the top-level tag - whereas I want to be able to match on ANY tag inside the test HTML, and in this case the top-level tag itself should match.
For testing earlier I reimplemented the example with lxml - this works:
======= ======= ======= ======= ======= ======= ======= ======= ===
from lxml import etree
from lxml.cssselect import CSSSelector
first_sel = CSSSelector( 'div.original- tweet') 'div[data- retweeter] ') io.open( 'tweet- page'), parser) second_ sel(first_ sel(tree) [1])[0] )
second_sel = CSSSelector(
tree = etree.parse(
etree.tostring(
======= ======= ======= ======= ======= ======= ======= ======= ===
For others that want to debug CSS selecting with Beautiful Soup, see _select_debug on element.py:1322 - change to True and you get a lot of debugging output.