Beautiful Soup 4.5.3-1 (Python 3) - refusing to return the first div in a 'div with attribute' CSS select?

Bug #1684968 reported by OmegaPhil
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Low
Unassigned

Bug Description

Originally posted this in the Google Group but it was ignored (https://groups.google.com/forum/#!topic/beautifulsoup/Q0F5SyfAx_4), so escalating to a bug here:

I have attached a tweet page I'm trying to parse - I fetch all tweets
via 'div.original-tweet', discard examples of 'div.pinned', then try to
detect retweets via 'div[data-retweeter]' (so a div with attribute
'data-retweeter') - however that select never returns a result, and I
have fallen back to detecting '.Icon--retweeted' instead.

I have noticed other times where the first div in a list of divs was not
being returned by a select, but this is the first time for me to look
into it properly.

Here is a cut down version of the code:

========================================================================

import io

import bs4

html_data = io.open('tweet-page').read()
parsed = bs4.BeautifulSoup(html_data, 'lxml')
parsed.select('div.original-tweet')[1].select_one('div[data-retweeter]')

========================================================================

From 'parsed.select('div.original-tweet')[1]' you can see that the very
first element is a div with attribute 'data-retweeter', but it is not
returned.

For reference if you create a new BeautifulSoup object from the 'parsed.select('div.original-tweet')[1]' result, its wrapped in html/body, and the select succeeds.

Python 3.5.3, BS 4.5.3-1, lxml 3.7.1-1 in a Devuan Testing system (very close to Debian Testing).

Thanks

Tags: css
Revision history for this message
OmegaPhil (omegaphil) wrote :
Revision history for this message
OmegaPhil (omegaphil) wrote :

I have looked into this failure - matching any top-level element will never work - in my case, the tags that the code will try to match are determined by bs4/element/py:1509:

======================================================

_use_candidate_generator = lambda tag: tag.descendants

======================================================

So by definition, Beautiful Soup will only try to match the tags inside the top-level tag - whereas I want to be able to match on ANY tag inside the test HTML, and in this case the top-level tag itself should match.

For testing earlier I reimplemented the example with lxml - this works:

===========================================================

from lxml import etree
from lxml.cssselect import CSSSelector

first_sel = CSSSelector('div.original-tweet')
second_sel = CSSSelector('div[data-retweeter]')
tree = etree.parse(io.open('tweet-page'), parser)
etree.tostring(second_sel(first_sel(tree)[1])[0])

===========================================================

For others that want to debug CSS selecting with Beautiful Soup, see _select_debug on element.py:1322 - change to True and you get a lot of debugging output.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for reporting this bug and for going deeper to diagnose the problem. The CSS selector system is contributed code and for my own sanity I only add to it when a patch and test are contributed. I'm going to leave this issue open in a 'confirmed' state and if someone provides a patch or pull request I'll merge it.

Changed in beautifulsoup:
status: New → Confirmed
tags: added: css
Changed in beautifulsoup:
importance: Undecided → Low
Revision history for this message
Isaac Muse (facelessuser) wrote :

This is how the select algorithm is supposed to work, it is selecting tags under the parent tag, it does not match the parent tag itself. This is how "querySelectorAll" and "querySelector" works in browsers. Unless I am misunderstanding the problem here.

Revision history for this message
Leonard Richardson (leonardr) wrote :

In the forthcoming 4.7.0 release, all CSS selector logic is delegated to the Soup Sieve project (https://facelessuser.github.io/soupsieve/). This should dramatically improve the overall support for CSS selectors in Beautiful Soup, and provide a responsive channel for improvements.

I'm closing this issue as of the 4.7.0 release. If it's still a problem in 4.7.0, file a bug against Soup Sieve. However, Isaac currently believes it's not a bug at all, so he'd need some more information.

Changed in beautifulsoup:
status: Confirmed → Fix Committed
Revision history for this message
Leonard Richardson (leonardr) wrote :

In 4.7.0 release.

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.