Beautiful Soup

Beautiful Soup 4.5.3-1 (Python 3) - refusing to return the first div in a 'div with attribute' CSS select?

Bug #1684968 reported by OmegaPhil on 2017-04-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Low	Unassigned

Bug Description

Originally posted this in the Google Group but it was ignored (https://groups.google.com/forum/#!topic/beautifulsoup/Q0F5SyfAx_4), so escalating to a bug here:

I have attached a tweet page I'm trying to parse - I fetch all tweets
via 'div.original-tweet', discard examples of 'div.pinned', then try to
detect retweets via 'div[data-retweeter]' (so a div with attribute
'data-retweeter') - however that select never returns a result, and I
have fallen back to detecting '.Icon--retweeted' instead.

I have noticed other times where the first div in a list of divs was not
being returned by a select, but this is the first time for me to look
into it properly.

Here is a cut down version of the code:

========================================================================

import io

import bs4

html_data = io.open('tweet-page').read()
parsed = bs4.BeautifulSoup(html_data, 'lxml')
parsed.select('div.original-tweet')[1].select_one('div[data-retweeter]')

========================================================================

From 'parsed.select('div.original-tweet')[1]' you can see that the very
first element is a div with attribute 'data-retweeter', but it is not
returned.

For reference if you create a new BeautifulSoup object from the 'parsed.select('div.original-tweet')[1]' result, its wrapped in html/body, and the select succeeds.

Python 3.5.3, BS 4.5.3-1, lxml 3.7.1-1 in a Devuan Testing system (very close to Debian Testing).

Thanks

Tags:

Revision history for this message

OmegaPhil (omegaphil) wrote on 2017-04-20:

tweet-page Edit (295.5 KiB, text/html)

Revision history for this message

OmegaPhil (omegaphil) wrote on 2017-06-19:

I have looked into this failure - matching any top-level element will never work - in my case, the tags that the code will try to match are determined by bs4/element/py:1509:

======================================================

_use_candidate_generator = lambda tag: tag.descendants

======================================================

So by definition, Beautiful Soup will only try to match the tags inside the top-level tag - whereas I want to be able to match on ANY tag inside the test HTML, and in this case the top-level tag itself should match.

For testing earlier I reimplemented the example with lxml - this works:

===========================================================

from lxml import etree
from lxml.cssselect import CSSSelector

first_sel = CSSSelector('div.original-tweet')
second_sel = CSSSelector('div[data-retweeter]')
tree = etree.parse(io.open('tweet-page'), parser)
etree.tostring(second_sel(first_sel(tree)[1])[0])

===========================================================

For others that want to debug CSS selecting with Beautiful Soup, see _select_debug on element.py:1322 - change to True and you get a lot of debugging output.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-16:

Thanks for reporting this bug and for going deeper to diagnose the problem. The CSS selector system is contributed code and for my own sanity I only add to it when a patch and test are contributed. I'm going to leave this issue open in a 'confirmed' state and if someone provides a patch or pull request I'll merge it.

Changed in beautifulsoup:
status:	New → Confirmed

Leonard Richardson (leonardr) on 2018-07-19

tags:

added: css

Leonard Richardson (leonardr) on 2018-07-21

Changed in beautifulsoup:
importance:	Undecided → Low

Revision history for this message

Isaac Muse (facelessuser) wrote on 2018-12-19:

This is how the select algorithm is supposed to work, it is selecting tags under the parent tag, it does not match the parent tag itself. This is how "querySelectorAll" and "querySelector" works in browsers. Unless I am misunderstanding the problem here.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-12-24:

In the forthcoming 4.7.0 release, all CSS selector logic is delegated to the Soup Sieve project (https://facelessuser.github.io/soupsieve/). This should dramatically improve the overall support for CSS selectors in Beautiful Soup, and provide a responsive channel for improvements.

I'm closing this issue as of the 4.7.0 release. If it's still a problem in 4.7.0, file a bug against Soup Sieve. However, Isaac currently believes it's not a bug at all, so he'd need some more information.