New feature: soup.a.href

Bug #1828188 reported by Boštjan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

It would be great to have the ability to do
soup.a.href
which finds all the href tags (links).
Is this something that was ever considered?

Changed in beautifulsoup:
assignee: nobody → Boštjan (pedantic-coder)
assignee: Boštjan (pedantic-coder) → nobody
assignee: nobody → Boštjan (pedantic-coder)
assignee: Boštjan (pedantic-coder) → nobody
Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for writing with this suggestion. I'm assuming you're not at a loss as to how to do this; you just think it could be easier. Things could always be easier, but I don't think this is the way to do it.

In Beautiful Soup, soup.a.href already has a meaning -- it directs Beautiful Soup to extract the <href> tag from this markup:

<a>
<href>
</a>

That doesn't make sense for HTML markup, but that could be a valid XML document. That's the first issue -- the dot operator already means something, and changing what it means would break a lot of scripts.

The other issue is that I think it's very important that all the Beautiful Soup operators and methods have a _consistent_ meaning. To make soup.a.href return the href _attributes_ of the each <a> _tag_, the dot operator would have to change its meaning halfway through a line of code. It would start out meaning "get all the <a> tags" (which is itself different from its current meaning) and then start meaning "get all the 'href' values for each tag in this list".

I'd have to come up with rules about when the dot operator means 'find the first child tag' (as it does now), when it means 'find all the child tags', and when it means 'find the value of an attribute'. All of these things are already part of the Beautiful Soup API under separate names, so the library would get no new capabilities. There would be many disagreements about which meaning should apply in which case, which I'd have to judge.

There are similar cases, like bug #1768330, which deals with text extraction, where the potential payoff might be worth adding extra complexity. Even then I'm very reluctant to add that complexity. In this case, it's already pretty easy to find all the href attributes in all the <a> tags. Making it into a simple one-liner doesn't seem worth the disruption it would cause to the API as a whole.

Changed in beautifulsoup:
status: New → Won't Fix
Revision history for this message
Leonard Richardson (leonardr) wrote :

Another way of doing this would be to use currently unused operators for this purpose. Then you could get syntax like:

soup % 'a' / 'href'

This avoids most of the problems I mentioned, but most of the currently unused operators are math operators. There's no intuitive connection between the meaning of the operator and what the operator does to a Tag or a ResultSet. It could just as easily look like this:

soup / 'a' % 'href'

So the resulting system would be hard to learn and remember. The dot operator (generally used to move from a Python object to one of its attributes) and the square-brackets operator (generally used to index a Python array or dictionary) don't have this problem. Their Beautiful Soup uses are similar to their normal Python uses.

Overall I think list comprehensions are the right tool for this sort of thing -- that's the syntax the Python devs came up with and even if I could do slightly better, the fact that it's different from normal Python would itself be a negative.

Revision history for this message
Boštjan (pedantic-coder) wrote : Re: [Bug 1828188] Re: New feature: soup.a.href

for href in soup.find_all("a", href=True):
    print(href)

Is this the only way to find links or do you know other variations?

On Wed, May 8, 2019, 14:10 Leonard Richardson <email address hidden> wrote:

> Thanks for writing with this suggestion. I'm assuming you're not at a
> loss as to how to do this; you just think it could be easier. Things
> could always be easier, but I don't think this is the way to do it.
>
> In Beautiful Soup, soup.a.href already has a meaning -- it directs
> Beautiful Soup to extract the <href> tag from this markup:
>
> <a>
> <href>
> </a>
>
> That doesn't make sense for HTML markup, but that could be a valid XML
> document. That's the first issue -- the dot operator already means
> something, and changing what it means would break a lot of scripts.
>
> The other issue is that I think it's very important that all the
> Beautiful Soup operators and methods have a _consistent_ meaning. To
> make soup.a.href return the href _attributes_ of the each <a> _tag_, the
> dot operator would have to change its meaning halfway through a line of
> code. It would start out meaning "get all the <a> tags" (which is itself
> different from its current meaning) and then start meaning "get all the
> 'href' values for each tag in this list".
>
> I'd have to come up with rules about when the dot operator means 'find
> the first child tag' (as it does now), when it means 'find all the child
> tags', and when it means 'find the value of an attribute'. All of these
> things are already part of the Beautiful Soup API under separate names,
> so the library would get no new capabilities. There would be many
> disagreements about which meaning should apply in which case, which I'd
> have to judge.
>
> There are similar cases, like bug #1768330, which deals with text
> extraction, where the potential payoff might be worth adding extra
> complexity. Even then I'm very reluctant to add that complexity. In this
> case, it's already pretty easy to find all the href attributes in all
> the <a> tags. Making it into a simple one-liner doesn't seem worth the
> disruption it would cause to the API as a whole.
>
> ** Changed in: beautifulsoup
> Status: New => Won't Fix
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1828188
>
> Title:
> New feature: soup.a.href
>
> Status in Beautiful Soup:
> Won't Fix
>
> Bug description:
> It would be great to have the ability to do
> soup.a.href
> which finds all the href tags (links).
> Is this something that was ever considered?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/1828188/+subscriptions
>

Revision history for this message
Isaac Muse (facelessuser) wrote :

You could use selectors (which I actually prefer):

for href in soup.select('a[href]'):
    print(href)

Revision history for this message
Leonard Richardson (leonardr) wrote :

find_all() and CSS selectors are the main ways. Selectors are more compact. It's a different domain-specific language, so it's less Pythonic, but a lot of web developers already know that language.

Revision history for this message
Boštjan (pedantic-coder) wrote :

Maybe
soup.a[href]
is an implementation possibility?

Revision history for this message
Isaac Muse (facelessuser) wrote :

soup.a['href']

Will get the first 'a' element's href, but not all. You can do that, but you only get one.

I should also clarify with the select method, it only returns elements, not attributes directly, so it was only returning 'a' elements that have 'href' attributes, but you still have to extract the attribute:

for href in soup.select('a[href]'):
    print(href['href')

I was not meaning to insinuate that it the attributes, only that it will return the elements that have the attribute you are looking for.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.