Bug #1828188 “New feature: soup.a.href” : Bugs : Beautiful Soup

Boštjan (pedantic-coder) on 2019-05-08

Changed in beautifulsoup:
assignee:	nobody → Boštjan (pedantic-coder)
assignee:	Boštjan (pedantic-coder) → nobody
assignee:	nobody → Boštjan (pedantic-coder)
assignee:	Boštjan (pedantic-coder) → nobody

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-05-08:

#1

Thanks for writing with this suggestion. I'm assuming you're not at a loss as to how to do this; you just think it could be easier. Things could always be easier, but I don't think this is the way to do it.

In Beautiful Soup, soup.a.href already has a meaning -- it directs Beautiful Soup to extract the <href> tag from this markup:

That doesn't make sense for HTML markup, but that could be a valid XML document. That's the first issue -- the dot operator already means something, and changing what it means would break a lot of scripts.

The other issue is that I think it's very important that all the Beautiful Soup operators and methods have a _consistent_ meaning. To make soup.a.href return the href _attributes_ of the each <a> _tag_, the dot operator would have to change its meaning halfway through a line of code. It would start out meaning "get all the <a> tags" (which is itself different from its current meaning) and then start meaning "get all the 'href' values for each tag in this list".

I'd have to come up with rules about when the dot operator means 'find the first child tag' (as it does now), when it means 'find all the child tags', and when it means 'find the value of an attribute'. All of these things are already part of the Beautiful Soup API under separate names, so the library would get no new capabilities. There would be many disagreements about which meaning should apply in which case, which I'd have to judge.

There are similar cases, like bug #1768330, which deals with text extraction, where the potential payoff might be worth adding extra complexity. Even then I'm very reluctant to add that complexity. In this case, it's already pretty easy to find all the href attributes in all the <a> tags. Making it into a simple one-liner doesn't seem worth the disruption it would cause to the API as a whole.

Changed in beautifulsoup:
status:	New → Won't Fix

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-05-08:

#2

Another way of doing this would be to use currently unused operators for this purpose. Then you could get syntax like:

soup % 'a' / 'href'

This avoids most of the problems I mentioned, but most of the currently unused operators are math operators. There's no intuitive connection between the meaning of the operator and what the operator does to a Tag or a ResultSet. It could just as easily look like this:

soup / 'a' % 'href'

So the resulting system would be hard to learn and remember. The dot operator (generally used to move from a Python object to one of its attributes) and the square-brackets operator (generally used to index a Python array or dictionary) don't have this problem. Their Beautiful Soup uses are similar to their normal Python uses.

Overall I think list comprehensions are the right tool for this sort of thing -- that's the syntax the Python devs came up with and even if I could do slightly better, the fact that it's different from normal Python would itself be a negative.

Revision history for this message

Boštjan (pedantic-coder) wrote on 2019-05-08: Re: [Bug 1828188] Re: New feature: soup.a.href

#3

for href in soup.find_all("a", href=True):
print(href)

Is this the only way to find links or do you know other variations?

On Wed, May 8, 2019, 14:10 Leonard Richardson <email address hidden> wrote:

> Thanks for writing with this suggestion. I'm assuming you're not at a
> loss as to how to do this; you just think it could be easier. Things
> could always be easier, but I don't think this is the way to do it.
>
> In Beautiful Soup, soup.a.href already has a meaning -- it directs
> Beautiful Soup to extract the <href> tag from this markup:
>
> <a>
> <href>
> </a>
>
> That doesn't make sense for HTML markup, but that could be a valid XML
> document. That's the first issue -- the dot operator already means
> something, and changing what it means would break a lot of scripts.
>
> The other issue is that I think it's very important that all the
> Beautiful Soup operators and methods have a _consistent_ meaning. To
> make soup.a.href return the href _attributes_ of the each <a> _tag_, the
> dot operator would have to change its meaning halfway through a line of
> code. It would start out meaning "get all the <a> tags" (which is itself
> different from its current meaning) and then start meaning "get all the
> 'href' values for each tag in this list".
>
> I'd have to come up with rules about when the dot operator means 'find
> the first child tag' (as it does now), when it means 'find all the child
> tags', and when it means 'find the value of an attribute'. All of these
> things are already part of the Beautiful Soup API under separate names,
> so the library would get no new capabilities. There would be many
> disagreements about which meaning should apply in which case, which I'd
> have to judge.
>
> There are similar cases, like bug #1768330, which deals with text
> extraction, where the potential payoff might be worth adding extra
> complexity. Even then I'm very reluctant to add that complexity. In this
> case, it's already pretty easy to find all the href attributes in all
> the <a> tags. Making it into a simple one-liner doesn't seem worth the
> disruption it would cause to the API as a whole.
>
> ** Changed in: beautifulsoup
> Status: New => Won't Fix
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1828188
>
> Title:
> New feature: soup.a.href
>
> Status in Beautiful Soup:
> Won't Fix
>
> Bug description:
> It would be great to have the ability to do
> soup.a.href
> which finds all the href tags (links).
> Is this something that was ever considered?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/1828188/+subscriptions
>

for href in soup.find_all("a", href=True):
    print(href)

Is this the only way to find links or do you know other variations?

On Wed, May 8, 2019, 14:10 Leonard Richardson <leonardr@segfault.org> wrote:

> Thanks for writing with this suggestion. I'm assuming you're not at a
> loss as to how to do this; you just think it could be easier. Things
> could always be easier, but I don't think this is the way to do it.
>
> In Beautiful Soup, soup.a.href already has a meaning -- it directs
> Beautiful Soup to extract the <href> tag from this markup:
>
> <a>
> <href>
> </a>
>
> That doesn't make sense for HTML markup, but that could be a valid XML
> document. That's the first issue -- the dot operator already means
> something, and changing what it means would break a lot of scripts.
>
> The other issue is that I think it's very important that all the
> Beautiful Soup operators and methods have a _consistent_ meaning. To
> make soup.a.href return the href _attributes_ of the each <a> _tag_, the
> dot operator would have to change its meaning halfway through a line of
> code. It would start out meaning "get all the <a> tags" (which is itself
> different from its current meaning) and then start meaning "get all the
> 'href' values for each tag in this list".
>
> I'd have to come up with rules about when the dot operator means 'find
> the first child tag' (as it does now), when it means 'find all the child
> tags', and when it means 'find the value of an attribute'. All of these
> things are already part of the Beautiful Soup API under separate names,
> so the library would get no new capabilities. There would be many
> disagreements about which meaning should apply in which case, which I'd
> have to judge.
>
> There are similar cases, like bug #1768330, which deals with text
> extraction, where the potential payoff might be worth adding extra
> complexity. Even then I'm very reluctant to add that complexity. In this
> case, it's already pretty easy to find all the href attributes in all
> the <a> tags. Making it into a simple one-liner doesn't seem worth the
> disruption it would cause to the API as a whole.
>
> ** Changed in: beautifulsoup
>        Status: New => Won't Fix
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1828188
>
> Title:
>   New feature: soup.a.href
>
> Status in Beautiful Soup:
>   Won't Fix
>
> Bug description:
>   It would be great to have the ability to do
>   soup.a.href
>   which finds all the href tags (links).
>   Is this something that was ever considered?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/1828188/+subscriptions
>

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-05-08:

#4

You could use selectors (which I actually prefer):

for href in soup.select('a[href]'):
print(href)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-05-08:

#5

find_all() and CSS selectors are the main ways. Selectors are more compact. It's a different domain-specific language, so it's less Pythonic, but a lot of web developers already know that language.

Revision history for this message

Boštjan (pedantic-coder) wrote on 2019-05-08:

#6

Maybe
soup.a[href]
is an implementation possibility?

Revision history for this message

Isaac Muse (facelessuser) wrote on 2019-05-09:

#7

soup.a['href']

Will get the first 'a' element's href, but not all. You can do that, but you only get one.

I should also clarify with the select method, it only returns elements, not attributes directly, so it was only returning 'a' elements that have 'href' attributes, but you still have to extract the attribute:

for href in soup.select('a[href]'):
print(href['href')

I was not meaning to insinuate that it the attributes, only that it will return the elements that have the attribute you are looking for.

Beautiful Soup

New feature: soup.a.href

Bug Description

Other bug subscribers

Remote bug watches