Beautiful Soup

BeautifulSoup incorrectly warns me that I'm an idiot

Bug #1955450 reported by Dale Maggee on 2021-12-21

This bug report is a duplicate of: Bug #1873787: Suppress UserWarning * looks like a URL. Edit Remove

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	New	Undecided	Unassigned

Bug Description

"UserWarning: "http://example.com" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup."

Um, no. That's not what happened, I just happened to pass in user-generated content that looks like a URL.

There's nothing strange at all about this - a URL is also a perfectly well-formed piece of XML content.

You are wasting my cpu time doing this test and clogging up my logs with incorrect trash because you're wrongly assuming I don't know what I'm doing.

A software library's job is to do as it's told and get out of my way, not to uselessly tell me about some incorrect assumption the developer has made.

This warning is a bug because it's surprise behaviour. If you want this feature in the library, it should have to be explicitly enabled with some setting, or at the very least there should be a very simple way to disable this useless test and warning.

I'd suggest "dumbass_mode" as a good setting name ;).

Sorry if my lecture-ish tone offends, but you thought it was perfectly fine to condescendingly lecture me about the difference between a URL and a piece of HTML, so I say it's fine ;)

Revision history for this message

Isaac Muse (facelessuser) wrote on 2021-12-21:

#1

Beautiful Soup generally takes the approach of trying to give "helpful" error/warning codes so that a user understands why things are not working the way they expect. While every developer may have a different opinion on how helpful error/warnings should be done, Beautiful Soup has taken a more ambitious approach.

> There's nothing strange at all about this - a URL is also a perfectly well-formed piece of XML content.

It is *only* well-formed if it is inside an XML tag. Running a URL through the XML parser will yield nothing if not provided as content of an actual tag:

>>> from bs4 import BeautifulSoup
>>> print(BeautifulSoup("http://example.com", 'lxml-xml'))
<?xml version="1.0" encoding="utf-8"?>

Now, if you are using an HTML parser, those are known to be quite forgiving and will often "correct" a user's HTML to be valid in some circumstances. HTML5lib, for instance, is known to do this quite heavily and matches more closely to how modern browsers work.

Generally, Beautiful Soup expects you are feeding it proper content within tags, not stray text fragments. The fact that some of the HTML parsers will take it does not change this fact. As noted above, XML will do nothing with it. That is not based on Beautiful Soup's behavior, but the underlying lxml parser when in XML mode.

So, personally, I think the warning is fine. While I am not the maintainer of Beautiful Soup, I do maintain a number of open-source libraries (such as the CSS selector library used by Beautiful Soup), and anything that helps me not answer the same question over and over again when someone uses the tool in an unintended way I view as a good thing.

It may annoy you that Beautiful Soup alerts you to something you think you know, but from a maintainer's perspective that understands exactly why such a warning is there, no doubt due to the same question being asked over and over, it makes perfect sense to me. And anything that makes the maintainer's life easier is okay by me. If you do not like it, I imagine you can fork it and maintain a version that does not annoy you so much.

Considering that open source maintainers are often doing this in their free time at no cost to you, I would consider checking your tone when making a request.

Beautiful Soup generally takes the approach of trying to give "helpful" error/warning codes so that a user understands why things are not working the way they expect. While every developer may have a different opinion on how helpful error/warnings should be done, Beautiful Soup has taken a more ambitious approach.

> There's nothing strange at all about this - a URL is also a perfectly well-formed piece of XML content.

It is *only* well-formed if it is inside an XML tag. Running a URL through the XML parser will yield nothing if not provided as content of an actual tag:

>>> from bs4 import BeautifulSoup
>>> print(BeautifulSoup("http://example.com", 'lxml-xml'))
<?xml version="1.0" encoding="utf-8"?>

Now, if you are using an HTML parser, those are known to be quite forgiving and will often "correct" a user's HTML to be valid in some circumstances. HTML5lib, for instance, is known to do this quite heavily and matches more closely to how modern browsers work.

Generally, Beautiful Soup expects you are feeding it proper content within tags, not stray text fragments. The fact that some of the HTML parsers will take it does not change this fact. As noted above, XML will do nothing with it. That is not based on Beautiful Soup's behavior, but the underlying lxml parser when in XML mode.

So, personally, I think the warning is fine. While I am not the maintainer of Beautiful Soup, I do maintain a number of open-source libraries (such as the CSS selector library used by Beautiful Soup), and anything that helps me not answer the same question over and over again when someone uses the tool in an unintended way I view as a good thing.

It may annoy you that Beautiful Soup alerts you to something you think you know, but from a maintainer's perspective that understands exactly why such a warning is there, no doubt due to the same question being asked over and over, it makes perfect sense to me. And anything that makes the maintainer's life easier is okay by me. If you do not like it, I imagine you can fork it and maintain a version that does not annoy you so much.

Considering that open source maintainers are often doing this in their free time at no cost to you, I would consider checking your tone when making a request.

Revision history for this message

Dale Maggee (antisol) wrote on 2021-12-21:

#2

>> Beautiful Soup has taken a more ambitious approach.

And by doing so, is now giving a warning which is incorrect, in a surprising way, as the default behaviour, with no way that I can see to disable.

Weren't you arguing *for* correctness somewhere else in your response?

> It is *only* well-formed if it is inside an XML tag

A text node is perfectly valid.

Interestingly, the default parser would seem to disagree with your assessment on what is and isn't valid:

>>> from bs4 import BeautifulSoup
>>> print(BeautifulSoup("http://example.com"))
bs4/__init__.py:417: MarkupResemblesLocatorWarning: "http://example.com" looks like a URL. Beautiful Soup is not an HTTP client, idiot. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
warnings.warn(
<html><body><p>http://example.com</p></body></html>

> That is not based on Beautiful Soup's behavior, but the underlying lxml parser when in XML mode

So what you're saying here is that beautifulsoup has no business caring whether what I pass it is valid or not, or looks like a URL or not, because it's handing it off to another parser, so this test and warning is doubly superfluous. Thanks for clearing that up :)

> anything that helps me not answer the same question over and over again when someone uses the tool in an unintended way I view as a good thing.

It takes 15 seconds to add a new entry in the FAQ section on your readme. It took a lot longer than that to implement this unnecessary, incorrect, and condescending test.

> If you do not like it, I imagine you can fork it and maintain a version that does not annoy you so much.

Aah, the good old "you can always fork if you don't like it" panacea, which in my experience often translates to "I can't be bothered making my software flexible to the needs of others (and in this case, correct) by investing the five minutes effort it would take to make that behaviour optional. Instead, you should invest hours or days figuring out my bizarre naming conventions".

Not very surprising to see it brought out so early in the dialogue.

> Considering that open source maintainers are often doing this in their free time at no cost to you, I would consider checking your tone when making a request.

Lol, apparently somebody didn't read the whole bug report. Or the offensive warning that caused me to raise it.

>> Beautiful Soup has taken a more ambitious approach.

And by doing so, is now giving a warning which is incorrect, in a surprising way, as the default behaviour, with no way that I can see to disable.

Weren't you arguing *for* correctness somewhere else in your response?

> It is *only* well-formed if it is inside an XML tag

A text node is perfectly valid.

Interestingly, the default parser would seem to disagree with your assessment on what is and isn't valid:

>>> from bs4 import BeautifulSoup
>>> print(BeautifulSoup("http://example.com"))
bs4/__init__.py:417: MarkupResemblesLocatorWarning: "http://example.com" looks like a URL. Beautiful Soup is not an HTTP client, idiot. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
  warnings.warn(
<html><body><p>http://example.com</p></body></html>

> That is not based on Beautiful Soup's behavior, but the underlying lxml parser when in XML mode

So what you're saying here is that beautifulsoup has no business caring whether what I pass it is valid or not, or looks like a URL or not, because it's handing it off to another parser, so this test and warning is doubly superfluous. Thanks for clearing that up :)

> anything that helps me not answer the same question over and over again when someone uses the tool in an unintended way I view as a good thing.

It takes 15 seconds to add a new entry in the FAQ section on your readme. It took a lot longer than that to implement this unnecessary, incorrect, and condescending test.

> If you do not like it, I imagine you can fork it and maintain a version that does not annoy you so much.

Aah, the good old "you can always fork if you don't like it" panacea, which in my experience often translates to "I can't be bothered making my software flexible to the needs of others (and in this case, correct) by investing the five minutes effort it would take to make that behaviour optional. Instead, you should invest hours or days figuring out my bizarre naming conventions".

Not very surprising to see it brought out so early in the dialogue.

> Considering that open source maintainers are often doing this in their free time at no cost to you, I would consider checking your tone when making a request.

Lol, apparently somebody didn't read the whole bug report. Or the offensive warning that caused me to raise it.

Revision history for this message

Isaac Muse (facelessuser) wrote on 2021-12-21:

#3

> Interestingly, the default parser would seem to disagree with your assessment on what is and isn't valid:

Sigh, I very clearly stated that HTML parsers are generally more forgiving. And yes, I know what a text node is, and they are wrapped in tags. That's also not the default parser, that is most likely lxml or html5lib. The default parser that ships with Python just gives you back the URL, which isn't even valid HTML.

>>> print(BeautifulSoup("http://example.com", 'html.parser'))
/usr/local/lib/python3.9/site-packages/bs4/__init__.py:431: MarkupResemblesLocatorWarning: "http://example.com" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
warnings.warn(
http://example.com

> A text node is perfectly valid.

Not in XML, which has a very strict spec. Text nodes are perfectly valid within the context of a tag only. This is also true in HTML as well, but most browsers are very forgiving, and some parsers mimic that forgiving behavior as well.

> Aah, the good old "you can always fork if you don't like it" panacea,

No, that is simply the response I give to entitled people who cannot communicate like adults. I'm not sure why you think people are going to be amenable to child-like rantings.

Again, I'm not the maintainer, so I'm moving on. Maybe Leonard will have more patience with you than me :).

Revision history for this message

Dale Maggee (antisol) wrote on 2021-12-21:

#4

Interesting how you totally failed to address the majority of my response while continuing to argue about inconsequential minutiae.

> I'm not the maintainer, so I'm moving on

I have no idea why you responded in the first place, particularly with such an offensive tone. Was it purely in order to raise the level of antagonism?

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-12-21 (last edit on 2021-12-21):

#5

Dale,

Before instituting this warning, I got many support requests from people who didn't understand why passing a filename or URL into the BeautifulSoup constructor doesn't read the file or download the URL. I don't think these people are idiots, but there's a particular thing they didn't understand, and they couldn't continue their work without an understanding.

As a maintainer, I can't be there with everyone using the library, so to handle large numbers of support requests on a given theme, I have to change the software's behavior for everyone. When I do, I have two choices: take it on myself to just make Beautiful Soup work in all situations, or add a warning that gives an explanation.

"Make it work in all situations" is a non-starter here because there _is_ no correct behavior for all situations. Most people who run BeautifulSoup("http://domain/") want, on a high level, to download the representation of that URL and parse it. But some people, like you, really do want to parse the URL as markup.

That leaves the other option: add a warning giving an explanation. To quote the documentation of Python's 'warnings' module (https://docs.python.org/3/library/warnings.html):

"Warning messages are typically issued in situations where it is useful to alert the user of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program."

That fits the situation here. It's useful to alert the user as to precisely what will happen when the code they wrote is executed, because most users of Beautiful Soup don't intend that behavior, but it doesn't warrant raising an exception, because some users _do_ intend it.

When the behavior is intentional, the warning is irrelevant and -- as you discovered -- can read as condescending. The last time someone brought this up was in bug 1873787 (https://bugs.launchpad.net/beautifulsoup/+bug/1873787). The case was very similar to yours: Beautiful Soup was being used to process text entered by users of another application, not text input by the programmer.

Bug 1873787 has a longer explanation of my thinking based on the "When to use logging" section of the Python documentation (https://docs.python.org/3/howto/logging.html#when-to-use-logging). In the end, I made warnings of this type instances of a distinct class, MarkupResemblesLocatorWarning. This allows you to use Python's standard mechanisms to filter out warnings you know to be irrelevant to your application:

---
from bs4 import BeautifulSoup, MarkupResemblesLocatorWarning
import warnings

warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)
BeautifulSoup("http://domain/") # no warning
---

This meets your request for a simple way to disable the warning.

I'm not going to add an option to disable the test itself, because the time saved is not worth the additional API complexity. If your application is performance-sensitive to the point that this test is a serious issue for you, I recommend you write your application directly against lxml's HTML parser, which is much faster than lxml plus Beautiful Soup.

Dale,

Before instituting this warning, I got many support requests from people who didn't understand why passing a filename or URL into the BeautifulSoup constructor doesn't read the file or download the URL. I don't think these people are idiots, but there's a particular thing they didn't understand, and they couldn't continue their work without an understanding.

As a maintainer, I can't be there with everyone using the library, so to handle large numbers of support requests on a given theme, I have to change the software's behavior for everyone. When I do, I have two choices: take it on myself to just make Beautiful Soup work in all situations, or add a warning that gives an explanation.

"Make it work in all situations" is a non-starter here because there _is_ no correct behavior for all situations. Most people who run BeautifulSoup("http://domain/") want, on a high level, to download the representation of that URL and parse it. But some people, like you, really do want to parse the URL as markup.

That leaves the other option: add a warning giving an explanation. To quote the documentation of Python's 'warnings' module (https://docs.python.org/3/library/warnings.html):

"Warning messages are typically issued in situations where it is useful to alert the user  of some condition in a program, where that condition (normally) doesn’t warrant raising an exception and terminating the program."

That fits the situation here. It's useful to alert the user as to precisely what will happen when the code they wrote is executed, because most users of Beautiful Soup don't intend that behavior, but it doesn't warrant raising an exception, because some users _do_ intend it.

When the behavior is intentional, the warning is irrelevant and -- as you discovered -- can read as condescending. The last time someone brought this up was in bug 1873787 (https://bugs.launchpad.net/beautifulsoup/+bug/1873787). The case was very similar to yours: Beautiful Soup was being used to process text entered by users of another application, not text input by the programmer.

Bug 1873787 has a longer explanation of my thinking based on the "When to use logging" section of the Python documentation (https://docs.python.org/3/howto/logging.html#when-to-use-logging). In the end, I made warnings of this type instances of a distinct class, MarkupResemblesLocatorWarning. This allows you to use Python's standard mechanisms to filter out warnings you know to be irrelevant to your application:

---
from bs4 import BeautifulSoup, MarkupResemblesLocatorWarning
import warnings

warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning)
BeautifulSoup("http://domain/") # no warning
---

This meets your request for a simple way to disable the warning.

I'm not going to add an option to disable the test itself, because the time saved is not worth the additional API complexity. If your application is performance-sensitive to the point that this test is a serious issue for you, I recommend you write your application directly against lxml's HTML parser, which is much faster than lxml plus Beautiful Soup.

Revision history for this message

Dale Maggee (antisol) wrote on 2021-12-21:

#6

Hi Leonard,

Thanks very much for your excellent and detailed response. I can see that you have indeed thought this through pretty well, and I can't really disagree with any of your conclusions too much, though I do think that just closing those issues/emails you talk about without reply and setting up filters in your issue tracker is a better option overall - the people who don't know the difference between a URL and a HTML document will either figure it out, or they won't, and it's not an HTML parser library's place to educate them. An HTML parser library should only be for parsing HTML :)

But we don't really need to get into that - I also think that there could be some big improvements made here, even if I can't change your mind on any of the more philosophical issues.

Firstly, the wording on the warning should be different. I have a real problem with "You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup". A piece of software should not be taking that tone with the user if there's any chance at all that it's wrong. And it *is* wrong.

I find your assertion that most instances of this happening are people trying to download the URL to be dubious - I'd argue that that's only what most of your emails were about, and doesn't necessarily have any bearing on real world usage. I'd argue that the more common use case in the real world is almost certainly user-generated content being parsed, and that most working people just don't tend to look at their logs or bother to email you about warnings which can obviously be safely ignored.

Perhaps consider wording more along the lines of:

"This looks like a URL, not HTML. Perhaps you meant to download '<INPUT>' with something like urllib.get first? BeautifulSoup doesn't download things from the internet, it only parses HTML. See <LINK> for more detail. You can disable this warning with <PARAMETER>."

This is a) a more plain english but also more in-depth explanation and better quick-fix hint and a link to a full explanation for the "less technical" person who might actually need this message, b) not incorrect, c) providing an actual quick-fix, no-web-search-needed solution for people who know enough to simply ignore it, and d) not condescending as hell.

Thanks for the code snippet, that's helpful. But I don't think it's nearly simple enough. That's two whole extra lines of code! and an import!!! My definition of "simple" would be something along the lines of:

BeautifulSoup("http://example.com", assume_im_a_dumb=False)

...I'll leave a more diplomatic option name to your more eloquent imagination.

I don't think this is too much to ask, given that the message is not correct. I don't think it's right (ethically) to be penalising people for doing the correct thing. If you must do that, you should also take efforts to reduce that penalty to the absolute minimum. Adding a new import everywhere where I'm using beautifulsoup doesn't fit the bill IMO - e.g I can't do that with a simple regex replace in my source tree.

As for the performance point, fair enough :)

Hi Leonard,

Thanks very much for your excellent and detailed response. I can see that you have indeed thought this through pretty well, and I can't really disagree with any of your conclusions too much, though I do think that just closing those issues/emails you talk about without reply and setting up filters in your issue tracker is a better option overall - the people who don't know the difference between a URL and a HTML document will either figure it out, or they won't, and it's not an HTML parser library's place to educate them. An HTML parser library should only be for parsing HTML :)

But we don't really need to get into that - I also think that there could be some big improvements made here, even if I can't change your mind on any of the more philosophical issues.

Firstly, the wording on the warning should be different. I have a real problem with "You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup". A piece of software should not be taking that tone with the user if there's any chance at all that it's wrong. And it *is* wrong.

I find your assertion that most instances of this happening are people trying to download the URL to be dubious - I'd argue that that's only what most of your emails were about, and doesn't necessarily have any bearing on real world usage. I'd argue that the more common use case in the real world is almost certainly user-generated content being parsed, and that most working people just don't tend to look at their logs or bother to email you about warnings which can obviously be safely ignored.

Perhaps consider wording more along the lines of:

"This looks like a URL, not HTML. Perhaps you meant to download '<INPUT>' with something like urllib.get first? BeautifulSoup doesn't download things from the internet, it only parses HTML. See <LINK> for more detail. You can disable this warning with <PARAMETER>."

This is a) a more plain english but also more in-depth explanation and better quick-fix hint and a link to a full explanation for the "less technical" person who might actually need this message, b) not incorrect, c) providing an actual quick-fix, no-web-search-needed solution for people who know enough to simply ignore it, and d) not condescending as hell.

Thanks for the code snippet, that's helpful. But I don't think it's nearly simple enough. That's two whole extra lines of code! and an import!!! My definition of "simple" would be something along the lines of:

BeautifulSoup("http://example.com", assume_im_a_dumb=False)

...I'll leave a more diplomatic option name to your more eloquent imagination.

I don't think this is too much to ask, given that the message is not correct. I don't think it's right (ethically) to be penalising people for doing the correct thing. If you must do that, you should also take efforts to reduce that penalty to the absolute minimum. Adding a new import everywhere where I'm using beautifulsoup doesn't fit the bill IMO - e.g I can't do that with a simple regex replace in my source tree.

As for the performance point, fair enough :)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-12-21:

#7

In revision 632 I standardized the three MarkupResemblesLocatorWarning texts to consistently use "may want to" language instead of the more judgmental "probably should" language. Previously the language was inconsistent, and "may want to" language was used only in situations where the markup was the name of a directory on disk.

I'm not going to add an argument to an already crowded method signature (and document it and support it forever) when Python already has a standard mechanism for suppressing warnings of a certain type across a running process.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1873787 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.