Beautiful Soup

Suppress UserWarning * looks like a URL

Bug #1873787 reported by Ben Armstrong on 2020-04-20

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

I would like to be able to suppress this message:

UserWarning: "https://*" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.

In my use case, I am parsing a user's comments typed into a description for an iNaturalist observation. The website allows arbitrary text with a "safe" subset of HTML supported. It is a common practice for observers to put nothing but a link to an observation made on another platform (e.g. eBird, youtube, etc.). bs4 has wrongly flagged this as warning, indicating a likely programmer error whereas in fact, it is simply user data from a web page that contains nothing but a URL, so should be considered legitimate input.

Would you please provide for passing an option to the parser to suppress the warning?

This affects these downstreams:

- https://github.com/dlon/html2markdown
    - the package I'm using to parse the HTML description, making it into acceptable
      input for Discord
- https://github.com/synrg/dronefly
    - my Discord bot code which uses html2markdown, which in turn uses bs4

The input my bot is handling is the response from this API request:

https://api.inaturalist.org/v1/observations/34067615

This contains: "description": "https://ebird.org/view/checklist/S48279561"

The way this informaiton is presented on the iNaturalist website is: https://www.inaturalist.org/observations/34067615

My bot code produces a very simplified preview version of this display for the benefit of Discord channel members where the link was shared. Because those descriptions can often contain html, I need to pass it through bs4 (via html2markdown) to transform it to Markdown for display by Discord.

I could suppress the warning using the warnings module, but that is a rather dirty solution and is prone to break if you ever change your code so that my filter no longer catches it.

I could suppress the warning by putting a blank in front of the whole description string before passing it to html2markdown, but that couples my code to a specific implementation choice made by bs4.

Therefore, I would like to be able to write intentional code instead that tells bs4 to not issue this warning. Ideally, it would be a module-level control I could set without having to bother html2markdown to also support passing through an option when it instantiates a parser, but if you decide to offer it in kwargs on the parser & if that means I'd have to file a bug on html2markdown, I would happily do that.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-20:

Let's talk this out using https://docs.python.org/3/howto/logging.html#when-to-use-logging as a guide.

We want to "Issue a warning regarding a particular runtime event". The recommendation is:

* a UserWarning, "if the issue is avoidable and the client application should be modified to eliminate the warning"
* logging.warning() "if there is nothing the client application can do about the situation, but the event should still be noted"

IMO those are the two options. An example that comes to mind is bug 1013862, where I changed a UserWarning to a logging call because I got convinced that "there is nothing the client application can do". There are two built-in ways of doing this, and it's a very tough sell to convince me I should add a third way by changing the Beautiful Soup API.

So the question in my mind is whether "the client application should be modified" here or whether "there is nothing the client application can do". The problem is that there's no clear answer for a library like Beautiful Soup. Up to today, I would have sided 100% with "the client application should be modified", because I'd only encountered cases where the "client application" looked like:

BeautifulSoup("https://url-i-want-to-download.com/")

But in your case, "there is nothing the client application can do", because you're passing in data you got from somewhere else.

On balance, I think "the client application should be modified" is still the way to go, and since you're in "there is nothing the client application can do" territory, you should take that into account by filtering the warning. Among other things, you say don't want a solution to break if I change the code in the future. That's reasonable, but if I change this from a warning to a logging call, it would break someone else's current solution based on filtering.

The furthest I'm willing to go down the path of special code is to define a custom subclass of UserWarning for this. That would let a caller filter this warning without filtering any others, in a way that's backwards compatible with existing solutions.

Let me know if a warning subclass would be better for you. There's a similar warning issued if the markup looks like a filename, so I'd cover that case as well, with a class called MarkupResemblesLocatorWarning or something.

Let's talk this out using https://docs.python.org/3/howto/logging.html#when-to-use-logging as a guide.

We want to "Issue a warning regarding a particular runtime event". The recommendation is:

So the question in my mind is whether "the client application should be modified" here or whether "there is nothing the client application can do". The problem is that there's no clear answer for a  library like Beautiful Soup. Up to today, I would have sided 100% with "the client application should be modified", because I'd only encountered cases where the "client application" looked like:

BeautifulSoup("https://url-i-want-to-download.com/")

But in your case, "there is nothing the client application can do", because you're passing in data you got from somewhere else.

Revision history for this message

Ben Armstrong (synrg) wrote on 2020-04-21:

A warning subclass would be perfect. Thanks.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-04-21:

OK, revision 569 introduces GuessedAtParserWarning (when no parser was specified) and MarkupResemblesLocatorWarning (when the markup looks like a URL or local path).