Bug #2052988 “Do a better job of distinguishing between markup t...” : Bugs : Beautiful Soup

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-13 (last edit on 2024-02-13):

#1

Thanks for taking the time to file this issue. For the history of this warning, you may find it useful to read these comments on bug 1873787 and bug 1955450:

https://bugs.launchpad.net/beautifulsoup/+bug/1873787/comments/1
https://bugs.launchpad.net/beautifulsoup/+bug/1955450/comments/5

Beautiful Soup decides to issue this warning if the markup is less than 256 bytes long, does not contain the '<' character (as almost all HTML fragments do), and if BeautifulSoup._markup_resembles_filename class method returns True. Here's the code:

https://git.launchpad.net/beautifulsoup/tree/bs4/__init__.py#n440

For security reasons, Beautiful Soup can't look to see if incoming markup actually exists as a file on disk, so it has to make a decision based on the structure of the string. Currently it checks whether the string ends in a textual file extension like ".html", or contains path characters--i.e. slashes or backslashes. Since your string contains slashes, _markup_resembles_filename returns true.

Looking at your example with my human eyes, I can see some hints that this is probably not a file path. For example, I could change _markup_resembles_filename to return False if it finds double slashes like in a URL, a combination of slashes and backslashes, characters with special meaning to Unix shells like #?*&>, or a colon (except near the beginning of the string, as with a Windows path). This will reduce the amount of markup that triggers this warning.

But there will always be some spurious warnings here. Some people who pass the string "myfile.html" into the BeautifulSoup constructor are doing the wrong thing: they actually want to open myfile.html and parse the contents. Those people should take note of the warning. But others really do want Beautiful Soup to parse the string "myfile.html", and those people should filter the warning.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-13:

#2

I committed some improvements to the filename detection as revision 0120dbe in the 4.13 branch. Regardless of whether this solves your issue, I recommend filtering the warning, since you know you're not passing a filename into the BeautifulSoup constructor, and there's no telling what markup you might receive from an external source in the future.

Changed in beautifulsoup:
status:	New → Fix Committed
summary:	- spurious "MarkupResemblesLocatorWarning: The input looks more like a - filename than markup. You may want to open this file and pass the - filehandle into Beautiful Soup." warnings + Do a better job of distinguishing between markup that looks like a + realistic filename and markup that doesn't.

Revision history for this message

Matija Nalis (mnalis) wrote on 2024-02-13:

#3

Thanks for that background, Leonard, it's much appreciated!

I can see why the change was done -- although I probably would've done it differently - e.g. only use special handling if the string starts with regex `^https?://` or `^.:\\` or ends with `\.[a-z]{3,4}$`). But as you said, there would always be some false positives when trying to "automagically" handle such values.

However, I was somewhat surprised that `warnings.filterwarnings` is the officially recommended way to handle it. I personally would only consider such ignoring of warnings as a quick kludge/workaround, and to be revisited as soon as properly fixed package is released. (IOW, IMHO warnings are something which one should find a root cause of and fix it, instead of ignoring them if they do not seem related to their case)

If one can get over the rudeness of the poster in mentioned issue, I'd too feel much cleaner solution would be something akin to `BeautifulSoup("http://example.com", force_html=True)` or `BeautifulSoup("http://example.com", ignore_urls=False)` or similar, to allow user to *explicitly* specify what handling they want.

While I get your concerns about documenting and supporting it, I'd find such solution much cleaner and preferable. `filterwarnings()` sounds almost as *dirty* as library dying in the middle of parsing, and caller having to handle it with try/except.

But as the saying goes, "nothing is ever hard for the man who doesn't have to do it himself", so I'll leave the final decision to you.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-13 (last edit on 2024-02-13):

#4

The case of markup that's just a URL is handled separately, by a different method that issues a MarkupResemblesLocatorWarning with a different message. As you suggest, I may well restrict _markup_resembles_filename further to trigger only on markup that ends with an apparent file extension. That's the most common type of filename-as-markup error (you download a file using your web browser and pass its filename into Beautiful Soup). The second most common filename-as-markup error (it's just a temporary file called "foo" or something) already won't be caught, because there's usually no slash in the path. I can always refine _markup_resembles_filename later if I start getting support requests again.

IMO filtering a warning that you know not to be applicable *is* a way of explicitly handling the warning, and it's the one documented in the Python standard library, so it's better than anything else I might come up with. Changing an internal API to not issue the warning does exactly the same thing as putting a filter in place; it just doesn't look like "filtering a warning" is what you're doing.

Aesthetics aside, there's also the problem that MarkupResemblesLocatorWarning is not the only warning of this type. I just added another warning class, AttributeResemblesVariableWarning, in response to issue 2025089. There the problem is that Beautiful Soup users sometimes type "_class" instead of "class_" when filtering tags on the "class" attribute. ("class_" itself being a necessary hack because "class" is a Python reserved word.) Currently, Beautiful Soup goes ahead and filters on the "_class" attribute, which shows up as a silent failure. The warning doesn't change that behavior, it just points out that they may have misspelled "class_" in their code.

But you know there's someone out there who really does want to filter on the "_class" attribute, and that warning is going to annoy them once they upgrade to 4.13. The best thing for them to do is filter the warning. The warning was based on guesswork that works in most cases, but was wrong in their case. I can't keep adding arguments to the BeautifulSoup constructor to suppress different kinds of warnings: that's what the warnings module is for.

Writing this out, I do think I'll give all of these warnings a common superclass: GuessworkBasedWarning or UnusualUsageWarning or something that doesn't sound authoritative the way DeprecationWarning does. That way it's easy to express "I know what I'm doing" by filtering all warnings of that class:

warnings.filterwarnings("ignore", category=UnusualUsageWarning)

The case of markup that's just a URL is handled separately, by a different method that issues a MarkupResemblesLocatorWarning with a different message. As you suggest, I may well restrict _markup_resembles_filename further to trigger only on markup that ends with an apparent file extension. That's the most common type of filename-as-markup error (you download a file using your web browser and pass its filename into Beautiful Soup). The second most common filename-as-markup error (it's just a temporary file called "foo" or something) already won't be caught, because there's usually no slash in the path. I can always refine _markup_resembles_filename later if I start getting support requests again.

IMO filtering a warning that you know not to be applicable *is* a way of explicitly handling the warning, and it's the one documented in the Python standard library, so it's better than anything else I might come up with. Changing an internal API to not issue the warning does exactly the same thing as putting a filter in place; it just doesn't look like "filtering a warning" is what you're doing.

Aesthetics aside, there's also the problem that MarkupResemblesLocatorWarning is not the only warning of this type. I just added another warning class, AttributeResemblesVariableWarning, in response to issue 2025089. There the problem is that Beautiful Soup users sometimes type "_class" instead of "class_" when filtering tags on the "class" attribute. ("class_" itself being a necessary hack because "class" is a Python reserved word.) Currently, Beautiful Soup goes ahead and filters on the "_class" attribute, which shows up as a silent failure. The warning doesn't change that behavior, it just points out that they may have misspelled "class_" in their code.

But you know there's someone out there who really does want to filter on the "_class" attribute, and that warning is going to annoy them once they upgrade to 4.13. The best thing for them to do is filter the warning. The warning was based on guesswork that works in most cases, but was wrong in their case. I can't keep adding arguments to the BeautifulSoup constructor to suppress different kinds of warnings: that's what the warnings module is for.

Writing this out, I do think I'll give all of these warnings a common superclass: GuessworkBasedWarning or UnusualUsageWarning or something that doesn't sound authoritative the way DeprecationWarning does. That way it's easy to express "I know what I'm doing" by filtering all warnings of that class:

warnings.filterwarnings("ignore", category=UnusualUsageWarning)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-13 (last edit on 2024-02-13):

#5

Another warning that's a bit like this is XMLParsedAsHTMLWarning, which is issued when you use an HTML parser to parse an XML document that's not XHTML. Without the warning, everything kind of works, but you can get bizarre un-XML-like behavior depending on what you're doing. But some people can't or don't want to install lxml, or they have other reasons for doing what they're doing. That's fine; they can filter the warning.

GuessedAtParserWarning is *not* like this. If you don't specify a parser to use, the behavior of your script across environments is not precisely defined, and you need to fix the issue.

Revision history for this message

Matija Nalis (mnalis) wrote on 2024-02-13:

#6

Download full text (5.2 KiB)

> IMO filtering a warning that you know not to be applicable *is* a way of explicitly handling the warning

In general, I'd agree with you if the user knew with absolute certainty that that warning applies exactly to that one use case, and that it will never change its behaviour. However usually each of those claims is tall order even by itself; combined, they are very taxing on the user. What will IMHO more likely happen instead is that most users will either:

- lock themselves to the exact version of library (that they've checked works and verified its code it does exactly what they want) to avoid surprises, but that will then bring problems when e.g. security issue in library is detected, or
- they will just add that ignore without much checking what it does as long as warning goes away and it seems to work (e.g. check only that it "fixes" specific one use case, not if it affects any other things too), and will not be checking if it changes in the future either - and thus open themselves to the whole lot of bugs which will be silently ignored

Very few users will actually verify if the warning applies to their use case (and ONLY their use case), and then keep re-verifying that behaviour on every library upgrade (which is only way ignoring warnings could safely work).

Although I agree with you that `MarkupResemblesLocatorWarning` in particular is probably specific enough that its behaviour might never change.

> I can't keep adding arguments to the BeautifulSoup constructor to suppress different kinds of warnings: that's what the warnings module is for.

I agree with you here completely - but that is not what I suggested.

But the suggestion was NOT to "add constructor to ignore warnings" - that would indeed be not only pointless but actively contraindicated, given existence of warnings.filterwarnings()

However while I wrote that, I was under impression that BeautifulSoup actually does support reading from file or URL, and was guessing what to do (open file/url, or parse as raw HTML) - just like it guesses for "GuessedAtParserWarning"

Thus my idea was instead to add constructor that forces the behaviour of the library, i.e. if `assume_data_is_always_html_fragment=True` is set, then the BeautifulSoup() call would never even attempt to use it as a file or URL (and thus would never need to warn the user "did you mean this as file/URL or as HTML fragment?" - as the user already explicitly said what they want when calling the BeautifulSoup)

But if BeautifulSoup actually never supports URL / file handling (which seems to be the case, if I understand you correctly), than that `assume_data_is_always_html_fragment=True` would indeed reduce itself to "just suppress a warning", as you indicated.

Note that my (wrong) assumption that it does support URL/file handling comes from the phrasing that warning message "MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup." which to me seems to imply that it is capable of using file/url too, but is currently just guessing what to do, so it would be better for user to specify it explicitly if they mea...

> IMO filtering a warning that you know not to be applicable *is* a way of explicitly handling the warning

In general, I'd agree with you if the user knew with absolute certainty that that warning applies exactly to that one use case, and that it will never change its behaviour. However usually each of those claims is tall order even by itself; combined, they are very taxing on the user. What will IMHO more likely happen instead is  that most users will either:

- lock themselves to the exact version of library (that they've checked works and verified its code it does exactly what they want) to avoid surprises, but that will then bring problems when e.g. security issue in library is detected, or
- they will just add that ignore without much checking what it does as long as warning goes away and it seems to work (e.g. check only that it "fixes" specific one use case, not if it affects  any other things too), and will not be checking if it changes in the future either - and thus open themselves to the whole lot of bugs which will be silently ignored