Do a better job of distinguishing between markup that looks like a *realistic* filename and markup that doesn't.

Bug #2052988 reported by Matija Nalis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Committed
Undecided
Unassigned

Bug Description

Upgrading from `4.9.3` to `4.11.2` I've started getting following spurious `MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.` warning from tools using BeautifulSoup.

searching the web found several reports linking them all to BeautifulSoup change, and a workaround, e.g.:
- https://gitlab.com/chaica/feed2toot/-/issues/76
- https://github.com/benbusby/whoogle-search/issues/967

issue is still present in `4.12.3`:

```
% python3
Python 3.11.8 (main, Feb 7 2024, 21:52:08) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(b'19.05.2024. Grude Biciklijada 2024. https://biciklijade.com/detail/4704 #Grude', 'html.parser').get_text()
<stdin>:1: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
'19.05.2024. Grude Biciklijada 2024. https://biciklijade.com/detail/4704 #Grude'
```

Revision history for this message
Leonard Richardson (leonardr) wrote (last edit ):

Thanks for taking the time to file this issue. For the history of this warning, you may find it useful to read these comments on bug 1873787 and bug 1955450:

https://bugs.launchpad.net/beautifulsoup/+bug/1873787/comments/1
https://bugs.launchpad.net/beautifulsoup/+bug/1955450/comments/5

Beautiful Soup decides to issue this warning if the markup is less than 256 bytes long, does not contain the '<' character (as almost all HTML fragments do), and if BeautifulSoup._markup_resembles_filename class method returns True. Here's the code:

https://git.launchpad.net/beautifulsoup/tree/bs4/__init__.py#n440

For security reasons, Beautiful Soup can't look to see if incoming markup actually exists as a file on disk, so it has to make a decision based on the structure of the string. Currently it checks whether the string ends in a textual file extension like ".html", or contains path characters--i.e. slashes or backslashes. Since your string contains slashes, _markup_resembles_filename returns true.

Looking at your example with my human eyes, I can see some hints that this is probably not a file path. For example, I could change _markup_resembles_filename to return False if it finds double slashes like in a URL, a combination of slashes and backslashes, characters with special meaning to Unix shells like #?*&>, or a colon (except near the beginning of the string, as with a Windows path). This will reduce the amount of markup that triggers this warning.

But there will always be some spurious warnings here. Some people who pass the string "myfile.html" into the BeautifulSoup constructor are doing the wrong thing: they actually want to open myfile.html and parse the contents. Those people should take note of the warning. But others really do want Beautiful Soup to parse the string "myfile.html", and those people should filter the warning.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I committed some improvements to the filename detection as revision 0120dbe in the 4.13 branch. Regardless of whether this solves your issue, I recommend filtering the warning, since you know you're not passing a filename into the BeautifulSoup constructor, and there's no telling what markup you might receive from an external source in the future.

Changed in beautifulsoup:
status: New → Fix Committed
summary: - spurious "MarkupResemblesLocatorWarning: The input looks more like a
- filename than markup. You may want to open this file and pass the
- filehandle into Beautiful Soup." warnings
+ Do a better job of distinguishing between markup that looks like a
+ *realistic* filename and markup that doesn't.
Revision history for this message
Matija Nalis (mnalis) wrote :

Thanks for that background, Leonard, it's much appreciated!

I can see why the change was done -- although I probably would've done it differently - e.g. only use special handling if the string starts with regex `^https?://` or `^.:\\` or ends with `\.[a-z]{3,4}$`). But as you said, there would always be some false positives when trying to "automagically" handle such values.

However, I was somewhat surprised that `warnings.filterwarnings` is the officially recommended way to handle it. I personally would only consider such ignoring of warnings as a quick kludge/workaround, and to be revisited as soon as properly fixed package is released. (IOW, IMHO warnings are something which one should find a root cause of and fix it, instead of ignoring them if they do not seem related to their case)

If one can get over the rudeness of the poster in mentioned issue, I'd too feel much cleaner solution would be something akin to `BeautifulSoup("http://example.com", force_html=True)` or `BeautifulSoup("http://example.com", ignore_urls=False)` or similar, to allow user to *explicitly* specify what handling they want.

While I get your concerns about documenting and supporting it, I'd find such solution much cleaner and preferable. `filterwarnings()` sounds almost as *dirty* as library dying in the middle of parsing, and caller having to handle it with try/except.

But as the saying goes, "nothing is ever hard for the man who doesn't have to do it himself", so I'll leave the final decision to you.

Revision history for this message
Leonard Richardson (leonardr) wrote (last edit ):

The case of markup that's just a URL is handled separately, by a different method that issues a MarkupResemblesLocatorWarning with a different message. As you suggest, I may well restrict _markup_resembles_filename further to trigger only on markup that ends with an apparent file extension. That's the most common type of filename-as-markup error (you download a file using your web browser and pass its filename into Beautiful Soup). The second most common filename-as-markup error (it's just a temporary file called "foo" or something) already won't be caught, because there's usually no slash in the path. I can always refine _markup_resembles_filename later if I start getting support requests again.

IMO filtering a warning that you know not to be applicable *is* a way of explicitly handling the warning, and it's the one documented in the Python standard library, so it's better than anything else I might come up with. Changing an internal API to not issue the warning does exactly the same thing as putting a filter in place; it just doesn't look like "filtering a warning" is what you're doing.

Aesthetics aside, there's also the problem that MarkupResemblesLocatorWarning is not the only warning of this type. I just added another warning class, AttributeResemblesVariableWarning, in response to issue 2025089. There the problem is that Beautiful Soup users sometimes type "_class" instead of "class_" when filtering tags on the "class" attribute. ("class_" itself being a necessary hack because "class" is a Python reserved word.) Currently, Beautiful Soup goes ahead and filters on the "_class" attribute, which shows up as a silent failure. The warning doesn't change that behavior, it just points out that they may have misspelled "class_" in their code.

But you know there's someone out there who really does want to filter on the "_class" attribute, and that warning is going to annoy them once they upgrade to 4.13. The best thing for them to do is filter the warning. The warning was based on guesswork that works in most cases, but was wrong in their case. I can't keep adding arguments to the BeautifulSoup constructor to suppress different kinds of warnings: that's what the warnings module is for.

Writing this out, I do think I'll give all of these warnings a common superclass: GuessworkBasedWarning or UnusualUsageWarning or something that doesn't sound authoritative the way DeprecationWarning does. That way it's easy to express "I know what I'm doing" by filtering all warnings of that class:

warnings.filterwarnings("ignore", category=UnusualUsageWarning)

Revision history for this message
Leonard Richardson (leonardr) wrote (last edit ):

Another warning that's a bit like this is XMLParsedAsHTMLWarning, which is issued when you use an HTML parser to parse an XML document that's not XHTML. Without the warning, everything kind of works, but you can get bizarre un-XML-like behavior depending on what you're doing. But some people can't or don't want to install lxml, or they have other reasons for doing what they're doing. That's fine; they can filter the warning.

GuessedAtParserWarning is *not* like this. If you don't specify a parser to use, the behavior of your script across environments is not precisely defined, and you need to fix the issue.

Revision history for this message
Matija Nalis (mnalis) wrote :
Download full text (5.2 KiB)

> IMO filtering a warning that you know not to be applicable *is* a way of explicitly handling the warning

In general, I'd agree with you if the user knew with absolute certainty that that warning applies exactly to that one use case, and that it will never change its behaviour. However usually each of those claims is tall order even by itself; combined, they are very taxing on the user. What will IMHO more likely happen instead is that most users will either:

- lock themselves to the exact version of library (that they've checked works and verified its code it does exactly what they want) to avoid surprises, but that will then bring problems when e.g. security issue in library is detected, or
- they will just add that ignore without much checking what it does as long as warning goes away and it seems to work (e.g. check only that it "fixes" specific one use case, not if it affects any other things too), and will not be checking if it changes in the future either - and thus open themselves to the whole lot of bugs which will be silently ignored

Very few users will actually verify if the warning applies to their use case (and ONLY their use case), and then keep re-verifying that behaviour on every library upgrade (which is only way ignoring warnings could safely work).

Although I agree with you that `MarkupResemblesLocatorWarning` in particular is probably specific enough that its behaviour might never change.

> I can't keep adding arguments to the BeautifulSoup constructor to suppress different kinds of warnings: that's what the warnings module is for.

I agree with you here completely - but that is not what I suggested.

But the suggestion was NOT to "add constructor to ignore warnings" - that would indeed be not only pointless but actively contraindicated, given existence of warnings.filterwarnings()

However while I wrote that, I was under impression that BeautifulSoup actually does support reading from file or URL, and was guessing what to do (open file/url, or parse as raw HTML) - just like it guesses for "GuessedAtParserWarning"

Thus my idea was instead to add constructor that forces the behaviour of the library, i.e. if `assume_data_is_always_html_fragment=True` is set, then the BeautifulSoup() call would never even attempt to use it as a file or URL (and thus would never need to warn the user "did you mean this as file/URL or as HTML fragment?" - as the user already explicitly said what they want when calling the BeautifulSoup)

But if BeautifulSoup actually never supports URL / file handling (which seems to be the case, if I understand you correctly), than that `assume_data_is_always_html_fragment=True` would indeed reduce itself to "just suppress a warning", as you indicated.

Note that my (wrong) assumption that it does support URL/file handling comes from the phrasing that warning message "MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup." which to me seems to imply that it is capable of using file/url too, but is currently just guessing what to do, so it would be better for user to specify it explicitly if they mea...

Read more...

Revision history for this message
Leonard Richardson (leonardr) wrote (last edit ):

To be explicit: Beautiful Soup does not support fetching resources by URL, or opening a file by its filename. If you pass in a filehandle, Beautiful Soup will read the contents of the file and parse the contents. If you pass in a string, it will parse the string, no matter what the string looks like.

The MarkupResemblesLocatorWarning messages used to try to explain this with text like "Beautiful Soup is not an HTTP client," but I took that language out. This was IMO the main thing some users found condescending about the warnings, because they weren't *trying* to use Beautiful Soup as an HTTP client. They were using Beautiful Soup as intended, passing in data they got from somewhere else. Then Beautiful Soup was issuing a warning, complaining about that data, saying it wasn't capable of doing something they hadn't asked it to do in the first place.

The UnusualUsageWarning family of warnings are explicitly designed to be suppressed if you know what you're doing. The thing I'm taking from this conversation is that the text of each warning needs to explicitly say how to make the warning go away. 99.5% of the time, that means fixing a problem with something you typed five minutes ago. But 0.5% of the time, the warning is spurious and the right thing to do is suppress it.

I haven't gotten any complaints about the wording of GuessedAtParserWarning, which probably triggers 100x more often than MarkupResemblesLocatorWarning, because it explains what you need to change to make the warning go away. I haven't gotten any complaints about XMLParsedAsHTMLWarning, which explicitly says you should either fix the problem, or decide there is no problem and filter the warning.

I'm going to do something like your suggestion, but I'm going to put the entire explanatory text in the warning message itself, rather than create a cross-reference that most people won't follow. That's what I do with GuessedAtParserWarning.

Revision history for this message
Matija Nalis (mnalis) wrote :

> The thing I'm taking from this conversation is that the text of each warning needs to explicitly say how to make the warning go away.

Yup

> 99.5% of the time, that means fixing a problem with something you typed five minutes ago. But 0.5% of the time, the warning is spurious and the right thing to do is suppress it.

But, if I understood you correctly, everybody who KNOWS that they use the lib correctly (i.e. that they are always feeding HTML fragments instead of URL/file to BeautifulSoup) should always use that warnings.filterwarnings code fragment -- regardless if in majority of cases that warning won't actually trigger for them (because it will randomly trigger eventually even if rarely, and they know with 100% certainty that if it triggers, it would always be spurious). Yes?

> I'm going to do something like your suggestion, but I'm going to put the entire explanatory text in the warning message itself

Sure, that sounds even better - it's just that I myself didn't see how to make it clear enough without making warning message huge so I suggested using external URL. It would IMHO still be beneficial to explain the procedure (and its reasoning, if possible) in documentation, even if you don't link to it from warning message.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.