Do a better job of distinguishing between markup that looks like a *realistic* filename and markup that doesn't.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
Upgrading from `4.9.3` to `4.11.2` I've started getting following spurious `MarkupResemble
searching the web found several reports linking them all to BeautifulSoup change, and a workaround, e.g.:
- https:/
- https:/
issue is still present in `4.12.3`:
```
% python3
Python 3.11.8 (main, Feb 7 2024, 21:52:08) [GCC 13.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(
<stdin>:1: MarkupResembles
'19.05.2024. Grude Biciklijada 2024. https:/
```
Thanks for taking the time to file this issue. For the history of this warning, you may find it useful to read these comments on bug 1873787 and bug 1955450:
https:/ /bugs.launchpad .net/beautifuls oup/+bug/ 1873787/ comments/ 1 /bugs.launchpad .net/beautifuls oup/+bug/ 1955450/ comments/ 5
https:/
Beautiful Soup decides to issue this warning if the markup is less than 256 bytes long, does not contain the '<' character (as almost all HTML fragments do), and if BeautifulSoup. _markup_ resembles_ filename class method returns True. Here's the code:
https:/ /git.launchpad. net/beautifulso up/tree/ bs4/__init_ _.py#n440
For security reasons, Beautiful Soup can't look to see if incoming markup actually exists as a file on disk, so it has to make a decision based on the structure of the string. Currently it checks whether the string ends in a textual file extension like ".html", or contains path characters--i.e. slashes or backslashes. Since your string contains slashes, _markup_ resembles_ filename returns true.
Looking at your example with my human eyes, I can see some hints that this is probably not a file path. For example, I could change _markup_ resembles_ filename to return False if it finds double slashes like in a URL, a combination of slashes and backslashes, characters with special meaning to Unix shells like #?*&>, or a colon (except near the beginning of the string, as with a Windows path). This will reduce the amount of markup that triggers this warning.
But there will always be some spurious warnings here. Some people who pass the string "myfile.html" into the BeautifulSoup constructor are doing the wrong thing: they actually want to open myfile.html and parse the contents. Those people should take note of the warning. But others really do want Beautiful Soup to parse the string "myfile.html", and those people should filter the warning.