Priorities of BOM and from_encoding should be switched

Bug #1889014 reported by John Wodder on 2020-07-26
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

If I'm reading the bs4 source correctly, when BeautifulSoup attempts to determine the encoding of a binary document, it tries the user-specified encoding first, and then after that it tries the encoding implied by the BOM (if any). However, the WHATWG standard for determining character encodings (https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding) says that the BOM encoding should take precedence over all other encoding sources, with user-specified encodings (and transport-layer-declared encodings, like the HTTP Content-Type charset, which I would wager is a major source of `from_encoding` values) coming in next. BeautifulSoup4 should thus try the BOM encoding first in order to be conformant.

Leonard Richardson (leonardr) wrote :

Thanks for taking the time to file this bug.

The "override_encodings" argument is designed to handle the "known definite encoding" case (https://html.spec.whatwg.org/multipage/parsing.html#parsing-with-a-known-character-encoding).

But in library code there's not a strong distinction between "known definite encoding" and "user has explicitly instructed the user agent to override the document's character encoding with a specific encoding". There's some passive voice in 12.2.3.1 -- who "knows" that the input has a certain encoding, if not the "user"?

Would it solve your problem if there were two arguments like "override_encodings", one list to be applied before BOM sniffing and one list to be applied afterwards?

John Wodder (jwodder) wrote :

> Would it solve your problem if there were two arguments like "override_encodings", one list to be applied before BOM sniffing and one list to be applied afterwards?

Yes, it would.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers