BeautifulSoup4 not reading local addresses
Bug #1407988 reported by
Michael Courtney
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
When I try to create a soup of a locally stored HTML file with Beautiful Soup 4 I get raft of error messages which end with 'maximum recurison depth exceeded'.
The command BeautifulSoup(
whereas if I download the file and send BeautifulSoup to the local address with
BeautifulSoup(
or
BeautifulSoup(
I get the error messages.
I am using Python 3.4.2.
This error does not occur for Beautiful Soup 3.
To post a comment you must log in.
I can't duplicate this and I have a number of questions.
The core of the problem: I don't understand what your open() function does. You're using it like normal Python open() but it's acting more like urllib.urlopen(). Normal Python open() can't open http: or file: URLs. I'm not a Python 3 expert but that doesn't seem to have changed. Maybe it's different on Windows?
If you still have your saved copy of the file, please upload it as an attachment to this issue. The New York Times home page is one of the fastest-changing web pages in the world, and if the problem is caused by bad markup, that markup is probably long gone.
It would also be useful to see the raft of error messages you mentioned. This would help me determine if the problem is a problem with markup, a problem with your open(), or the Beautiful Soup constructor.
I would also like to know which parser backend you are using. Try passing 'html.parser', 'lxml', and 'html5lib' as the second argument to the BeautifulSoup constructor, and tell me if they all have the same problem.
What does the open() method return? What if you call read() on the return value of open() before passing it into Beautiful Soup?