Beautiful Soup

Bug #2047713
Activity log

Activity log for bug #2047713

Date	Who	What changed	Old value	New value	Message
2023-12-29 20:20:40	Chris Papademetrious	bug			added bug
2023-12-29 20:22:32	Chris Papademetrious	description	Beautiful Soup and XSLT/XQuery object types correlate as follows: * Tag is like * (element nodes) * NavigableString is like text() (text nodes) * Comment is like comment() (comment nodes) * ProcessingInstruction is like processing-instruction() (PI nodes) In XSLT, a node() object type matches any object type that can be contained in the document. For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do: ==== preceding-sibling::node()[not(mine:is-whitespace-text(.))][1] following-sibling::node()[not(mine:is-whitespace-text(.))][1] ==== where is-whitespace-text() is a function that returns true() for whitespace text() nodes. I want to similarly filter through arbitrary object types in Beautiful Soup too. But if I define a custom filter function: ==== def is_whitespace_text(tag) -> bool: return isinstance(tag, NavigableString) and tag.text.isspace() def is_not_whitespace_text(tag) -> bool: return not is_whitespace_text(tag) ==== there is not a "node" argument that considers all object types that I can pass my filter function to: ==== prev_thing = this_thing.find_previous_sibling(node=is_not_whitespace) next_thing = this_thing.find_next_sibling(node=is_not_whitespace) ==== The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior). This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstruction, and so on. Possible argument names for this filter type could be: node= object= I think this new argument should accept only the following: Callable - return matching objects True - return all objects False - return no objects None - (??? not sure what makes sense here ???) Here is an example testcase: ==== #!/usr/bin/env python from bs4 import BeautifulSoup, NavigableString html_doc = """ <p> <b>bold</b> <i>italic</i> and <u>underline</u> </p> """ soup = BeautifulSoup(html_doc, 'lxml') # this is the filter I want to use def is_non_whitespace(thing) -> bool: return not (isinstance(thing, NavigableString) and thing.text.isspace()) # this is workaround function #1 def workaround_find_next_sibling_non_whitespace(thing) -> bool: for next_thing in thing.next_siblings: if is_non_whitespace(next_thing): return next_thing return None # this is workaround function #2 def workaround_find_first_child_non_whitespace(thing) -> bool: for next_thing in thing.contents: if is_non_whitespace(next_thing): return next_thing return None # get the first non-whitespace thing in <p> #this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False) this_thing = workaround_find_first_child_non_whitespace(soup.find('p')) # print all following non-whitespace sibling elements in <p> while this_thing: #next_thing = this_thing.find_next_sibling(node=is_non_whitespace) next_thing = workaround_find_next_sibling_non_whitespace(this_thing) print(f"{repr(this_thing)} is followed by {repr(next_thing)}") this_thing = next_thing ====	Beautiful Soup and XSLT/XQuery object types correlate as follows: * Tag is like * (element nodes) * NavigableString is like text() (text nodes) * Comment is like comment() (comment nodes) * ProcessingInstruction is like processing-instruction() (PI nodes) In XSLT, a node() object type matches any object type that can be contained in the document. For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do: ==== preceding-sibling::node()[not(mine:is-whitespace-text(.))][1] following-sibling::node()[not(mine:is-whitespace-text(.))][1] ==== where is-whitespace-text() is a function that returns true() for whitespace text() nodes. I want to similarly filter through arbitrary object types in Beautiful Soup. But if I define a custom filter function: ==== def is_whitespace_text(tag) -> bool: return isinstance(tag, NavigableString) and tag.text.isspace() def is_not_whitespace_text(tag) -> bool: return not is_whitespace_text(tag) ==== then I would need some kind of "node" argument that considers all object types via my filter function: ==== prev_thing = this_thing.find_previous_sibling(node=is_not_whitespace) next_thing = this_thing.find_next_sibling(node=is_not_whitespace) ==== The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior). This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstruction, and so on. Possible argument names for this filter type could be: node= object= I think this new argument should accept only the following: Callable - return matching objects True - return all objects False - return no objects None - (??? not sure what makes sense here ???) Here is an example testcase: ==== #!/usr/bin/env python from bs4 import BeautifulSoup, NavigableString html_doc = """ <p> <b>bold</b> <i>italic</i> and <u>underline</u> </p> """ soup = BeautifulSoup(html_doc, 'lxml') # this is the filter I want to use def is_non_whitespace(thing) -> bool: return not (isinstance(thing, NavigableString) and thing.text.isspace()) # this is workaround function #1 def workaround_find_next_sibling_non_whitespace(thing) -> bool: for next_thing in thing.next_siblings: if is_non_whitespace(next_thing): return next_thing return None # this is workaround function #2 def workaround_find_first_child_non_whitespace(thing) -> bool: for next_thing in thing.contents: if is_non_whitespace(next_thing): return next_thing return None # get the first non-whitespace thing in <p> #this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False) this_thing = workaround_find_first_child_non_whitespace(soup.find('p')) # print all following non-whitespace sibling elements in <p> while this_thing: #next_thing = this_thing.find_next_sibling(node=is_non_whitespace) next_thing = workaround_find_next_sibling_non_whitespace(this_thing) print(f"{repr(this_thing)} is followed by {repr(next_thing)}") this_thing = next_thing ====
2023-12-30 12:48:42	Chris Papademetrious	description	Beautiful Soup and XSLT/XQuery object types correlate as follows: * Tag is like * (element nodes) * NavigableString is like text() (text nodes) * Comment is like comment() (comment nodes) * ProcessingInstruction is like processing-instruction() (PI nodes) In XSLT, a node() object type matches any object type that can be contained in the document. For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do: ==== preceding-sibling::node()[not(mine:is-whitespace-text(.))][1] following-sibling::node()[not(mine:is-whitespace-text(.))][1] ==== where is-whitespace-text() is a function that returns true() for whitespace text() nodes. I want to similarly filter through arbitrary object types in Beautiful Soup. But if I define a custom filter function: ==== def is_whitespace_text(tag) -> bool: return isinstance(tag, NavigableString) and tag.text.isspace() def is_not_whitespace_text(tag) -> bool: return not is_whitespace_text(tag) ==== then I would need some kind of "node" argument that considers all object types via my filter function: ==== prev_thing = this_thing.find_previous_sibling(node=is_not_whitespace) next_thing = this_thing.find_next_sibling(node=is_not_whitespace) ==== The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior). This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstruction, and so on. Possible argument names for this filter type could be: node= object= I think this new argument should accept only the following: Callable - return matching objects True - return all objects False - return no objects None - (??? not sure what makes sense here ???) Here is an example testcase: ==== #!/usr/bin/env python from bs4 import BeautifulSoup, NavigableString html_doc = """ <p> <b>bold</b> <i>italic</i> and <u>underline</u> </p> """ soup = BeautifulSoup(html_doc, 'lxml') # this is the filter I want to use def is_non_whitespace(thing) -> bool: return not (isinstance(thing, NavigableString) and thing.text.isspace()) # this is workaround function #1 def workaround_find_next_sibling_non_whitespace(thing) -> bool: for next_thing in thing.next_siblings: if is_non_whitespace(next_thing): return next_thing return None # this is workaround function #2 def workaround_find_first_child_non_whitespace(thing) -> bool: for next_thing in thing.contents: if is_non_whitespace(next_thing): return next_thing return None # get the first non-whitespace thing in <p> #this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False) this_thing = workaround_find_first_child_non_whitespace(soup.find('p')) # print all following non-whitespace sibling elements in <p> while this_thing: #next_thing = this_thing.find_next_sibling(node=is_non_whitespace) next_thing = workaround_find_next_sibling_non_whitespace(this_thing) print(f"{repr(this_thing)} is followed by {repr(next_thing)}") this_thing = next_thing ====	Beautiful Soup and XSLT/XQuery object types correlate as follows: * Tag is like * (element nodes) * NavigableString is like text() (text nodes) * Comment is like comment() (comment nodes) * ProcessingInstruction is like processing-instruction() (PI nodes) In XSLT, a node() object type matches any object type that can be contained in the document. For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do: ==== preceding-sibling::node()[not(mine:is-whitespace-text(.))][1] following-sibling::node()[not(mine:is-whitespace-text(.))][1] ==== where is-whitespace-text() is a function that returns true() for whitespace text() nodes. I want to similarly filter through arbitrary object types in Beautiful Soup. But if I define a custom filter function: ==== def is_whitespace_text(tag) -> bool: return isinstance(tag, NavigableString) and tag.text.isspace() def is_not_whitespace_text(tag) -> bool: return not is_whitespace_text(tag) ==== then I would need some kind of "node" argument that considers all object types via my filter function: ==== prev_thing = this_thing.find_previous_sibling(node=is_not_whitespace) next_thing = this_thing.find_next_sibling(node=is_not_whitespace) ==== The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior). This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstruction, and so on. Possible argument names for this filter type could be: node= object= I think this new argument should accept only the following: Callable - return matching objects True - return all objects False - return no objects None - (??? not sure what makes sense here ???) Here is an example testcase: ==== #!/usr/bin/env python from bs4 import BeautifulSoup, NavigableString html_doc = """ <p> <b>bold</b> <i>italic</i> and <u>underline</u> <br /> </p> """ soup = BeautifulSoup(html_doc, 'lxml') # this is the filter I want to use def is_non_whitespace(thing) -> bool: return not (isinstance(thing, NavigableString) and thing.text.isspace()) # this is workaround function #1 def workaround_find_next_sibling_non_whitespace(thing) -> bool: for next_thing in thing.next_siblings: if is_non_whitespace(next_thing): return next_thing return None # this is workaround function #2 def workaround_find_first_child_non_whitespace(thing) -> bool: for next_thing in thing.contents: if is_non_whitespace(next_thing): return next_thing return None # get the first non-whitespace thing in <p> #this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False) this_thing = workaround_find_first_child_non_whitespace(soup.find('p')) # print all following non-whitespace sibling elements in <p> while this_thing: #next_thing = this_thing.find_next_sibling(node=is_non_whitespace) next_thing = workaround_find_next_sibling_non_whitespace(this_thing) print(f"{repr(this_thing)} is followed by {repr(next_thing)}") this_thing = next_thing ====
2024-01-22 18:00:50	Leonard Richardson	summary	enhance find*() methods to filter through all object types	Create an easy way to apply a filter to any kind of PageElement
2024-02-02 16:46:13	Leonard Richardson	beautifulsoup: status	New	Fix Committed
2024-02-02 16:46:24	Leonard Richardson	beautifulsoup: status	Fix Committed	In Progress