Bug #101440 (silva-1325) “XML tags not removed on fulltext index...” : Bugs : Silva

Revision history for this message

sacco (timothy-heap) wrote on 2005-11-14:

#1

I have the impression that we may be waiting here for something to happen
elsewhere in the Zope indexing code.

I don't want to write a long essay about it, but having thought a little about
what that something might be, I suspect that it may never happen, at least not
in the near future.

In short, it would be nice to send something with some structure, like XML, to
the indexing pipeline, which could then use a sophisticated set of tools to
decide how to index.
However, I don't think that this something/XML is ever going to be the raw XML
representing the document structure, and neither, probably, should it be.
Probably what is needed is an XML schema for "somewhat structured text to be
indexed".

In the meantime, it would be better to have something along the lines of what
has been suggested here, at least as a temporary solution. I'll try to lash
something together sometime today.

Revision history for this message

sacco (timothy-heap) wrote on 2005-11-16:

#2

O.K.

Here's the fulltext() method I'm using at present (i.e. this is where I'm
starting from):

    def _stringify(self, obj):
        if not isinstance(obj, basestring):
            try:
                obj = str(obj)
            except UnicodeEncodeError:
                obj = unicode(obj)
        return obj

    security.declareProtected(SilvaPermissions.AccessContentsInformation,
                              'fulltext')
    def fulltext(self):
        """Return the full text content of this object."""
        if self.version_status() == 'unapproved':
            return ''
        fulltext = [self.get_title()]
        text = self._flattenxml(self.content_xml())
        if isinstance(text, (list, tuple, )):
            fulltext.extend(map(self._stringify, filter(None, text)))
        else:
            text = self._stringify(text)
            if text:
                fulltext.append(text)
        return fulltext

Revision history for this message

sacco (timothy-heap) wrote on 2005-11-16:

#3

Incidentally, even though Silva 1.4 is described as requiring Zope 2.7.8 and not
working with previous versions, the patch to fulltext() for compatibility
with 2.7.[67] has actually been lef in place.

Revision history for this message

sacco (timothy-heap) wrote on 2005-11-16:

#4

Here is a new version of fulltext() along the lines of johnny's suggestion
which strips the XML to return a list of strings, each of which represents the
content of a top-level element of the DocumentVersion.

    def fulltext(self):
        """Return the full text content of this object."""
        if self.version_status() == 'unapproved':
            return ''
        fulltext = [self.get_title()]
        ### text = self._flattenxml(self.content_xml())
        text = self._get_textContents(self.content.documentElement, [])
        if isinstance(text, (list, tuple, )):
            fulltext.extend(map(self._stringify, filter(None, text)))
        else:
            text = self._stringify(text)
            if text:
                fulltext.append(text)
        return fulltext

    def _get_textContents(self, node, L_res=False, textStrip=True):
        if not L_res and not isinstance(L_res, list): L_res = []
        for child in node.childNodes:
            nodeType = child.nodeType
            if ( nodeType == child.COMMENT_NODE
                 or nodeType == child.PROCESSING_INSTRUCTION_NODE
                 ):
                continue
            if nodeType == child.TEXT_NODE:
                if textStrip:
                    L_res.append(child.nodeValue.strip())
                else:
                    L_res.append(child.nodeValue)
            elif ( nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ] ):
                continue # or you might prefer, e.g, to look inside attributes
            else:
                ### This (first) call would simply grow the list recursively
                ### self._get_textContents(child, L_res, textStrip=textStrip)
                # or one could decide to join or not (and how) based on, e.g,
                # the element tag type --- for now we just join:
                L_res.append(' '.join(
                    self._get_textContents(child, [], textStrip=textStrip) ))
        return L_res

Here is a new version of  fulltext()  along the lines of  johnny's  suggestion
which strips the XML to return a list of strings, each of which represents the
content of a top-level element of the DocumentVersion.

def fulltext(self):
        """Return the full text content of this object."""
        if self.version_status() == 'unapproved':
            return ''
        fulltext = [self.get_title()]
        ### text = self._flattenxml(self.content_xml())
        text = self._get_textContents(self.content.documentElement, [])
        if isinstance(text, (list, tuple, )):
            fulltext.extend(map(self._stringify, filter(None, text)))
        else:
            text = self._stringify(text)
            if text:
                fulltext.append(text)
        return fulltext

def _get_textContents(self, node, L_res=False, textStrip=True):
        if not L_res and not isinstance(L_res, list): L_res = []
        for child in node.childNodes:
            nodeType = child.nodeType
            if ( nodeType == child.COMMENT_NODE
                 or  nodeType == child.PROCESSING_INSTRUCTION_NODE
                 ):
                continue
            if nodeType == child.TEXT_NODE:
                if textStrip:
                    L_res.append(child.nodeValue.strip())
                else:
                    L_res.append(child.nodeValue)
            elif ( nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ] ):
                continue  # or you might prefer, e.g, to look inside attributes
            else:
                ### This (first) call would simply grow the list recursively
                ### self._get_textContents(child, L_res, textStrip=textStrip)
                # or one could decide to join or not (and how) based on, e.g,
                #   the element tag type --- for now we just join:
                L_res.append(' '.join(
                    self._get_textContents(child, [], textStrip=textStrip) ))
        return L_res

Revision history for this message

sacco (timothy-heap) wrote on 2005-11-16:

#5

Whilst, as johnny notes, it would be lighter to use an iterator, I'm not
convinced that this would be better overall, because:

1) The recursion is depth-first, and we can be careful not to put anything on
the stack more than once; thus, unless the structure is _really_ deep, it
shouldn't be too bad.

If we were to decide that all that is required is a single string to represent
the contents of the whole document, then (as the commented out recursion below
illustrates) we could even also 'thread' an accumulator through the recursion by
passing (a reference to) a mutable list. The compiler could then use the same
stack frame for the whole recursion (in theory: I've no idea whether the Python
compiler actually knows anything about tail-recursion ... removing the last
'return L_res' might help?)

2)
More importantly, the recursive code is *much* easier to adapt to particular
requirements.
For example, one might wish to have a separate string in the list returned for
each paragraph contained in the document (to avoid compound terms being indexed
across paragraph boundaries, for instance). It's fairly easy to see how this
can be done with a recursive version of this function (and should be
straightforward to implement); even such a simple requirement would be *much*
more complicated with a function based on one of the basic iterators of
ParsedXML (as far as I can see --- this is the first time I've looked inside it!).

Revision history for this message

sacco (timothy-heap) wrote on 2005-11-16:

#6

Download full text (4.9 KiB)

Finally, one could combine the recursive and iterator-based approaches:
the idea would be to start at the top with a recursive function which is
clear, flexible, and easy to modify to particular requirements, but,
once one has used this to select and divide as required the text from the
various elements, an iterator is used to extract the text more efficiently
from the lower levels.

Below is a continuation of the previous example, showing how this might
be done (the desired output, a list of strings representing top-level elements,
is as before).

I'd welcome any comments, especially from somebody who knows more
about either ParsedXML or about Silva document structure; my choices of
_nodeAcceptMap and whatToShow , in particular, are just what I happened
to type in/cut and paste on the spur of the moment while I was experimenting.

This does, however, all seem to work. Can we include something like this,
at least as a stop-gap until a better approach is ready?

   def _get_textContents(self, node, L_res=False, textStrip=True):
        if not L_res and not isinstance(L_res, list): L_res = []
        for child in node.childNodes:
            nodeType = child.nodeType
            if ( nodeType == child.COMMENT_NODE
                 or nodeType == child.PROCESSING_INSTRUCTION_NODE
                 ):
                continue
            if nodeType == child.TEXT_NODE:
                if textStrip:
                    L_res.append(child.nodeValue.strip())
                else:
                    L_res.append(child.nodeValue)
            elif ( nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ] ):
                continue # or you might prefer, e.g, to look inside attributes
            else:
                ### This (first) call would simply grow the list recursively
                ### self._get_textContents(child, L_res, textStrip=textStrip)
                # or one could decide to join or not (and how) based on, e.g,
                # the element tag type --- for now we just join:
                ##L_res.append(' '.join(
                ## self._get_textContents(child, [], textStrip=textStrip) ))
                L_res.append(' '.join(
                    self._get_textFlat(child, textStrip=textStrip, ) ))
        return L_res

    _nodeAcceptMap = dict( { # REJECT some sub-trees we don't want to see
        Node.COMMENT_NODE : NodeFilter.FILTER_REJECT,
        Node.PROCESSING_INSTRUCTION_NODE : NodeFilter.FILTER_REJECT,
        Node.TEXT_NODE : NodeFilter.FILTER_ACCEPT,
        Node.ATTRIBUTE_NODE : NodeFilter.FILTER_REJECT,
        Node.DOCUMENT_TYPE_NODE : NodeFilter.FILTER_REJECT,
        Node.NOTATION_NODE : NodeFilter.FILTER_REJECT,
        Node.ENTITY_NODE : NodeFilter.FILTER_REJECT,
        Node.ENTITY_REFERENCE_NODE : NodeFilter.FILTER_ACCEPT,
        Node.CDATA_SECTION_NODE : NodeFilter.FILTER_ACCEPT,
        Node.ELEMENT_NODE : NodeFilter.FILTER_SKIP,
        Node.DOC...

Finally, one could combine the recursive and iterator-based approaches:
the idea would be to start at the top with a recursive function which is
clear, flexible, and easy to modify to particular requirements, but,
once one has used this to select and divide as required the text from the 
various elements, an iterator is used to extract the text more efficiently 
from the lower levels.

Below is a continuation of the previous example, showing how this might
be done (the desired output, a list of strings representing top-level elements,
is as before).

I'd welcome any comments, especially from somebody who knows more
about either ParsedXML or about Silva document structure; my choices of
_nodeAcceptMap and whatToShow , in particular, are just what I happened
to type in/cut and paste on the spur of the moment while I was experimenting.

This does, however, all seem to work.  Can we include something like this,
at least as a stop-gap until a better approach is ready?

def _get_textContents(self, node, L_res=False, textStrip=True):
        if not L_res and not isinstance(L_res, list): L_res = []
        for child in node.childNodes:
            nodeType = child.nodeType
            if ( nodeType == child.COMMENT_NODE
                 or  nodeType == child.PROCESSING_INSTRUCTION_NODE
                 ):
                continue
            if nodeType == child.TEXT_NODE:
                if textStrip:
                    L_res.append(child.nodeValue.strip())
                else:
                    L_res.append(child.nodeValue)
            elif ( nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ] ):
                continue  # or you might prefer, e.g, to look inside attributes
            else:
                ### This (first) call would simply grow the list recursively
                ### self._get_textContents(child, L_res, textStrip=textStrip)
                # or one could decide to join or not (and how) based on, e.g,
                #   the element tag type --- for now we just join:
                ##L_res.append(' '.join(
                ##    self._get_textContents(child, [], textStrip=textStrip) ))
                L_res.append(' '.join(
                    self._get_textFlat(child, textStrip=textStrip, ) ))
        return L_res

_nodeAcceptMap = dict( {  	# REJECT some sub-trees we don't want to see
        Node.COMMENT_NODE :                 NodeFilter.FILTER_REJECT,
        Node.PROCESSING_INSTRUCTION_NODE :  NodeFilter.FILTER_REJECT,
        Node.TEXT_NODE :                    NodeFilter.FILTER_ACCEPT,
        Node.ATTRIBUTE_NODE :               NodeFilter.FILTER_REJECT,
        Node.DOCUMENT_TYPE_NODE :           NodeFilter.FILTER_REJECT,
        Node.NOTATION_NODE :                NodeFilter.FILTER_REJECT,
        Node.ENTITY_NODE :                  NodeFilter.FILTER_REJECT,
        Node.ENTITY_REFERENCE_NODE :        NodeFilter.FILTER_ACCEPT,
        Node.CDATA_SECTION_NODE :           NodeFilter.FILTER_ACCEPT,
        Node.ELEMENT_NODE :                 NodeFilter.FILTER_SKIP,
        Node.DOCUMENT_FRAGMENT_NODE :       NodeFilter.FILTER_SKIP,
        } )

class myNodeFilter(NodeFilter):
        def __init__(self, nodeAcceptMap):
            self._nodeAcceptMap = nodeAcceptMap
        def acceptNode(self, node):
            return self._nodeAcceptMap.get(node.nodeType) or \
                   self.FILTER_ACCEPT

def _get_textFlat(self, document, textJoin=False, textStrip=True,
                      nodeFilter=myNodeFilter(_nodeAcceptMap),
                      nodeAcceptMap=None):
        _document = document.cloneNode(True)
        _document.normalize()
        L_ret = []
        if nodeAcceptMap is not None:
            nodeFilter = self.myNodeFilter(nodeAcceptMap)
        elif ( not isinstance(nodeFilter, NodeFilter) ) \
                 and  nodeFilter is not None  and  not nodeFilter:
            nodeFilter = self.myNodeFilter(self._nodeAcceptMap)

whatToShow = NodeFilter.SHOW_ALL \
                     &  ~ NodeFilter.SHOW_DOCUMENT_FRAGMENT \
                     &  ~ NodeFilter.SHOW_CDATA_SECTION

nodeIter = _document.createTreeWalker(_document,
                                              whatToShow, nodeFilter,
                                              entityReferenceExpansion=1)
        child = nodeIter._get_currentNode()
        while child is not None:
            nodeType = child.nodeType
            if nodeType == child.TEXT_NODE:
                L_ret.append(child.nodeValue)
            child = nodeIter.nextNode()
        _document = None
        del nodeIter, _document
        if textStrip:
            L_ret = map(lambda s : s.strip(), L_ret)
        if isinstance(textJoin, basestring):
            return textJoin.join(L_ret)
        elif textJoin:
            return ' '.join(L_ret)
        else:
            if L_ret:
                return L_ret
        return []

Revision history for this message

sacco (timothy-heap) wrote on 2005-11-22:

#7

Hi all,

given that this has never worked or even been implemented,
does anybody have a strong objection to me checking in
one of these versions, at least until something better comes
along?

By the way, can somebody tell me if xml.dom is the "right"
place from which to get Node , i.e.

from xml.dom import Node

Revision history for this message

Martijn Faassen (faassen) wrote on 2006-01-11:

#8

I do want to review this before it's checked in. While the XML tag removal never
worked, fulltext indexing has been around for a while.

I'm a bit scared to see the DOM code; I know quite a bit about DOM and ParsedXML
but I'm trying to avoid them. :) The NodeFilter code and such worries me --
ParsedXML does have some implementation of this, but I remember it never really
got a lot of review so I'm worried about it failing in obscure cases. This is
why I'd prefer the simpler DOM tree walking approach.

I'm also slightly worried about the performance impact of this. A simple version
should be as fast as the XML generating form. I don't think that'll be too hard
to accomplish -- the XML generation in ParsedXML isn't particularly fancy
either, but some simple measurements extracting this information from large
documents would comfort me.

Revision history for this message

Martijn Faassen (faassen) wrote on 2006-01-18:

#9

Hm, making sacco 'nosy' on the issue. Could you respond to my "I'm worried" item
below? I'm deferring this for the Silva 1.5 beta.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-18:

#10

hi

Sorry not to reply sooner --- I've been a bit tied up with something else and
hadn't checked in for a while on this.

Personally, *for a beta*, I'd be inclined to check in at least the recursive
tree-walking approach of msg7975 so that it gets some exposure (i.e. chance to
break) while it's less critical, or even the more complicated msg7977 version ...

I too found the code using NodeFilter a bit scary, although it's less
complicated than it looks:
much of it is minor fiddling around, e.g. defining a myNodeFilter class only
because NodeFilter is defined in a way which doesn't easily allow one to define
one's own filter :(

Really, the NodeFilter code in msg7977 was just written in response to
johnny's comment
"Note that this is recursive, traversing the DOM using some iterator would be
way lighter"
in as much as to say "this is how much it apparently takes to do this
iteratively in ParsedXML, is it really worth it?"
...and so as to have at least something that could be cleaned up by somebody who
knows more about the Silva document structure and the internal pitfalls of
ParsedXML, or possibly rejected as too complicated relative to the recursive
tree-walking approach for the likely gain in performance (if any)
(or even just tested and improved if this does look like the right way to go).

For an *actual release*, on the other hand,I would tend to stick with the
recursive DOM tree-walking approach a la msg 7975, especially given your
comments about the ParsedXML implementation.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-18:

#11

As for the performance impact, I'm afraid I don't have the time to do much
testing for now.

What I expect though:

1)
For large documents msg7977 should (in theory) be as efficient as it can be,
but this rather depends on there not being any obscure glitches in ParsedXML's
iterative tree walker: if you're worried about this possibly failing, then this
might be an unduly optimistic assumption. As I said though, for a first beta I
would be tempted to check it in and see what breaks (which was partly why I
wrote it).

On the other hand, unless the ParsedXML DOM data structure was specifically
designed to allow efficient iteration (I haven't checked) it's not really likely
to be any faster than a careful depth-first recursion.

2)
For small documents msg7975 would probably be lighter (no cloneNode() , for
instance), so if one expects a large number of small documents...

3)
As I mentioned in msg7976, the latter approach is depth-first (and doesn't pass
anything much inwards), so the stack requirements depend on the depth rather
than the size of documents, and it shouldn't be too heavy unless you have really
deep documents (in which case you may have trouble anyway).

As I also mentioned, it could be made even lighter by passing around just a
reference to a single string to be used as accumulator, but this at the cost of
making the code less easy to follow for anybody who isn't habitually recursive.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-18:

#12

In summary, the problem really can't be that complicated.

Unless stripping XML from Silva documents is intrinsically too complex (and it
shouldn't be), then I think a recursive approach a la msg7975 would be
adequate and, as I suggest in msg7976, there are probably reasons to prefer it.

If, however, this doesn't perform, then you need to use the iterator; however,
from what you say, there's no guarantee that this would actually be any lighter
in practice.

If it were my codebase, I would check in the iterative version now to see what
happens, with the intention of later abandoning it in favour of the recursive
approach unless the latter turns out to be dramatically slower on real data.
The two versions also have enough in common that testing the iterative version
also effectively tests some aspects of the recursive version.

A final consideration is be that any performance hit might be offset against
lighter indexing with less spam.

Revision history for this message

Martijn Faassen (faassen) wrote on 2006-01-18:

#13

Thanks for the analysis! I think I'm in favor of the recursive version, so I'll
look at checking this in before the beta tomorrow.

Revision history for this message

Martijn Faassen (faassen) wrote on 2006-01-18:

#14

Okay, I reviewed things right now instead of tomorrow. Some comments:

* Thinking about this some more, I won't check this code in without at least a
bunch of tests. The tests would containsmall Silva documents with some various
constructs, and we expect whether the output is correct. Even though the tests
are for SilvaDocument,
put them in Silva core for now. Or alternatively, we could extend ParsedXML with
this functionality, if at least we can write it so that we don't have Silva
specific knowledge in it. I suspect tests would have helped us finding the next
issue sooner:

* I have my doubts about the ''.join() operation you do. This would mean that
the ZCTextIndex could see, for this text:
OneTwo

OneTwo

And this is undesirable. Am I missing something?

* You could join with a space, though that would mean this doesn't get indexed
accurately:

Foobar

As it'd show up as 'Foo bar', instead of 'Foobar'. It's failing on this now
too though, and
I'm willing to live with this fairly rare problem of subword markup for now.

I looked at your version and attempted to rewrite it using a generator, ripping
out some
code that I thought complicated matters. It's untested, but perhaps useful if
you want to work on this further. It's simpler as it doesn't have L_res
initialization code (which I dislike anyway even if we retained a L_res
parameter, nor do we have the textStrip parameter. The default should be the
right one, and since whitespace characters shouldn't hurt during indexing as
tokenizing into words already takes place there, let's leave it in:

  def _get_textContents(self, node):
        for child in node.childNodes:
            nodeType = child.nodeType
            if (nodeType == child.COMMENT_NODE or
                nodeType == child.PROCESSING_INSTRUCTION_NODE):
                continue
            if nodeType == child.TEXT_NODE:
                yield child.nodeValue
            elif (nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ]):
                continue # or you might prefer, e.g, to look inside attributes
            else:
                # iterate through elements are left
                for text in self._get_textContents(child):
                    yield text

the result of this generator would be a sequence of texts, which could be joined
using ' '.join(_get_textContents). (note the space in the ' ').

Another option is returning the sequence directly from fulltext, in a list form,
instead of doing the join here. This should be safe except that some phrases
with bold in them get broken up. I.e. if you have "my special
phrase", this would become
["my ", "special", " phrase"], and you couldn't find "my special phrase" anymore
with a phrase search.

I'm redeferring this one, as I don't think we can work all of this out in the beta.

Okay, I reviewed things right now instead of tomorrow. Some comments:

* Thinking about this some more, I won't check this code in without at least a
bunch of tests. The tests would containsmall Silva documents with some various
constructs, and we expect whether the output is correct. Even though the tests
are for SilvaDocument,
put them in Silva core for now. Or alternatively, we could extend ParsedXML with
this functionality, if at least we can write it so that we don't have Silva
specific knowledge in it. I suspect tests would have helped us finding the next
issue sooner:

* I have my doubts about the ''.join() operation you do. This would mean that
 the ZCTextIndex could see, for this text:
 OneTwo

OneTwo

And this is undesirable. Am I missing something?

* You could join with a space, though that would mean this doesn't get indexed
 accurately: 
 
 Foobar

As it'd show up as 'Foo bar', instead of 'Foobar'. It's failing on this now
too though, and
  I'm willing to live with this fairly rare problem of subword markup for now.

I looked at your version and attempted to rewrite it using a generator, ripping
out some
code that I thought complicated matters. It's untested, but perhaps useful if
you want to work on this further. It's simpler as it doesn't have L_res
initialization code (which I dislike anyway even if we retained a L_res
parameter, nor do we have the textStrip parameter. The default should be the
right one, and since whitespace characters shouldn't hurt during indexing as
tokenizing into words already takes place there, let's leave it in:

def _get_textContents(self, node):
        for child in node.childNodes:
            nodeType = child.nodeType
            if (nodeType == child.COMMENT_NODE or
                nodeType == child.PROCESSING_INSTRUCTION_NODE):
                continue
            if nodeType == child.TEXT_NODE:
                yield child.nodeValue
            elif (nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ]):
                continue # or you might prefer, e.g, to look inside attributes
            else:
                # iterate through elements are left
                for text in self._get_textContents(child):
                    yield text
 
the result of this generator would be a sequence of texts, which could be joined 
using ' '.join(_get_textContents). (note the space in the ' ').

Another option is returning the sequence directly from fulltext, in a list form,
instead of doing the join here. This should be safe except that some phrases
with bold in them get broken up. I.e. if you have "my special
phrase", this would become
["my ", "special", " phrase"], and you couldn't find "my special phrase" anymore
with a phrase search.
 
I'm redeferring this one, as I don't think we can work all of this out in the beta.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-21:

#15

Unless there has been some cut and paste problem, my join operation was on a
space (it is on a space in the code I am using here).
The sub-word markup would be a problem in this case, but needn't be...

The function I wrote was just meant to be the first step/example and wasn't
written to any particular spec. It just returned a list consisting of the
joined contents of each top-level element simply to indicate that the recursive
structure could be used to do various things, hence the comment about deciding
to join or not, etc.
To create a specification for something more appropriate for Silva would require
more detailed knowledge of the Silva Document model and how it is used than I have.

This need for somewhat specialised knowledge of the document type is one reason
why I suspect this should be done here in SilvaDocument for now, rather than
trying to pass the XML and sort it all out in the indexing pipeline. In theory
it would be nice to think that there will be common issues to be dealt with in
the processing of many different document types, and that the indexing package
could provide some nice generic tools to help with the process, but in practice
I don't see any evidence that anybody is near even beginning to think about this
kind of abstraction yet in Zope, and I'd like to do my indexing next month
rather than in my next lifetime.

Perhaps what we learn here could be generalised later ...

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-21:

#16

As was possibly clear, I too tend to favour the recursive version: things may
get split into various clauses and possibly even mutually recursive auxillary
functions, but it's usually fairly clear what you are dealing with at each point
and how to carry around the little extra pieces of information you may need.

With an iterator this can get tricky even when a full (and efficient) set of
neighbourhood inspection and navigation functions is availiable ... and in this
case I don't believe they are.

The difference is often more convincing when the two versions can be seen side
by side, though.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-21:

#17

I have several other comments: will try to find time to post more over the weekend.
In particular, some serious reservations about the performance as a recursive
generator (I think that you're looking for a lazy list here, but what you're
really getting looks a lot more complex).

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-21:

#18

In the meantime, without explanation (sorry - it still uses L_res but the reason
should be clearer here):

    def fulltext(self):
        """Return the full text content of this object."""
        if self.version_status() == 'unapproved':
            return ''
        fulltext = [self.get_title()]
        text = list()
        self._get_textContents(self.content.documentElement, text)
        fulltext.extend(filter(None, text))
        return fulltext

    def _get_textContents(self, node, L_res, textStrip=True):
        for child in node.childNodes:
            nodeType = child.nodeType
            if ( nodeType == child.COMMENT_NODE
                 or nodeType == child.PROCESSING_INSTRUCTION_NODE
                 ):
                continue
            if nodeType == child.TEXT_NODE:
                if textStrip:
                    L_res.append(child.nodeValue.strip())
                else:
                    L_res.append(child.nodeValue)
            elif ( nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ] ):
                continue # or you might prefer, e.g, to look inside attributes
            elif nodeType == child.ELEMENT_NODE:
                if tag_is_p(child.tagName) :
                    # text in a paragraph should be joined with no space but not
stripped
                    P_text = list()
                    self._get_textContents(child, P_text, textStrip=False)
                    L_res.append(''.join(P_text))
                    # can alternatively call a mutually recursive helper
function if we want
                    # to do something more complicated or e.g. to enforce schema
                else:
                    self._get_textContents(child, L_res, textStrip=textStrip)
            else:
                self._get_textContents(child, L_res, textStrip=textStrip)

Here tag_is_p() looks, I suppose something like:

def tag_is_p(tagName):
    if ":" in tagName:
        parts = tagName.split(":")
        if len(parts) != 2 or not parts[1]:
            raise SomeException
        else: # check parts[0] is a suitable prefix if you want
            tagName = parts[1]
    return tagName == "p"

In the meantime, without explanation (sorry - it still uses L_res but the reason
should be clearer here):

def fulltext(self):
        """Return the full text content of this object."""
        if self.version_status() == 'unapproved':
            return ''
        fulltext = [self.get_title()]
        text = list()
        self._get_textContents(self.content.documentElement, text)
        fulltext.extend(filter(None, text))
        return fulltext

def _get_textContents(self, node, L_res, textStrip=True):
        for child in node.childNodes:
            nodeType = child.nodeType
            if ( nodeType == child.COMMENT_NODE
                 or  nodeType == child.PROCESSING_INSTRUCTION_NODE
                 ):
                continue
            if nodeType == child.TEXT_NODE:
                if textStrip:
                    L_res.append(child.nodeValue.strip())
                else:
                    L_res.append(child.nodeValue)
            elif ( nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ] ):
                continue  # or you might prefer, e.g, to look inside attributes
            elif nodeType == child.ELEMENT_NODE:
                if tag_is_p(child.tagName) :
                    # text in a paragraph should be joined with no space but not
stripped
                    P_text = list()
                    self._get_textContents(child, P_text, textStrip=False)
                    L_res.append(''.join(P_text))
                    # can alternatively call a mutually recursive helper
function if we want
                    # to do something more complicated or e.g. to enforce schema
                else:
                    self._get_textContents(child, L_res, textStrip=textStrip)
            else:
                self._get_textContents(child, L_res, textStrip=textStrip)

Here  tag_is_p()  looks, I suppose something like:

def tag_is_p(tagName):
    if ":" in tagName:
        parts = tagName.split(":")
        if len(parts) != 2  or not parts[1]:
            raise SomeException
        else:    # check parts[0] is a suitable prefix if you want
            tagName = parts[1]
    return  tagName == "p"

Revision history for this message

Martijn Faassen (faassen) wrote on 2006-01-23:

#19

"Unless there has been some cut and paste problem, my join operation was on a space (it is on a space in the code I am using here). Edit (1.8 KiB, text/plain)

"Unless there has been some cut and paste problem, my join operation was on a
space (it is on a space in the code I am using here)." you are right, I think I
misread your code somehow.

Some comments on your recent comments and code:

* I don't see the performance implications of my code. Why would my code produce
something more complicated than a lazy list? I added a 'microdom.py' which
demonstrates my code on a fake mockup DOM tree. It produces a list (actually a
generator).

* I don't see a reason to support the textStrip extension. Either we always do
it or never, but no need to configure this when calling and complicated the
code. I don't think we need to use any stripping.

* manually splitting on ':' in tag_in_p is rather inefficient and hard to read.
DOM supports 'localName' on the Node interface to get that information. The DOM
keeps this information directly internally as well, as far as I can recall.
Splitting off prefixes is a task for an XML parser, not for someone who works
with an XML API.

* no matter what we do, we need a testsuite for this functionality. :)

Revision history for this message

Martijn Faassen (faassen) wrote on 2006-01-24:

#20

Deferring this into future; won't do this in Silva 1.5

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-25:

#21

> manually splitting on ':' in tag_in_p is rather inefficient and hard to read.
> DOM supports 'localName' on the Node interface to get that information. The DOM
> keeps this information directly internally as well, as far as I can recall.
> Splitting off prefixes is a task for an XML parser, not for someone who works
> with an XML API.

Fine: the tag_is_p() was just what I threw in the second before posting.

In this case, instead of if tag_is_p( ... I should write:

if child.localName == "p":

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-25:

#22

> I don't see a reason to support the textStrip extension. Either we always do
> it or never, but no need to configure this when calling and complicated the
> code. I don't think we need to use any stripping.

I think we certainly can't do it always!

Just as the idea of the original version returning a list
consisting of the joined contents of each top-level element
was simply intended as an example of how to do something,
so was the textStrip parameter. The reason stripping
whitespace was chosen as the example is that I
frequently see XML documents which are over 50%
whitespace, particularly those generated in Python
(Python programmers don't tend to use tabs ;?> ),
but stripping can't simply be applied throughout;
however, I tend not to examine the XML internal to
Silva if at all possible, so you will know more than
me about whether things are better here.

An example of what?
Sometimes it may become necessary to treat a
node differently depending upon where it occurs in
the document tree, e.g.whether or not it occurs inside
another particular type of node. Unless this depends
only upon strictly "local" information (e.g. the difference
is that the node in question is a direct child of an
'li' element, in which case it may be possible to add
a suitable clause to the part of the function processing
element node) there are essentially two ways to deal with
this:
1) passing some (limited) information down the stack
(in this example via the textStrip parameter);
2) using an auxilliary function.

But even when an auxilliary function is used,
unless the situation is *really* complicated
(and it really shouldn't be in this case)
it is far neater and more maintainable to
make it mutually recursive (i.e. to call back
into the main function for the inner recursions);
in this case it is usually necessary to put
some info on the stack as well to alter the
behaviour of the inner recursions.

Summary: if you don't think we ever need to
strip then let's omit the parameter; however,
I wouldn't yet rule out using something similar
to tune the algorithm to the Silva document
model.

> I don't see a reason to support the textStrip extension. Either we always do
> it or never, but no need to configure this when calling and complicated the
> code. I don't think we need to use any stripping.

I think we certainly can't do it always!

Just as the idea of the original version returning a list 
consisting of the joined contents of each top-level element 
was simply intended as an example of how to do something,
so was the textStrip parameter.  The reason stripping 
whitespace was chosen as the example is that I 
frequently see XML documents which are over 50%
whitespace, particularly those generated in Python
(Python programmers don't tend to use tabs  ;?> ),
but stripping can't simply be applied throughout;
however, I tend not to examine the XML internal to
Silva if at all possible, so you will know more than
me about whether things are better here.

An example of what?  
Sometimes it may become necessary to treat a 
node differently depending upon where it occurs in 
the document tree, e.g.whether or not it occurs inside 
another particular type of node.  Unless this depends
only upon strictly "local" information (e.g. the difference
is that the node in question is a direct child of an
'li' element, in which case it may be possible to add
a suitable clause to the part of the function processing
element node) there are essentially two ways to deal with
this:
1) passing some (limited) information down the stack
    (in this example via the textStrip parameter);
2) using an auxilliary function.

But even when an auxilliary function is used, 
unless the situation is *really* complicated
(and it really shouldn't be in this case) 
it is far neater and more maintainable to 
make it mutually recursive (i.e. to call back
into the main function for the inner recursions);
in this case it is usually necessary to put
some info on the stack as well to alter the
behaviour of the inner recursions.

Summary: if you don't think we ever need to 
strip then let's omit the parameter; however,
I wouldn't yet rule out using something similar
to tune the algorithm to the Silva document
model.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-25:

#23

> Why would my code produce
> something more complicated than a lazy list?

What I meant was that:

i) The use of the generator seems to be trying to
provide the advantages of a lazy list (but I'm not
convinced either that it does, or that these
advantages would be very significant here);

ii) I suspect that the computational complexity
of the generator version is likely to be somewhat
worse than that of generating a list (lazy or
otherwise) ... of which more shortly.

I'm afraid the two comments got telescoped
in my haste to go home.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-25:

#24

Two comments on the last version I posted:

1)
It solves the markup problems Martijn posed earlier
(e.e msg8140)

2)
Accumulating text in a single list (L_res) (passed
by reference to the inner recursions) and returning
nothing avoids creating new variables at each level
and uses the minimum stack space.

This was actually how the function was originally
written but (as suggested by the comments) I added
the return value and the initialisation code in the
original version just because it is sometimes clearer
to see how things are supposed to work this way.

Revision history for this message

sacco (timothy-heap) wrote on 2006-01-25:

#25

MORE IMPORTANT Edit (4.4 KiB, text/plain)

MORE IMPORTANT
(given the concern expressed about performance)

The generator idea is conceptually nice --- effectively
using yield to convert a recursive function into an
iterator --- but I have some practical reservations.

Basically, I don't think yield() can play well with
recursion, not in the sense of the semantics (which
I sure will work fine) but the performance ...

Also, to me, yield seems to make it harder to control
recursion precisely and I I suspect that when we come
to look at tuning the results to get precisely what is
required it may turn out that yield also shares some
of the disadvantages of an iterator and that the the
solution will come to look more and more like the
old recursive version.

But returning to matters of performance:
with a recursive generator, every yield() statment
executed must effectively pass a result back up
the stack, and every level of recursion needs to
be wrapped in a construct which iterates
over the generated list, pulling out values one at
a time and feeding them on upwards. This
almost certainly involves freezing a local context
for every level of recursion each time a value is
returned, and I'd be surprised if Python can do
much optimisation here.

In particular, when you do:
for text in self._get_textContents(child):
yield text
as well as introducing an extra level of iteration,
I suspect you are actually using one next()+yield()
(effectively a function call and return) for every
level of recursion each time you return a single
value in the list!

By contrast, the 'threaded' version using L_res
returns nothing on the stack and uses just one
call/return for each node visited.

I've uploaded a version of your microdom (microdom2.py)
which prints some traces to demonstrate what
I mean (the parameters L_test and depth are
obviously just for demonstration purposes):
for this example the generator appears to
use five times as many call/returns!

Revision history for this message

Sylvain Viollon (thefunny) wrote on 2012-08-24:

#26

In Silva 3.0, the document type changed, and the fulltext only include the fulltext of the document, and no other html or xml tags or special attributes.

Changed in silva:
milestone:	none → 3.0
status:	Incomplete → Fix Committed

Sylvain Viollon (thefunny) on 2013-02-22

Changed in silva:
status:	Fix Committed → Fix Released

Silva

XML tags not removed on fulltext indexing

Bug Description

Other bug subscribers

Patches

Remote bug watches