Comment 14 for bug 101440

Revision history for this message
Martijn Faassen (faassen) wrote :

Okay, I reviewed things right now instead of tomorrow. Some comments:

* Thinking about this some more, I won't check this code in without at least a
bunch of tests. The tests would containsmall Silva documents with some various
constructs, and we expect whether the output is correct. Even though the tests
are for SilvaDocument,
put them in Silva core for now. Or alternatively, we could extend ParsedXML with
this functionality, if at least we can write it so that we don't have Silva
specific knowledge in it. I suspect tests would have helped us finding the next
issue sooner:

* I have my doubts about the ''.join() operation you do. This would mean that
  the ZCTextIndex could see, for this text:
  <p>One</p><p>Two</p>

  OneTwo

  And this is undesirable. Am I missing something?

* You could join with a space, though that would mean this doesn't get indexed
  accurately:

  <p>Foo<strong>bar</strong></p>

  As it'd show up as 'Foo bar', instead of 'Foobar'. It's failing on this now
too though, and
  I'm willing to live with this fairly rare problem of subword markup for now.

I looked at your version and attempted to rewrite it using a generator, ripping
out some
code that I thought complicated matters. It's untested, but perhaps useful if
you want to work on this further. It's simpler as it doesn't have L_res
initialization code (which I dislike anyway even if we retained a L_res
parameter, nor do we have the textStrip parameter. The default should be the
right one, and since whitespace characters shouldn't hurt during indexing as
tokenizing into words already takes place there, let's leave it in:

  def _get_textContents(self, node):
        for child in node.childNodes:
            nodeType = child.nodeType
            if (nodeType == child.COMMENT_NODE or
                nodeType == child.PROCESSING_INSTRUCTION_NODE):
                continue
            if nodeType == child.TEXT_NODE:
                yield child.nodeValue
            elif (nodeType in [
                child.ATTRIBUTE_NODE,
                child.DOCUMENT_TYPE_NODE,
                child.NOTATION_NODE,
                child.ENTITY_NODE,
                ]):
                continue # or you might prefer, e.g, to look inside attributes
            else:
                # iterate through elements are left
                for text in self._get_textContents(child):
                    yield text

the result of this generator would be a sequence of texts, which could be joined
using ' '.join(_get_textContents). (note the space in the ' ').

Another option is returning the sequence directly from fulltext, in a list form,
instead of doing the join here. This should be safe except that some phrases
with bold in them get broken up. I.e. if you have "my <strong>special</strong>
phrase", this would become
["my ", "special", " phrase"], and you couldn't find "my special phrase" anymore
with a phrase search.

I'm redeferring this one, as I don't think we can work all of this out in the beta.