Okay, I reviewed things right now instead of tomorrow. Some comments:
* Thinking about this some more, I won't check this code in without at least a
bunch of tests. The tests would containsmall Silva documents with some various
constructs, and we expect whether the output is correct. Even though the tests
are for SilvaDocument,
put them in Silva core for now. Or alternatively, we could extend ParsedXML with
this functionality, if at least we can write it so that we don't have Silva
specific knowledge in it. I suspect tests would have helped us finding the next
issue sooner:
* I have my doubts about the ''.join() operation you do. This would mean that
the ZCTextIndex could see, for this text:
<p>One</p><p>Two</p>
OneTwo
And this is undesirable. Am I missing something?
* You could join with a space, though that would mean this doesn't get indexed
accurately:
<p>Foo<strong>bar</strong></p>
As it'd show up as 'Foo bar', instead of 'Foobar'. It's failing on this now
too though, and
I'm willing to live with this fairly rare problem of subword markup for now.
I looked at your version and attempted to rewrite it using a generator, ripping
out some
code that I thought complicated matters. It's untested, but perhaps useful if
you want to work on this further. It's simpler as it doesn't have L_res
initialization code (which I dislike anyway even if we retained a L_res
parameter, nor do we have the textStrip parameter. The default should be the
right one, and since whitespace characters shouldn't hurt during indexing as
tokenizing into words already takes place there, let's leave it in:
def _get_textContents(self, node):
for child in node.childNodes: nodeType = child.nodeType
if (nodeType == child.COMMENT_NODE or nodeType == child.PROCESSING_INSTRUCTION_NODE): continue
if nodeType == child.TEXT_NODE: yield child.nodeValue
elif (nodeType in [ child.ATTRIBUTE_NODE, child.DOCUMENT_TYPE_NODE, child.NOTATION_NODE, child.ENTITY_NODE,
]): continue # or you might prefer, e.g, to look inside attributes
else:
# iterate through elements are left
for text in self._get_textContents(child): yield text
the result of this generator would be a sequence of texts, which could be joined
using ' '.join(_get_textContents). (note the space in the ' ').
Another option is returning the sequence directly from fulltext, in a list form,
instead of doing the join here. This should be safe except that some phrases
with bold in them get broken up. I.e. if you have "my <strong>special</strong>
phrase", this would become
["my ", "special", " phrase"], and you couldn't find "my special phrase" anymore
with a phrase search.
I'm redeferring this one, as I don't think we can work all of this out in the beta.
Okay, I reviewed things right now instead of tomorrow. Some comments:
* Thinking about this some more, I won't check this code in without at least a
bunch of tests. The tests would containsmall Silva documents with some various
constructs, and we expect whether the output is correct. Even though the tests
are for SilvaDocument,
put them in Silva core for now. Or alternatively, we could extend ParsedXML with
this functionality, if at least we can write it so that we don't have Silva
specific knowledge in it. I suspect tests would have helped us finding the next
issue sooner:
* I have my doubts about the ''.join() operation you do. This would mean that /p><p>Two< /p>
the ZCTextIndex could see, for this text:
<p>One<
OneTwo
And this is undesirable. Am I missing something?
* You could join with a space, though that would mean this doesn't get indexed
accurately:
<p>Foo< strong> bar</strong> </p>
As it'd show up as 'Foo bar', instead of 'Foobar'. It's failing on this now
too though, and
I'm willing to live with this fairly rare problem of subword markup for now.
I looked at your version and attempted to rewrite it using a generator, ripping
out some
code that I thought complicated matters. It's untested, but perhaps useful if
you want to work on this further. It's simpler as it doesn't have L_res
initialization code (which I dislike anyway even if we retained a L_res
parameter, nor do we have the textStrip parameter. The default should be the
right one, and since whitespace characters shouldn't hurt during indexing as
tokenizing into words already takes place there, let's leave it in:
def _get_textConten ts(self, node):
nodeType = child.nodeType
nodeType == child.PROCESSIN G_INSTRUCTION_ NODE):
continue
yield child.nodeValue
child. ATTRIBUTE_ NODE,
child. DOCUMENT_ TYPE_NODE,
child. NOTATION_ NODE,
child. ENTITY_ NODE,
continue # or you might prefer, e.g, to look inside attributes textContents( child):
yield text
for child in node.childNodes:
if (nodeType == child.COMMENT_NODE or
if nodeType == child.TEXT_NODE:
elif (nodeType in [
]):
else:
# iterate through elements are left
for text in self._get_
the result of this generator would be a sequence of texts, which could be joined _get_textConten ts). (note the space in the ' ').
using ' '.join(
Another option is returning the sequence directly from fulltext, in a list form, special< /strong>
instead of doing the join here. This should be safe except that some phrases
with bold in them get broken up. I.e. if you have "my <strong>
phrase", this would become
["my ", "special", " phrase"], and you couldn't find "my special phrase" anymore
with a phrase search.
I'm redeferring this one, as I don't think we can work all of this out in the beta.