Plugin not indexing

Bug #606975 reported by Michael Neradkov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
DokuWiki Sphinx Search plugin
In Progress
Undecided
Unassigned

Bug Description

After installing and running

# indexer -c sphinx.conf dk_main
Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

using config file 'sphinx.conf'...
indexing index 'dk_main'...
ERROR: index 'dk_main': source 'dk_main': XML parse error: not well-formed (invalid token) (line=675, pos=28, docid=732506693).
total 0 docs, 0 bytes
total 8.296 sec, 0 bytes/sec, 0.00 docs/sec
total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 0 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg

Revision history for this message
Ivinco (info-ivinco) wrote :

Hi do you get this error with the latest (0.3) version? We've fixed several related bugs in this version. Let me know if you get this error with 0.3.

Thanks!

Revision history for this message
Michael Neradkov (michael-neradkov) wrote :

Yes

Version 0.3

Centos 5.4

Ivinco (info-ivinco)
Changed in dokuwiki-sphinxsearch:
status: New → In Progress
Revision history for this message
Yaroslav Vorozhko (vorozhko) wrote :

Your can create xml file with dokuwiki content by following command:
php xmlall.php > dw.xml

And show to us the lines from 670 to 680 by following command:
head -n 680 dw.xml | tail -n10

--
Thanks Yaroslav Vorozhko

Revision history for this message
Yaroslav Vorozhko (vorozhko) wrote :

You have some invalid xml tokens in that lines.
So, it will help us to understand the problem.

Revision history for this message
urusha (urusha) wrote :

Have the same error:
Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

using config file 'sphinx.conf'...
indexing index 'dk_main'...
ERROR: index 'dk_main': source 'dk_main': XML parse error: not well-formed (invalid token) (line=4008, pos=22, docid=404657722).
total 0 docs, 0 bytes
total 3.388 sec, 0 bytes/sec, 0.00 docs/sec
total 0 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 0 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg

Output of next command is in attachment:
php xmlall.php |head -4008 |tail -1 > err.xml

This error token is in the end of the page.
I tried to edit the page with error (remove all symbols near token, remove whole line), but the error appears again in another place on this page (in the end too). I removed error page but erros appears in another page (also in the end), See err2.xml in the next attachment.

Revision history for this message
urusha (urusha) wrote :
Revision history for this message
Ivinco (info-ivinco) wrote :

Urusha thank your for report.
May you send me a sample of that page?

Revision history for this message
urusha (urusha) wrote :

here is example of the page that gives errors

It is generated by awk script from some .csv file. So, it may be an utf8-awk issue. But file looks fine in mcedit or firefox.

Forget to tell that I use sphinxsearch plugin 0.3.3 and sphinxsearch-0.9.9-6 deb from squeeze

Revision history for this message
urusha (urusha) wrote :

The next error page was the last. But I can't send it, because of the information in it. I tried to cut it (head, bottom, middle) but found no logic in appearing-disappearing of errors. Hope it's the same error as in the first page. Or give me instructions to find the reason of error.

Revision history for this message
Yaroslav Vorozhko (vorozhko) wrote :

Urusha,
I am added detecting of encoding to the plugin version 0.3.4.
It should be published soon here.

In your case some pages contain utf-8 and ASCII encoding, so it went to the problems with xml.

Revision history for this message
urusha (urusha) wrote :

Now in 0.3.5 all works fine. Thanks

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.