Wordcount reports extra words

Bug #502508 reported by jtmasaki
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
PyRoom
New
Undecided
Unassigned
gedit
New
Undecided
Unassigned

Bug Description

A document containing "this doesn't seem right" is reported as having 5 words, it looks like pyroom splits on an apostrophe.

Also, it looks like trailing space at the end of a document is counted as an additional word (this is rather minor, however).

Gedit looks like it has the same bug with the apostrophes, though until just now I couldn't figure out the whitespace bug.

It looks like the bug is due to your using gtk's text widget to advance words. I'm not sure where to report the bug upstream, however, and it might be worth working around.

I tried to modify the wordcount code to just use whitespace a while back, but it kind of just blew up in my face (I don't know quite enough python to be helpful, I am afraid).

Tags: gedit
Revision history for this message
Nathan Haines (nhaines) wrote :

This should definitely be reported upstream, and filing the bug against gedit is probably a good start.

I wanted to thank you for investigating this bug as much as you did--while patches are always nice, they are by no means necessarily the most important part of a bug report, and the clear explanation as well as the research you did is extremely valuable.

Revision history for this message
jtmasaki (jtmasaki) wrote :

Glad I could help.

I'll see if I can find the faulty function again (it's been a few months since I tried to come up with a patch) and try to report it upstream properly (a bug against the library, that is).

Revision history for this message
Rafael Gattringer (rafael.gattringer) wrote :

OpenOffice Writer 3.1 gets it right (4 words).

tags: added: gedit
Revision history for this message
Lionel Dricot (ploum-deactivatedaccount) wrote :

This is a lot less trivial than you should expect. Indeed, it depends of the language.

In French, apostrophes are between words and, thus, you cannot have an uniform way of counting.

I'm also curious about Engilsh because : "I'm" is obviously two words. "doesn't" could be counted as two words too "does not".

Revision history for this message
jtmasaki (jtmasaki) wrote :

In English, to the best of my knowledge, both "I'm" and "doesn't" are counted as single words.

I am actually not aware of a situation where words are separated by something other than white-space. (Though, one could possibly make the case of hyphenated words counting as multiple words.)

I believe GTK's text widget changes behavior based on the language selected. So at least a fix to the English word breaking algorithm should be possible.

Revision history for this message
Rafael Gattringer (rafael.gattringer) wrote :
Revision history for this message
Rafael Gattringer (rafael.gattringer) wrote :

This bug still exists in Ubuntu 10.10 Maverick Meerkat Alpha 2 + updates: version 2.30.3-0ubuntu2 (64bit).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.