Whitespace is too significant for factoids

Bug #411160 reported by Space Hobo
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Yardbird
Fix Released
Medium
Zen

Bug Description

Our factoid code nicely becomes case-insensitive, but unfortunately it treats whitespace as significant. This causes the following problem::

 <@SpaceHobo> shakalaka: zoot boot
 <shakalaka> No way!
 <@SpaceHobo> shakalaka: zoot boot
 <shakalaka> I don't know what you mean, SpaceHobo.

We need to collapse whitespace down both in learn() and trigger(). The solution to this may also inform the discussion on bug#409225.

Tags: iotower

Related branches

Space Hobo (spacehobo)
description: updated
Changed in yardbird:
status: New → Triaged
importance: Undecided → Medium
milestone: none → another-hairdo
Revision history for this message
Space Hobo (spacehobo) wrote :

Per our discussion on IRC, we agreed that the 'key' variable from user input should be normalized in a function that is used both in learn() and in trigger(), and we should add an re.sub(r'\s+', ' ', key) call to it.

Changed in yardbird:
assignee: nobody → Zen (shaboom)
Revision history for this message
Space Hobo (spacehobo) wrote :

We ought to further consider normalizing away all punctuation, so that we would replace \s+ with \W+ in the substitution.

Revision history for this message
Zen (shaboom) wrote :

I implemented both, so that whitespace is preserved as single spaces and punctuation is removed completely. This has the advantage of not letting the presence of a comma or hyphen matter, but it does mean you can't make a unique response in cases where it might be important.

Changed in yardbird:
status: Triaged → Fix Committed
Revision history for this message
Space Hobo (spacehobo) wrote :

Here's a log of the discovery of this potential wrinkle:

 <@SpaceHobo> the a-team
 <shakalaka> In 1972 a crack commando unit was sent to prison by a military court for a crime they didn’t commit. These men promptly escaped from a maximum security stockade to the Los Angeles underground. Today, still wanted by the government, they survive as soldiers of fortune. If you have a problem, if no one else can help, and if you can find them, maybe you can hire the A-Team.
 <@SpaceHobo> the ateam
 <shakalaka> In 1972 a crack commando unit was sent to prison by a military court for a crime they didn’t commit. These men promptly escaped from a maximum security stockade to the Los Angeles underground. Today, still wanted by the government, they survive as soldiers of fortune. If you have a problem, if no one else can help, and if you can find them, maybe you can hire the A-Team.
 <@Zen> delicious
 <@SpaceHobo> shakalaka: delicious is pernicious
 <shakalaka> Roger that, SpaceHobo.
 <@SpaceHobo> del.icio.us
 <shakalaka> del.icio.us is pernicious
 <@Zen> haha
 <@Zen> except
 <@Zen> I would want del.icio.us to be per.nicio.us

Revision history for this message
Space Hobo (spacehobo) wrote :

Zen, looking at your branch, I see that you change \s+ to a single space first, then remove \W+. Since \W includes all of \s (and, in fact, the single space you replace it with), this is a little redundant. You may want to try [^\w\s]+ to kill all non-alnum non-whitespace characters.

Revision history for this message
Zen (shaboom) wrote :

I fixed that. Also, punctuation was handled after whitespace, so there were sometimes multiple spaces. I switched them around.

Also altered commands.py to leave trailing punctuation, since it was only aesthetic anymore. Input and output were differing in confusing ways.

We still need to decide whether stripping punctuation is really desired.

Revision history for this message
Zen (shaboom) wrote :

Example of why I edited commands.py:

16:45 <@SpaceHobo> shakalaka: literal f.r.o.o.-f.r.o.o.
16:45 < shakalaka> f.r.o.o.-f.r.o.o =is= <reply> ballin'

The factoid key is stored as just "froofroo", but at its earliest parsing the trailing "." was snipped. This is kind of awkward, but since the key is stored without this anyway I just left the trailing punctuation on for all other purposes:

16:58 <@SpaceHobo> shakalaka: literal f.r.o.o.-f.r.o.o.
16:58 < shakalaka> f.r.o.o.-f.r.o.o. =is= <reply> ballin'

That caused this problem:

22:20 <@Zen> shakalaka: ballin'?
22:20 < shakalaka> ballin'? is the illest

We don't want the question mark, but we DO want the apostrophe. So I hacked up commands.py some more so that when factoids are triggered or literaled, any periods, question marks, and exclamation points sitting at the end of the key get removed.

This means acronyms with periods and morse code are both kind of impossible. But it basically works.

Revision history for this message
Space Hobo (spacehobo) wrote :

The current branch has the following problem:

 23:00 <@SpaceHobo> "internet"
 23:00 <shakalaka> internet" is http://ecx.images-amazon.com/images/I/51K5MWCVQAL._SS500_.jpg

Since we're still stripping leading \W

Revision history for this message
Space Hobo (spacehobo) wrote :
Revision history for this message
xkcd (xkcd) wrote :

There's nothing more heart-warming than the sound of a child's laughter.

Revision history for this message
xkcd (xkcd) wrote :

We do this with Robot9000 (the no-repeating-sentences enforcer) and it seems to work without too many unexpected collisions, although I'm not sure how we'd tell when one happened. But my guess is we'll get some weird moments plus one or two places where it's annoying that we can't have separate factoids for something (like possibly "ha ha" versus "haha"). On balance I'm in favor, at least until we see what it's like. Just leave the code so it's not too hard to uncollapse whitespace and if we need to we'll change it back.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

xkcd:

  The trick here is that infobots need to *repeat* the text back to you. We are trying to normalize just the database keys, and leave the printed text as-was. Most of the movement on this merge request (https://code.launchpad.net/~shaboom/yardbird/shakalaka/+merge/9953) is about stripping trailing question marks and such without losing trailing apostrophes or other data.

We decided to leave individual spaces after normalization to keep the database keys legible when we use the Web-based admin.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote : Re: [Bug 411160] Re: Whitespace is too significant for factoids

xkcd:
> (like possibly "ha ha" versus "haha"). On balance I'm in favor, at
> least until we see what it's like. Just leave the code so it's not
> too hard to uncollapse whitespace and if we need to we'll change it
> back.

So if you look at the patch, we *collapse* whitespace, but *remove*
punctuation. So "ha ha" and "haha" are already two different keys, but
"ha ha" (two spaces) and "ha ha" (one space) are the same.

--
Nick Moffitt

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

After discussion in IRC, I've decided to merge Zen's branch, and we can later open further bugs that reference this one.

Space Hobo (spacehobo)
Changed in yardbird:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.