non-normalized concepts exist

Bug #445125 reported by Ken Arnold
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ConceptNet
Fix Committed
Medium
Rob Speer

Bug Description

I noticed that some concepts seem to be not normalized:

>>> Concept.get('balls', 'en')
<Concept: <en: balls>>
>>> Concept.get('ball', 'en')
<Concept: <en: ball>>
>>> Concept.get('balls', 'en').surfaceform_set.all()[0]
<SurfaceForm: balls>
>>> Concept.get('balls', 'en').get_assertions().count()
45

Where'd that come from?

Revision history for this message
Ken Arnold (kenneth-arnold) wrote :

_Lots_ of non-normalized concepts exist:

>>> from csc.conceptnet.models import *
>>> from csc.nl import get_nl
>>> en_nl = get_nl('en')
>>> bad_surfaces = []
>>> for text, normalized in SurfaceForm.objects.filter(language='en').order_by().values_list('text', 'concept__text').iterator():
        if en_nl.normalize(text) != normalized:
            bad_surfaces.append(text)
>>> len(bad_surfaces)
29955

Revision history for this message
Ken Arnold (kenneth-arnold) wrote :

And they're not the Verbosity import, at least not all of them. Here's a sample:

communicate verbally
may be charming, but they can also be expensive to heat
take the path less taken
probably got the book at the library
the statement 'visa is one credit card' helps answer the question 'what
no one else will take care of them
getting loved
can't sleep during the day
when a saw
express repressed feelings
getting bored easily
statement 'a criminal is likely to have a weapon.' helps answer the question 'what

Though part of the issue may be that we changed normalization. For example, the last phrase's concept text is:
"statement 'a criminal likely weapon.' help answer question 'what"

but en_nl.normalize(that) is:
"statement criminal likely weapon help answer question"

(this ignoring the utter uselessness of that concept..., for the purpose of example)

Revision history for this message
Rob Speer (rspeer) wrote :

In working on fixing this, I've stumbled across a suboptimal decision I made a while ago. I wanted the concept "people" to not match its normalized form, which MBLEM thinks is still "people", singular. So I associated the SurfaceForm "people" with the concept "person" manually, and I guess I assumed that we'd be using SurfaceForms to do nl normalization. It seemed like a good way to override special cases.

This was probably dumb. We want to be able to do normalization without having the database at all. I'm going to try to convince MBLEM that "people" is foremost the plural of "person".

Revision history for this message
Ken Arnold (kenneth-arnold) wrote : Re: [Bug 445125] Re: non-normalized concepts exist

[If you're getting this message and don't actually care about the
internals of ConceptNet, let me know -- I suspect our bug traffic is
bugging more people than necessary.]

I agree that normalization should be able to happen without an active
database -- I use that behavior, actually.

Since we have full control of calling MBLEM, we can maintain a static
list of lemmatization overrides, just as a dict. Unless it turns out
to be very easy to change MBLEM for a special case, the override
approach would give us an easy way to fix anything we notice in the
future also.

Thanks for working on this. I suspect we'll need to reparse at some
point. Let's unit test the parse process before we do that, though ;)

-Ken

On Thu, Oct 8, 2009 at 3:32 PM, Rob Speer <email address hidden> wrote:
> In working on fixing this, I've stumbled across a suboptimal decision I
> made a while ago. I wanted the concept "people" to not match its
> normalized form, which MBLEM thinks is still "people", singular. So I
> associated the SurfaceForm "people" with the concept "person" manually,
> and I guess I assumed that we'd be using SurfaceForms to do nl
> normalization. It seemed like a good way to override special cases.
>
> This was probably dumb. We want to be able to do normalization without
> having the database at all. I'm going to try to convince MBLEM that
> "people" is foremost the plural of "person".
>
> --
> non-normalized concepts exist
> https://bugs.launchpad.net/bugs/445125
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message
Rob Speer (rspeer) wrote :

I implemented an exception mechanism, fixing "people -> person" and "ground -> ground" (not "grind") while I was at it, and started a script to update the concepts (fix_abnormal_concepts.py).

The update is estimated to finish in about 3 days. :(

Changed in conceptnet:
assignee: nobody → Rob Speer (rspeer)
importance: Undecided → Medium
milestone: none → 4.0
status: New → In Progress
Revision history for this message
Francisco Dalla Rosa (francisco-s) wrote :

I guess I should post this here not as a new post.

Some concepts have both a fully normalized version and a not so normalized counterpart: "hold meeting" and "hold meet" is the case I caught. I don't know if this is already going to be handled together with all the other fixing on the "not properly normalized concepts" problem, but I guessed it was worth mentioning.

Rob Speer (rspeer)
Changed in conceptnet:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.