Pluggable normalizers

Bug #913095 reported by Jeroen T. Vermeulen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
langmatch
Triaged
Wishlist
Unassigned

Bug Description

There are many ways in which you might want to normalize text: keep whitespace, replace it with a single space, or remove it altogether; treat certain kinds of punctuation as interchangeable; remove markup or non-textual data; and so on.

Advanced use-cases are better handled by external programs while pre-processing data for use with langmatch. But even if it's just for experimentation, it'd be helpful if the user could specify a normalization regime.

The normalization should be applied to language maps and text alike. This also means that two grams in a map can be considered identical; the loading code should add their occurrence counts.

Revision history for this message
Jeroen T. Vermeulen (jtv) wrote :

The basics for this have now been done. Just add new normalizers to the langmatchlib.normalize.normalizers dict.

Changed in langmatch:
importance: Undecided → Wishlist
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.