Pluggable normalizers
Bug #913095 reported by
Jeroen T. Vermeulen
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
langmatch |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
There are many ways in which you might want to normalize text: keep whitespace, replace it with a single space, or remove it altogether; treat certain kinds of punctuation as interchangeable; remove markup or non-textual data; and so on.
Advanced use-cases are better handled by external programs while pre-processing data for use with langmatch. But even if it's just for experimentation, it'd be helpful if the user could specify a normalization regime.
The normalization should be applied to language maps and text alike. This also means that two grams in a map can be considered identical; the loading code should add their occurrence counts.
Changed in langmatch: | |
importance: | Undecided → Wishlist |
status: | New → Triaged |
To post a comment you must log in.
The basics for this have now been done. Just add new normalizers to the langmatchlib. normalize. normalizers dict.