langmatch

Pluggable normalizers

Bug #913095 reported by Jeroen T. Vermeulen on 2012-01-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	langmatch	Triaged	Wishlist	Unassigned

Bug Description

There are many ways in which you might want to normalize text: keep whitespace, replace it with a single space, or remove it altogether; treat certain kinds of punctuation as interchangeable; remove markup or non-textual data; and so on.

Advanced use-cases are better handled by external programs while pre-processing data for use with langmatch. But even if it's just for experimentation, it'd be helpful if the user could specify a normalization regime.

The normalization should be applied to language maps and text alike. This also means that two grams in a map can be considered identical; the loading code should add their occurrence counts.

Revision history for this message

Jeroen T. Vermeulen (jtv) wrote on 2012-01-10:

The basics for this have now been done. Just add new normalizers to the langmatchlib.normalize.normalizers dict.

Jeroen T. Vermeulen (jtv) on 2012-04-16

Changed in langmatch:
importance:	Undecided → Wishlist
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.