Quotient

Database persistence for spambayes classifier data could be faster

Bug #1040874 reported by Jean-Paul Calderone on 2012-08-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Quotient	Fix Committed	Undecided	Unassigned

Bug Description

The database storage implementation for spambayes classifier information removes the O(N) cost of reading and writing a huge pickle, but there's still a big cost for individually loading spam/ham information about each document token individually.

It would be better to load and save all of this data in a single statement to eliminate the repeated go-into-the-database overhead.

Related branches

lp:~exarkun/divmod.org/spambayes-fewer-potatoes

Merged into lp:divmod.org at revision 2698

Tristan Seligmann: Approve on 2012-09-13

Revision history for this message

Jean-Paul Calderone (exarkun) wrote on 2012-08-23:

I added a benchmark in this branch. Here are the results on my laptop:

exarkun@top:~/Projects/Divmod/branches/spambayes-fewer-potatoes$ python Quotient/benchmarks/spambayes
Learning: 0.71 ms/word
Guessing: 0.09 ms/word
exarkun@top:~/Projects/Divmod/branches/spambayes-fewer-potatoes$ chbranch Divmod spambayes-fewer-potatoes
exarkun@top:~/Projects/Divmod/branches/spambayes-fewer-potatoes$ python Quotient/benchmarks/spambayes
Learning: 0.72 ms/word
Guessing: 0.02 ms/word

I guess I shouldn't have tried to speed up writes (learning), SQLite3 must do a pretty good job of those already. Read (guessing) performance seems like a good win, though.

Also did a little bit of test enhancement so the various extra codepaths introduced by caching all get covered.

Jean-Paul Calderone (exarkun) on 2012-09-13

Changed in quotient:
status:	New → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.