Database persistence for spambayes classifier data could be faster

Bug #1040874 reported by Jean-Paul Calderone
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fix Committed

Bug Description

The database storage implementation for spambayes classifier information removes the O(N) cost of reading and writing a huge pickle, but there's still a big cost for individually loading spam/ham information about each document token individually.

It would be better to load and save all of this data in a single statement to eliminate the repeated go-into-the-database overhead.

Related branches

Revision history for this message
Jean-Paul Calderone (exarkun) wrote :

I added a benchmark in this branch. Here are the results on my laptop:

exarkun@top:~/Projects/Divmod/branches/spambayes-fewer-potatoes$ python Quotient/benchmarks/spambayes
Learning: 0.71 ms/word
Guessing: 0.09 ms/word
exarkun@top:~/Projects/Divmod/branches/spambayes-fewer-potatoes$ chbranch Divmod spambayes-fewer-potatoes
exarkun@top:~/Projects/Divmod/branches/spambayes-fewer-potatoes$ python Quotient/benchmarks/spambayes
Learning: 0.72 ms/word
Guessing: 0.02 ms/word

I guess I shouldn't have tried to speed up writes (learning), SQLite3 must do a pretty good job of those already. Read (guessing) performance seems like a good win, though.

Also did a little bit of test enhancement so the various extra codepaths introduced by caching all get covered.

Changed in quotient:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.