Database persistence for spambayes classifier data could be faster
Bug #1040874 reported by
Jean-Paul Calderone
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Quotient |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
The database storage implementation for spambayes classifier information removes the O(N) cost of reading and writing a huge pickle, but there's still a big cost for individually loading spam/ham information about each document token individually.
It would be better to load and save all of this data in a single statement to eliminate the repeated go-into-
Related branches
lp:~exarkun/divmod.org/spambayes-fewer-potatoes
- Tristan Seligmann: Approve
-
Diff: 327 lines (+222/-14)3 files modifiedQuotient/benchmarks/spambayes (+44/-0)
Quotient/xquotient/spam.py (+91/-14)
Quotient/xquotient/test/test_spambayes.py (+87/-0)
Changed in quotient: | |
status: | New → Fix Committed |
To post a comment you must log in.
I added a benchmark in this branch. Here are the results on my laptop:
exarkun@ top:~/Projects/ Divmod/ branches/ spambayes- fewer-potatoes$ python Quotient/ benchmarks/ spambayes top:~/Projects/ Divmod/ branches/ spambayes- fewer-potatoes$ chbranch Divmod spambayes- fewer-potatoes top:~/Projects/ Divmod/ branches/ spambayes- fewer-potatoes$ python Quotient/ benchmarks/ spambayes
Learning: 0.71 ms/word
Guessing: 0.09 ms/word
exarkun@
exarkun@
Learning: 0.72 ms/word
Guessing: 0.02 ms/word
I guess I shouldn't have tried to speed up writes (learning), SQLite3 must do a pretty good job of those already. Read (guessing) performance seems like a good win, though.
Also did a little bit of test enhancement so the various extra codepaths introduced by caching all get covered.