Bogofilter seems to fail decoding base64

Bug #320829 reported by Christian Frommeyer on 2009-01-24
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bogofilter (Ubuntu)
Loïc Minier
Loïc Minier

Bug Description

Binary package hint: bogofilter-bdb

Description: Ubuntu 8.04.1
Release: 8.04
Package: bogofilter-bdb
Source-Package: bogofilter
Version: 1.1.5-2ubuntu5

During the last days I received a lot of similar spam that passed bogofilter marked as Ham. Even after tagging a lot of mails (>50) this was not improved. Neither for already tagged mails nor for new mails.

Looking on the plain mail text I found out that the mails although plain text with cp1251 formatting were base64 encoded. Thus I first assumed that bogofilter might be unable of handling base64 encoding. But actually this is integrated since version 0.10 and should therefore be still in 1.1.5-2ubuntu5 as I have installed here.

A brief test brought up the following:

I tagged one of the spam mails using a new database with "bogofilter -s" and compared the database contents (retrieved via "bogoutil -d") with another new database were I tagged the same mail but with decoded body and subject.

In the first DB only information on header fields was present. In the second DB there was also information regarding the body of the mail.

Thus I conclude that bogofilter did not manage to decode the mail - whereas KMail does this flawlessly.

I attach an mbox folder with a selection of mails.

Related branches

Hi Christian,

thanks for providing some samples of faulty messages and a hint to the cause of the problem.

Apparently bogofilter (including upstream version 1.2.0) indeed has issues with decoding the message bodies; at least bogolexer doesn't come up with body tokens.

WRT Ubuntu Core Developers as maintainers, please forward such reports upstream. We're not actively monitoring distributor package bugs, so this isn't ever gonna get fixed unless you forward reports on short notice. Letting reports linger for half a year isn't useful.

Changed in bogofilter:
status: New → Confirmed
Changed in bogofilter (Ubuntu):
status: New → Confirmed

This is an upstream bogofilter bug.

The lexer (that extracts words from messages) misattributes part of the base64 message part to the header, splits the long base64 line in two pieces, trashes part of the first, then drops it on the floor, and the second part that is properly attributed to the body wasn't split out at a four-character boundary, so the base64 decoder is out of synch and produces garbage.

Sorry for that.

Fixed in bogofilter's upstream Subversion repository. Relevant commit is r6848. Unfortunately, it's non-trivial, so that revision may have to be backported manually.

Thanks, Christian, for the test cases.

Changed in bogofilter:
status: Confirmed → Fix Committed

r6850 is also required to fix up an indentation issue in r6848.

bogofilter 1.2.1 has just been released, it fixes this bug and a quoted-printable bug that failed to recognize =\r (<- ANSI-C escape notation) sequences at line ends. Please upgrade or backport the fixes.

Changed in bogofilter:
status: Fix Committed → Confirmed
Loïc Minier (lool) wrote :

Actually this bug uncovers an important issue with parsing of the first line of the body; bumping to high.

Changed in bogofilter (Ubuntu):
status: Confirmed → Fix Committed
importance: Undecided → Medium
assignee: nobody → Loïc Minier (lool)
Changed in bogofilter (Ubuntu Lucid):
importance: Medium → High
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bogofilter - 1.2.1-0ubuntu1

bogofilter (1.2.1-0ubuntu1) lucid; urgency=low

  * New upstream bugfix release; LP: #557468.
    - Fixes parsing of the first line of the body in MIME messages;
      LP: #320829.
 -- Loic Minier <email address hidden> Sat, 10 Apr 2010 11:08:53 +0200

Changed in bogofilter (Ubuntu Lucid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers