libpst / readpst incorrectly decodes latin1 contacts, etc.

Bug #1470032 reported by Martin Møller on 2015-06-30
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LIBPST
New
Undecided
Unassigned
libpst (Ubuntu)
Undecided
Unassigned

Bug Description

After a client of ours moved from Exchange 2003 to Office 365 we had to get some data out of PST-files, which mostly worked well, but apparently Contacts and some Tasks have a tendency på be incorrectly decoded into gibberish.

As far as I can tell, the problem is that the data is interpreted to be UTF16 that needs to be converted to UTF8 and the charset I defined on the commandline for readpst is not consulted in this transaction.

When inspecting the debug log, it is clear to human eyes that this conversion is incorrect and if anything should have been from the charset I specified to UTF8 and not from UTF16 to UTF8.

As far as I can tell, the problem occurs in the 'pst_vb_utf16to8', which seems to be called indescrimately, and it seems that the charset I specify to readpst is rarely used, if ever.

I wonder if it would be possible to have a switch to present the user with the unconverted version and possibly a couple of encoding and let the user decide the proper one. There are several contacts that are fine, but over 200 that suffer from this garbling of the data. Unfortunately it is more or less impossible to get from the utf8 version of the non-utf16 data back to latin1, as far as I can tell.

This is a sample contact that has the issue (Most are totally illegible, but a few had some text I could search for):
FN:Ballerup Politi
N:汋獯整�;潊湨祮;;;
EMAIL:慂汬牥灵倠汯瑩<U+2069>䨨䡃灀汯瑩<U+2E69>此�
ADR;TYPE=work:;;;;;;
LABEL;TYPE=work:汇<U+202E><U+E552>桤獵敶<U+206A>㤱\n慂汬牥灵 㜲〵\n慄浮牡�
TEL;TYPE=work,voice:㤳㔠‴㐱㐠‸潬慫<U+206C>㐠㌲�
TEL;TYPE=cell,voice: 72 58 78 29 (20 90 98 02)
TITLE:楖散潰楬楴潫浭獩狦
NOTE:Gladsaxe Politi (kredsen) 3969 1448\n
VERSION: 3.0
END:VCARD

Attached is debug version of the parsing of this contact.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: pst-utils 0.6.59-1build1
ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
Uname: Linux 3.13.0-24-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.11
Architecture: amd64
CurrentDesktop: X-Cinnamon
Date: Tue Jun 30 11:05:51 2015
EcryptfsInUse: Yes
InstallationDate: Installed on 2014-07-27 (337 days ago)
InstallationMedia: Linux Mint 17 "Qiana" - Release amd64 20140624
ProcEnviron:
 SHELL=/bin/bash
 TERM=xterm
 PATH=(custom, no user)
 LANG=da_DK.UTF-8
 XDG_RUNTIME_DIR=<set>
SourcePackage: libpst
UpgradeStatus: No upgrade log present (probably fresh install)

Martin Møller (martin-moller) wrote :
Martin Møller (martin-moller) wrote :

PS:
This is the full commandline I used to get the above result (and all the other contacts now shown here):

readpst -C iso-8859-1 -cv -tc -o /tmp/TAN/ -d /tmp/TAN/TAN-debug.txt Postkasse\ -\ Torben\ Andersen.pst

Also, Office 2013 act precisely the same way as libpst seems to do, but wouldn't it be nice to have a leg up on Microsoft here?
Clearly, the data is in there and if we can only avoid it getting garbled then it is easy enough to convert the Latin1 entries later to UTF-8, if need be.

Martin Møller (martin-moller) wrote :

Well, I took a stab at the problem, since our client would really like his contacts restored.

Attached is a very crude patch (as I am not a C programmer and thus I am going the trials and error route here).
With this patch I seem to get both my Latin1 contacts and the UTF-16 contacts written correctly.
The Latin1 contacts often are missing the last character and I have not dared try to remove the BOM marked which is thus translated to a couple of y characters with different accents, but I would much rather have this and then be able to replace them with a questionmark before importing the VCF file to inform the receiver that something is probably missing here.

Also, I have hardcoded the charset as I am not sure where I would get the one I specified on the commandline or whether the one I had used there would even be enough.

I hope someone with better understanding of the code than me can clean up the patch and then hopefully submit it.
Now if only Microsoft/Recovery Tool didn't export the data incorrectly in the first place ...

/MMO

The attachment "Crude patch that does what I need, but probably fails for non-western languages." seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Sebastien Bacher (seb128) wrote :

Thanks, that should probably be subscribed upstream for review...

That would probably be best, yes. I didn't see how to do so.

ons. 16. sep. 2015 15.31 skrev Sebastien Bacher <email address hidden>:

> Thanks, that should probably be subscribed upstream for review...
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1470032
>
> Title:
> libpst / readpst incorrectly decodes latin1 contacts, etc.
>
> Status in libpst package in Ubuntu:
> New
>
> Bug description:
> After a client of ours moved from Exchange 2003 to Office 365 we had
> to get some data out of PST-files, which mostly worked well, but
> apparently Contacts and some Tasks have a tendency på be incorrectly
> decoded into gibberish.
>
> As far as I can tell, the problem is that the data is interpreted to
> be UTF16 that needs to be converted to UTF8 and the charset I defined
> on the commandline for readpst is not consulted in this transaction.
>
> When inspecting the debug log, it is clear to human eyes that this
> conversion is incorrect and if anything should have been from the
> charset I specified to UTF8 and not from UTF16 to UTF8.
>
> As far as I can tell, the problem occurs in the 'pst_vb_utf16to8',
> which seems to be called indescrimately, and it seems that the charset
> I specify to readpst is rarely used, if ever.
>
> I wonder if it would be possible to have a switch to present the user
> with the unconverted version and possibly a couple of encoding and let
> the user decide the proper one. There are several contacts that are
> fine, but over 200 that suffer from this garbling of the data.
> Unfortunately it is more or less impossible to get from the utf8
> version of the non-utf16 data back to latin1, as far as I can tell.
>
> This is a sample contact that has the issue (Most are totally illegible,
> but a few had some text I could search for):
> FN:Ballerup Politi
> N:汋獯整�;潊湨祮;;;
> EMAIL:慂汬牥灵倠汯瑩<U+2069>䨨䡃灀汯瑩<U+2E69>此�
> ADR;TYPE=work:;;;;;;
> LABEL;TYPE=work:汇<U+202E><U+E552>桤獵敶<U+206A>㤱\n慂汬牥灵 㜲〵\n慄浮牡�
> TEL;TYPE=work,voice:㤳㔠‴㐱㐠‸潬慫<U+206C>㐠㌲�
> TEL;TYPE=cell,voice: 72 58 78 29 (20 90 98 02)
> TITLE:楖散潰楬楴潫浭獩狦
> NOTE:Gladsaxe Politi (kredsen) 3969 1448\n
> VERSION: 3.0
> END:VCARD
>
> Attached is debug version of the parsing of this contact.
>
> ProblemType: Bug
> DistroRelease: Ubuntu 14.04
> Package: pst-utils 0.6.59-1build1
> ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
> Uname: Linux 3.13.0-24-generic x86_64
> ApportVersion: 2.14.1-0ubuntu3.11
> Architecture: amd64
> CurrentDesktop: X-Cinnamon
> Date: Tue Jun 30 11:05:51 2015
> EcryptfsInUse: Yes
> InstallationDate: Installed on 2014-07-27 (337 days ago)
> InstallationMedia: Linux Mint 17 "Qiana" - Release amd64 20140624
> ProcEnviron:
> SHELL=/bin/bash
> TERM=xterm
> PATH=(custom, no user)
> LANG=da_DK.UTF-8
> XDG_RUNTIME_DIR=<set>
> SourcePackage: libpst
> UpgradeStatus: No upgrade log present (probably fresh install)
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/libpst/+bug/1470032/+subscriptions
>

Sebastien Bacher (seb128) wrote :

Unsure, there is no bug tracker info on their website, maybe email the current maintainer (info in http://hg.five-ten-sg.com/libpst/file/602869b958a3/AUTHORS) or at least forward the bug to Debian since the Ubuntu package is based on theirs

Ivan Zakharyaschev (imz) wrote :

The upstream maintainer turned out to coincide with the Redhat maintainer, namely, Carl.

So, I'd suggest to just report your problem at the Redhat bugzilla and reach the upstream this way.

Martin Møller (martin-moller) wrote :

I'll look into that in the near future. Thanks.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers