libpst / readpst incorrectly decodes latin1 contacts, etc.

Bug #1470032 reported by Martin Møller
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LIBPST
New
Undecided
Unassigned
libpst (Ubuntu)
New
Undecided
Unassigned

Bug Description

After a client of ours moved from Exchange 2003 to Office 365 we had to get some data out of PST-files, which mostly worked well, but apparently Contacts and some Tasks have a tendency på be incorrectly decoded into gibberish.

As far as I can tell, the problem is that the data is interpreted to be UTF16 that needs to be converted to UTF8 and the charset I defined on the commandline for readpst is not consulted in this transaction.

When inspecting the debug log, it is clear to human eyes that this conversion is incorrect and if anything should have been from the charset I specified to UTF8 and not from UTF16 to UTF8.

As far as I can tell, the problem occurs in the 'pst_vb_utf16to8', which seems to be called indescrimately, and it seems that the charset I specify to readpst is rarely used, if ever.

I wonder if it would be possible to have a switch to present the user with the unconverted version and possibly a couple of encoding and let the user decide the proper one. There are several contacts that are fine, but over 200 that suffer from this garbling of the data. Unfortunately it is more or less impossible to get from the utf8 version of the non-utf16 data back to latin1, as far as I can tell.

This is a sample contact that has the issue (Most are totally illegible, but a few had some text I could search for):
FN:Ballerup Politi
N:汋獯整�;潊湨祮;;;
EMAIL:慂汬牥灵倠汯瑩<U+2069>䨨䡃灀汯瑩<U+2E69>此�
ADR;TYPE=work:;;;;;;
LABEL;TYPE=work:汇<U+202E><U+E552>桤獵敶<U+206A>㤱\n慂汬牥灵 㜲〵\n慄浮牡�
TEL;TYPE=work,voice:㤳㔠‴㐱㐠‸潬慫<U+206C>㐠㌲�
TEL;TYPE=cell,voice: 72 58 78 29 (20 90 98 02)
TITLE:楖散潰楬楴潫浭獩狦
NOTE:Gladsaxe Politi (kredsen) 3969 1448\n
VERSION: 3.0
END:VCARD

Attached is debug version of the parsing of this contact.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: pst-utils 0.6.59-1build1
ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
Uname: Linux 3.13.0-24-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.11
Architecture: amd64
CurrentDesktop: X-Cinnamon
Date: Tue Jun 30 11:05:51 2015
EcryptfsInUse: Yes
InstallationDate: Installed on 2014-07-27 (337 days ago)
InstallationMedia: Linux Mint 17 "Qiana" - Release amd64 20140624
ProcEnviron:
 SHELL=/bin/bash
 TERM=xterm
 PATH=(custom, no user)
 LANG=da_DK.UTF-8
 XDG_RUNTIME_DIR=<set>
SourcePackage: libpst
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Martin Møller (martin-moller) wrote :
Revision history for this message
Martin Møller (martin-moller) wrote :

PS:
This is the full commandline I used to get the above result (and all the other contacts now shown here):

readpst -C iso-8859-1 -cv -tc -o /tmp/TAN/ -d /tmp/TAN/TAN-debug.txt Postkasse\ -\ Torben\ Andersen.pst

Also, Office 2013 act precisely the same way as libpst seems to do, but wouldn't it be nice to have a leg up on Microsoft here?
Clearly, the data is in there and if we can only avoid it getting garbled then it is easy enough to convert the Latin1 entries later to UTF-8, if need be.

Revision history for this message
Martin Møller (martin-moller) wrote :

Well, I took a stab at the problem, since our client would really like his contacts restored.

Attached is a very crude patch (as I am not a C programmer and thus I am going the trials and error route here).
With this patch I seem to get both my Latin1 contacts and the UTF-16 contacts written correctly.
The Latin1 contacts often are missing the last character and I have not dared try to remove the BOM marked which is thus translated to a couple of y characters with different accents, but I would much rather have this and then be able to replace them with a questionmark before importing the VCF file to inform the receiver that something is probably missing here.

Also, I have hardcoded the charset as I am not sure where I would get the one I specified on the commandline or whether the one I had used there would even be enough.

I hope someone with better understanding of the code than me can clean up the patch and then hopefully submit it.
Now if only Microsoft/Recovery Tool didn't export the data incorrectly in the first place ...

/MMO

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "Crude patch that does what I need, but probably fails for non-western languages." seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Revision history for this message
Sebastien Bacher (seb128) wrote :

Thanks, that should probably be subscribed upstream for review...

Revision history for this message
Martin Møller (martin-moller) wrote : Re: [Bug 1470032] Re: libpst / readpst incorrectly decodes latin1 contacts, etc.

That would probably be best, yes. I didn't see how to do so.

ons. 16. sep. 2015 15.31 skrev Sebastien Bacher <email address hidden>:

> Thanks, that should probably be subscribed upstream for review...
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1470032
>
> Title:
> libpst / readpst incorrectly decodes latin1 contacts, etc.
>
> Status in libpst package in Ubuntu:
> New
>
> Bug description:
> After a client of ours moved from Exchange 2003 to Office 365 we had
> to get some data out of PST-files, which mostly worked well, but
> apparently Contacts and some Tasks have a tendency på be incorrectly
> decoded into gibberish.
>
> As far as I can tell, the problem is that the data is interpreted to
> be UTF16 that needs to be converted to UTF8 and the charset I defined
> on the commandline for readpst is not consulted in this transaction.
>
> When inspecting the debug log, it is clear to human eyes that this
> conversion is incorrect and if anything should have been from the
> charset I specified to UTF8 and not from UTF16 to UTF8.
>
> As far as I can tell, the problem occurs in the 'pst_vb_utf16to8',
> which seems to be called indescrimately, and it seems that the charset
> I specify to readpst is rarely used, if ever.
>
> I wonder if it would be possible to have a switch to present the user
> with the unconverted version and possibly a couple of encoding and let
> the user decide the proper one. There are several contacts that are
> fine, but over 200 that suffer from this garbling of the data.
> Unfortunately it is more or less impossible to get from the utf8
> version of the non-utf16 data back to latin1, as far as I can tell.
>
> This is a sample contact that has the issue (Most are totally illegible,
> but a few had some text I could search for):
> FN:Ballerup Politi
> N:汋獯整�;潊湨祮;;;
> EMAIL:慂汬牥灵倠汯瑩<U+2069>䨨䡃灀汯瑩<U+2E69>此�
> ADR;TYPE=work:;;;;;;
> LABEL;TYPE=work:汇<U+202E><U+E552>桤獵敶<U+206A>㤱\n慂汬牥灵 㜲〵\n慄浮牡�
> TEL;TYPE=work,voice:㤳㔠‴㐱㐠‸潬慫<U+206C>㐠㌲�
> TEL;TYPE=cell,voice: 72 58 78 29 (20 90 98 02)
> TITLE:楖散潰楬楴潫浭獩狦
> NOTE:Gladsaxe Politi (kredsen) 3969 1448\n
> VERSION: 3.0
> END:VCARD
>
> Attached is debug version of the parsing of this contact.
>
> ProblemType: Bug
> DistroRelease: Ubuntu 14.04
> Package: pst-utils 0.6.59-1build1
> ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
> Uname: Linux 3.13.0-24-generic x86_64
> ApportVersion: 2.14.1-0ubuntu3.11
> Architecture: amd64
> CurrentDesktop: X-Cinnamon
> Date: Tue Jun 30 11:05:51 2015
> EcryptfsInUse: Yes
> InstallationDate: Installed on 2014-07-27 (337 days ago)
> InstallationMedia: Linux Mint 17 "Qiana" - Release amd64 20140624
> ProcEnviron:
> SHELL=/bin/bash
> TERM=xterm
> PATH=(custom, no user)
> LANG=da_DK.UTF-8
> XDG_RUNTIME_DIR=<set>
> SourcePackage: libpst
> UpgradeStatus: No upgrade log present (probably fresh install)
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/libpst/+bug/1470032/+subscriptions
>

Revision history for this message
Sebastien Bacher (seb128) wrote :

Unsure, there is no bug tracker info on their website, maybe email the current maintainer (info in http://hg.five-ten-sg.com/libpst/file/602869b958a3/AUTHORS) or at least forward the bug to Debian since the Ubuntu package is based on theirs

Revision history for this message
Ivan Zakharyaschev (imz) wrote :

The upstream maintainer turned out to coincide with the Redhat maintainer, namely, Carl.

So, I'd suggest to just report your problem at the Redhat bugzilla and reach the upstream this way.

Revision history for this message
Martin Møller (martin-moller) wrote :

I'll look into that in the near future. Thanks.

Revision history for this message
Paul Wise (Debian) (pabs) wrote : libpst: 1470032: reported upstream?

Martin Møller: did you manage to report this libpst issue to upstream via RedHat?

I've also had success forwarding patches to upstream directly.

--
bye,
pabs

https://wiki.debian.org/PaulWise

Revision history for this message
Martin Møller (martin-moller) wrote : Re: [Bug 1470032] libpst: 1470032: reported upstream?
Download full text (3.6 KiB)

I don’t think I ever managed that, unfortunately.
It has not been a priority since.

man. 16. dec. 2019 kl. 03.20 skrev Paul Wise (Debian) <
<email address hidden>>:

> Martin Møller: did you manage to report this libpst issue to upstream
> via RedHat?
>
> I've also had success forwarding patches to upstream directly.
>
> --
> bye,
> pabs
>
> https://wiki.debian.org/PaulWise
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1470032
>
> Title:
> libpst / readpst incorrectly decodes latin1 contacts, etc.
>
> Status in LIBPST:
> New
> Status in libpst package in Ubuntu:
> New
>
> Bug description:
> After a client of ours moved from Exchange 2003 to Office 365 we had
> to get some data out of PST-files, which mostly worked well, but
> apparently Contacts and some Tasks have a tendency på be incorrectly
> decoded into gibberish.
>
> As far as I can tell, the problem is that the data is interpreted to
> be UTF16 that needs to be converted to UTF8 and the charset I defined
> on the commandline for readpst is not consulted in this transaction.
>
> When inspecting the debug log, it is clear to human eyes that this
> conversion is incorrect and if anything should have been from the
> charset I specified to UTF8 and not from UTF16 to UTF8.
>
> As far as I can tell, the problem occurs in the 'pst_vb_utf16to8',
> which seems to be called indescrimately, and it seems that the charset
> I specify to readpst is rarely used, if ever.
>
> I wonder if it would be possible to have a switch to present the user
> with the unconverted version and possibly a couple of encoding and let
> the user decide the proper one. There are several contacts that are
> fine, but over 200 that suffer from this garbling of the data.
> Unfortunately it is more or less impossible to get from the utf8
> version of the non-utf16 data back to latin1, as far as I can tell.
>
> This is a sample contact that has the issue (Most are totally illegible,
> but a few had some text I could search for):
> FN:Ballerup Politi
> N:汋獯整�;潊湨祮;;;
> EMAIL:慂汬牥灵倠汯瑩<U+2069>䨨䡃灀汯瑩<U+2E69>此�
> ADR;TYPE=work:;;;;;;
> LABEL;TYPE=work:汇<U+202E><U+E552>桤獵敶<U+206A>㤱\n慂汬牥灵 㜲〵\n慄浮牡�
> TEL;TYPE=work,voice:㤳㔠‴㐱㐠‸潬慫<U+206C>㐠㌲�
> TEL;TYPE=cell,voice: 72 58 78 29 (20 90 98 02)
> TITLE:楖散潰楬楴潫浭獩狦
> NOTE:Gladsaxe Politi (kredsen) 3969 1448\n
> VERSION: 3.0
> END:VCARD
>
> Attached is debug version of the parsing of this contact.
>
> ProblemType: Bug
> DistroRelease: Ubuntu 14.04
> Package: pst-utils 0.6.59-1build1
> ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
> Uname: Linux 3.13.0-24-generic x86_64
> ApportVersion: 2.14.1-0ubuntu3.11
> Architecture: amd64
> CurrentDesktop: X-Cinnamon
> Date: Tue Jun 30 11:05:51 2015
> EcryptfsInUse: Yes
> InstallationDate: Installed on 2014-07-27 (337 days ago)
> InstallationMedia: Linux Mint 17 "Qiana" - Release amd64 20140624
> ProcEnviron:
> SHELL=/bin/bash
> TERM=xterm
> PATH=(custom, no user)
> LANG=da_DK.UTF-8
> XDG_RUNTIME_DIR=<set>
> SourcePackage: libpst
> UpgradeStatus: No ...

Read more...

Revision history for this message
Paul Wise (Debian) (pabs) wrote : libpst: 1470032: please forward patch to new libpst project on GitHub

Martin Møller: since my last post on this issue, Carl Byington has
suggested I take over the libpst project and I have done that and have
moved the project to GitHub.

https://github.com/pst-format/libpst/

Please file an upstream issue explaining the problem. Please ensure you
attach a PST file & commands that can be used to reproduce the problem.

Please also file an upstream pull request for the patch. Please mark
the pull request as a draft since you mentioned it needs cleaning up.

If you're no longer affected by this or don't have time to forward
things to the new upstream then I may eventually try to work on it.

--
bye,
pabs

https://bonedaddy.net/pabs3/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.