[upstream] Calc truncates data from HTML based .xls

Bug #480130 reported by Circa Lucid on 2009-11-10
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
LibreOffice
Fix Released
Medium
OpenOffice
Invalid
Undecided
Unassigned
libreoffice (Ubuntu)
Medium
Unassigned
openoffice.org (Ubuntu)
Low
Unassigned

Bug Description

Binary package hint: openoffice.org

1) lsb_release -rd
Description: Ubuntu 11.04
Release: 11.04

2) apt-cache policy libreoffice-calc
libreoffice-calc:
  Installed: 1:3.3.2-1ubuntu5
  Candidate: 1:3.3.2-1ubuntu5
  Version table:
 *** 1:3.3.2-1ubuntu5 0
        500 http://us.archive.ubuntu.com/ubuntu/ natty-updates/main i386 Packages
        100 /var/lib/dpkg/status
     1:3.3.2-1ubuntu4 0
        500 http://us.archive.ubuntu.com/ubuntu/ natty/main i386 Packages

3) What is expected to happen in LibreOffice Calc via the Terminal:

cd ~/Desktop && wget https://bugs.launchpad.net/ubuntu/+source/openoffice.org/+bug/480130/+attachment/1019499/+files/OE_Enrollment_Audit_20091110114007-C55555.2.xls && localc -nologo OE_Enrollment_Audit_20091110114007-C55555.2.xls

is the file displays all 12384 rows.

4) What happens instead is it only displays data for the first 6700 rows. It shows border formatting for rows 6701 to 12384, but no data.

WORKAROUND: Use Gnumeric.

apt-cache policy gnumeric
gnumeric:
  Installed: 1.10.13-1ubuntu1
  Candidate: 1.10.13-1ubuntu1
  Version table:
 *** 1.10.13-1ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ natty/universe i386 Packages
        100 /var/lib/dpkg/status

WORKAROUND: Excel 2003 via WINE.

Microsoft Office Excel 2003 (11.5612.6505)

apt-cache policy wine1.3
wine1.3:
  Installed: 1.3.19-0ubuntu1~maverick1~ppa1
  Candidate: 1.3.19-0ubuntu1~maverick1~ppa1
  Version table:
 *** 1.3.19-0ubuntu1~maverick1~ppa1 0
        100 /var/lib/dpkg/status
     1.3.15-0ubuntu5 0
        500 http://us.archive.ubuntu.com/ubuntu/ natty/universe i386 Packages

Original Report Comments: I have a Web App that generates HTML based spreadsheets with very simple tables, amount displayed varies with length of cell contents.

ProblemType: Bug
Architecture: i386
Date: Tue Nov 10 11:42:07 2009
DistroRelease: Ubuntu 9.10
InstallationMedia: Ubuntu 9.10 "Karmic Koala" - Release i386 (20091028.5)
Package: openoffice.org-core 1:3.1.1-5ubuntu1
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.31-14.48-generic
SourcePackage: openoffice.org
Uname: Linux 2.6.31-14-generic i686

Chris Cheney (ccheney) on 2010-05-13
tags: added: karmic
Chris Cheney (ccheney) wrote :

[This is an automatic notification.]

This bug was reported against an earlier version of Ubuntu, can you test if it still occurs on Ubuntu 10.04 LTS (Lucid Lynx)?

Please note we also provide technical support for older versions of Ubuntu, but not in the bug tracker. Instead, to raise the issue through normal support channels, please see:

    http://www.ubuntu.com/support

If you are the original reporter and can still reproduce the issue on Lucid, please run the following command to refresh the report:

  apport-collect 480130

Bear in mind that you may need to install the python-launchpadlib package from the universe repository. Additionally, when prompted to give apport-collect permissions for Launchpad you will need to give it at least the ability to "Change Non-Private" data as it will be adding information to your bug report.

If you are not the original reporter, please file a new bug report, so we can work with you as the original reporter instead (you can reference bug 480130 in your report if you think it may be related):

  ubuntu-bug openoffice.org

If by chance you can no longer reproduce the issue on Lucid or if you feel it is no longer relevant, please mark the bug report 'Fix Released' or 'Invalid' as appropriate, at the following URL:

  https://bugs.launchpad.net/ubuntu/+bug/480130

Changed in openoffice.org (Ubuntu):
status: New → Incomplete
Bryan Quigley (bryanquigley) wrote :

I found this bug on OpenOffice's bug tracker. You can follow progress here: http://www.openoffice.org/issues/show_bug.cgi?id=110486

Changed in openoffice.org (Ubuntu):
status: Incomplete → Confirmed
Changed in openoffice:
status: Unknown → Confirmed

I just upgraded to 10.04 which has OOO320m12 (Build:9483) and tested. I have an excel sheet with 17000 rows and 22 columns and it's blank after row 3508. I can coax it to show more by slowly scrolling down and waiting while the program locks up for a good 10 seconds generating the view. Also, the scroll bar is at the bottom as if it were the last row but yet I can still scroll down though it's all blank.

Chris Cheney (ccheney) on 2010-05-21
Changed in openoffice.org (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
summary: - OOSpreadsheet only partially opens HTML based spreadsheet.
+ [upstream] OOo calc only partially opens HTML based spreadsheet.

By importing a document (a HTML file named as .XLS or .HTML) with lots (more
than 15000) of rows, OpenOffice Calc truncates data without showing any error
or warning.

The issue can be reproduced by importing attached file into calc. On my machine
it imports just 13635 of the 20000 rows.

Created attachment 44977
20000 rows as HTML table

The effect is reproducible with reporter's sample document and "LibreOffice 3.3.2 – WIN7 Home Premium (64bit) German UI [OOO330m19 (Build:202 / tag 3.3.2.2)]"

I saw a lot of documents with name extension .xls having nothing to do with an EXCEL spreadsheet, the user or his application only used that name because of "somehow table contents".

To be honest, I do not know much about EXCEL HTML document, except that it is a mess to work with them. Imho that's an EXCEL problem, EXCEL should create documents with correct syntax.

Reporter's sample is no correct html, although source text is pretending to be html. At least html type information is missing.
I'ts also not an EXCEL type spreadsheet.

MS EXCEL viewer will not open that document.

Some other observations:
OOo3.1.1. (from open WRITER document) will by default open the document as WRITER-HTML document in writer with correct table view until "A12800", then table view stops and strings from table will be shown as endless plain text line.
I can force OOo to open the document as html-calc, then it will open the document as spreadsheet, "E13105" is the latest content shown correctly, then table formatting breaks.

Exactly the same with OOo-dev 3.4

My result:
My aversion against such documents has nothing to do with the reported problem, LibO should reject the document or open it correctly (may be with a warning message). Low priority, imprtant data should be exported to a document with correct syntax, that's a problem of the application creating such documents.

@Marco:
You get such documents from what application?

Although the "html" code is completely different, I see something similar to the reported problem with the attachment of OOo bug
 Bug 111579 - Opening large html excel document from SAS
<http://openoffice.org/bugzilla/show_bug.cgi?id=111579>
Opening that document with LibO CALC (from WIN Explorer) the last correctly shown cell 'F6712' will have contents "PXXX09.001.AAAA.BBBB 1728". Next cell will be broken, no further contents will be shown, Table ends with date 15/09/2009

Renaming document to .html and opening with Seamonky shows: there is much ocntents behind "15/09/2009"

(In reply to comment #3)
> Although the "html" code is completely different, I see something similar to
> the reported problem with the attachment of OOo bug
> Bug 111579 - Opening large html excel document from SAS
> <http://openoffice.org/bugzilla/show_bug.cgi?id=111579>
> Opening that document with LibO CALC (from WIN Explorer) the last correctly
> shown cell 'F6712' will have contents "PXXX09.001.AAAA.BBBB 1728". Next cell
> will be broken, no further contents will be shown, Table ends with date
> 15/09/2009
>
> Renaming document to .html and opening with Seamonky shows: there is much
> ocntents behind "15/09/2009"

Yes I agree, it seems to be same issue.

(In reply to comment #3)
> Although the "html" code is completely different, I see something similar to
> the reported problem with the attachment of OOo bug
> Bug 111579 - Opening large html excel document from SAS
> <http://openoffice.org/bugzilla/show_bug.cgi?id=111579>
> Opening that document with LibO CALC (from WIN Explorer) the last correctly
> shown cell 'F6712' will have contents "PXXX09.001.AAAA.BBBB 1728". Next cell
> will be broken, no further contents will be shown, Table ends with date
> 15/09/2009
>
> Renaming document to .html and opening with Seamonky shows: there is much
> ocntents behind "15/09/2009"

The .XLS extension is used for users convenience - as those extensions are
associated with LibreOffice or MS Excel by default.

Trying with MS Excel 2010, it imports that example file without a problem. It
just showed a warning that it's not an Excel file.

Such files are generated by applications which cannot create native .XLS (or
.XLSX). The example file is one I was creating manually to demonstrate the
issue.

However, the main issue I see here is that LibreOffice cannot import huge HTML
tables. It should either import the whole data or show warning message.

description: updated
Changed in libreoffice (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
tags: added: lo33
summary: - [upstream] OOo calc only partially opens HTML based spreadsheet.
+ [upstream] Calc truncates data from HTML based .xls
Changed in df-libreoffice:
importance: Unknown → Medium
status: Unknown → Confirmed
Changed in openoffice.org (Ubuntu):
importance: Medium → Low

I can confirm this bug too in libreoffice 3.4.2. Happens for me on slightly less huge tables with around 3000 rows. The interesting thing is, that borders of the table are rendered to the last row, but data are truncated randomly in each file somewhere in the middle.

Changed in openoffice.org (Ubuntu):
status: Triaged → Won't Fix

[This is an automated message.]
There are no new official OpenOffice.org releases in Ubuntu packaging anymore => Won't Fix

If the problem persists, please mark this bug as "also affects project Libreoffice" or "also affects distribution Libreoffice (Ubuntu)" if that has not happened already.

Please leave references to upstream OpenOffice.org bugs in place to allow cross pollination.

It's already been marked as "also effects". I just retested with and it still stops all data at about row 6700.

1) lsb_release -rd
Description: Ubuntu 11.04
Release: 11.04

2) apt-cache policy libreoffice-calc
libreoffice-calc:
  Installed: 1:3.3.3-1ubuntu2
  Candidate: 1:3.3.3-1ubuntu2
  Version table:
 *** 1:3.3.3-1ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ natty-updates/main i386 Packages
        100 /var/lib/dpkg/status
     1:3.3.2-1ubuntu4 0
        500 http://us.archive.ubuntu.com/ubuntu/ natty/main i386 Packages

I have some free time this weekend and just realized LibreOffice has some easy documentation on getting started hacking at this issue. Let's see if I can fix it myself.

[This is an automated message.]
This bug was filed before the changes to Bugzilla on 2011-10-16. Thus it
started right out as NEW without ever being explicitly confirmed. The bug is
changed to state NEEDINFO for this reason. To move this bug from NEEDINFO back
to NEW please check if the bug still persists with the 3.5.0 beta1 or beta2 prereleases.
Details on how to test the 3.5.0 beta1 can be found at:
http://wiki.documentfoundation.org/QA/BugHunting_Session_3.5.0.-1

more detail on this bulk operation: http://nabble.documentfoundation.org/RFC-Operation-Spamzilla-tp3607474p3607474.html

The issue is still open and reproducible with "3.5.0 beta2".

Issue is still reproducible under v3.5.7.2 (Ubuntu v10.04 x86_64) and v4.0.1.2 (Win7).

Working on this. The limit is around ~64k data cells, imposed by some underlying structures used during import.

Eike Rathke committed a patch related to this issue.
It has been pushed to "master":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=2af1f5691e8d64afd5246d245d7876b5a2cd5cd8

resolved fdo#35756 import more than 64k HTML table cells

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Changed in df-libreoffice:
status: Confirmed → Fix Released

*** Bug 64168 has been marked as a duplicate of this bug. ***

*** Bug 64572 has been marked as a duplicate of this bug. ***

*** Bug 60354 has been marked as a duplicate of this bug. ***

Backport pending review for 4-0 as https://gerrit.libreoffice.org/4368

Eike Rathke committed a patch related to this issue.
It has been pushed to "libreoffice-4-0":

http://cgit.freedesktop.org/libreoffice/core/commit/?id=da11528150df545a31df3c9863bd4c3925ccdf21&h=libreoffice-4-0

resolved fdo#35756 import more than 64k HTML table cells

It will be available in LibreOffice 4.0.5.

The patch should be included in the daily builds available at
http://dev-builds.libreoffice.org/daily/ in the next 24-48 hours. More
information about daily builds can be found at:
http://wiki.documentfoundation.org/Testing_Daily_Builds
Affected users are encouraged to test the fix and report feedback.

Fix released with upstream 4.1.0 and thus included in saucy.

Changed in libreoffice (Ubuntu):
status: Triaged → Fix Released
Changed in openoffice:
importance: Unknown → Undecided
status: Confirmed → New
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.