Time out error when text box has heavy HTML

Bug #547773 reported by Craig Eves
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mahara
Won't Fix
Medium
Unassigned

Bug Description

The following error comes up when trying to access a users view

Fatal error: Maximum execution time of 30 seconds exceeded in
/var/www/mahara-site-myportfolio-ac-nz/lib/htmlpurifier/HTMLPurifier/Strateg
y/MakeWellFormed.php on line 395

The view number is
http://myportfolio.ac.nz/view/view.php?id=13222

This view includes text blocks that have been copied and pasted from Word without using the Paste from Word button (this is hidden unless you go to full screen mode).

Unfortunately once this happens the content in the view is not able to be edited.

By the way I experimented with the clean messy code button and this doesn't seem to have any effect on content pasted from Word - is this right?

Not sure if this is strictly a bug but more using of Word to paste content and the htmlpurifier having difficulty purifying the Word html.

This bug was imported from eduforge.org, see:
https://eduforge.org/tracker/index.php?func=detail&aid=3430&group_id=176&atid=739

Revision history for this message
Nigel-catalyst (nigel-catalyst) wrote :

It's a bug, the site certainly shouldn't crash like that.

The view is quite big, it seems to have 29 blocks in it. One of them is a text box with some 300K of text in it. I'm not that surprised that HTMLPurifier is choking on that, though I'm not sure what we can do about it exactly.

The 'paste from word' button - no idea. It's just some button that wysiwyg editors have. I presume it does _something_, but I couldn't tell you what.

For now, I have deleted the really big block out of that view.

Revision history for this message
Craig Eves (ceves) wrote :

Thanks Nigel

Yes this view is a bit big - students should be encouraged to have multipage views and not use Word to paste text from.

The best way of encouragement is to make this as part of the interface somehow - developing the multipage view is a step in the right direction.

Revision history for this message
François Marier (fmarier) wrote :

This might have been fixed in HTML Purifier 4.1.0 (Mahara.org and MyPortfolio.ac.nz are now running that code).

Craig, could you test that again to see if it's still a problem?

Changed in mahara:
status: Confirmed → Incomplete
Revision history for this message
François Marier (fmarier) wrote :

If it turns out that this is still a problem, we may be able to run a filter over Word's HTML before handing it over to HTML Purifier:

  http://www.codinghorror.com/blog/2006/01/cleaning-words-nasty-html.html

Revision history for this message
Craig Eves (ceves) wrote : Re: [Bug 547773] Re: Time out error when opening a view

Hi Francois

Unfortunately I can't test with this view - this was someone else's view.

I think the paste from Word button works better now.

What happened to the full screen editor in V 1.3 - this is required for
editing long text.

regards
Craig

On Tue, Jul 13, 2010 at 4:00 PM, François Marier <email address hidden>wrote:

> This might have been fixed in HTML Purifier 4.1.0 (Mahara.org and
> MyPortfolio.ac.nz are now running that code).
>
> Craig, could you test that again to see if it's still a problem?
>
> ** Changed in: mahara
> Status: Confirmed => Incomplete
>
> --
> Time out error when opening a view
> https://bugs.launchpad.net/bugs/547773
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Mahara ePortfolio: Incomplete
>
> Bug description:
> The following error comes up when trying to access a users view
>
> Fatal error: Maximum execution time of 30 seconds exceeded in
>
> /var/www/mahara-site-myportfolio-ac-nz/lib/htmlpurifier/HTMLPurifier/Strateg
> y/MakeWellFormed.php on line 395
>
> The view number is
> http://myportfolio.ac.nz/view/view.php?id=13222
>
> This view includes text blocks that have been copied and pasted from Word
> without using the Paste from Word button (this is hidden unless you go to
> full screen mode).
>
> Unfortunately once this happens the content in the view is not able to be
> edited.
>
> By the way I experimented with the clean messy code button and this doesn't
> seem to have any effect on content pasted from Word - is this right?
>
> Not sure if this is strictly a bug but more using of Word to paste content
> and the htmlpurifier having difficulty purifying the Word html.
>
>
>
>
>
>
>
>
> This bug was imported from eduforge.org, see:
>
> https://eduforge.org/tracker/index.php?func=detail&aid=3430&group_id=176&atid=739
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/mahara/+bug/547773/+subscribe
>

Revision history for this message
Richard Mansfield (richard-mansfield) wrote : Re: Time out error when opening a view

I can reproduce this by adding html to a view which links to stuff behind apache basic auth.

Changed in mahara:
status: Incomplete → Confirmed
Revision history for this message
Richard Mansfield (richard-mansfield) wrote :

I lied, I can reproduce it because my html has lots of images, the auth is not important

Changed in mahara:
milestone: none → 1.4alpha1
milestone: 1.4alpha1 → 1.4.0
Revision history for this message
Richard Mansfield (richard-mansfield) wrote :

We should check in the htmlpurifier forums to see if anyone has a good solution for this.

Revision history for this message
Richard Mansfield (richard-mansfield) wrote :

I've had another look at this, but haven't found a good solution yet, so I'm going to remove it from the 1.4 milestone for now. It's pretty easy to reproduce by putting 300k or so of relatively heavy html (I used a bunch of big tables) into a textbox. It's possible to fix it in a dirty way (tried hacking some code to add a timer & throw exceptions into HTMLPurifier/Strategy/MakeWellFormed.php and it works fine), but we need a clean way to fix the bug. It'd be nice if there was a way to set a timeout on an individual function call rather than just on the entire script.

Changed in mahara:
milestone: 1.4.0 → 1.5.0
Changed in mahara:
milestone: 1.5.0 → none
summary: - Time out error when opening a view
+ Time out error when text box has heavy HTML
Changed in mahara:
importance: Medium → High
milestone: none → 1.6.0
Revision history for this message
Son Nguyen (ngson2000) wrote :

I realised that all text will be purified when rendering blocks in a view(page). This will cause this problem if a page contains lots of artefacts.
My solution is a text will be cleaned right after adding into a artefact.
 - Many text fields in artefacts have limitation of 64k characters. This should not cause max execution time exceeding.
 - There is only HTML file block may cause the problem as HTML file can be big (>300k). In this case the system should raises an error message to ask the user simplify the HTML file before add it into the page.

Any discussion?

Revision history for this message
Son Nguyen (ngson2000) wrote :

For existing data, the system can schedule a cron jobs for purifying artefacts' text and send notifications to users if errors happen.

Son Nguyen (ngson2000)
Changed in mahara:
assignee: nobody → Son Nguyen (ngson2000)
status: Confirmed → In Progress
Son Nguyen (ngson2000)
Changed in mahara:
assignee: Son Nguyen (ngson2000) → nobody
status: In Progress → Confirmed
assignee: nobody → Son Nguyen (ngson2000)
status: Confirmed → In Progress
Revision history for this message
Hugh Davenport (hugh-davenport) wrote :
Changed in mahara:
assignee: Son Nguyen (ngson2000) → Hugh Davenport (hugh-catalyst)
Revision history for this message
Son Nguyen (ngson2000) wrote :

Hi Hugh;

Caching cleaned html output is a good solution for improving the performance of Mahara system. However it does not solve this problem as HTML outputs of some pages(views) are too big and cause HtmlPurifier died.

Revision history for this message
Hugh Davenport (hugh-davenport) wrote :

Current thoughts on the solution is to use the caching as well as splitting the large html blocks into smaller chunks and then using json or something to refresh on timeouts

Changed in mahara:
assignee: Hugh Davenport (hugh-catalyst) → Son Nguyen (ngson2000)
Melissa Draper (melissa)
Changed in mahara:
milestone: 1.6.0 → 1.7.0
Son Nguyen (ngson2000)
Changed in mahara:
milestone: 1.7.0 → 1.8.0
Aaron Wells (u-aaronw)
Changed in mahara:
importance: High → Medium
milestone: 1.8.0rc1 → 1.6.7
milestone: 1.6.7 → none
assignee: Son Nguyen (ngson2000) → nobody
Robert Lyon (robertl-9)
Changed in mahara:
status: In Progress → Triaged
Revision history for this message
Son Nguyen (ngson2000) wrote :

I think the best way is to purify html text before store it into the database.
For some bug html blocks, we will use chunking method as Hugh mentioned

Revision history for this message
Son Nguyen (ngson2000) wrote :

Test case:

1. Edit a page
2. Add a 'Some HTML' block
3. Upload a big html file. See my attached file
4. Click save and check the error log to see the error message

Revision history for this message
Son Nguyen (ngson2000) wrote :

Another big html file to test

Revision history for this message
Son Nguyen (ngson2000) wrote :

After analysing the profiling of Mahara, I found the most CPU-consuming task is to purify the HTML text.

I pushed https://reviews.mahara.org/2759 for speedup the execution of viewing big HTML in Mahara view by over 100%

Revision history for this message
Kristina Hoeppner (kris-hoeppner) wrote :

Son, there is also Hugh's patch at https://reviews.mahara.org/#/c/1488/

Revision history for this message
Aaron Wells (u-aaronw) wrote :

I abandoned patch https://reviews.mahara.org/2759 because it requires a substantial rewrite for our current version of HTMLPurifier.

Revision history for this message
Aaron Wells (u-aaronw) wrote :

This one no longer seems to be an issue. I can no longer noticeably slow down a page's load time by putting long blocks of text in it. I'm guessing this is probably a combination of HTMLPurifier improving with later versions, and our improved enforcement of the 65,000-character limit on text fields.

Changed in mahara:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.