Comment 7 for bug 1768330

Revision history for this message
Leonard Richardson (leonardr) wrote :

I came back to this following the 4.8 release and I think I have an efficient algorithm that groups text blocks together. The catch is there's no way to get the text blocks in a nice list, because block elements can contain other block elements. That's why my original plan fell apart when I started looking at nested lists. I was trying to turn a nested data structure into a list, and there's no general way to do that. Any given strategy will look good on some pages (or parts of pages) and bad on others.

My algorithm focuses on removing 'junk' and presenting the text nodes in a way that reflects the structure of the original tree.