KARL3

undisplayed link in blog post preview

Bug #663399 reported by Jim B. Glenn on 2010-10-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	KARL3	Fix Released	Medium	Tres Seaver	KARL3 m49

Bug Description

Jim, once you get this entered in, I'll hand it to Tres for triage.

Tres, if you read the conversation below, I think you'll get up to speed on this pretty quick. I think karl.content.views.utils.extract_description uses an XPath expression which could probably be tinkered with.

--Paul

On Oct 18, 2010, at 2:35 PM, Evan McGonagill wrote:

Hi Jim and Paul,

Sorry I lost track of this message when it was originally sent. I
believe our preferences are as follows; let me know if there's anything
I'm missing:

Yes, include the text inside <a> tags, but not bold or italics or any
other HMTL markers. The primary concern is just with link displays, but
bold and italicized text are not a priority.

Yes include this in all content types and "descriptions," not just blog
descriptions.

Are those requests broad enough to be easily implemented, and make
sense/won't cause problems?

Evan

-----Original Message-----
From: James B Glenn via RT
Sent: Monday, October 18, 2010 11:00 AM
To: Evan McGonagill
Subject: [sixfeetup/karl-support #81715] Bug report: undisplayed link in
blog post preview

Hi Evan,
Paul asked questions below. Would you like me to keep this support
request open? or would
you like to deal with this issue separately?

Thanks,

Jim Glenn
KARL Champion

On Wed Sep 22 10:17:49 2010, peveritt wrote:

As background, the blog listing shows the "description" of each blog
entry. The description is something KARL auto-extracts on each
content item and shows in various places: search results, RSS
feeds, content Feeds, and this blog listing.

We don't make the person manually enter a description, as most people
won't do that. So we guess by taking the first X characters of
text. Since this is HTML, we have to be careful how we extract the
text....we can't, for example, open an italics and forget to close
it. Same for a table.

Our current algorithm for extraction simply ignores anything nested
inside HTML markup and only looks at the words. The words you
mentioned are in an HTML hyperlink, thus they are getting ignored.

We can make a change, we just need to think through it:

- Do we only include text inside a <a> tag? Or also bold, italics?
Anything else?

- Is this a policy to apply on all content types, or only on blog
entries?

- Do you want it to apply everywhere the "description" appears, or
only in this spot?

We need to be careful about exceptions to rules, as they lead to a
system that almost nobody can remember the requirements for. :)
Thus, if possible, a policy that applies broadly is preferable.

--Paul

On Sep 21, 2010, at 6:11 PM, Evan McGonagill via RT wrote:

Hi Jim,

I found what might be a bug: looking at the blog page in a
community,
I'm noticing that one of the blog previews displays a kind of blank
space where there is a link in the actual post. It reads "Hi All, I
just
came across this post via the Council on Foundations' Twitter Feed.
Apparently the , is going..."

You can see that after the word "the" there is a blank space
followed by
a comma. The actual blog post has a link in that space which
displays
fine when you click to view the full post. Can this be fixed so that
the
link itself, or at least the text, is also displayed on the page
before
you click on the individual post?

I've attached a screenshot for your reference.

Evan

Revision history for this message

Jim B. Glenn (jimbglenn) wrote on 2010-10-19:

Bug shot undisplayed link in blog post preview.bmp Edit (2.3 MiB, image/x-ms-bmp)

Revision history for this message

Jim B. Glenn (jimbglenn) wrote on 2010-10-19:

reported via rt:
https://rt01.sixfeetup.com/Ticket/Display.html?id=81715

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2010-10-19:

This can be handed to Chris or Tres. My guess is, we just need to tweak the utility function which extracts description, as described in the bug text.

Changed in karl3:
assignee:	nobody → Tres Seaver (tseaver)
importance:	Undecided → Medium
milestone:	none → m49
status:	New → Confirmed

Revision history for this message

Tres Seaver (tseaver) wrote on 2010-10-26:

Hmm, I can't see how the code in 'extract_description' would be discarding
the text inside the link tags::

  >>> from lxml.html import document_fromstring
  >>> x = document_fromstring('before ' +
  ... '<a href="http://example.com/">inside</a>' +
  ... 'after.')
  >>> for chunk in x.xpath('//text()'):
  ... print chunk.strip()
  ...
  before
  inside
  after.

Changed in karl3:
status:	Confirmed → In Progress

Revision history for this message

Tres Seaver (tseaver) wrote on 2010-10-26:

The previous example was from my development sandbox, which
has the buildout-compiled libxml2 version 2.6.32.

The deployment build on the karlhost01 box uses the system version
of libxml2 libraries, which are backleveled to 2.6.26. With that version,
we get::

  >>> from lxml.html import document_fromstring
  >>> x = document_fromstring('<p>before ' +
  ... '<a href="/example">inside</a>' +
  ... ' after</p>')
  >>> list(x.xpath('//text()'))
  ['before ', ' after', 'inside']

which is clearly a problem (and how the link text is being lost).

The release notes for libxml2[1] show a bugfix in 2.6.30:

Bugfixes: xmlXPathNodeSetSort problem (William Brack)

which sounds like it would explain this behavior. We can fix this bug by
changing the deployment buildout to compile libxml2 / libxslt, or else
figure out how to get a newer libxml2 RPM for the environment.

[1] http://xmlsoft.org/news.html

Revision history for this message

Tres Seaver (tseaver) wrote on 2010-10-26:

Another solution would be to bag the xpath() call and iterate the nodes
of the tree directly, e.g.::

>>> from lxml.html import document_fromstring
>>> d = document_fromstring(x.text)
>>> for element in d.iter():
... text = element.text.strip()
... for word in text.split():
... yield word
... tail = element.tail.strip()
... for word in tail.split():
... yield word

and then consume that iterator using itertools.islice.

Revision history for this message

Paul Everitt (paul-agendaless) wrote on 2010-10-26: Re: [Bug 663399] Re: undisplayed link in blog post preview

Download full text (5.2 KiB)

I wouldn't mind gradually weaning ourselves of libxml2 and lxml. I'm pretty sure this function could be implemented pretty easily using elementtree, although we couldn't avail ourselves of libxml2's HTMLParser which handled malformed content.

Hmm, I guess a really easy way to fix this without throwing out libxml2 or upgrading would be to just not use XPath for this extraction. Tres, any experience with any of the other interfaces for walking a tree?

--Paul

On Oct 26, 2010, at 4:32 PM, Tres Seaver wrote:

I wouldn't mind gradually weaning ourselves of libxml2 and lxml.  I'm pretty sure this function could be implemented pretty easily using elementtree, although we couldn't avail ourselves of libxml2's HTMLParser which handled malformed content.

Hmm, I guess a really easy way to fix this without throwing out libxml2 or upgrading would be to just not use XPath for this extraction.  Tres, any experience with any of the other interfaces for walking a tree?

--Paul

On Oct 26, 2010, at 4:32 PM, Tres Seaver wrote:

> The previous example was from my development sandbox, which
> has the buildout-compiled libxml2 version 2.6.32.
> 
> The deployment build on the karlhost01 box uses the system version
> of libxml2 libraries, which are backleveled to 2.6.26.  With that version,
> we get::
> 
>>>> from lxml.html import document_fromstring
>>>> x = document_fromstring('<p>before ' +
>  ...                         '<a href="/example">inside</a>' +
>  ...                         ' after</p>')
>>>> list(x.xpath('//text()'))
>  ['before ', ' after', 'inside']
> 
> which is clearly a problem (and how the link text is being lost).
> 
> The release notes for libxml2[1] show a bugfix in 2.6.30:
> 
>  Bugfixes: xmlXPathNodeSetSort problem (William Brack)
> 
> which sounds like it would explain this behavior.  We can fix this bug by
> changing the deployment buildout to compile libxml2 / libxslt, or else
> figure out how to get a newer libxml2 RPM for the environment.
> 
> [1] http://xmlsoft.org/news.html
> 
> -- 
> undisplayed link in blog post preview
> https://bugs.launchpad.net/bugs/663399
> You received this bug notification because you are subscribed to KARL3.
> 
> Status in KARL3: In Progress
> 
> Bug description:
> 
> Jim, once you get this entered in, I'll hand it to Tres for triage.
> 
> Tres, if you read the conversation below, I think you'll get up to speed on this pretty quick.  I think karl.content.views.utils.extract_description uses an XPath expression which could probably be tinkered with.
> 
> --Paul
> 
> On Oct 18, 2010, at 2:35 PM, Evan McGonagill wrote:
> 
> Hi Jim and Paul,
> 
> Sorry I lost track of this message when it was originally sent. I
> believe our preferences are as follows; let me know if there's anything
> I'm missing:
> 
> Yes, include the text inside <a> tags, but not bold or italics or any
> other HMTL markers. The primary concern is just with link displays, but
> bold and italicized text are not a priority.
> 
> Yes include this in all content types and "descriptions," not just blog
> descriptions. 
> 
> Are those requests broad enough to be easily implemented, and make
> sense/won't cause problems?
> 
> Evan
> 
> -----Original Message-----
> From: James B Glenn via RT 
> Sent: Monday, October 18, 2010 11:00 AM
> To: Evan McGonagill
> Subject: [sixfeetup/karl-support #81715] Bug report: undisplayed link in
> blog post preview 
> 
> Hi Evan,
> Paul asked questions below.  Would you like me to keep this support
> request open? or would 
> you like to deal with this issue separately?
> 
> Thanks,
> 
> Jim Glenn
> KARL Champion
> 
> 
> On Wed Sep 22 10:17:49 2010, peveritt wrote:
> 
> As background, the blog listing shows the "description" of each blog
> entry.  The description is something KARL auto-extracts on each
> content item and shows in various places: search results, RSS
> feeds, content Feeds, and this blog listing.
> 
> We don't make the person manually enter a description, as most people
> won't do that.  So we guess by taking the first X characters of
> text.  Since this is HTML, we have to be careful how we extract the
> text....we can't, for example, open an italics and forget to close
> it.  Same for a table.
> 
> Our current algorithm for extraction simply ignores anything nested
> inside HTML markup and only looks at the words.  The words you
> mentioned are in an HTML hyperlink, thus they are getting ignored.
> 
> We can make a change, we just need to think through it:
> 
> - Do we only include text inside a <a> tag?  Or also bold, italics?
> Anything else?
> 
> - Is this a policy to apply on all content types, or only on blog
> entries?
> 
> - Do you want it to apply everywhere the "description" appears, or
> only in this spot?
> 
> We need to be careful about exceptions to rules, as they lead to a
> system that almost nobody can remember the requirements for. :)
> Thus, if possible, a policy that applies broadly is preferable.
> 
> --Paul
> 
> On Sep 21, 2010, at 6:11 PM, Evan McGonagill via RT wrote:
> 
> 
> 
> 
> Hi Jim,
> 
> 
> 
> I found what might be a bug: looking at the blog page in a
> community,
> I'm noticing that one of the blog previews displays a kind of blank
> space where there is a link in the actual post. It reads "Hi All, I
> just
> came across this post via the Council on Foundations' Twitter Feed.
> Apparently the , is going..."
> 
> 
> 
> You can see that after the word "the" there is a blank space
> followed by
> a comma. The actual blog post has a link in that space which
> displays
> fine when you click to view the full post. Can this be fixed so that
> the
> link itself, or at least the text, is also displayed on the page
> before
> you click on the individual post?
> 
> 
> 
> I've attached a screenshot for your reference.
> 
> 
> 
> Evan
> 
> 
> <Bug shot undisplayed link in blog post preview.bmp>
> 
> 
> 
> --
> 
>

Revision history for this message

Tres Seaver (tseaver) wrote on 2010-10-26:

After writing a test which passed with the original implementation in my
sandbox, but failed in the staging environment, I morphed the implementation
to use the lazy islice() technique I outlined above: the test then passes
in both environments. As a bonus, the code should be a little clearer and
more efficient, especially for large texts.