undisplayed link in blog post preview

Bug #663399 reported by Jim B. Glenn
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
Tres Seaver

Bug Description

Jim, once you get this entered in, I'll hand it to Tres for triage.

Tres, if you read the conversation below, I think you'll get up to speed on this pretty quick. I think karl.content.views.utils.extract_description uses an XPath expression which could probably be tinkered with.

--Paul

On Oct 18, 2010, at 2:35 PM, Evan McGonagill wrote:

Hi Jim and Paul,

Sorry I lost track of this message when it was originally sent. I
believe our preferences are as follows; let me know if there's anything
I'm missing:

Yes, include the text inside <a> tags, but not bold or italics or any
other HMTL markers. The primary concern is just with link displays, but
bold and italicized text are not a priority.

Yes include this in all content types and "descriptions," not just blog
descriptions.

Are those requests broad enough to be easily implemented, and make
sense/won't cause problems?

Evan

-----Original Message-----
From: James B Glenn via RT
Sent: Monday, October 18, 2010 11:00 AM
To: Evan McGonagill
Subject: [sixfeetup/karl-support #81715] Bug report: undisplayed link in
blog post preview

Hi Evan,
Paul asked questions below. Would you like me to keep this support
request open? or would
you like to deal with this issue separately?

Thanks,

Jim Glenn
KARL Champion

On Wed Sep 22 10:17:49 2010, peveritt wrote:

As background, the blog listing shows the "description" of each blog
 entry. The description is something KARL auto-extracts on each
 content item and shows in various places: search results, RSS
 feeds, content Feeds, and this blog listing.

We don't make the person manually enter a description, as most people
 won't do that. So we guess by taking the first X characters of
 text. Since this is HTML, we have to be careful how we extract the
 text....we can't, for example, open an italics and forget to close
 it. Same for a table.

Our current algorithm for extraction simply ignores anything nested
 inside HTML markup and only looks at the words. The words you
 mentioned are in an HTML hyperlink, thus they are getting ignored.

We can make a change, we just need to think through it:

- Do we only include text inside a <a> tag? Or also bold, italics?
 Anything else?

- Is this a policy to apply on all content types, or only on blog
 entries?

- Do you want it to apply everywhere the "description" appears, or
 only in this spot?

We need to be careful about exceptions to rules, as they lead to a
 system that almost nobody can remember the requirements for. :)
 Thus, if possible, a policy that applies broadly is preferable.

--Paul

On Sep 21, 2010, at 6:11 PM, Evan McGonagill via RT wrote:

Hi Jim,

I found what might be a bug: looking at the blog page in a
 community,
I'm noticing that one of the blog previews displays a kind of blank
space where there is a link in the actual post. It reads "Hi All, I
 just
came across this post via the Council on Foundations' Twitter Feed.
Apparently the , is going..."

You can see that after the word "the" there is a blank space
 followed by
a comma. The actual blog post has a link in that space which
 displays
fine when you click to view the full post. Can this be fixed so that
 the
link itself, or at least the text, is also displayed on the page
 before
you click on the individual post?

I've attached a screenshot for your reference.

Evan

<Bug shot undisplayed link in blog post preview.bmp>

--

Revision history for this message
Jim B. Glenn (jimbglenn) wrote :
Revision history for this message
Jim B. Glenn (jimbglenn) wrote :
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

This can be handed to Chris or Tres. My guess is, we just need to tweak the utility function which extracts description, as described in the bug text.

Changed in karl3:
assignee: nobody → Tres Seaver (tseaver)
importance: Undecided → Medium
milestone: none → m49
status: New → Confirmed
Revision history for this message
Tres Seaver (tseaver) wrote :

Hmm, I can't see how the code in 'extract_description' would be discarding
the text inside the link tags::

  >>> from lxml.html import document_fromstring
  >>> x = document_fromstring('before ' +
  ... '<a href="http://example.com/">inside</a>' +
  ... 'after.')
  >>> for chunk in x.xpath('//text()'):
  ... print chunk.strip()
  ...
  before
  inside
  after.

Changed in karl3:
status: Confirmed → In Progress
Revision history for this message
Tres Seaver (tseaver) wrote :

The previous example was from my development sandbox, which
has the buildout-compiled libxml2 version 2.6.32.

The deployment build on the karlhost01 box uses the system version
of libxml2 libraries, which are backleveled to 2.6.26. With that version,
we get::

  >>> from lxml.html import document_fromstring
  >>> x = document_fromstring('<p>before ' +
  ... '<a href="/example">inside</a>' +
  ... ' after</p>')
  >>> list(x.xpath('//text()'))
  ['before ', ' after', 'inside']

which is clearly a problem (and how the link text is being lost).

The release notes for libxml2[1] show a bugfix in 2.6.30:

  Bugfixes: xmlXPathNodeSetSort problem (William Brack)

which sounds like it would explain this behavior. We can fix this bug by
changing the deployment buildout to compile libxml2 / libxslt, or else
figure out how to get a newer libxml2 RPM for the environment.

[1] http://xmlsoft.org/news.html

Revision history for this message
Tres Seaver (tseaver) wrote :

Another solution would be to bag the xpath() call and iterate the nodes
of the tree directly, e.g.::

 >>> from lxml.html import document_fromstring
 >>> d = document_fromstring(x.text)
 >>> for element in d.iter():
 ... text = element.text.strip()
 ... for word in text.split():
 ... yield word
 ... tail = element.tail.strip()
 ... for word in tail.split():
 ... yield word

and then consume that iterator using itertools.islice.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 663399] Re: undisplayed link in blog post preview
Download full text (5.2 KiB)

I wouldn't mind gradually weaning ourselves of libxml2 and lxml. I'm pretty sure this function could be implemented pretty easily using elementtree, although we couldn't avail ourselves of libxml2's HTMLParser which handled malformed content.

Hmm, I guess a really easy way to fix this without throwing out libxml2 or upgrading would be to just not use XPath for this extraction. Tres, any experience with any of the other interfaces for walking a tree?

--Paul

On Oct 26, 2010, at 4:32 PM, Tres Seaver wrote:

> The previous example was from my development sandbox, which
> has the buildout-compiled libxml2 version 2.6.32.
>
> The deployment build on the karlhost01 box uses the system version
> of libxml2 libraries, which are backleveled to 2.6.26. With that version,
> we get::
>
>>>> from lxml.html import document_fromstring
>>>> x = document_fromstring('<p>before ' +
> ... '<a href="/example">inside</a>' +
> ... ' after</p>')
>>>> list(x.xpath('//text()'))
> ['before ', ' after', 'inside']
>
> which is clearly a problem (and how the link text is being lost).
>
> The release notes for libxml2[1] show a bugfix in 2.6.30:
>
> Bugfixes: xmlXPathNodeSetSort problem (William Brack)
>
> which sounds like it would explain this behavior. We can fix this bug by
> changing the deployment buildout to compile libxml2 / libxslt, or else
> figure out how to get a newer libxml2 RPM for the environment.
>
> [1] http://xmlsoft.org/news.html
>
> --
> undisplayed link in blog post preview
> https://bugs.launchpad.net/bugs/663399
> You received this bug notification because you are subscribed to KARL3.
>
> Status in KARL3: In Progress
>
> Bug description:
>
> Jim, once you get this entered in, I'll hand it to Tres for triage.
>
> Tres, if you read the conversation below, I think you'll get up to speed on this pretty quick. I think karl.content.views.utils.extract_description uses an XPath expression which could probably be tinkered with.
>
> --Paul
>
> On Oct 18, 2010, at 2:35 PM, Evan McGonagill wrote:
>
> Hi Jim and Paul,
>
> Sorry I lost track of this message when it was originally sent. I
> believe our preferences are as follows; let me know if there's anything
> I'm missing:
>
> Yes, include the text inside <a> tags, but not bold or italics or any
> other HMTL markers. The primary concern is just with link displays, but
> bold and italicized text are not a priority.
>
> Yes include this in all content types and "descriptions," not just blog
> descriptions.
>
> Are those requests broad enough to be easily implemented, and make
> sense/won't cause problems?
>
> Evan
>
> -----Original Message-----
> From: James B Glenn via RT
> Sent: Monday, October 18, 2010 11:00 AM
> To: Evan McGonagill
> Subject: [sixfeetup/karl-support #81715] Bug report: undisplayed link in
> blog post preview
>
> Hi Evan,
> Paul asked questions below. Would you like me to keep this support
> request open? or would
> you like to deal with this issue separately?
>
> Thanks,
>
> Jim Glenn
> KARL Champion
>
>
> On Wed Sep 22 10:17:49 2010, peveritt wrote:
>
> As background, the blog listing shows the "des...

Read more...

Revision history for this message
Tres Seaver (tseaver) wrote :

After writing a test which passed with the original implementation in my
sandbox, but failed in the staging environment, I morphed the implementation
to use the lazy islice() technique I outlined above: the test then passes
in both environments. As a bonus, the code should be a little clearer and
more efficient, especially for large texts.

Changed in karl3:
status: In Progress → Fix Committed
Revision history for this message
JimPGlenn (jpglenn09) wrote :

released

Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.