I wouldn't mind gradually weaning ourselves of libxml2 and lxml. I'm pretty sure this function could be implemented pretty easily using elementtree, although we couldn't avail ourselves of libxml2's HTMLParser which handled malformed content. Hmm, I guess a really easy way to fix this without throwing out libxml2 or upgrading would be to just not use XPath for this extraction. Tres, any experience with any of the other interfaces for walking a tree? --Paul On Oct 26, 2010, at 4:32 PM, Tres Seaver wrote: > The previous example was from my development sandbox, which > has the buildout-compiled libxml2 version 2.6.32. > > The deployment build on the karlhost01 box uses the system version > of libxml2 libraries, which are backleveled to 2.6.26. With that version, > we get:: > >>>> from lxml.html import document_fromstring >>>> x = document_fromstring('

before ' + > ... 'inside' + > ... ' after

') >>>> list(x.xpath('//text()')) > ['before ', ' after', 'inside'] > > which is clearly a problem (and how the link text is being lost). > > The release notes for libxml2[1] show a bugfix in 2.6.30: > > Bugfixes: xmlXPathNodeSetSort problem (William Brack) > > which sounds like it would explain this behavior. We can fix this bug by > changing the deployment buildout to compile libxml2 / libxslt, or else > figure out how to get a newer libxml2 RPM for the environment. > > [1] http://xmlsoft.org/news.html > > -- > undisplayed link in blog post preview > https://bugs.launchpad.net/bugs/663399 > You received this bug notification because you are subscribed to KARL3. > > Status in KARL3: In Progress > > Bug description: > > Jim, once you get this entered in, I'll hand it to Tres for triage. > > Tres, if you read the conversation below, I think you'll get up to speed on this pretty quick. I think karl.content.views.utils.extract_description uses an XPath expression which could probably be tinkered with. > > --Paul > > On Oct 18, 2010, at 2:35 PM, Evan McGonagill wrote: > > Hi Jim and Paul, > > Sorry I lost track of this message when it was originally sent. I > believe our preferences are as follows; let me know if there's anything > I'm missing: > > Yes, include the text inside tags, but not bold or italics or any > other HMTL markers. The primary concern is just with link displays, but > bold and italicized text are not a priority. > > Yes include this in all content types and "descriptions," not just blog > descriptions. > > Are those requests broad enough to be easily implemented, and make > sense/won't cause problems? > > Evan > > -----Original Message----- > From: James B Glenn via RT > Sent: Monday, October 18, 2010 11:00 AM > To: Evan McGonagill > Subject: [sixfeetup/karl-support #81715] Bug report: undisplayed link in > blog post preview > > Hi Evan, > Paul asked questions below. Would you like me to keep this support > request open? or would > you like to deal with this issue separately? > > Thanks, > > Jim Glenn > KARL Champion > > > On Wed Sep 22 10:17:49 2010, peveritt wrote: > > As background, the blog listing shows the "description" of each blog > entry. The description is something KARL auto-extracts on each > content item and shows in various places: search results, RSS > feeds, content Feeds, and this blog listing. > > We don't make the person manually enter a description, as most people > won't do that. So we guess by taking the first X characters of > text. Since this is HTML, we have to be careful how we extract the > text....we can't, for example, open an italics and forget to close > it. Same for a table. > > Our current algorithm for extraction simply ignores anything nested > inside HTML markup and only looks at the words. The words you > mentioned are in an HTML hyperlink, thus they are getting ignored. > > We can make a change, we just need to think through it: > > - Do we only include text inside a tag? Or also bold, italics? > Anything else? > > - Is this a policy to apply on all content types, or only on blog > entries? > > - Do you want it to apply everywhere the "description" appears, or > only in this spot? > > We need to be careful about exceptions to rules, as they lead to a > system that almost nobody can remember the requirements for. :) > Thus, if possible, a policy that applies broadly is preferable. > > --Paul > > On Sep 21, 2010, at 6:11 PM, Evan McGonagill via RT wrote: > > > > > Hi Jim, > > > > I found what might be a bug: looking at the blog page in a > community, > I'm noticing that one of the blog previews displays a kind of blank > space where there is a link in the actual post. It reads "Hi All, I > just > came across this post via the Council on Foundations' Twitter Feed. > Apparently the , is going..." > > > > You can see that after the word "the" there is a blank space > followed by > a comma. The actual blog post has a link in that space which > displays > fine when you click to view the full post. Can this be fixed so that > the > link itself, or at least the text, is also displayed on the page > before > you click on the individual post? > > > > I've attached a screenshot for your reference. > > > > Evan > > > > > > > -- > >