Using filters (text match / exclusion / etc)

Bug #485966 reported by Siegfried Gevatter on 2009-11-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zeitgeist Framework
Fix Released
Critical
Markus Korn
Declined for 0.1 by Seif Lotfy
Declined for 0.2 by Seif Lotfy
0.3
Invalid
Critical
Markus Korn
0.4
Fix Released
Critical
Unassigned

Bug Description

 - How can I get all events except those with interpretation VISIT_EVENT?

 - In case you give me above the awful answer "looking for all interpretations except VISIT_EVENT", how can I get all events except those from application "firefox.desktop"?

 - How can I get all those events whose URI ends with "myfile.txt"?

- And those events whose title contains "zeitgeist"?

Related branches

Changed in zeitgeist:
importance: Undecided → Critical
milestone: none → 0.3

For this question to make sense you have to give a use case and explain why it is not good enough to filter out events manually in the Firefox example.

For nitpicking: events can have a "title that contains zeitgeist". They can have a subject with text field containing 'zeitgeist'. If we want to make such queries efficient we need a full text index on the event.subject_text column, which would require the FTS extension for sqlite and I am not quite sure that this is a standard dependency..?

If we are serious about scaling to millions of events then we can never under any circumstance allow queries that force us into a table scan. And I think massive scalability is more important than esoteric queries, but this is simply my opinion, and up to discussion.

If zeitgeist was implemented on top of something like Xapian or Lucene instead of SQLite we could allow more complex queries. But for the near- to mid term future I don't see that happening.

On the negation issue I suppose we could add some sort of negation switch (fx '!') before the fields where we use controlled identifiers (that'd be uri, iterpretations, manifestation, mimetype, etc., but not payload or subject_text). But I still need a convincing argument showing that we need this...

Sorry; in the second paragraph above I meant "... events CAN'T have a..."

Seif Lotfy (seif) wrote :

I think the exclusion makes a lot of sense and i think the "!" would be a good indicator for exclusion. Example this would be interesting if i want to query for my last 100 events excluding those from firefox, since firefox kinda bloats everything here. I can see myself using this a lot later in relevancy and for blacklists. Especially blacklists will be an issue sooner or later.

Do we really want to block 0.3.0 because of this bug? I would say no, otherwise we will ship some very untested code if we want to ship 0.3.0 this weekend. I say defer to 0.3.1, it's only a micro API break.

Lets push this to 0.3.1 then
this is no break to the api so we can roll it out later

2009/11/26 Mikkel Kamstrup Erlandsen <email address hidden>

> Do we really want to block 0.3.0 because of this bug? I would say no,
> otherwise we will ship some very untested code if we want to ship 0.3.0
> this weekend. I say defer to 0.3.1, it's only a micro API break.
>
> --
> Using filters (text match / exclusion / etc)
> https://bugs.launchpad.net/bugs/485966
> You received this bug notification because you are subscribed to The
> Zeitgeist Project.
>
> Status in Zeitgeist Framework: New
>
> Bug description:
> - How can I get all events except those with interpretation VISIT_EVENT?
>
> - In case you give me above the awful answer "looking for all
> interpretations except VISIT_EVENT", how can I get all events except those
> from application "firefox.desktop"?
>
> - How can I get all those events whose URI ends with "myfile.txt"?
>
> - And those events whose title contains "zeitgeist"?
>
>

Changed in zeitgeist:
milestone: 0.3.0 → 0.3.1
Markus Korn (thekorn) wrote :

I added a negation switch, using "!" as operator to some fields in the attached branch.
There is also *one* testcase, I would like to add some more before merging this, but I'm running out of time now.

Any comments?

Changed in zeitgeist:
assignee: nobody → Markus Korn (thekorn)
status: New → In Progress

> Any comments?

Yes :-)

We also need to support matching semantics in the Event.matches_template() method. If FindEventIds() and Event.matches_template() do not line up 100% then monitors installed with InstallMonitor() will not coincide with the Find* results.

There is also the open request to have prefix-queries. This could be done by appending (or prepending, for cohenrency?) a * to the template field. This again raises the problem of integrating this with the ! operator. If the branch is merged without prefix-support then we should open a separate ticket on this

Markus Korn (thekorn) wrote :

I updated lp:~thekorn/zeitgeist/negation_switch,
 * it has wildcard support ('*' is used)
 * added logic to Event.matches_template() and Subject.matches_template()

This updates add a new convention to zeitgeist: "*" in fields which support wildcard queries must be escaped as "\*"

TODO:
 * adding helper function to escape "*"
 * adding more tests
 * documenting "!" and "*"

Markus Korn (thekorn) wrote :

Moved to the 0.3.2 milestone as implmentation details needs to be discussed, see the discussion on the merge proposal.

Changed in zeitgeist:
milestone: 0.3.1 → 0.3.2
Changed in zeitgeist:
milestone: 0.3.2 → 0.3.3

Let's revive this old bug, and get the code in a mergeable state. Would be a nice feature for 0.3.3.

I still think we should limit *-queries to prefix queries only. Without a full text index queries with * inter*sper*sed in the strings will require a full table scan and expensive string checking. With a log of 1M events this will *completely* take down a regular netbook for several minutes.

This will diminish the syntax to be ! as prefix and * as suffix and should simplify the code a bit... Prefix queries can use the index on the textual columns (at least they do so in most db systems i know).

I am not saying that full text querying is not useful - quite the contrary. I am saying that if we want to support full text querying then we should do it properly.

Seif Lotfy (seif) wrote :

Can we tile it in little parts.
I would like to have exclusion on its own :)

2010/3/2 Mikkel Kamstrup Erlandsen <email address hidden>

> Let's revive this old bug, and get the code in a mergeable state. Would
> be a nice feature for 0.3.3.
>
> I still think we should limit *-queries to prefix queries only. Without
> a full text index queries with * inter*sper*sed in the strings will
> require a full table scan and expensive string checking. With a log of
> 1M events this will *completely* take down a regular netbook for several
> minutes.
>
> This will diminish the syntax to be ! as prefix and * as suffix and
> should simplify the code a bit... Prefix queries can use the index on
> the textual columns (at least they do so in most db systems i know).
>
> I am not saying that full text querying is not useful - quite the
> contrary. I am saying that if we want to support full text querying then
> we should do it properly.
>
> --
> Using filters (text match / exclusion / etc)
> https://bugs.launchpad.net/bugs/485966
> You received this bug notification because you are subscribed to The
> Zeitgeist Project.
>
> Status in Zeitgeist Framework: In Progress
> Status in Zeitgeist Framework 0.3 series: In Progress
>
> Bug description:
> - How can I get all events except those with interpretation VISIT_EVENT?
>
> - In case you give me above the awful answer "looking for all
> interpretations except VISIT_EVENT", how can I get all events except those
> from application "firefox.desktop"?
>
> - How can I get all those events whose URI ends with "myfile.txt"?
>
> - And those events whose title contains "zeitgeist"?
>
>
>
>

Siegfried Gevatter (rainct) wrote :

We should finally get this in with 0.3.4. (I don't think the wildcards will be a problem, btw, but will do some benchmarking next week).

Markus Korn (thekorn) wrote :

I will start hacking on this again today, a plan to split it into two parts.

1.) I will start with negation
2.) the next step is wildcards

There are still a few open question to me:

wrt 1.)
   * for which fields do we allow negation?
      - Subject.{Uri, Interpretation, Manifestation, Origin, Mimetype}
        (Text is too expensive, and Storage make no sense as we have an Enum for both cases)
      - Event.{Interpretation, Manifestation, Actor}
        (we do not allow searching by Id, we have different ways to query by Timestamp)

wrt 2.)
   * we will only allow wildcard searches like "sometext*"; and not "some*xt" or "*text"
   * Allowed fields for wildcard searches are:
      - Subject.{Uri, Origin, Mimetype}
      - Event.{Actor,}

        (in both cases Manifestation/Interpretation are covered by expansion feature, I think allowing wildcards is not necessary)

Any comments?

I think it makes perfect sense what you laid out. The only thing I am considering is subject.text. I am not sure whether or not to add negation and/or prefix queries for that one...

Markus Korn (thekorn) wrote :

I'll not support Subject.text for now, but make it easy to add this field.

I mean when I look at my database for the values I've there it turns out they are very random, like title of webpages, like "Television program of 2010-05-12, 4 - 5 p.m", negation support make close to no sense for this case.

Markus Korn (thekorn) wrote :

I think I got the negation part working in lp:~thekorn/zeitgeist/negation_support
Before a merge, this branch needs a few more tests, I hope to work on it tomorrow early morning.
After I got this branch landed I will work on the wildcard part.

Markus Korn (thekorn) wrote :

I've started working on the wildcard part in lp:~thekorn/zeitgeist/wildcard_support
And I'm adding another constraint: searches are case sensitive.

Please cry out loud if you think that case sensitive searches is a bad thing.

Siegfried Gevatter (rainct) wrote :

2010/5/14 Markus Korn <email address hidden>:
> Please cry out loud if you think that case sensitive searches is a bad
> thing.

I'm crying out loud!

(not really constructive, I know :P)

Markus Korn (thekorn) wrote :

This bug is fixed, the negation part as well as the wildcard part are fixed in lp:zeitgeist

------------------------------------------------------------
revno: 1470 [merge]
committer: Markus Korn <email address hidden>
branch nick: trunk
timestamp: Fri 2010-05-14 13:02:32 +0200
message:
  Added negation support to some template fields, this is the first part of a
  fix of bug 485966, thanks Mikkel for doing the review.
------------------------------------------------------------

------------------------------------------------------------
revno: 1474 [merge]
committer: Markus Korn <email address hidden>
branch nick: trunk
timestamp: Sun 2010-05-16 17:21:55 +0200
message:
  Added wildcard support to some query fields, thanks to Seif and Siegfried
  for reviewing the code.
  Wildcards can be at the end of some query fields, like
     mimetype=text/*
  which queries for all subjects with mimetype beginning with 'text/', see the
  related bugreport (LP: #485966) for detailed information.
------------------------------------------------------------

Just for the record: +1 for case sensitivity. We don't wanna go down the case-insensitive route. That's for full text indexers as it's generally not as simple as that (far from it in fact). Fx. how about transliteration? For instance - does 'û' match 'u' (apply the same logic for all unicode glyphs ad nauseum)? So I am all for strict prefix matching.

Changed in zeitgeist:
milestone: 0.3.4 → 0.4.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers