New FindRelated method (using "apriori")

Bug #494288 reported by Seif Lotfy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zeitgeist Framework
Fix Released
Undecided
Unassigned

Bug Description

We have a branch with the 1-step apriori algorithm built.
Right now it throws out the most used items with another item
We should make it configurable to be able to ask for most used interpretations of items with other items
This way we can for example ask for most used "websites" with document X
etc....
what do u think?

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote :

I think this is crucial. If one can not ask for related subjects withing some specific category this feature more or less just gives apps "a shot in the dark".

I have a few comments on the branch:

 * I don't grok what the key_events.revers() call is used for, but I can see the unit test fail if I remove it... Can you add a comment in the code?

 * The code is not particularly optimized. It allocates a lot of list and the SQL could be leaner too. The attached patch goes some of the way, but there is still room for improvement

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote :

Forgot to add - great work on the branch!

Revision history for this message
Seif Lotfy (seif) wrote : Re: [Bug 494288] Re: "apriori": get most used (websites/notes/documents/etc...)

The biggest problem right now is i only consider the first 7 items after
document X as part of the shopping list.
We have 3 options to optimize our results.
- We pick out the all entires between the open and close events of X
- We pick out all entries between an open and a modfiy event. If an open
comes after the modify we take the 7 items.
- We pick out the average amount of items between the latest 7 events for X
and use that as a shopping list size.
I am just brainstorming here but we should try.

2009/12/9 Mikkel Kamstrup Erlandsen <email address hidden>

> I think this is crucial. If one can not ask for related subjects withing
> some specific category this feature more or less just gives apps "a shot
> in the dark".
>
> I have a few comments on the branch:
>
> * I don't grok what the key_events.revers() call is used for, but I can
> see the unit test fail if I remove it... Can you add a comment in the
> code?
>
> * The code is not particularly optimized. It allocates a lot of list
> and the SQL could be leaner too. The attached patch goes some of the
> way, but there is still room for improvement
>
> ** Attachment added: "Don't build too many lists and optimize some SQL"
> http://launchpadlibrarian.net/36618860/apriori-optimize-1.patch
>
> --
> "apriori": get most used (websites/notes/documents/etc...)
> https://bugs.launchpad.net/bugs/494288
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Zeitgeist Framework: New
>
> Bug description:
> We have a branch with the 1-step apriori algorithm built.
> Right now it throws out the most used items with another item
> We should make it configurable to be able to ask for most used
> interpretations of items with other items
> This way we can for example ask for most used "websites" with document X
> etc....
> what do u think?
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/zeitgeist/+bug/494288/+subscribe
>
>

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote : Re: "apriori": get most used (websites/notes/documents/etc...)

I don't think we should consider open/close events when calculating these relations. That way it wont work for contacts and other non-file-like items.

The initial step of the algorithm: "Fetch the last 7 events for this subject uri" seems good.

The next step where you create a time range neighbourhood around each of these events, is a bit unclear to me... You create the neighbourhood as (event.timestamp, <next_event_timestamp>). This seems odd at a glance. Why not (event.timestamp - delta, event.timestamp + delta) ?

Next thing is that I think you can do the two last steps of the algorithm in one SQL query. Ie. the parts where you create the k_tuples and the part where you calculate the support of the k_tuples. Possibly:

SELECT subj_uri, count(subject_uri)
FROM event_view
WHERE (timestamp > ? AND timestamp < ?) OR (timestamp > ? timestamp < ?) OR (...) ...
GROUP BY subj_uri
ORDER BY timestamp ASC
LIMIT 5

I am sure Siegfried can do this even better though :-D

Revision history for this message
Seif Lotfy (seif) wrote : Re: [Bug 494288] Re: "apriori": get most used (websites/notes/documents/etc...)

2009/12/9 Mikkel Kamstrup Erlandsen <email address hidden>

> I don't think we should consider open/close events when calculating
> these relations. That way it wont work for contacts and other non-file-
> like items.
>
> The initial step of the algorithm: "Fetch the last 7 events for this
> subject uri" seems good.
>
> The next step where you create a time range neighbourhood around each of
> these events, is a bit unclear to me... You create the neighbourhood as
> (event.timestamp, <next_event_timestamp>). This seems odd at a glance.
> Why not (event.timestamp - delta, event.timestamp + delta) ?
>

please exlpain ur delta

The reason why i went with this neighbourhood generation is:
imagine
*
*
*x*, 1, *x*, 3, 8, 1, 2, *x*, 5, 4, *x*, 2, 6, 7

if i took the next 7 events i get

[*x*, 1, *x*, 3, 8, 1, 2, *x*]
[*x*, 3, 8, 1, 2, *x*, 5, 4]
[*x*, 5, 4, *x*, 2, 6, 7]
[*x*, 2, 6, 7]

this has lot of overlapping

however what i do by figuring out the ranges between 2 x is to allow me to
get...

[x,1] [x,3,8,1,2] [x,5,4] [x,2,6,7]

makes sense?

>
> Next thing is that I think you can do the two last steps of the
> algorithm in one SQL query. Ie. the parts where you create the k_tuples
> and the part where you calculate the support of the k_tuples. Possibly:
>
> SELECT subj_uri, count(subject_uri)
> FROM event_view
> WHERE (timestamp > ? AND timestamp < ?) OR (timestamp > ? timestamp < ?) OR
> (...) ...
> GROUP BY subj_uri
> ORDER BY timestamp ASC
> LIMIT 5
>
> I am sure Siegfried can do this even better though :-D
>
> --
> "apriori": get most used (websites/notes/documents/etc...)
> https://bugs.launchpad.net/bugs/494288
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Zeitgeist Framework: New
>
> Bug description:
> We have a branch with the 1-step apriori algorithm built.
> Right now it throws out the most used items with another item
> We should make it configurable to be able to ask for most used
> interpretations of items with other items
> This way we can for example ask for most used "websites" with document X
> etc....
> what do u think?
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/zeitgeist/+bug/494288/+subscribe
>
>

Revision history for this message
Siegfried Gevatter (rainct) wrote : Re: "apriori": get most used (websites/notes/documents/etc...)

Mikkel:

Yeah I'd like to have a single query, but the problem then is how to calculate the min_support (just having a "LIMIT 5" doesn't really make sense, as it may catch not relevant stuff or leave out very relevant stuff).

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote :

Siegfried: Ouch. Got me :-) How about if we do ORDER count(subj_uri) LIMIT 5 instead? It would change the ordering (quelle surprise :-)), but I am wondering what the "correct" ordering here is? By popularity or by recency?

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote :

Seif: Fx. delta = 15 minutes, (=900000ms). So you take into account all events that happened in a 15 minute radius of the timestamp.

As far as I can see the way you do it would be meen that you consider two events related if there are no other events in between?

I honestly don't know which approach makes the most sense. They both have cases where they are obviously wrong.

Revision history for this message
Siegfried Gevatter (rainct) wrote :

Order is by popularity, the problem is determining what is popular enough to be displayed and what not. But now that I'm thinking about it we could do the min_support relevancy as a subselect, I think I'll give that a try.

Revision history for this message
Siegfried Gevatter (rainct) wrote :

> As far as I can see the way you do it would be meen that you consider
> two events related if there are no other events in between?

No, what the timestamps do is consider the stuff in-between of each occurrence of the subject as a "set" (taking a look at the Wikipedia article on "Apriori" may help understand how it works, although what's used here isn't really the same).

I think I'd be interesting to use both this and your suggestion together, so that instead of looking for everything between timestamp1 and timestamp2 we'd look for everything from timestamp1 to timestamp1+X and from timestamp2-X to timestamp2. The problem with this is that I may open a PDF (triggering a VisitEvent), look at it for half an hour (where half an hour > X), switch to Firefox and read up some related topics for another big amount of time and go back to the PDF. The relationship between the PDF and the websites would be ignored with this approach.

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote :

You are right. I think a hybrid approach would give the best results

Revision history for this message
Seif Lotfy (seif) wrote : Re: [Bug 494288] Re: "apriori": get most used (websites/notes/documents/etc...)

2009/12/9 Siegfried Gevatter <email address hidden>

> > As far as I can see the way you do it would be meen that you consider
> > two events related if there are no other events in between?
>
> No, what the timestamps do is consider the stuff in-between of each
> occurrence of the subject as a "set" (taking a look at the Wikipedia
> article on "Apriori" may help understand how it works, although what's
> used here isn't really the same).
>

It would be the same if we use the close event for X
but since we dont haee it we have to use our imaginary "shopping cart"
either using the delta offset like mikkel suggested
or we use the event offset i suggested as in "next seven" non X as subject
events.

> I think I'd be interesting to use both this and your suggestion
> together, so that instead of looking for everything between timestamp1
> and timestamp2 we'd look for everything from timestamp1 to timestamp1+X
> and from timestamp2-X to timestamp2. The problem with this is that I may
> open a PDF (triggering a VisitEvent), look at it for half an hour (where
> half an hour > X), switch to Firefox and read up some related topics for
> another big amount of time and go back to the PDF. The relationship
> between the PDF and the websites would be ignored with this approach.
>
> --
> "apriori": get most used (websites/notes/documents/etc...)
> https://bugs.launchpad.net/bugs/494288
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Zeitgeist Framework: New
>
> Bug description:
> We have a branch with the 1-step apriori algorithm built.
> Right now it throws out the most used items with another item
> We should make it configurable to be able to ask for most used
> interpretations of items with other items
> This way we can for example ask for most used "websites" with document X
> etc....
> what do u think?
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/zeitgeist/+bug/494288/+subscribe
>
>

Revision history for this message
Siegfried Gevatter (rainct) wrote : Re: "apriori": get most used (websites/notes/documents/etc...)

Okay, this is now merged into lp:zeitgeist (after doing some more changes and having Markus review it again), but of course this doesn't mean that we can't continue improving it there (all the opposite!).

>> You are right. I think a hybrid approach would give the best results
Yeah. But for this we need to find a solution to the problem I explain in the last paragraph of my previous comment (https://bugs.edge.launchpad.net/zeitgeist/+bug/494288/comments/10).

Unrelated to this, one thing I've just been thinking about is that maybe we should change the function to take a list of subject URIs, instead of only one, so that you can ask for eg. stuff related to any of the currently open files and get more results than with just one. What are your thought on this?

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote :

> ... maybe we should change the function to take a list of subject URIs ...

Agreed

Revision history for this message
Seif Lotfy (seif) wrote : Re: [Bug 494288] Re: "apriori": get most used (websites/notes/documents/etc...)

I have a little prototype of a full apriori! sadly it is very intensive
calculation. Basically I create a daily neighborhood over the last 3 weeks.
The results hve ot be tested.

As for our current apriori we need an adaptive way of generating our
neighborhoods instead of
pulling out only x elements in a neighborhood.

I will write down my analysis in another blueprint. So we can discuss these
issues

2009/12/12 Mikkel Kamstrup Erlandsen <email address hidden>

> > ... maybe we should change the function to take a list of subject URIs
> ...
>
> Agreed
>
> --
> "apriori": get most used (websites/notes/documents/etc...)
> https://bugs.launchpad.net/bugs/494288
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Zeitgeist Framework: New
>
> Bug description:
> We have a branch with the 1-step apriori algorithm built.
> Right now it throws out the most used items with another item
> We should make it configurable to be able to ask for most used
> interpretations of items with other items
> This way we can for example ask for most used "websites" with document X
> etc....
> what do u think?
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/zeitgeist/+bug/494288/+subscribe
>
>

Revision history for this message
Seif Lotfy (seif) wrote : Re: "apriori": get most used (websites/notes/documents/etc...)

I think we need to change the api for apriori
right now we ask for "most used documents" with x
what about asking for "most used documents of mimetype a" with x
a simple solution for that is to send a second argument to the most used in form of a list of Subjects where some specifications are set. this way we can ask for not only mimetypes but interpretations or manifestations too.
What do u think?

Revision history for this message
Siegfried Gevatter (rainct) wrote : Re: [Zeitgeist] [Bug 494288] Re: "apriori": get most used (websites/notes/documents/etc...)

2009/12/18 Seif Lotfy <email address hidden>:
> what about asking for "most used documents of mimetype a" with x

That's what the result_event_template is for.

--
Siegfried-Angel Gevatter Pujals (RainCT)
Free Software Developer 363DEAE3

Revision history for this message
Siegfried Gevatter (rainct) wrote : Re: "apriori": get most used (websites/notes/documents/etc...)

Actually, changing the list of URIs to another list of events might prove beneficial.

Revision history for this message
Mikkel Kamstrup Erlandsen (kamstrup) wrote :

I mistakenly thought this bug was closed so I filed another about the Apriori API: bug #498878. Sorry for the fuzz.

Revision history for this message
Siegfried Gevatter (rainct) wrote :

I am closing this as we have the other bug for further discussion.

summary: - "apriori": get most used (websites/notes/documents/etc...)
+ New FindRelated method (using "apriori")
Changed in zeitgeist:
milestone: none → 0.3.1
milestone: 0.3.1 → none
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.