Activity log for bug #778140

Date Who What changed Old value New value Message
2011-05-05 21:15:30 Siegfried Gevatter bug added bug
2011-05-05 21:16:38 Siegfried Gevatter zeitgeist: importance Undecided Medium
2011-05-05 21:16:38 Siegfried Gevatter zeitgeist: status New Triaged
2011-05-05 21:16:38 Siegfried Gevatter zeitgeist: assignee Siegfried Gevatter (rainct)
2011-05-07 11:40:16 Siegfried Gevatter summary MOVE_EVENT handling doesn't support unsorted events for the same uri Time warp problem in MOVE_EVENT handling
2011-05-07 11:42:18 Launchpad Janitor branch linked lp:zeitgeist
2012-02-18 22:43:26 Siegfried Gevatter description <RainCT> seiflotfy: the query updating current_uri on MOVE_EVENT should only change stuff with timestamp<move_event_timestamp <seiflotfy> RainCT, true <seiflotfy> good catch <RainCT> seiflotfy: there's also another ugly case <seiflotfy> RainCT, do tell <RainCT> seiflotfy: Imagine you insert event: 0. A, 1. A, 2. A, 3. B, 4. C, 5. A. Then with timestamp between events 3 and 4 you get A->T, so now you have "T, T, T, B, C, A" [...] <seiflotfy> i see the problem <seiflotfy> the last A should be a T <RainCT> no, the last A is fine <RainCT> because it is a new file with the same name [...] <seiflotfy> yeah ok <RainCT> seiflotfy: now you get A-L with timestamp between 1 and 2, so it should have "L, L, T, B, C, A", but since the current_uri of the first is already L you won't see it. for this it'd need to check the original URI instead of the current_uri <RainCT> are you with me so far? <seiflotfy> trying to <seiflotfy> RainCT, ok i cont get your last point <seiflotfy> A-L wont change the T <seiflotfy> because A has been move to T <seiflotfy> you can not move A again <seiflotfy> you need to move T <seiflotfy> thus its does not work <RainCT> yeah, but it should, because you're being told that it was moved before that <seiflotfy> unless you really want to you will have to look for all "MOVE_EVENTS" with A and figure out what A is now <seiflotfy> its doable <RainCT> so the move that happened later in time didn't affect those events, only the later ones <seiflotfy> RainCT, true <RainCT> the easy way to solve this is checking subj_id instead of subj_id_current <seiflotfy> RainCT, i think we should raise an exception <RainCT> but now when it gets really messed up is if there was even another move event before that <RainCT> which was already logged <seiflotfy> "You tried to move and event after it was used in a new location" <seiflotfy> RainCT, actually we also have the MOVE_EVENT logged <seiflotfy> you can then try to figrue out the patch of A <RainCT> yes, that's the solution <seiflotfy> the path <seiflotfy> RainCT, but i highly discourage that <RainCT> you can find the previos move event and set timestamp>previous_move_event.timestamp <seiflotfy> RainCT, exactly <seiflotfy> i am +- 0 on that tbh <seiflotfy> not sure <RainCT> ok, I don't dislike finding the previous timestamp <RainCT> i'll open a bug MOVE EVENTS ============================================ PRESENTATION By definition, Zeitgeist's events are immutable, and the subject meta-data they contain is a snapshot of how a given resource was back when the event happened. To be useful, some way of linking event subjects to their physical representation is needed. The primary identifier for doing this is the subject's URI. However, URIs, especially local ones, are transient and may change. To solve this problem, a new field was added to subjects, and it is special in that it isn't considered to be immutable. This is the `current_uri' field. INITIAL IDEA When a subject is inserted, its `current_uri' field is initially set to the same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that file (with a coherent timestamp), the value of `current_uri' is updated to its new file name. The idea here is that this is done in a way that, if we deleted the `current_uri' of all subjects and restored them looking at all MOVE_EVENTs in the database, the result would be the same as before. CURRENT IMPLEMENTATION As of now, `current_uri' is initially set to the same value as `current_uri'. Once a MOVE_EVENT is inserted, all events with a timestamp before that of the move are updated. However, after the point the MOVE_EVENT has been inserted, it is never considered again. This is so for performance reasons, since the initial plan would require pretty much "rebuilding the database". PROBLEMS There are numerous problems with this implementation, at least in theoretical situations. One problem is that of events coming in after the MOVE_EVENT (maybe because the application is batching them). In this case they won't be updated. We also have the opposite problem, a MOVE_EVENT coming in late after another conflicting MOVE_EVENT happened. For instance, we have the following events: > T5 a.txt, T10 a.txt, T15 a.txt We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we have (time / current_uri): > T5 a.txt, T10 b.txt, T15 b.txt Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0. The result is: > T5 c.txt, T10 b.txt, T15 b.txt This is totally inconsistent; the correct result would have been: > T5 c.txt, T10 c.txt, T15 b.txt Further, even if implemented as described in the "initial idea" section, the concept is flawed in that it may happen that events are inserted retrospectively using already their updated URI. This could give rise to further inconsistencies. PROPOSAL No clear way to avoid this problem is evident. Maybe the best idea is to formalize the current behavior by documenting it and requesting that MOVE and DELETE events be inserted near real time (for local files). ADDITIONAL PROPOSAL So far we haven't taken resource deletions into account at all. However, those also affect the URI of a resource, in that it ceases to exist (and may be subsequently reused for an unrelated resource). For this reason, I propose that DELETE_EVENTs also update `current_uri'. In particular, they should change said URI to "" (empty).
2012-03-18 12:19:55 Siegfried Gevatter description MOVE EVENTS ============================================ PRESENTATION By definition, Zeitgeist's events are immutable, and the subject meta-data they contain is a snapshot of how a given resource was back when the event happened. To be useful, some way of linking event subjects to their physical representation is needed. The primary identifier for doing this is the subject's URI. However, URIs, especially local ones, are transient and may change. To solve this problem, a new field was added to subjects, and it is special in that it isn't considered to be immutable. This is the `current_uri' field. INITIAL IDEA When a subject is inserted, its `current_uri' field is initially set to the same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that file (with a coherent timestamp), the value of `current_uri' is updated to its new file name. The idea here is that this is done in a way that, if we deleted the `current_uri' of all subjects and restored them looking at all MOVE_EVENTs in the database, the result would be the same as before. CURRENT IMPLEMENTATION As of now, `current_uri' is initially set to the same value as `current_uri'. Once a MOVE_EVENT is inserted, all events with a timestamp before that of the move are updated. However, after the point the MOVE_EVENT has been inserted, it is never considered again. This is so for performance reasons, since the initial plan would require pretty much "rebuilding the database". PROBLEMS There are numerous problems with this implementation, at least in theoretical situations. One problem is that of events coming in after the MOVE_EVENT (maybe because the application is batching them). In this case they won't be updated. We also have the opposite problem, a MOVE_EVENT coming in late after another conflicting MOVE_EVENT happened. For instance, we have the following events: > T5 a.txt, T10 a.txt, T15 a.txt We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we have (time / current_uri): > T5 a.txt, T10 b.txt, T15 b.txt Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0. The result is: > T5 c.txt, T10 b.txt, T15 b.txt This is totally inconsistent; the correct result would have been: > T5 c.txt, T10 c.txt, T15 b.txt Further, even if implemented as described in the "initial idea" section, the concept is flawed in that it may happen that events are inserted retrospectively using already their updated URI. This could give rise to further inconsistencies. PROPOSAL No clear way to avoid this problem is evident. Maybe the best idea is to formalize the current behavior by documenting it and requesting that MOVE and DELETE events be inserted near real time (for local files). ADDITIONAL PROPOSAL So far we haven't taken resource deletions into account at all. However, those also affect the URI of a resource, in that it ceases to exist (and may be subsequently reused for an unrelated resource). For this reason, I propose that DELETE_EVENTs also update `current_uri'. In particular, they should change said URI to "" (empty). MOVE EVENTS ============================================ PRESENTATION By definition, Zeitgeist's events are immutable, and the subject meta-data they contain is a snapshot of how a given resource was back when the event happened. To be useful, some way of linking event subjects to their physical representation is needed. The primary identifier for doing this is the subject's URI. However, URIs, especially local ones, are transient and may change. To solve this problem, a new field was added to subjects, and it is special in that it isn't considered to be immutable. This is the `current_uri' field. INITIAL IDEA When a subject is inserted, its `current_uri' field is initially set to the same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that file (with a coherent timestamp), the value of `current_uri' is updated to its new file name. The idea here is that this is done in a way that, if we deleted the `current_uri' of all subjects and restored them looking at all MOVE_EVENTs in the database, the result would be the same as before. CURRENT IMPLEMENTATION As of now, `current_uri' is initially set to the same value as `current_uri'. Once a MOVE_EVENT is inserted, all events with a timestamp before that of the move are updated. However, after the point the MOVE_EVENT has been inserted, it is never considered again. This is so for performance reasons, since the initial plan would require pretty much "rebuilding the database". PROBLEMS There are numerous problems with this implementation, at least in theoretical situations. One problem is that of events coming in after the MOVE_EVENT (maybe because the application is batching them). In this case they won't be updated. We also have the opposite problem, a MOVE_EVENT coming in late after another conflicting MOVE_EVENT happened. For instance, we have the following events:  > T5 a.txt, T10 a.txt, T15 a.txt We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we have (time / current_uri):  > T5 a.txt, T10 b.txt, T15 b.txt Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0. The result is:  > T5 c.txt, T10 b.txt, T15 b.txt This is totally inconsistent; the correct result would have been:  > T5 c.txt, T10 c.txt, T15 b.txt Further, even if implemented as described in the "initial idea" section, the concept is flawed in that it may happen that events are inserted retrospectively using already their updated URI. This could give rise to further inconsistencies. PROPOSAL No clear way to avoid this problem is evident. Maybe the best idea is to formalize the current behavior by documenting it and requesting that MOVE and DELETE events be inserted near real time (for local files). OUTSTANDING ISSUES a) Deletion of MOVE_EVENT What happens upon deletion of a MOVE_EVENT? Should the current_uri changes be reverted? b) Insertion of other events When inserting an event, should Zeitgeist check whether a MOVE_EVENT happened for that URI after the event's timestamp, and update it accordingly? c) Directory Should the insertion of a MOVE_EVENT with the renaming from "file:///home/user/dir1" to "file:///home/user/dir2" also update all events with uri "file:///home/user/dir1/*" to "file:///home/user/dir2/*"? I think so. SEE ALSO Related to this, please also check my proposal for improved DELETE_EVENT handling in bug #954206.
2012-03-18 12:20:09 Siegfried Gevatter description MOVE EVENTS ============================================ PRESENTATION By definition, Zeitgeist's events are immutable, and the subject meta-data they contain is a snapshot of how a given resource was back when the event happened. To be useful, some way of linking event subjects to their physical representation is needed. The primary identifier for doing this is the subject's URI. However, URIs, especially local ones, are transient and may change. To solve this problem, a new field was added to subjects, and it is special in that it isn't considered to be immutable. This is the `current_uri' field. INITIAL IDEA When a subject is inserted, its `current_uri' field is initially set to the same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that file (with a coherent timestamp), the value of `current_uri' is updated to its new file name. The idea here is that this is done in a way that, if we deleted the `current_uri' of all subjects and restored them looking at all MOVE_EVENTs in the database, the result would be the same as before. CURRENT IMPLEMENTATION As of now, `current_uri' is initially set to the same value as `current_uri'. Once a MOVE_EVENT is inserted, all events with a timestamp before that of the move are updated. However, after the point the MOVE_EVENT has been inserted, it is never considered again. This is so for performance reasons, since the initial plan would require pretty much "rebuilding the database". PROBLEMS There are numerous problems with this implementation, at least in theoretical situations. One problem is that of events coming in after the MOVE_EVENT (maybe because the application is batching them). In this case they won't be updated. We also have the opposite problem, a MOVE_EVENT coming in late after another conflicting MOVE_EVENT happened. For instance, we have the following events:  > T5 a.txt, T10 a.txt, T15 a.txt We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we have (time / current_uri):  > T5 a.txt, T10 b.txt, T15 b.txt Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0. The result is:  > T5 c.txt, T10 b.txt, T15 b.txt This is totally inconsistent; the correct result would have been:  > T5 c.txt, T10 c.txt, T15 b.txt Further, even if implemented as described in the "initial idea" section, the concept is flawed in that it may happen that events are inserted retrospectively using already their updated URI. This could give rise to further inconsistencies. PROPOSAL No clear way to avoid this problem is evident. Maybe the best idea is to formalize the current behavior by documenting it and requesting that MOVE and DELETE events be inserted near real time (for local files). OUTSTANDING ISSUES a) Deletion of MOVE_EVENT What happens upon deletion of a MOVE_EVENT? Should the current_uri changes be reverted? b) Insertion of other events When inserting an event, should Zeitgeist check whether a MOVE_EVENT happened for that URI after the event's timestamp, and update it accordingly? c) Directory Should the insertion of a MOVE_EVENT with the renaming from "file:///home/user/dir1" to "file:///home/user/dir2" also update all events with uri "file:///home/user/dir1/*" to "file:///home/user/dir2/*"? I think so. SEE ALSO Related to this, please also check my proposal for improved DELETE_EVENT handling in bug #954206. MOVE EVENTS ============================================ PRESENTATION By definition, Zeitgeist's events are immutable, and the subject meta-data they contain is a snapshot of how a given resource was back when the event happened. To be useful, some way of linking event subjects to their physical representation is needed. The primary identifier for doing this is the subject's URI. However, URIs, especially local ones, are transient and may change. To solve this problem, a new field was added to subjects, and it is special in that it isn't considered to be immutable. This is the `current_uri' field. INITIAL IDEA When a subject is inserted, its `current_uri' field is initially set to the same value as its `uri' field. When Zeitgeist receives a MOVE_EVENT for that file (with a coherent timestamp), the value of `current_uri' is updated to its new file name. The idea here is that this is done in a way that, if we deleted the `current_uri' of all subjects and restored them looking at all MOVE_EVENTs in the database, the result would be the same as before. CURRENT IMPLEMENTATION As of now, `current_uri' is initially set to the same value as `current_uri'. Once a MOVE_EVENT is inserted, all events with a timestamp before that of the move are updated. However, after the point the MOVE_EVENT has been inserted, it is never considered again. This is so for performance reasons, since the initial plan would require pretty much "rebuilding the database". PROBLEMS There are numerous problems with this implementation, at least in theoretical situations. One problem is that of events coming in after the MOVE_EVENT (maybe because the application is batching them). In this case they won't be updated. We also have the opposite problem, a MOVE_EVENT coming in late after another conflicting MOVE_EVENT happened. For instance, we have the following events:  > T5 a.txt, T10 a.txt, T15 a.txt We receive a first MOVE_EVENT from a.txt to b.txt with timestamp T7. Now we have (time / current_uri):  > T5 a.txt, T10 b.txt, T15 b.txt Finally, we receive a further MOVE_EVENT from a.txt to c.txt with timestamp T0. The result is:  > T5 c.txt, T10 b.txt, T15 b.txt This is totally inconsistent; the correct result would have been:  > T5 c.txt, T10 c.txt, T15 b.txt Further, even if implemented as described in the "initial idea" section, the concept is flawed in that it may happen that events are inserted retrospectively using already their updated URI. This could give rise to further inconsistencies. PROPOSAL No clear way to avoid this problem is evident. Maybe the best idea is to formalize the current behavior by documenting it and requesting that MOVE and DELETE events be inserted near real time (for local files). OUTSTANDING ISSUES a) Deletion of MOVE_EVENT What happens upon deletion of a MOVE_EVENT? Should the current_uri changes be reverted? b) Insertion of other events When inserting an event, should Zeitgeist check whether a MOVE_EVENT happened for that URI after the event's timestamp, and update it accordingly? c) Directories Should the insertion of a MOVE_EVENT with the renaming from "file:///home/user/dir1" to "file:///home/user/dir2" also update all events with uri "file:///home/user/dir1/*" to "file:///home/user/dir2/*"? I think so. SEE ALSO Related to this, please also check my proposal for improved DELETE_EVENT handling in bug #954206.
2012-03-18 12:20:59 Siegfried Gevatter summary Time warp problem in MOVE_EVENT handling Improved MOVE_EVENT handling (was: Time warp problem)
2012-04-10 15:17:48 Siegfried Gevatter zeitgeist: milestone 0.9.1