(Redis) Unable to claim messages after some messages expired

Bug #1548135 reported by Eva Balycheva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
zaqar
Fix Released
High
Eva Balycheva

Bug Description

How to reproduce:
1. Post some messages
2. Wait for them to expire
3. (optional) You can post some more messages if you wish, it doesn't matter.
4. Try to claim messages. You'll get the error 503 on the client, and on the server you'll get:
ResponseError: Error running script (call to f_f8726c7ad2f323131fb1bbd395fff6b501b316c8): @user_script:59: user_script:59: attempt to compare nil with number

Why it happens:
Redis can automatically expire top-level keys. We ask Redis to expire our top-level keys after some time: <claim id> or <message id> and Redis cleans them wonderfully.
However, there are also these top-level keys:
1. <project_id>.<queue_name>.messages (contains set of message ids)
2. <project_id>.<queue_name>.claims (contains set of claim ids)
And these are not cleaned. Nor Zaqar, nor Redis wipes out expired ids from these sets.
So claim_messages.lua script grabs message ids from <project_id>.<queue_name>.messages top-level key set:
https://github.com/openstack/zaqar/blob/master/zaqar/storage/redis/scripts/claim_messages.lua#L40
Then it takes first grabbed message id and tries to get two field values of corresponding <message id> key:
https://github.com/openstack/zaqar/blob/master/zaqar/storage/redis/scripts/claim_messages.lua#L57
In case <message id> key was already expired and cleaned by Redis, this function returns [nil, nil] instead of [claim id, expiration date number].
So this condition fails, because it can't be evaluated: https://github.com/openstack/zaqar/blob/master/zaqar/storage/redis/scripts/claim_messages.lua#L59

Possible solution:
Because it's impossible to make Redis auto expire row in a set (See https://github.com/antirez/redis/issues/167#issuecomment-2559040), make Zaqar check row's expiration date on each query in sets <project_id>.<queue_name>.messages and <project_id>.<queue_name>.claims. If the date is expired, delete the row.
Note: should also include expiration date in each row of <project_id>.<queue_name>.messages. It's only implemented in <project_id>.<queue_name>.claims.

Tags: redis
Eva Balycheva (ubershy)
summary: - (Redis) Unable to claim messages after some messages have expired
+ (Redis) Unable to claim messages after some messages expired
Eva Balycheva (ubershy)
description: updated
Eva Balycheva (ubershy)
description: updated
Revision history for this message
wangxiyuan (wangxiyuan) wrote :

the ttl doesn't work in redis backend. So it leads a series of bugs. I think we should first find a sutiable way to solve the bug:
https://review.openstack.org/#/c/276548/ . Then other bugs, like this one, will be fixed easily.

Feilong Wang (flwang)
Changed in zaqar:
importance: Undecided → High
Eva Balycheva (ubershy)
Changed in zaqar:
assignee: nobody → Eva Balycheva (ubershy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to zaqar (master)

Fix proposed to branch: master
Review: https://review.openstack.org/314273

Changed in zaqar:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to zaqar (master)

Reviewed: https://review.openstack.org/314273
Committed: https://git.openstack.org/cgit/openstack/zaqar/commit/?id=48f3e9f9a284b6f979bc0ba770146208164e0c8e
Submitter: Jenkins
Branch: master

commit 48f3e9f9a284b6f979bc0ba770146208164e0c8e
Author: Eva Balycheva <email address hidden>
Date: Sun May 8 04:42:53 2016 +0300

    Fix fail of claim after some message expire(Redis)

    If some messages in Redis are already expired, it's impossible to claim
    any new messages.

    Zaqar internally catches this error from running "claim_messages.lua":
    "ResponseError: Error running script: @user_script:59: user_script:59:
    attempt to compare nil with number". And returns 503 response code to
    the user.

    This happens because if some message is expired (it's <message_id>
    record was automatically deleted from database), it's ID is still listed
    in set "<project_id>.<queue_name>.messages" and Zaqar uses this ID. In
    this case on attempt to get some values from <message_id>'s fields, it
    gets 'nil' values.

    This patch makes Zaqar check if returned field values from <message_id>
    record are 'nil'. If values are not 'nil', process the record normally.
    If values are 'nil', remember message ID. In the end of the script
    "claim_messages.lua" garbage collect all such message IDs, because it's
    a good opportunity to do so.

    Closes-Bug: 1548135
    Change-Id: Iedd2b82f6da30dd38bfdbb837bf9d0d4c282e222

Changed in zaqar:
status: In Progress → Fix Released
Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/zaqar 3.0.0.0b2

This issue was fixed in the openstack/zaqar 3.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.