The Schedule Operation does not trigger using operationengine multiple- node

Bug #1845049 reported by Hernando Clareth Ariza Perez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Karbor
Invalid
Undecided
Unassigned

Bug Description

- Description:

After ScheduleOperation creation if you set a date-time to run this operation it doesn't trigger if the node that takes the trigger operation task, isn't where the operation was created.

The function that executes the Operation gets the operation from a set (_operation_ids) saved in Memory, So only the node where the operation was created can execute it. For others node the set (_operation_ids) is empty so, those can't execute Operation.

- code:

```def _trigger_operations(cls, trigger_id, expect_run_time, window):
        """Trigger operations once"""

        # The executor execute_operation may have I/O operation.
        # If it is, this green thread will be switched out during looping
        # operation_ids. In order to avoid changing self._operation_ids
        # during the green thread is switched out, copy self._operation_ids
        # as the iterative object.
        trigger = cls._triggers.get(trigger_id)
        if not trigger:
            LOG.warning("Can't find trigger: %s" % trigger_id)
            return
        ########set in Memory -> trigger._operation_ids ############
        ############################################################
        operations_ids = trigger._operation_ids.copy()
        sent_ops = set()
        end_time = expect_run_time + timedelta(seconds=window)

        for operation_id in operations_ids:
            if operation_id not in trigger._operation_ids:
                # Maybe, when traversing this operation_id, it has been
                # removed by self.unregister_operation
                LOG.warning("Execute operation %s which is not exist, "
                            "ignore it", operation_id)
                continue

            now = datetime.utcnow()
            if now >= end_time:
                LOG.error("Can not trigger operations to run. Because it is "
                          "out of window time. now=%(now)s, "
                          "end time=%(end_time)s, waiting operations=%(ops)s",
                          {'now': now, 'end_time': end_time,
                           'ops': operations_ids - sent_ops})
                break

            try:
                trigger._executor.execute_operation(
                    operation_id, now, expect_run_time, window)
            except Exception:
                LOG.exception("Submit operation to executor failed, operation"
                              " id=%s", operation_id)

            sent_ops.add(operation_id)```

- Environment condition:

    * karbor version: master (I think all versions has this problem)
    * Operation Engine service running on 3 nodes

- Test to reproduce it:

Create Karbor plan and trigger
karbor plan-create "myplan001" ac9b986e-cd0a-49fb-9513-9d93a4d9dac8 1a1441d3-d825-4258-ac9a-05787476fa99=OS::Nova::Server="karbor_keep_test"
+-------------+---------------------------------------------------+
| Property | Value |
+-------------+---------------------------------------------------+
| description | None |
| id | df1c0584-b9fc-4f8c-87d9-df54cc6bcd38 |
| name | myplan001 |
| parameters | {} |
| provider_id | ac9b986e-cd0a-49fb-9513-9d93a4d9dac8 |
| resources | [ |
| | { |
| | "id": "1a1441d3-d825-4258-ac9a-05787476fa99", |
| | "name": "karbor_keep_test", |
| | "type": "OS::Nova::Server" |
| | } |
| | ] |
| status | suspended |
+-------------+---------------------------------------------------+

karbor trigger-create 'mytrigger001' 'time' "pattern"="BEGIN:VEVENT\nRRULE:FREQ=DAILY;INTERVAL=1;\nEND:VEVENT","start_time"="2019-09-19 17:20:00","format"="calendar"
+------------+--------------------------------------------------------------------------+
| Property | Value |
+------------+--------------------------------------------------------------------------+
| id | ab103b0e-d4a6-4ff0-acf7-4b8022b23f52 |
| name | mytrigger001 |
| properties | { |
| | "format": "calendar", |
| | "pattern": "BEGIN:VEVENT\\nRRULE:FREQ=DAILY;INTERVAL=1;\\nEND:VEVENT", |
| | "start_time": "2019-09-19 17:20:00" |
| | } |
| type | time |
+------------+--------------------------------------------------------------------------+
Monitoring the operation engine log (debug mode)
Every nodes shows the same log they are in the infinity loop

2019-09-19 17:08:47.092 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-4d71847c-8edf-4e57-a22b-bd52bbe82450 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
2019-09-19 17:09:02.092 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-634487a9-f71e-48ef-9a8d-302ea3cdf5b4 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88.....
Create the schedule operation based on the created plan and trigger
karbor scheduledoperation-create 'myprotection_retention' retention_protect ab103b0e-d4a6-4ff0-acf7-4b8022b23f52 "plan_id"="df1c0584-b9fc-4f8c-87d9-df54cc6bcd38","provider_id"="ac9b986e-cd0a-77fb-9513-9d93a779dac8"
+----------------------+---------------------------------------------------------+
| Property | Value |
+----------------------+---------------------------------------------------------+
| description | None |
| enabled | True |
| id | b5f4401c-3e7e-4f6a-9b90-60aaa9e3d0da |
| name | myprotection_retention |
| operation_definition | { |
| | "plan_id": "df1c0584-b9fc-4f8c-87d9-df54cc6bcd38", |
| | "provider_id": "ac9b986e-cd0a-49fb-9513-9d93a4d9dac8" |
| | } |
| operation_type | retention_protect |
| trigger_id | ab103b0e-d4a6-4ff0-acf7-4b8022b23f52 |
+----------------------+---------------------------------------------------------+
Continue monitoring the logs and the operation was created on Node2 (a simple name to call one of the nodes)

2019-09-19 17:13:55.041 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-7aeddb89-4f8e-4875-91f6-f36dd19e7f20 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
2019-09-19 17:13:59.769 8 DEBUG karbor.services.operationengine.manager [req-4d0f783e-cd1e-41b8-aa63-7b9264e20a7e 14a7f940db0f4ba6b8f418d4b419c131 2f3ebe37aded44d8ae49f009bf31cbfe - - -] Create scheduled operation. create_scheduled_operation /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/manager.py:162
2019-09-19 17:14:10.039 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-a7524d10-a928-48df-92ae-ea98a6d25859 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
2019-09-19 17:14:25.039 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-509c4b7c-4f7f-4dc5-98ab-a2c069328bbd - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
Continue monitoring the operation engine service logs,
All nodes are in the infinitive loop.
Waiting the setted time in the trigger to see the execution of the operation....

2019-09-19 17:17:47.106 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-4d97a8e2-10b4-40c7-89c1-dc95016b7023 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
2019-09-19 17:18:02.107 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-5369854d-6d9e-4722-9d47-e1ca43e90ad7 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
At 2019-09-19 17:20:02 the Node3 took the created trigger to start the schedule operation..., see the following log, You can see that the set _operation_ids is empty, cause the schedule operation was created on a different node (Node2).

NOTE: I added some lines in the code to log the trace and understand this behavior
2019-09-19 17:19:47.113 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-2505f17b-9aab-49d5-837f-cc710503a7a5 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
2019-09-19 17:20:02.116 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-a6c8aef7-0763-4345-9c15-bc049b7f3e17 - - - - -] Time trigger (ab103b0e-d4a6-4ff0-acf7-4b8022b23f52) is due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:101
2019-09-19 17:20:02.118 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-a6c8aef7-0763-4345-9c15-bc049b7f3e17 - - - - -] Rescheduling (ab103b0e-d4a6-4ff0-acf7-4b8022b23f52) from 2019-09-19 17:20:00 to 2019-09-20 17:20:00 _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:118
2019-09-19 17:20:02.129 8 DEBUG karbor.db.sqlalchemy.api [req-8227b9af-4342-4c94-a0f3-a139cf794993 - - - - -] Updated trigger execution (a7a47abc-9bbe-4595-ab79-589c9cb5a9e7) from 2019-09-19 17:20:00 to 2019-09-20 17:20:00 trigger_execution_update /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/db/sqlalchemy/api.py:502
2019-09-19 17:20:02.131 8 INFO karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-8227b9af-4342-4c94-a0f3-a139cf794993 - - - - -] HERE STARTING:_trigger_operations function
2019-09-19 17:20:02.131 8 INFO karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-8227b9af-4342-4c94-a0f3-a139cf794993 - - - - -] HERE set In Memory (operation ids)
2019-09-19 17:20:02.132 8 INFO karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-8227b9af-4342-4c94-a0f3-a139cf794993 - - - - -] set([])
2019-09-19 17:20:02.144 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-0270ef7e-aa41-44c7-9e77-e755eaee18e3 - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88
2019-09-19 17:20:17.113 8 DEBUG karbor.services.operationengine.engine.triggers.timetrigger.time_trigger_multi_node [req-470348c3-cf55-42ef-b2d8-ca859be5460a - - - - -] Time trigger not yet due _loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/karbor/services/operationengine/engine/triggers/timetrigger/time_trigger_multi_node.py:88

- Workaround

Enable the schedule operation service only on one node, stop it on others.

Changed in karbor:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.