Initial execution of parallel engines can be racey

Bug #1598946 reported by Dave Walker on 2016-07-04
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
watcher
Undecided
Zhai, Edwin

Bug Description

When running multiple watcher engines in parallel (ie, via automated HA deployment), the initial run can create multiple duplicate Goals and Strategies.

For example:
+--------------------------------------+----------------------+----------------------+
| UUID | Name | Display name |
+--------------------------------------+----------------------+----------------------+
| 55211ddd-e9a5-4105-b8e8-c177c0de59fa | server_consolidation | Server consolidation |
| b084a6c7-e03e-4ef6-ba72-10e493ba2b6e | dummy | Dummy goal |
| 2ff77cc0-0f20-4fd7-a739-4346a044a961 | thermal_optimization | Thermal optimization |
| 542de4f0-73c7-461d-b4e3-b8dbe808e0d6 | thermal_optimization | Thermal optimization |
| f78952c5-294b-4f33-9fb7-7b2f4d7494a8 | unclassified | Unclassified |
| 6beb2ff1-fffb-488b-a7db-92615761fe13 | unclassified | Unclassified |
| 45a9d181-0641-451e-97b5-0f705fd9c16f | workload_balancing | Workload balancing |
| aa619ddd-7265-4857-a57e-e3776b668a5c | workload_balancing | Workload balancing |
+--------------------------------------+----------------------+----------------------+

This has bad consequences, such as:
2016-07-04 17:22:41.053 1 CRITICAL python-watcher [req-b99b79f4-2b24-4dab-a1be-c0c90713b726 - - - - -] MultipleResultsFound: Multiple rows were found for one()
2016-07-04 17:22:41.053 1 ERROR python-watcher Traceback (most recent call last):
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/bin/watcher-decision-engine", line 10, in <module>
2016-07-04 17:22:41.053 1 ERROR python-watcher sys.exit(main())
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/watcher/cmd/decisionengine.py", line 43, in main
2016-07-04 17:22:41.053 1 ERROR python-watcher syncer.sync()
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/watcher/decision_engine/sync.py", line 119, in sync
2016-07-04 17:22:41.053 1 ERROR python-watcher self.strategy_mapping.update(self._sync_strategy(strategy_map))
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/watcher/decision_engine/sync.py", line 167, in _sync_strategy
2016-07-04 17:22:41.053 1 ERROR python-watcher strategy.goal_id = objects.Goal.get_by_name(self.ctx, goal_name).id
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/watcher/objects/goal.py", line 116, in get_by_name
2016-07-04 17:22:41.053 1 ERROR python-watcher db_goal = cls.dbapi.get_goal_by_name(context, name)
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/watcher/db/sqlalchemy/api.py", line 432, in get_goal_by_name
2016-07-04 17:22:41.053 1 ERROR python-watcher return self._get_goal(context, fieldname="name", value=goal_name)
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/watcher/db/sqlalchemy/api.py", line 421, in _get_goal
2016-07-04 17:22:41.053 1 ERROR python-watcher fieldname=fieldname, value=value)
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/watcher/db/sqlalchemy/api.py", line 243, in _get
2016-07-04 17:22:41.053 1 ERROR python-watcher obj = query.one()
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/kolla/venv/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 2727, in one
2016-07-04 17:22:41.053 1 ERROR python-watcher "Multiple rows were found for one()")
2016-07-04 17:22:41.053 1 ERROR python-watcher MultipleResultsFound: Multiple rows were found for one()
2016-07-04 17:22:41.053 1 ERROR python-watcher

Zhai, Edwin (edwin-zhai) on 2016-07-06
Changed in watcher:
assignee: nobody → Zhai, Edwin (edwin-zhai)

Fix proposed to branch: master
Review: https://review.openstack.org/339285

Changed in watcher:
status: New → In Progress
Changed in watcher:
milestone: none → newton-3
importance: Undecided → Medium
Jean-Emile DARTOIS (jed56) wrote :

This is not a bug. We should discuss that. IMHO, we don't need be HA ready for now. Moreover, you are missing many things.

Changed in watcher:
importance: Medium → Undecided
Dave Walker (davewalker) wrote :

This most certainly is a bug. Inadvertently running multiple services concurrently shouldn't cause the thing to explode.

No other OpenStack services that I have come across will happily create duplicate services entries on starting, and then balk out.

Why would Watcher not want to be deployed HA?

Creating the initial fixtures should either be separated a part of watcher-db-manage process, or (as the current proposed branch does) be idempotent.

The reason why this cannot be considered as a bug is because there as many outstanding issues that would have to be treated in the decision engine for it to be HA-ready.

The first one is indeed the database sync that would have to be handled across 2+ decision engine processes.

Once this phase is correctly handled, we would also need to make the whole strategy execution become stateless, which is not currently the case:

- How do we handle continuous audits across multiple nodes given that they are background tasks?
- If we instantiate multiple, that implies we have make N times more queries that would hit Nova, Neutron and so on as we build our data models in-memory. Is it OK for us to increase the network I/O ? Is there something we need to do to mitigate this?

If we have a decision engine that is HA-ready, that would also mean that we have to make the Watcher Applier HA-ready in order to be able to cope with the expected increase of action plans to execute:

- How do we handle the execution of action plans across multiple processes?
- ...

My point here is to give you an insight to some of the questions that need to be answered in order for Watcher to fully support HA. However, the added value at the moment is quite minimal given the fact that Watcher isn't necessary to the good functioning of an OpenStack infrastructure. This is the reason my I would argue we shouldn't push for this (at least for now) even though this may become a priority in the future.

Dave Walker (davewalker) wrote :

But.. it is still a bug.. if i accidentally start two processes concurrently.. i corrupt the database... which watcher can't fix.. Manual database surgery is needed.

Please stop calling this not-a-bug, because it clearly is. The fact there are other issues, are, well, separate issues.

Jean-Emile DARTOIS (jed56) wrote :

I agree, in a way, we can say that it's a bug. However, in our mind when the scope or the impact is huge or when we are adding a new feature to watcher, we are used to call that a blueprint.

Change abandoned by Edwin Zhai (<email address hidden>) on branch: master
Review: https://review.openstack.org/339285
Reason: as discussed, it's not a bug

Changed in watcher:
status: In Progress → Won't Fix
Dave Walker (davewalker) wrote :

I'm disappointed this has been marked Won't Fix, it seemed more appropriate based on the discussions to leave it open.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers