Initial execution of parallel engines can be racey
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| watcher |
Undecided
|
Zhai, Edwin |
Bug Description
When running multiple watcher engines in parallel (ie, via automated HA deployment), the initial run can create multiple duplicate Goals and Strategies.
For example:
+------
| UUID | Name | Display name |
+------
| 55211ddd-
| b084a6c7-
| 2ff77cc0-
| 542de4f0-
| f78952c5-
| 6beb2ff1-
| 45a9d181-
| aa619ddd-
+------
This has bad consequences, such as:
2016-07-04 17:22:41.053 1 CRITICAL python-watcher [req-b99b79f4-
2016-07-04 17:22:41.053 1 ERROR python-watcher Traceback (most recent call last):
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher sys.exit(main())
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher syncer.sync()
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher self.strategy_
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher strategy.goal_id = objects.
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher db_goal = cls.dbapi.
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher return self._get_
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher fieldname=
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher obj = query.one()
2016-07-04 17:22:41.053 1 ERROR python-watcher File "/var/lib/
2016-07-04 17:22:41.053 1 ERROR python-watcher "Multiple rows were found for one()")
2016-07-04 17:22:41.053 1 ERROR python-watcher MultipleResults
2016-07-04 17:22:41.053 1 ERROR python-watcher
Changed in watcher: | |
assignee: | nobody → Zhai, Edwin (edwin-zhai) |
Changed in watcher: | |
status: | New → In Progress |
Changed in watcher: | |
milestone: | none → newton-3 |
importance: | Undecided → Medium |
Jean-Emile DARTOIS (jed56) wrote : | #2 |
This is not a bug. We should discuss that. IMHO, we don't need be HA ready for now. Moreover, you are missing many things.
Changed in watcher: | |
importance: | Medium → Undecided |
Dave Walker (davewalker) wrote : | #3 |
This most certainly is a bug. Inadvertently running multiple services concurrently shouldn't cause the thing to explode.
No other OpenStack services that I have come across will happily create duplicate services entries on starting, and then balk out.
Why would Watcher not want to be deployed HA?
Creating the initial fixtures should either be separated a part of watcher-db-manage process, or (as the current proposed branch does) be idempotent.
The reason why this cannot be considered as a bug is because there as many outstanding issues that would have to be treated in the decision engine for it to be HA-ready.
The first one is indeed the database sync that would have to be handled across 2+ decision engine processes.
Once this phase is correctly handled, we would also need to make the whole strategy execution become stateless, which is not currently the case:
- How do we handle continuous audits across multiple nodes given that they are background tasks?
- If we instantiate multiple, that implies we have make N times more queries that would hit Nova, Neutron and so on as we build our data models in-memory. Is it OK for us to increase the network I/O ? Is there something we need to do to mitigate this?
If we have a decision engine that is HA-ready, that would also mean that we have to make the Watcher Applier HA-ready in order to be able to cope with the expected increase of action plans to execute:
- How do we handle the execution of action plans across multiple processes?
- ...
My point here is to give you an insight to some of the questions that need to be answered in order for Watcher to fully support HA. However, the added value at the moment is quite minimal given the fact that Watcher isn't necessary to the good functioning of an OpenStack infrastructure. This is the reason my I would argue we shouldn't push for this (at least for now) even though this may become a priority in the future.
Dave Walker (davewalker) wrote : | #5 |
But.. it is still a bug.. if i accidentally start two processes concurrently.. i corrupt the database... which watcher can't fix.. Manual database surgery is needed.
Please stop calling this not-a-bug, because it clearly is. The fact there are other issues, are, well, separate issues.
Jean-Emile DARTOIS (jed56) wrote : | #6 |
I agree, in a way, we can say that it's a bug. However, in our mind when the scope or the impact is huge or when we are adding a new feature to watcher, we are used to call that a blueprint.
Change abandoned by Edwin Zhai (<email address hidden>) on branch: master
Review: https:/
Reason: as discussed, it's not a bug
Changed in watcher: | |
status: | In Progress → Won't Fix |
Dave Walker (davewalker) wrote : | #8 |
I'm disappointed this has been marked Won't Fix, it seemed more appropriate based on the discussions to leave it open.
Fix proposed to branch: master /review. openstack. org/339285
Review: https:/