Pacemaker Arbitrarily restarts service after 33 failure counter increments

Bug #1840505 reported by Jesse Pendergrass
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
High
MOS Maintenance

Bug Description

Detailed bug description:
This report is related to https://bugs.launchpad.net/mos/+bug/1840503.

In the conditions detailed above, and only in the customer environment, the neutron-dhcp-agent resource will be killed and will not be started again cleanly by the OCF scripts.

Pacemaker can see this, and based on log outputs it appears to try and start the process repeatedly over the course of 30-40 minutes, at an interval of once per minute. It is apparent that a failure counter is incrementing over this period.

After about 33 failures, PCS bans the resource, and then about 50 seconds later starts up the resource successfully. It is unclear what configurations in PCS govern this behavior, as the configuration for neutron-dhcp-agent do not appear to match what is being seen.

Steps to reproduce:
This is, so far, only reproducable in the customer environment.

Expected results:
PCS should start up the resource immediately after the issue reported in https://bugs.launchpad.net/mos/+bug/1840503 has occurred.

Actual result:
It takes 33+ minutes for the resource to be started up again, which involves the resource being banned even though there doesn't appear to be any sort of failure thresholds set in the resource configuration.

Impact:
Without knowing the precise configuration that governs this behavior, it appears to be arbitrary. Thus, the customer cannot set a timeout threshold on the fuel task that would account for this situation.

Description of the environment:
- Operation system: Ubuntu 14.04
- Reference architecture: MOS 9.0
- Network model: Neutron + OVS

Additional information:
Clone: clone_neutron-dhcp-agent
 Meta Attrs: interleave=true
 Resource: neutron-dhcp-agent (class=ocf provider=fuel type=neutron-dhcp-agent)
  Attributes: plugin_config=/etc/neutron/dhcp_agent.ini remove_artifacts_on_stop_start=true
  Operations: monitor interval=20 timeout=30 (neutron-dhcp-agent-monitor-20)
              start interval=0 timeout=60 (neutron-dhcp-agent-start-0)
              stop interval=0 timeout=60 (neutron-dhcp-agent-stop-0)

Changed in mos:
milestone: none → 9.2-mu-14
assignee: nobody → MOS Maintenance (mos-maintenance)
importance: Undecided → High
status: New → Confirmed
tags: added: customer-found
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Closed per Jesse's request.

Changed in mos:
status: Confirmed → Invalid
milestone: 9.2-mu-14 → 9.x-updates
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.