Pacemaker turned neutron-ovs-agent into unmanaged state because proc_kill sends wrong signals to pkill

Bug #1528889 reported by Ilya Shakhat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Bogdan Dobrelya
6.1.x
Won't Fix
Critical
MOS Maintenance
7.0.x
Won't Fix
Critical
MOS Maintenance

Bug Description

As result of some failover circumstances neutron-ovs-agent became unmanaged.

According to agent logs:
2015-12-23T14:06:41 -- ovs agent started
2015-12-23T14:07:19 -- Agent initialized successfully, now running...
2015-12-23T14:08:55 -- Error while processing VIF ports due to MessagingTimeout: Timed out waiting for a reply to message
2015-12-23T14:09:01 -- the last message

In pacemaker logs:
Dec 23 14:08:51 [11810] node-4.domain.tld pacemaker_remoted: warning: child_timeout_callback: p_neutron-plugin-openvswitch-agent_stop_0 process (PID 12067) timed out
Dec 23 14:08:51 [11810] node-4.domain.tld pacemaker_remoted: warning: operation_finished: p_neutron-plugin-openvswitch-agent_stop_0:12067 - timed out after 80000ms
Dec 23 14:08:51 [11813] node-4.domain.tld crmd: warning: update_failcount: Updating failcount for p_neutron-plugin-openvswitch-agent on node-4.domain.tld after failed stop: rc=1 (update=INFINITY, time=1450879731)
Dec 23 14:08:51 [11812] node-4.domain.tld pengine: info: native_print: p_neutron-plugin-openvswitch-agent (ocf::fuel:ocf-neutron-ovs-agent): FAILED node-4.domain.tld (unmanaged)

It appears that the agent was started slowly and pacemaker decided to stop it forever.

Revision history for this message
Ilya Shakhat (shakhat) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

DEPLOYMENT:
  Neutron VXLAN + KVM

Revision history for this message
Ilya Shakhat (shakhat) wrote :
  • Logs Edit (3.4 MiB, application/x-tar)
tags: added: ha neutron
Changed in fuel:
milestone: none → 8.0
Changed in fuel:
status: New → Confirmed
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Ilya, this isn't enough logging to get a full idea of what changed on the host. Can you provide steps to reproduce or a diangostic snapshot?

Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
tags: added: area-library team-bugfix tricky
Changed in fuel:
importance: Undecided → High
status: Confirmed → Incomplete
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If stop action failed, the resource shall be unmanaged or fenced , if stonith enabled. That is expected behavior for pacemaker. Although, normally the action stop shall not fail. So we should fix the OCF script

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Yes please attach the full diag logs snapshot

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note that the pacemaker.log shows many time out for monitor actions as well, see http://pastebin.com/ELmthmBM
This points to the issue with too agressive timeouts configured for pacemaker resources, which is not good

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/262043

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
status: Incomplete → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raised to critical because the current implementation of the proc_kill may be very harmful for running processess

summary: - Pacemaker turned neutron-ovs-agent into unmanaged state
+ Pacemaker turned neutron-ovs-agent into unmanaged state because
+ proc_kill sends wrong signals to pkill
Changed in fuel:
importance: High → Critical
tags: added: regression-8.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Although I'm not sure why the current behavior, which is sending pkill -1 (SIGHUP) to the agent's process group 5 times by 2 sec retry , may cause the stop operation timed out by 80000 msec.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/262043
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=d961691e5fa7ba49305e35a4c17de09b2dae2eb6
Submitter: Jenkins
Branch: master

commit d961691e5fa7ba49305e35a4c17de09b2dae2eb6
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 28 18:06:38 2015 +0100

    Fix the proc_kill ocf helper func

    W/o this fix, the command "proc_kill pid name count" will misplace
    the passed count of tries as the signal name for the pkill, which
    is wrong and might be unpredictably harmful for the proces group
    containing the given pid.

    The fix is to correctly specify params for the proc_kill calls.
    Add bats tests for the proc_kill as well.

    Closes-bug: #1528889

    Change-Id: I992480f4e0d3380215d3f8bd910553070334c343
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified 519 iso

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/316085

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/316803

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/317979

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/8.0)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/8.0
Review: https://review.openstack.org/317979

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/7.0)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/7.0
Review: https://review.openstack.org/316803

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (stable/6.1)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/6.1
Review: https://review.openstack.org/316085

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 6.1- and 7.0-updates as this change is too large to be accepted to stable branch

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.