neutron-openvswitch-agent failed after cluster cold shutdown

Bug #1585678 reported by ElenaRossokhina
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Fuel Sustaining
Mitaka
Invalid
Medium
Fuel Sustaining
Newton
Invalid
Medium
Fuel Sustaining

Bug Description

Detailed bug description:
HA suite cannot be performed for cluster after cold restart. All tests fail with "Can not set proxy for Health Check.Make sure that network configuration for controllers is correct"
Steps to reproduce:
1. Pre-condition - do steps from 'deploy_ha_cinder' test
2. Create 2 instances
3. Create 2 volumes
4. Attach volumes to instances
5. Fill cinder storage up to 30%
6. Cold shutdown of all nodes
7. Wait 5 min
8. Start of all nodes
9. Wait for HA services ready <== FAIL
10. Verify networks
11. Run OSTF tests
Expected results:
HA suite PASS
Actual result:
Cluster cannot recover in long period (over 0.5 - 1 hours)
'fuel node' shows all nodes are online
[root@nailgun ~]# fuel health --env 1 --check ha
[ 1 of 7] [failure] 'Check state of haproxy backends on controllers' (0.0 s) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
[ 2 of 7] [failure] 'Check data replication over mysql' (0.0 s) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
[ 3 of 7] [failure] 'Check if amount of tables in databases is the same on each node' (0.0 s) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
[ 4 of 7] [failure] 'Check galera environment state' (0.0 s) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
[ 5 of 7] [failure] 'Check pacemaker status' (0.0 s) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
[ 6 of 7] [failure] 'RabbitMQ availability' (0.0 s) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct
[ 7 of 7] [failure] 'RabbitMQ replication' (0.0 s) Can not set proxy for Health Check.Make sure that network configuration for controllers is correct

root@node-5:~# crm status
Last updated: Wed May 25 12:43:53 2016 Last change: Wed May 25 06:57:55 2016 by root via cibadmin on node-1.test.domain.local
Stack: corosync
Current DC: node-2.test.domain.local (version 1.1.14-70404b0) - partition with quorum
3 nodes and 46 resources configured

Online: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-2.test.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-2.test.domain.local
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-5.test.domain.local ]
 Clone Set: clone_p_mysqld [p_mysqld]
     Started: [ node-2.test.domain.local ]
 sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Started node-2.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-2.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ node-2.test.domain.local ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-2.test.domain.local ]
 Clone Set: clone_neutron-openvswitch-agent [neutron-openvswitch-agent]
     neutron-openvswitch-agent (ocf::fuel:neutron-ovs-agent): FAILED node-2.test.domain.local

Failed Actions:
* neutron-openvswitch-agent_monitor_20000 on node-2.test.domain.local 'not running' (7): call=91, status=complete, exitreason='none',
    last-rc-change='Wed May 25 09:34:15 2016', queued=0ms, exec=0ms
* sysinfo_node-2.test.domain.local_monitor_15000 on node-2.test.domain.local 'unknown error' (1): call=51, status=Timed Out, exitreason='none',
    last-rc-change='Wed May 25 09:16:00 2016', queued=0ms, exec=0ms

Description of the environment:
[root@nailgun ~]# shotgun2 short-report
cat /etc/fuel_build_id:
 376
cat /etc/fuel_build_number:
 376
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6346.noarch
 fuel-bootstrap-cli-9.0.0-1.mos281.noarch
 fuel-migrate-9.0.0-1.mos8376.noarch
 rubygem-astute-9.0.0-1.mos745.noarch
 fuel-misc-9.0.0-1.mos8376.noarch
 network-checker-9.0.0-1.mos72.x86_64
 fuel-mirror-9.0.0-1.mos136.noarch
 fuel-openstack-metadata-9.0.0-1.mos8693.noarch
 fuel-notify-9.0.0-1.mos8376.noarch
 nailgun-mcagents-9.0.0-1.mos745.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8693.noarch
 python-fuelclient-9.0.0-1.mos315.noarch
 fuelmenu-9.0.0-1.mos270.noarch
 fuel-9.0.0-1.mos6346.noarch
 fuel-utils-9.0.0-1.mos8376.noarch
 fuel-setup-9.0.0-1.mos6346.noarch
 fuel-library9.0-9.0.0-1.mos8376.noarch
 shotgun-9.0.0-1.mos88.noarch
 fuel-agent-9.0.0-1.mos281.noarch
 fuel-ui-9.0.0-1.mos2688.noarch
 fuel-ostf-9.0.0-1.mos934.noarch
 python-packetary-9.0.0-1.mos136.noarch
 fuel-nailgun-9.0.0-1.mos8693.noarch

logs: https://drive.google.com/a/mirantis.com/file/d/0B2ag_Bf-ShtTRmRwOUItY0VGbTQ/view?usp=sharing

Tags: area-qa
Changed in fuel:
milestone: none → 9.0
assignee: nobody → Fuel Sustaining (fuel-sustaining-team)
importance: Undecided → High
status: New → Confirmed
tags: added: area-library
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Kyrylo Galanov (kgalanov)
status: Confirmed → In Progress
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Kyrylo Galanov (kgalanov) → MOS Neutron (mos-neutron)
tags: removed: area-library
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Last logs from ovs agent on node-2:

2016-05-25 09:17:02.039 17506 ERROR neutron.agent.ovsdb.impl_vsctl [req-4c1e47fe-17bc-4cb5-86d3-3963a626701c - - - - -] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', 'list-br']. Exception: Exit code: 142; Stdin: ; Stdout: ; Stderr: 2016-05-25T09:16:57Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)

2016-05-25 09:17:02.310 17506 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-4c1e47fe-17bc-4cb5-86d3-3963a626701c - - - - -] Exit code: 142; Stdin: ; Stdout: ; Stderr: 2016-05-25T09:16:57Z|00001|fatal_signal|WARN|terminating with signal 14 (Alarm clock)
 Agent terminated!

After that there were no attempts to start the agent

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Seems not enough memory issue. Lots of page allocation failures in kernel logs from node-2.
3G for comtroller is not enough!

Changed in fuel:
assignee: MOS Neutron (mos-neutron) → Fuel QA Team (fuel-qa)
Revision history for this message
Nastya Urlapova (aurlapova) wrote :

Could you redeploy your env, because ISO #376 is too old.

Changed in fuel:
importance: High → Medium
tags: added: area-qa
Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :

Was able to reproduce it on 9.0-mos #490. Consider it as still valid issue..

[root@nailgun ~]# shotgun2 short-report | head -n8
cat /etc/fuel_build_id:
 490
cat /etc/fuel_build_number:
 490
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0

RAM for controller was 3 Gb though.

@Oleg, how many RAM is enough for controller from your point of view?

Revision history for this message
Andrey Lavrentyev (alavrentyev) wrote :

Update:

Tried one more time after a bit longer timeout on the same 9.0-mos #490.
It looks like the issue is gone.
Can't reproduce it.
Instead, the https://bugs.launchpad.net/fuel/newton/+bug/1592876 known issue got.

Suggest to move it to invalid or another status.

Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Changed in fuel:
assignee: ElenaRossokhina (esolomina) → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

We need new reproduction for investigation

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :
Revision history for this message
Dmitry Belyaninov (dbelyaninov) wrote :

^ not merged
the issue was reproduced.

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Patch was merged on 17.02, please reopen if reproduced.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.