[10.0 swarm] Launch of instance fails due to PortBindingFailed

Bug #1674381 reported by Sergey Novikov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
Fuel Sustaining

Bug Description

Detailed bug description:

The issue was found by https://product-ci.infra.mirantis.net/job/10.0.system_test.ubuntu.ha_neutron/213/testReport/fuel_tests.tests.test_neutron/TestNeutronTunHa/test_deploy_neutron_gre_ha/

Steps to reproduce:
            1. Create cluster
            2. Add 3 nodes with controller role
            3. Add 2 nodes with compute role
            4. Deploy the cluster
            5. Run network verification
            6. Run OSTF

Actual result:
AssertionError: Failed 1 OSTF tests; should fail 0 tests. Names of failed tests: - Launch instance, create snapshot, launch instance from snapshot (failure) Failed to get to expected status. In error state. Please refer to OpenStack logs for more details.

some details from logs
http://paste.openstack.org/show/603441/

Revision history for this message
Sergey Novikov (snovikov) wrote :
Changed in fuel:
milestone: none → 10.1
Changed in fuel:
assignee: nobody → MOS Neutron (mos-neutron)
Changed in fuel:
status: New → Confirmed
tags: added: area-neutron
Changed in fuel:
assignee: MOS Neutron (mos-neutron) → Oleg Bondarev (obondarev)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

From neutron logs:

 Refusing to bind port c0ebb7e1-ffb9-4cfc-a418-46b1a8dc3dbe to dead agent

 2017-03-20 01:32:25.458 17933 WARNING neutron.db.agents_db [req-faf604d7-11a7-4e32-a727-65171800240d - - - - -] Agent healthcheck: found 10 dead agents out of 14:
                Type Last heartbeat host
  Open vSwitch agent 2017-03-20 01:31:34 node-5.test.domain.local
            L3 agent 2017-03-20 01:31:39 node-2.test.domain.local
      Metadata agent 2017-03-20 01:31:34 node-2.test.domain.local
      Metadata agent 2017-03-20 01:31:41 node-1.test.domain.local
      Metadata agent 2017-03-20 01:31:32 node-4.test.domain.local
  Open vSwitch agent 2017-03-20 01:31:34 node-1.test.domain.local
  Open vSwitch agent 2017-03-20 01:31:32 node-3.test.domain.local
            L3 agent 2017-03-20 01:31:40 node-1.test.domain.local
  Open vSwitch agent 2017-03-20 01:31:38 node-4.test.domain.local
          DHCP agent 2017-03-20 01:31:32 node-2.test.domain.local

Possibly some problems with rabbit/messaging

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Node-1 started to report dead agents starting from 2017-03-20 01:29:40, while other nodes didn't: "found 14 active agents" on both nodes 2 and 4.

However there is a strange time jump on other controllers; node-2:

2017-03-20 01:32:39.586 5600 INFO neutron.wsgi [req-82c09c56-2b96-4a98-b209-1558903e3d37 6be4335fc7594520a6644ff2bc24da68 b6f9819e09e145eb8793ded57ad6f407 - - -] 10.109.54.8 - - [20/Mar/2017 01:32:39] "GET /v2.0/subnets.json?fields=id&fields=cidr&id=9a04f9ae-8b59-4551-88c1-831d24afa596&id=48a1bc6d-7c96-415c-bc6c-bdc6f84a9c85 HTTP/1.1" 200 370 0.133562
2017-03-20 01:33:24.949 5601 DEBUG neutron.wsgi [-] (5601) accepted ('10.109.54.2', 55042) server /usr/lib/python2.7/dist-packages/eventlet/wsgi.py:867

and node-4:

2017-03-20 01:32:43.967 23558 DEBUG neutron.api.rpc.handlers.resources_rpc [req-490fc077-a8fc-4ca5-927b-181425bf04ca - - - - -] neutron.api.rpc.handlers.resources_rpc.ResourcesPushToServerRpcCallback method report_agent_resource_versions called with arguments (<neutron.context.Context object at 0x7fa479994bd0>,) {u'version_map': {u'SubPort': u'1.0', u'QosPolicy': u'1.3', u'Trunk': u'1.0'}, u'agent_type': u'Open vSwitch agent', u'agent_host': u'node-2.test.domain.local'} wrapper /usr/lib/python2.7/dist-packages/oslo_log/helpers.py:47
2017-03-20 01:33:28.589 23559 DEBUG oslo_messaging._drivers.amqpdriver [-] received message with unique_id: 0015f4278ffc49b29e378035d5b7c10d __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:196

all service logs on nodes 2 and 4 have this jump (~40 sec)

Revision history for this message
Oleg Bondarev (obondarev) wrote :

node-1 has this time jump (same ~40 sec) a bit earlier at 01:28:41.161:

2017-03-20 01:28:41.161 17933 DEBUG neutron.db.agents_db [req-faf604d7-11a7-4e32-a727-65171800240d - - - - -] Agent healthcheck: found 14 active agents agent_health_check /usr/lib/python2.7/dist-packages/neutron/db/agents_db.py:310
2017-03-20 01:29:26.181 17922 DEBUG oslo_messaging._drivers.amqpdriver [-] received message msg_id: c6ff2fb951264277ad4fe3a907d0644b reply to reply_0f40b8804c2d4211bfc601d4462faf2b __call__ /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:194

And after this node-1 starts reporting dead agents.

Currently it looks like time syncing issue...

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Looking at fuel master node logs can also see this time jump clearly; node-1 nova-api log:

2017-03-20T01:28:39.120193+00:00 info: 2017-03-20 01:28:39.465 24878 INFO nova.osapi_compute.wsgi.server [req-f4dbc9f9-31ee-45ca-9ea2-298ca47db84b 6be4335fc7594520a6644ff2bc24da68 b6f9819e09e145eb8793ded57ad6f407 - default default] 10.109.54.8 "POST /v2.1/os-floating-ips HTTP/1.1" status: 200 len: 483 time: 0.8551819
2017-03-20T01:28:41.369852+00:00 debug: 2017-03-20 01:29:26.009 24902 DEBUG nova.metadata.wsgi.server [req-25a6d42c-6337-4d15-b152-a0e77d253bd7 - - - - -] (24902) accepted ('10.109.54.8', 60288) server /usr/lib/python2.7/dist-packages/eventlet/wsgi.py:867

node-2 nova-api log:

2017-03-20T01:32:39.496879+00:00 info: 2017-03-20 01:32:39.840 24128 INFO nova.metadata.wsgi.server [-] 10.109.54.2 "OPTIONS / HTTP/1.0" status: 200 len: 234 time: 0.0004840
2017-03-20T01:32:40.334387+00:00 debug: 2017-03-20 01:33:24.980 24128 DEBUG nova.metadata.wsgi.server [req-0efcb845-b371-440c-a3e4-e08e71a7866d - - - - -] (24128) accepted ('10.109.54.8', 52580) server /usr/lib/python2.7/dist-packages/eventlet/wsgi.py:867

node-4:

2017-03-20T01:32:41.424439+00:00 info: 2017-03-20 01:32:41.776 4887 INFO nova.metadata.wsgi.server [-] 240.0.0.2 "OPTIONS / HTTP/1.0" status: 200 len: 234 time: 0.0005469
2017-03-20T01:32:44.256129+00:00 debug: 2017-03-20 01:33:28.903 4884 DEBUG nova.metadata.wsgi.server [req-4b076af4-a646-4feb-babc-6b611db632f5 - - - - -] (4884) accepted ('10.109.54.8', 35338) server /usr/lib/python2.7/dist-packages/eventlet/wsgi.py:867

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Given above seems it's time desynchronisation issue on nodes.

Changed in fuel:
assignee: Oleg Bondarev (obondarev) → Fuel Sustaining (fuel-sustaining-team)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.