Instance doesn't get IP (DHCP) after compute node re-installation

Bug #1531241 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
High
MOS Nova
8.0.x
Confirmed
High
MOS Nova
Mitaka
Confirmed
High
MOS Nova

Bug Description

After compute node re-installation Nova instance is running, but not accessible via network because DHCP client doesn't setup interfaces (see attached screenshot):

root@node-1:~# nova list
+--------------------------------------+--------------------+--------+------------+-------------+------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------------------+--------+------------+-------------+------------------------------------------------+
| 50118e0e-e104-4cc0-b154-b0dcdaef0494 | test-serv818246635 | ACTIVE | - | Running | admin_internal_net=10.109.29.10, 10.109.28.132 |
+--------------------------------------+--------------------+--------+------------+-------------+------------------------------------------------+
root@node-1:~# ip netns
qrouter-36585fb6-089e-42ea-91cb-07785decd05e
qdhcp-d2486119-b310-41af-be67-9c52dec90277
haproxy
vrouter
root@node-1:~# ip netns exec qdhcp-d2486119-b310-41af-be67-9c52dec90277 ping -c 1 10.109.29.10
PING 10.109.29.10 (10.109.29.10) 56(84) bytes of data.

--- 10.109.29.10 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

root@node-1:~# ping -c 1 -W 1 10.109.28.132
PING 10.109.28.132 (10.109.28.132) 56(84) bytes of data.

--- 10.109.28.132 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

Steps to reproduce:

1. run auto test 'cinder_nova_partition_preservation' or manually follow its steps: https://github.com/openstack/fuel-qa/blob/77a672c3c6a37723a4d24d3db3260ab58661f518/fuelweb_test/tests/test_node_reinstallation.py#L463-L481

Expected result: test passed
Actual result: test fails with error "test-servNNNNNNN VM is not accessible via its FloatingIP"

There is only one error in nova-compute logs from snapshot:

http://paste.openstack.org/show/483057/

But after I reverted environment I got also these errors (see diagnostic_snapshot_after_revert.xz):

http://paste.openstack.org/show/483059/

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkUFJyNmZMQXRPMk0/view?usp=sharing

Tags: area-nova
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
description: updated
Artem Roma (aroma-x)
tags: added: area-mos
tags: added: area-nova
removed: area-mos nova
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Artem, I followed the steps and didn't manage to reproduce the issue: after the re-installation the instance is in ACTIVE state and I can ping it / ssh to it (even the volume attached to instance remains available).

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "410"
  build_id: "410"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "7ef751bdc0e4601310e85b8bf713a62ed4aee305"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "8bb8c70efc61bcf633e02d6054dbf5ec8dcf6699"
  fuelmenu_sha: "2a0def56276f0fc30fd949605eeefc43e5d7cc6c"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "62573cb2a8aa54845de9303b4a30935a90e1db61"

root@node-2:~# dpkg -l | grep nova
ii nova-common 2:12.0.0-1~u14.04+mos21 all OpenStack Compute - common files
ii nova-compute 2:12.0.0-1~u14.04+mos21 all OpenStack Compute - compute node
ii nova-compute-kvm 2:12.0.0-1~u14.04+mos21 all OpenStack Compute - compute node (KVM)
ii python-nova 2:12.0.0-1~u14.04+mos21 all OpenStack Compute - libraries
ii python-novaclient 2:2.30.2-1~u14.04+mos3 all client library for OpenStack Compute API
root@node-2:~# dpkg -l | grep neutron
ii neutron-common 2:7.0.1-1~u14.04+mos55 all OpenStack virtual network service - common files
ii neutron-plugin-ml2 2:7.0.1-1~u14.04+mos55 all Neutron is a virtual network service for Openstack - ML2 plugin
ii neutron-plugin-openvswitch-agent 2:7.0.1-1~u14.04+mos55 all OpenStack virtual network service - Open vSwitch agent
ii python-neutron 2:7.0.1-1~u14.04+mos55 all OpenStack virtual network service - Python library
ii python-neutronclient 1:3.1.0-1~u14.04+mos9 all client API library for Neutron

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

At the same time the snippet http://paste.openstack.org/show/483057/ looks interesting. I think it's neutron-openvswitch-agent which creates an OVS integration bridge - br-int. Looks like in your case neutron agent started after nova-compute and by the time the latter tried to plug a VM's VIF into br-int the bridge didn't exist.

I checked the upstart config of nova-compute and looks like we don't have an explicit dependency on neutron-openvswitch-agent, so upstart *may* start the agent later. I guess, we could try to tweak upstart scripts here, although a started neutron agent process would not actually mean all the initialization steps have been performed (e.g. in our case - br-int has been created).

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Another point is that the test itself should probably do a better job by migrating the existing workloads from the node / disabling them properly before doing node reinstallation. Having them hanging in ACTIVE state during the node provisioning + deployment phases seems to be odd and error-prone from the users standpoint.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Roman,

the issue is permanently reproduced on CI starting from iso build #355 (Dec 26, 2015), please contact me in IRC if you need an environment after tests for investigation.

Revision history for this message
Alexander Gubanov (ogubanov) wrote :

I can't reproduce it on MOS8.0 (build 402)
Env: neutron vlan - 3 controller nodes, 2 compute with cinder
Proof http://pastebin.com/kGjsHgku

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.