Fuel for OpenStack

Instance doesn't get IP (DHCP) after compute node re-installation

Series 8.0.x
Bug #1531241

Bug #1531241 reported by Artem Panchenko on 2016-01-05

This bug report is a duplicate of: Bug #1477475: [Partition preservation] Sometimes VM lose network connectivity after rollback of compute node where they were booted. Edit Remove

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Confirmed	High	MOS Nova	Fuel for OpenStack 9.0
8.0.x	Confirmed	High	MOS Nova	Fuel for OpenStack 8.0
Mitaka	Confirmed	High	MOS Nova	Fuel for OpenStack 9.0

Bug Description

After compute node re-installation Nova instance is running, but not accessible via network because DHCP client doesn't setup interfaces (see attached screenshot):

root@node-1:~# nova list
+--------------------------------------+--------------------+--------+------------+-------------+------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------------------+--------+------------+-------------+------------------------------------------------+
| 50118e0e-e104-4cc0-b154-b0dcdaef0494 | test-serv818246635 | ACTIVE | - | Running | admin_internal_net=10.109.29.10, 10.109.28.132 |
+--------------------------------------+--------------------+--------+------------+-------------+------------------------------------------------+
root@node-1:~# ip netns
qrouter-36585fb6-089e-42ea-91cb-07785decd05e
qdhcp-d2486119-b310-41af-be67-9c52dec90277
haproxy
vrouter
root@node-1:~# ip netns exec qdhcp-d2486119-b310-41af-be67-9c52dec90277 ping -c 1 10.109.29.10
PING 10.109.29.10 (10.109.29.10) 56(84) bytes of data.

--- 10.109.29.10 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

root@node-1:~# ping -c 1 -W 1 10.109.28.132
PING 10.109.28.132 (10.109.28.132) 56(84) bytes of data.

--- 10.109.28.132 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms

Steps to reproduce:

1. run auto test 'cinder_nova_partition_preservation' or manually follow its steps: https://github.com/openstack/fuel-qa/blob/77a672c3c6a37723a4d24d3db3260ab58661f518/fuelweb_test/tests/test_node_reinstallation.py#L463-L481

Expected result: test passed
Actual result: test fails with error "test-servNNNNNNN VM is not accessible via its FloatingIP"

There is only one error in nova-compute logs from snapshot:

http://paste.openstack.org/show/483057/

But after I reverted environment I got also these errors (see diagnostic_snapshot_after_revert.xz):

http://paste.openstack.org/show/483059/

Diagnostic snapshot: https://drive.google.com/file/d/0BzaZINLQ8-xkUFJyNmZMQXRPMk0/view?usp=sharing

See original description

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-01-05:

vnc_dhcp_fail.png Edit (48.3 KiB, image/png)

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-01-05:

console.log Edit (24.8 KiB, text/plain)

description:

updated

Artem Roma (aroma-x) on 2016-01-06

tags:

added: area-mos

Roman Podoliaka (rpodolyaka) on 2016-01-06

tags:

added: area-nova
removed: area-mos nova

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-01-08:

Artem, I followed the steps and didn't manage to reproduce the issue: after the re-installation the instance is in ACTIVE state and I can ping it / ssh to it (even the volume attached to instance remains available).

[root@nailgun ~]# cat /etc/fuel/version.yaml
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "410"
  build_id: "410"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "7ef751bdc0e4601310e85b8bf713a62ed4aee305"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "8bb8c70efc61bcf633e02d6054dbf5ec8dcf6699"
  fuelmenu_sha: "2a0def56276f0fc30fd949605eeefc43e5d7cc6c"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "62573cb2a8aa54845de9303b4a30935a90e1db61"

[root@nailgun ~]# cat /etc/fuel/version.yaml 
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "410"
  build_id: "410"
  fuel-nailgun_sha: "9ebbaa0473effafa5adee40270da96acf9c7d58a"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "df16d41cd7a9445cf82ad9fd8f0d53824711fcd8"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "7ef751bdc0e4601310e85b8bf713a62ed4aee305"
  fuel-ostf_sha: "214e794835acc7aa0c1c5de936e93696a90bb57a"
  fuel-mirror_sha: "8bb8c70efc61bcf633e02d6054dbf5ec8dcf6699"
  fuelmenu_sha: "2a0def56276f0fc30fd949605eeefc43e5d7cc6c"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "62573cb2a8aa54845de9303b4a30935a90e1db61"

root@node-2:~# dpkg -l | grep nova
ii  nova-common                         2:12.0.0-1~u14.04+mos21                all          OpenStack Compute - common files
ii  nova-compute                        2:12.0.0-1~u14.04+mos21                all          OpenStack Compute - compute node
ii  nova-compute-kvm                    2:12.0.0-1~u14.04+mos21                all          OpenStack Compute - compute node (KVM)
ii  python-nova                         2:12.0.0-1~u14.04+mos21                all          OpenStack Compute - libraries
ii  python-novaclient                   2:2.30.2-1~u14.04+mos3                 all          client library for OpenStack Compute API
root@node-2:~# dpkg -l | grep neutron
ii  neutron-common                      2:7.0.1-1~u14.04+mos55                 all          OpenStack virtual network service - common files
ii  neutron-plugin-ml2                  2:7.0.1-1~u14.04+mos55                 all          Neutron is a virtual network service for Openstack - ML2 plugin
ii  neutron-plugin-openvswitch-agent    2:7.0.1-1~u14.04+mos55                 all          OpenStack virtual network service - Open vSwitch agent
ii  python-neutron                      2:7.0.1-1~u14.04+mos55                 all          OpenStack virtual network service - Python library
ii  python-neutronclient                1:3.1.0-1~u14.04+mos9                  all          client API library for Neutron

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-01-08:

At the same time the snippet http://paste.openstack.org/show/483057/ looks interesting. I think it's neutron-openvswitch-agent which creates an OVS integration bridge - br-int. Looks like in your case neutron agent started after nova-compute and by the time the latter tried to plug a VM's VIF into br-int the bridge didn't exist.

I checked the upstart config of nova-compute and looks like we don't have an explicit dependency on neutron-openvswitch-agent, so upstart *may* start the agent later. I guess, we could try to tweak upstart scripts here, although a started neutron agent process would not actually mean all the initialization steps have been performed (e.g. in our case - br-int has been created).

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-01-08:

Another point is that the test itself should probably do a better job by migrating the existing workloads from the node / disabling them properly before doing node reinstallation. Having them hanging in ACTIVE state during the node provisioning + deployment phases seems to be odd and error-prone from the users standpoint.

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2016-01-11:

Roman,

the issue is permanently reproduced on CI starting from iso build #355 (Dec 26, 2015), please contact me in IRC if you need an environment after tests for investigation.

Revision history for this message

Alexander Gubanov (ogubanov) wrote on 2016-01-11:

I can't reproduce it on MOS8.0 (build 402)
Env: neutron vlan - 3 controller nodes, 2 compute with cinder
Proof http://pastebin.com/kGjsHgku