Fuel for OpenStack

HA cluster deployment fails on Network/Exec[waiting-for-neutron-api]

Bug #1421723 reported by Dmitry Sutyagin on 2015-02-13

This bug report is a duplicate of: Bug #1396126: Deployment doesn't work without an active public gateway. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Confirmed	Undecided	Fuel Library (Deprecated)	Fuel for OpenStack 6.1

Bug Description

Got this error in my virtual environment created from scratch with virtualbox scripts available on mirantis.com

1. Created MOS system (fuel + 4 nodes) via VirtualBox script provided by Mirantis.
2. Created a new environment in Fuel: CentOS -> HA -> Neutron+VLAN.
3. Selected 3 machines as Controllers, one as Compute.
4. Fuel installed CentOS on all nodes successfully, then started to install OpenStack on first node. After a while I noticed this in Puppet log:
--------------------------------------------------
Fri Feb 13 14:15:05 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api] (info): Starting to evaluate the resource
Fri Feb 13 14:15:05 +0000 2015 Exec[waiting-for-neutron-api](provider=posix) (debug): Executing check 'test -r /root/openrc'
Fri Feb 13 14:15:05 +0000 2015 Puppet (debug): Executing 'test -r /root/openrc'
Fri Feb 13 14:15:05 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Exec try 1/30
Fri Feb 13 14:15:05 +0000 2015 Exec[waiting-for-neutron-api](provider=posix) (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:15:05 +0000 2015 Puppet (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:15:21 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Sleeping for 4.0 seconds between tries
Fri Feb 13 14:15:25 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Exec try 2/30
Fri Feb 13 14:15:25 +0000 2015 Exec[waiting-for-neutron-api](provider=posix) (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:15:25 +0000 2015 Puppet (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:15:35 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Sleeping for 4.0 seconds between tries
Fri Feb 13 14:15:39 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Exec try 3/30
Fri Feb 13 14:15:39 +0000 2015 Exec[waiting-for-neutron-api](provider=posix) (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:15:39 +0000 2015 Puppet (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:15:49 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Sleeping for 4.0 seconds between tries
Fri Feb 13 14:15:53 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Exec try 4/30
Fri Feb 13 14:15:53 +0000 2015 Exec[waiting-for-neutron-api](provider=posix) (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:15:53 +0000 2015 Puppet (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:16:02 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Sleeping for 4.0 seconds between tries
--------------------------------------------------

...and so on until all 30 tries are done:

--------------------------------------------------
Fri Feb 13 14:21:35 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Exec try 30/30
Fri Feb 13 14:21:35 +0000 2015 Exec[waiting-for-neutron-api](provider=posix) (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:21:35 +0000 2015 Puppet (debug): Executing 'bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null'
Fri Feb 13 14:21:45 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (debug): Sleeping for 4.0 seconds between tries
Fri Feb 13 14:21:49 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (notice): Unable to establish connection to http://172.16.0.2:9696/v2.0/networks.json
Fri Feb 13 14:21:49 +0000 2015 Puppet (err): bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1 > /dev/null returned 1 instead of one of [0]
Fri Feb 13 14:21:49 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api]/returns (err): change from notrun to 0 failed: bash -c "source /root/openrc ; neutron net-list --http-timeout=4 " 2>&1
Fri Feb 13 14:21:49 +0000 2015 /Stage[main]/Openstack::Network/Exec[waiting-for-neutron-api] (info): Evaluated in 403.76 seconds
Fri Feb 13 14:21:49 +0000 2015 /Stage[main]/Neutron::Agents::Ml2::Ovs/Service[neutron-ovs-agent-service] (info): Starting to evaluate the resource
Fri Feb 13 14:21:49 +0000 2015 /Stage[main]/Neutron::Agents::Ml2::Ovs/Service[neutron-ovs-agent-service] (notice): Dependency Exec[waiting-for-neutron-api] has failures: true
Fri Feb 13 14:21:49 +0000 2015 /Stage[main]/Neutron::Agents::Ml2::Ovs/Service[neutron-ovs-agent-service] (warning): Skipping because of failed dependencies
--------------------------------------------------

So puppet went on but skipping dependent stuff which inevitably resulted in failed deployment.

I had to wait another 50 minutes for deployment to "officially" fail in Fuel interface.

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2015-02-13:

fuel-snapshot-2015-02-13_16-38-14.tgz Edit (17.4 MiB, application/x-tar)

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2015-02-13:

Extra info - the IP which fails can be found in /etc/astute.yaml of this node:

public_vip: 172.16.0.2

I could not find why neutron cli tries to connect to a public vip and not to 192.168... or 10.20.0.X.

Revision history for this message

Ryan Moe (rmoe) wrote on 2015-02-13:

The public VIP failed to start, most likely because the public gateway (172.16.0.1) can't be pinged. Can you verify that you can ping the public gateway from your controller?

Ryan Moe (rmoe) on 2015-02-14

Changed in fuel:
status:	New → Confirmed
assignee:	nobody → Fuel Library Team (fuel-library)
milestone:	none → 6.1

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2015-02-16:

Hi Ryan,

172.16.0.1 can be pinged from the node. The network is provided by VirtualBox so I suppose this IP was pingable from the beginning of installation (cannot be sure 100%).

How do I restart public VIP?

Revision history for this message

Dmitry Sutyagin (dsutyagin) wrote on 2015-02-16:

Ryan,

If you check puppet.log from node-1 within the attached snapshot, you can see that service vip__public did not fail to start:

--------------------------------
Fri Feb 13 14:03:36 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
Fri Feb 13 14:03:37 +0000 2015 Puppet (debug):
-> Simple primitive 'vip__public' global status: start
node-1.domain.tld: start
--------------------------------

But in a minute it already stopped:

--------------------------------
Fri Feb 13 14:04:13 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
Fri Feb 13 14:04:15 +0000 2015 Puppet (debug):
-> Cloned primitive 'clone_ping_vip__public' global status: stop
node-1.domain.tld: stop
-> Simple primitive 'vip__public' global status: stop
node-1.domain.tld: stop
--------------------------------

Today I have tried to deploy again, this time the same setup fails on:

--------------------------------
Mon Feb 16 07:26:43 +0000 2015 Puppet (debug): Executing '/usr/sbin/pcs resource enable vip__public'
Mon Feb 16 07:26:45 +0000 2015 Puppet (debug): Choose global start for Pacemaker service 'vip__public'
Mon Feb 16 07:26:45 +0000 2015 Puppet (debug): Waiting 600 seconds for service 'vip__public' to start
Mon Feb 16 07:26:45 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
Mon Feb 16 07:26:53 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
Mon Feb 16 07:27:01 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
Mon Feb 16 07:27:09 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
--------------------------------

And so on for 10 minutes. At this time (after reboot and during second deploy attempt), 172.16.0.1 CANNOT be pinged from the node.

I have checked host machine IP settings and found that the host address 172.16.0.1 is not set by VirtualBox:
--------------------------------
3: vboxnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 0a:00:27:00:00:00 brd ff:ff:ff:ff:ff:ff
    inet 10.20.0.1/24 brd 10.20.0.255 scope global vboxnet0
       valid_lft forever preferred_lft forever
    inet 10.20.0.8/24 brd 10.20.0.255 scope global secondary dynamic vboxnet0
       valid_lft 7049sec preferred_lft 7049sec
4: vboxnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 0a:00:27:00:00:01 brd ff:ff:ff:ff:ff:ff
5: vboxnet2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN group default qlen 1000
    link/ether 0a:00:27:00:00:02 brd ff:ff:ff:ff:ff:ff
--------------------------------

Not sure why, seems like a problem with VirtualBox (using 4.3.20 on Fedora 20 x64).

Meanwhile I would like to note that nowhere in the documentation (at least I could not find) it is mentioned that a functioning public network is required for successful deployment.

Ryan,

If you check puppet.log from node-1 within the attached snapshot, you can see that service vip__public did not fail to start:

--------------------------------
Fri Feb 13 14:03:36 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
Fri Feb 13 14:03:37 +0000 2015 Puppet (debug): 
-> Simple primitive 'vip__public' global status: start
   node-1.domain.tld: start
--------------------------------

But in a minute it already stopped:

--------------------------------
Fri Feb 13 14:04:13 +0000 2015 Puppet (debug): Executing '/usr/sbin/cibadmin -Q'
Fri Feb 13 14:04:15 +0000 2015 Puppet (debug): 
-> Cloned primitive 'clone_ping_vip__public' global status: stop
   node-1.domain.tld: stop
-> Simple primitive 'vip__public' global status: stop
   node-1.domain.tld: stop
--------------------------------

Today I have tried to deploy again, this time the same setup fails on:

And so on for 10 minutes. At this time (after reboot and during second deploy attempt), 172.16.0.1 CANNOT be pinged from the node.

Not sure why, seems like a problem with VirtualBox (using 4.3.20 on Fedora 20 x64).

Meanwhile I would like to note that nowhere in the documentation (at least I could not find) it is mentioned that a functioning public network is required for successful deployment.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-02-16:

@Dmitry, the documentation is being updated to reflect this requirement. https://review.openstack.org/#/c/154130/5/pages/reference-architecture/network-concepts/6011-ha-networking.rst

I set this bug as a duplicate for a known issue. Public gw must be accessible from controller nodes in order to deployment can succeed.