Using different interface for Admin/PXE fails provisioning

Bug #1466148 reported by Sam Stoelinga
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
IK
6.1.x
Won't Fix
High
Rodion Tikunov
7.0.x
Fix Released
High
IK

Bug Description

Version found: Fuel 6.1 nightly build 2015-06-13

Steps to reproduce:
1. Create a 1 node cluster consisting only of a controller.
2. Make sure that 2 of the interfaces of this controller are on the same l2 as PXE/Admin network. As example our eth0 and eth1 are both on the same l2.
3. Now specify that the controller PXE boots in following order: eth0, eth1
4. At discovery phase let the controller PXE boot from eth1
5. After node is discovered assign the controller role and assign eth0 as the Admin/PXE network
6. Deploy as usual

Current result:
Deployment fails at provisioning for the node which changed it's pxe/admin interface. The fuel-agent contains a bug: http://paste.openstack.org/show/295544/
fuel_agent.cmd.agent
fuel_agent.cmd.agent KeyError: 'ip_address'
fuel_agent.cmd.agent admin_ip=admin_interface['ip_address'],
fuel_agent.cmd.agent File "/usr/lib/python2.6/site-packages/fuel_agent/drivers/nailgun.py", line 365, in parse_configdrive_scheme

Expected result:
Deployment should succeed or user experience should disable changing PXE/admin interface.

Workaround:
I had to manually change the code /usr/lib/python2.6/site-packages/fuel_agent/drivers/nailgun.py +349 to make it return eth1 instead of eth0.

Changed in fuel:
milestone: none → 6.1
assignee: nobody → Fuel provisioning team (fuel-provisioning)
importance: Undecided → High
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Sam Stoelinga, could you provide at least full fuel-agent.log from failed node?

it could be found on bootstrap loaded node at /var/log/ or on masternode in /var/log/docker-logs/remote/<node>/bootstrap.

If possible please share with us diagnostic snapshot.

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

I don't have it anymore. But if I can reproduce on virtualbox env I will provide it.

Changed in fuel:
assignee: Fuel provisioning team (fuel-provisioning) → Aleksandr Gordeev (a-gordeev)
Revision history for this message
Ryan Moe (rmoe) wrote :

Sam, how were you changing the admin network (step 5 in your description) ?

I believe the reason this doesn't work is as follows:

1. Node is booted and gets admin IP on eth1. The nailgun-agent reports back to nailgun that eth1 has an IP in the admin subnet.

2. You change which nic the admin network maps to (presumably by making a direct API call with something like curl).

3. When you go to deploy, the provisioning serializer looks for the admin interface by looking for which interface has that network mapped (which will be eth0 in this case because of the update in step 2). The serializer stores the MAC address of this in the node's cobbler profile as netcfg/choose_interface.

4. fuel-agent uses the MAC in netcfg/choose_interface to locate the admin interface. In this case it will think eth0 is the admin interface but in reality it's eth1. This results in the traceback you see.

I don't see how this is a high-priority bug at all.

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

Ryan,
In the web interface I just select Configure interfaces and move the Admin/PXE interface from eth1 to eth0. The hardest part is getting the Node to PXE bootstrap from eth1, but this can be done via changing PXE boot order to: 1. eth1, 2. eth0

Your point 2 is incorrect as this can be done easily from the web interface. I agree that this is medium or low priority as it's not normal to have both eth1 and eth0 on the same l2 as admin/pxe I guess.

Revision history for this message
Ryan Moe (rmoe) wrote :

How are you able to move the admin network in the UI? I've never seen that before. We specifically restrict moving that network.

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

That's really weird. I'm 100% confident that I did move the Admin network in the UI in 6.1 nightly release of 2015-06-13, but I can't reproduce on nightly build of 2015-06-19 it does restrict moving the network like you said. Maybe because of some special edge case / condition the restriction in the UI wasn't there. Feel free to mark as invalid until I can reproduce with steps listed above.

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
tags: added: customer-found feature-image-based
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Here comes the steps to reproduce the issue:

Create 2 slave nodes, both of them with 4 NICs.
With the following network connectivity:
Slave1:
eth0, eth1 - the same models, connect them to public network
eth2, eth3 - the same models, connect them to admin network
Slave2:
eth0, eth1 - the same models, connect them to admin network
eth2, eth3 - the same models, connect them to public network

So, NIC order should be quite different (That step could be easily done on VMs). One node has 2 first NICs connected to public, the another node has 2 first NICs connected to admin network.

Create cluster with those 2 slave nodes, assign conroller role to both of them.

Go to network settings, Create bonding for one pair of interfaces (Doesn't matter what NICs will be in a bond.). Leave the another pair untouched.

Start deployment. One node will fail to provision with the same traceback as was reported.

Look like it's not a provisioning issue. The root cause may be hidden somewhere between nailgun and UI interactions. I bet it's nailgun.

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :
tags: added: feature
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Sam, can you please comment on how common this use case is?

Revision history for this message
Sam Stoelinga (sammiestoel) wrote : Re: [Bug 1466148] Re: Using different interface for Admin/PXE fails provisioning

I think it's pretty uncommon. It was an environment where they didn't
listen to my instructions to configure the network correctly.

On Sat, Aug 1, 2015 at 6:28 AM, Mike Scherbakov <email address hidden>
wrote:

> Sam, can you please comment on how common this use case is?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1466148
>
> Title:
> Using different interface for Admin/PXE fails provisioning
>
> Status in Fuel for OpenStack:
> Confirmed
> Status in Fuel for OpenStack 6.1.x series:
> Won't Fix
> Status in Fuel for OpenStack 7.0.x series:
> Confirmed
>
> Bug description:
> Version found: Fuel 6.1 nightly build 2015-06-13
>
> Steps to reproduce:
> 1. Create a 1 node cluster consisting only of a controller.
> 2. Make sure that 2 of the interfaces of this controller are on the same
> l2 as PXE/Admin network. As example our eth0 and eth1 are both on the same
> l2.
> 3. Now specify that the controller PXE boots in following order: eth0,
> eth1
> 4. At discovery phase let the controller PXE boot from eth1
> 5. After node is discovered assign the controller role and assign eth0
> as the Admin/PXE network
> 6. Deploy as usual
>
> Current result:
> Deployment fails at provisioning for the node which changed it's
> pxe/admin interface. The fuel-agent contains a bug:
> http://paste.openstack.org/show/295544/
> fuel_agent.cmd.agent
> fuel_agent.cmd.agent KeyError: 'ip_address'
> fuel_agent.cmd.agent admin_ip=admin_interface['ip_address'],
> fuel_agent.cmd.agent File
> "/usr/lib/python2.6/site-packages/fuel_agent/drivers/nailgun.py", line 365,
> in parse_configdrive_scheme
>
> Expected result:
> Deployment should succeed or user experience should disable changing
> PXE/admin interface.
>
> Workaround:
> I had to manually change the code
> /usr/lib/python2.6/site-packages/fuel_agent/drivers/nailgun.py +349 to make
> it return eth1 instead of eth0.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/fuel/+bug/1466148/+subscriptions
>

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Thanks Sam. As this is being not very common use case, I'm lowering priority to Medium and moving to 8.0. We've passed SCF, and need focus on more common use cases / other High/Critical bugs.

Changed in fuel:
milestone: 7.0 → 8.0
no longer affects: fuel/8.0.x
Revision history for this message
fatih nar (fenar) wrote :

Hi;
We are experiencing the exactly the same issue with HP Pizza Boxes which PXE boots from (only) embbedded NICs marked with eth3/3eth4 and FUEL GUI puts Admin(PXE) interface to Eth0 which is the Optional PCI NIC Card on Server which is reserved for Opendstack Networks. We kinldy request a way to configure FUEL to allow moving Admin(PXE) interface so we can move it to proper interface which Server has booted from in 1st place.
Regards
  Fatih E.

Revision history for this message
Andrew Woodward (xarses) wrote :

@fenar,

Does this occur when using multiple nodegroups `fuel nodegroup --env <id>`

I was only able to reproduce this when the discovered node's admin interface is not a member of the default fuelweb_admin network (network of id=1)

Revision history for this message
Andrew Woodward (xarses) wrote :

customer found on 6.1 with multiple node groups.

tags: added: feature-nodegroup
Changed in fuel:
importance: Medium → High
Revision history for this message
Andrew Woodward (xarses) wrote :
Revision history for this message
Andrew Woodward (xarses) wrote :

[root@nailgun ~]# manage.py shell
2015-09-04 06:48:07.872 DEBUG [7f595c19e700] (settings) Looking for settings.yaml package config using old style __file__
2015-09-04 06:48:07.872 DEBUG [7f595c19e700] (settings) Trying to read config file /usr/lib/python2.6/site-packages/nailgun/settings.yaml
2015-09-04 06:48:08.052 DEBUG [7f595c19e700] (settings) Trying to read config file /etc/nailgun/settings.yaml
2015-09-04 06:48:08.067 DEBUG [7f595c19e700] (settings) Trying to read config file /etc/fuel/version.yaml
Python 2.6.6 (r266:84292, Jan 22 2014, 09:42:36)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from nailgun.objects import Node, Cluster
>>> c = Cluster.get_by_uid(3)
>>> n = Node.get_by_uid(2)
>>> nm = Cluster.get_network_manager(c)
>>>
>>> Node.add_into_cluster(n,3)
2015-09-04 06:48:09.954 DEBUG [7f595c19e700] (manager) Cannot find interface with assigned admin network group on Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)
2015-09-04 06:48:09.961 WARNING [7f595c19e700] (manager) Cannot find admin interface for node return first interface: "Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)"
2015-09-04 06:48:09.961 DEBUG [7f595c19e700] (manager) Cannot find interface with assigned admin network group on Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)
2015-09-04 06:48:09.969 WARNING [7f595c19e700] (manager) Cannot find admin interface for node return first interface: "Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)"
2015-09-04 06:48:09.969 DEBUG [7f595c19e700] (manager) Cannot find interface with assigned admin network group on Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)
2015-09-04 06:48:09.976 WARNING [7f595c19e700] (manager) Cannot find admin interface for node return first interface: "Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)"
2015-09-04 06:48:09.977 DEBUG [7f595c19e700] (manager) Cannot find interface with assigned admin network group on Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)
2015-09-04 06:48:09.985 WARNING [7f595c19e700] (manager) Cannot find admin interface for node return first interface: "Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)"
2015-09-04 06:48:09.985 DEBUG [7f595c19e700] (manager) Cannot find interface with assigned admin network group on Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)
2015-09-04 06:48:09.993 WARNING [7f595c19e700] (manager) Cannot find admin interface for node return first interface: "Untitled (a7:2f) (id=2, mac=64:36:1e:29:a7:2f)"
2015-09-04 06:48:09.993 WARNING [7f595c19e700] (manager) Cannot assign all networks appropriately fornode u'Untitled (a7:2f)'. Set all unassigned networks to theinterface u'eth0'
/usr/lib64/python2.6/site-packages/sqlalchemy/sql/default_comparator.py:35: SAWarning: The IN-predicate on "network_groups.id" was invoked with an empty sequence. This results in a contradiction, which nonetheless can be expensive to evaluate. Consider alternative strategies for improved performance.
  return o[0](self, self.expr, op, *(other + o[1:]), **kwargs)
2015-09-04 06:48:10.005 DEBUG [7f595c19e700] (cluster) New pending changes in environment 3: interfaces node_id=2

Revision history for this message
Andrew Woodward (xarses) wrote :

bug appears to be from

diff --git a/nailgun/nailgun/network/manager.py b/nailgun/nailgun/network/manager.py
index 9f036ce..09653a6 100644
--- a/nailgun/nailgun/network/manager.py
+++ b/nailgun/nailgun/network/manager.py
@@ -874,7 +874,7 @@ class NetworkManager(object):
                          'network group on %s', node.full_name)

         for interface in node.nic_interfaces:
- if cls.is_ip_belongs_to_admin_subnet(interface.ip_addr):
+ if cls.is_ip_belongs_to_admin_subnet(interface.ip_addr, node.id):
                 return interface

         logger.warning(u'Cannot find admin interface for node '

>>> for i in n.nic_interfaces: print nm.is_ip_belongs_to_admin_subnet(i.ip_addr, n.id), i.name
...
False eth0
False eth1
False eth2
False eth3
True eth4
>>> for i in n.nic_interfaces: print nm.is_ip_belongs_to_admin_subnet(i.ip_addr), i.name
...
False eth0
False eth1
False eth2
False eth3
False eth4
>>>

but after adding node id to is_ip_belongs_to_admin_subnet it still didn't assign to the right interface

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Is the bug reproducible on 7.0?

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

According to Alexander Gordeev bug exists in 7.0.

no longer affects: fuel/8.0.x
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Why don't we rename the boot NIC to eth0 in the bootstrap (using udev rules) so there's no need to reassign the admin network?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/220741

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Alexei, networks to interfaces mapping will be changed unpredictably then. And we agreed to move in opposite direction - to use persistent NIC names (not in 7.0 though).
In current model of networks to interfaces assignment I propose to:
1. Fix Nailgun to remap Admin network when PXE interface is changed.
2. Fix Fuel-agent to not rely on initial interfaces configuration as something stable. Seems, it's second priority as agent reports correct PXE MAC and Admin IP still.

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Fix to Nailgun is proposed: https://review.openstack.org/#/c/220741/

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Workaroud: remove node from environment and add it again.

tags: added: tricky
Changed in fuel:
assignee: Aleksey Kasatkin (alekseyk-ru) → Ivan Kliuk (ivankliuk)
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

regarding fix for fuel-agent.

according to the attached log https://bugs.launchpad.net/fuel/7.0.x/+bug/1466148/+attachment/4423541/+files/fuel-agent.log

It thows KeyError because admin MAC which is 00:0e:1e:8f:18:20 is pointing to eth2:

https://github.com/stackforge/fuel-web/blob/master/nailgun/nailgun/orchestrator/provisioning_serializers.py#L104

but nailgun has allocated ip address only for eth0:

u'eth2': {u'static': u'0', u'mac_address': u'00:0e:1e:8f:18:20'},

u'eth0': {u'dns_name': u'node-27.domain.tld', u'netmask': u'255.255.254.0', u'mac_address': u'54:9f:35:23:b3:ec', u'ip_address': u'10.9.26.15', u'static': u'0'}

so, there's nothing to fix in fuel-agent. It's definitely wrong input data.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/220741
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=ef2ac6551af54a69a4e1387a46ba6b8a8e8a268f
Submitter: Jenkins
Branch: master

commit ef2ac6551af54a69a4e1387a46ba6b8a8e8a268f
Author: Aleksey Kasatkin <email address hidden>
Date: Sat Sep 5 12:14:01 2015 -0500

    Remap Admin network when PXE interface is changed

    1. Check whether PXE interface was changed and remap Admin-pxe network accordingly.
    2. Fix _get_pxe_iface_name to check all meta info related to PXE interface.
    3. Update node's IP and MAC before checking it against interfaces meta.
    4. Fix 'pxe' flag assignment in tests.

    Change-Id: I6bec445dbc56c9caad12926a2733c59e82e58e36
    Partial-bug: #1466148

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/7.0)

Fix proposed to branch: stable/7.0
Review: https://review.openstack.org/221235

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/7.0)

Reviewed: https://review.openstack.org/221235
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=93477f9b42c5a5e0506248659f40bebc9ac23943
Submitter: Jenkins
Branch: stable/7.0

commit 93477f9b42c5a5e0506248659f40bebc9ac23943
Author: Aleksey Kasatkin <email address hidden>
Date: Sat Sep 5 12:14:01 2015 -0500

    Remap Admin network when PXE interface is changed

    1. Check whether PXE interface was changed and remap Admin-pxe network accordingly.
    2. Fix _get_pxe_iface_name to check all meta info related to PXE interface.
    3. Update node's IP and MAC before checking it against interfaces meta.
    4. Fix 'pxe' flag assignment in tests.

    Change-Id: I6bec445dbc56c9caad12926a2733c59e82e58e36
    Partial-bug: #1466148

Vitalii Myhal (xmig)
tags: added: on-verification
Revision history for this message
Vitalii Myhal (xmig) wrote :

verified on fuel-7.0-292-2015-09-12_17-07-49.iso

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
Dmitry Pyzhov (dpyzhov)
tags: added: area-python
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Won't Fix for 6.1 because the patch for fixing this issue changes the code which has not presented in 6.1.

tags: added: wontfix-risky
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.