PXE interface flapping

Bug #1493412 reported by Dmitry Klenov
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Alexei Sheplyakov
7.0.x
Fix Released
Critical
Dmitry Klenov
8.0.x
Fix Released
Critical
Alexei Sheplyakov

Bug Description

Steps to reproduce:

1) Boot node into bootstrap
2) eth4 - admin if
3)deploy env
4)reset env
5) node again into bootstrap
6) eth0 - admin if

It is expected that admin interface name does not change

Revision history for this message
Dmitry Klenov (dklenov) wrote :

User impact:

Issue is seen in ~25% cases on the H/W which is planned to be used as production one in customer environment. When issue happens, it breaks deployment.

Changed in fuel:
importance: Undecided → Critical
assignee: nobody → Alexei Sheplyakov (asheplyakov)
milestone: none → 7.0
Changed in fuel:
assignee: Alexei Sheplyakov (asheplyakov) → nobody
Changed in fuel:
assignee: nobody → Alexei Sheplyakov (asheplyakov)
status: New → Confirmed
Revision history for this message
Igor Marnat (imarnat) wrote :
Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Folks, this report is definitely way too incomplete.

at least, we need to know:
1) the exact version of fuel on which the customer have run into this issue.
2) log files. It's not clear whether eth4 and eth0 have the different MACs or eth4 was just renamed to eth0 by applying udev's mapping from nailgun.

could the original issue be related with https://bugs.launchpad.net/fuel/+bug/1455610 or https://bugs.launchpad.net/fuel/+bug/1466148 ?

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

@Aleksandr,

I think the exact version of Fuel is not very useful. Instead we need the hardware data (lspci -vvv, dmidecode) and boot logs (dmesg)

Revision history for this message
Igor Marnat (imarnat) wrote :

Correction to comment #2: it's possible that the fix suggested is partial, not help fully. Still investigating

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

The bug has nothing to do with Ceph, I don't quite understand why it has been assigned to me.

Changed in fuel:
assignee: Alexei Sheplyakov (asheplyakov) → nobody
assignee: nobody → Fuel Python Team (fuel-python)
Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

Perhaps nailgun should maintain the state (in particular the network interface naming) for the bootstrap nodes too.
Alternatively it can use the stateless interfaces' naming, for instance http://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> The fix for the issue: https://review.openstack.org/#/c/220728

Nope, that patch is supposed to fix a different problem

Revision history for this message
Aleksey Zvyagintsev (azvyagintsev) wrote :

 1) the exact version of fuel on which the customer have run into this issue.
A: reproduces at least on fuel7.0-286

2)It's not clear whether eth4 and eth0 have the different MACs or eth4 was just renamed to eth0 by applying udev's mapping from nailgun.
A:eth4=>renamed to eth0 , mac address stays same (same cobbler dhcp address, same node id. )

3) lspci -vvv, dmidecode and logs
A:will be provided after tests.

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Ok, Thanks for the answers.

@Aleksey Zvyagintsev (azvyagintsev), Could you additionaly provide the contents of /etc/udev/rules.d ?

Changed in fuel:
importance: Critical → High
status: Confirmed → Incomplete
tags: added: feature
Andrew Woodward (xarses)
tags: removed: feature
Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

Any logs, please!

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

In our current case this bug leads to misconfiguration of ALL bond interfaces, not only for admin interface.

Revision history for this message
slava valyavskiy (slava-val-al) wrote :

I have attached nailgun log files from master node.

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

Reproduced 1/20 times:

ISO 288
Run ps aux on this node, when eth4 should be PXE:

root 2423 0.0 0.0 9172 612 ? Ss 11:50 0:00 /sbin/dhclient -1 -q -lf /var/lib/dhclient/dhclient-eth0.leases -pf /var/run/dhclient-eth0.pid eth0
root 2781 0.0 0.0 249344 1560 ? Sl 11:54 0:00 /sbin/rsyslogd -i /var/run/syslogd.pid -c 5
root 2845 0.0 0.0 66268 1128 ? Ss 11:54 0:00 /usr/sbin/sshd
root 14824 0.1 0.0 68828 3512 ? Ss 12:31 0:00 \_ sshd: root@pts/0
root 14826 0.1 0.0 108352 1792 pts/0 Ss 12:31 0:00 \_ -bash
root 15096 0.0 0.0 110796 1656 pts/0 R+ 12:31 0:00 \_ ps auxf

Revision history for this message
Bartosz Kupidura (zynzel) wrote :

Probably main reason of that bug is 2 kind of NICs installed. We use there igb AND ixgbe driver.
So depending which module will be loaded first, admin interface can be found on eth0 or eth4.

70-persistent-net.rules before reboot:
# PCI device 0x8086:0x1521 (igb)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="ec:f4:bb:cd:45:54", ATTR{type}=="1", KERNEL=="eth*", NAME="eth4"

70-persistent-net.rules after reboot:
# PCI device 0x8086:0x1521 (igb)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="ec:f4:bb:cd:45:54", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"

Revision history for this message
Eugene Bogdanov (ebogdanov) wrote :

Raising priority to Critical as agreed with Dmitry Klenov.

Revision history for this message
Eugene Bogdanov (ebogdanov) wrote :

Moving back to 7.0 scope - this is a blocker for Telco team.

Revision history for this message
Igor Marnat (imarnat) wrote :

JFYI, Albert works on the fix which _partially_ relates to the problem with hardcoded interface naming: https://review.openstack.org/#/c/223939

Revision history for this message
Dmitry Klenov (dklenov) wrote :

The bug should be fixed by a commit made for https://bugs.launchpad.net/fuel/+bug/1466148. Commit: https://review.openstack.org/#/c/221235/.

So moving issue to fix released.

Revision history for this message
Dmitry Klenov (dklenov) wrote :

Sorry, to Fix Committed.

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

Issue was not reproduced with ISO#295

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Assigned back to Alexei Sheplyakov for 8.0 - I assigned to Alexey Shtokolov by mistake, sorry for confusion.

Revision history for this message
Mike Scherbakov (mihgen) wrote :

Folks,
how can it be fixed for 7.0 but not for 8.0? Are we following our process of backporting fixes?

Revision history for this message
Aleksey Kasatkin (alekseyk-ru) wrote :

But https://review.openstack.org/#/c/221235/ is the backport from 8.0 master ( of https://review.openstack.org/#/c/220741/ ), so it should be fixed in 8.0 as well.

Dmitry Pyzhov (dpyzhov)
tags: added: mos-linux
Dmitry Pyzhov (dpyzhov)
tags: added: area-mos
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 7.0-updates → 8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.