StarlingX

Changing MTU to jumbo frame on Data network causes reboot cycling

Bug #1798442 reported by Elio Martinez on 2018-10-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Steven Webster

Bug Description

Title
-----

Changing MTU value on Data network causes reboot cycling

Brief Description
-----------------

On Multinode Local storage configuration with data network, we should change MTU value from 1500 to 3000 on locked compute node. After performing Unlock action the compute reboots normally until the terminal shows the login console, starting with the reboot cycle.

Severity
--------
Major

Steps to Reproduce
------------------
description: change the mtu value of the data interface using cli
step-1: lock a compute node
step-2: use the system host-if-modify command to specify the interface and the new mtu value on the node
$ system host-if-modify -m 3000 compute-0 eth1000
step-3: unlock the node
step-4: repeat the above steps on each compute nodes

Expected Behavior
-----------------
The compute should be on unlock active state without issues

Actual Behavior
---------------

Compute node is rebooting as soon as console ask for password.

Las messages on dmesg shows:
[ 94.543418] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[ 105.314803] cgroup: new mount options do not match the existing superblock, will be ignored
[ 112.198930] watchdog watchdog0: watchdog did not stop!
[ 113.456691] nfsd: last server has exited, flushing export cache
[ 118.958720] device ovs-netdev left promiscuous mode
[ 120.218603] watchdog watchdog0: watchdog did not stop!

Reproducibility
---------------
100%

System Configuration
--------------------
Bare Metal Multinode Local Storage Configuration 2 controllers 2 computes

ISO
---

Tiemstamp/Logs
--------------
[ 94.543418] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[ 105.314803] cgroup: new mount options do not match the existing superblock, will be ignored
[ 112.198930] watchdog watchdog0: watchdog did not stop!
[ 113.456691] nfsd: last server has exited, flushing export cache
[ 118.958720] device ovs-netdev left promiscuous mode
[ 120.218603] watchdog watchdog0: watchdog did not stop!

/var/log/daemon.log:

The ovs-vswitchd.log shows |ERR| failed to create memory pool for netdev eth0 with MTU 3000 on socket 0: Invalid argument.

puppet.log shows

/Stage[main]/Platform::Vswitch::Ovs/Platform::Vswitch::Ovs::Port[eth0]/Exec[ovs-add-port: eth0]/returns: ovs-vsctl: Error detected while setting up 'eth0': Error attaching device

iso : stx-2018-10-03-11-r-2018.10.iso

Branch/Pull Time/Commit
-----------------------
stx-tools

Attaching dmesg output

See original description

Tags:

Revision history for this message

Elio Martinez (elio1979) wrote on 2018-10-17:

log.failure Edit (244.8 KiB, text/plain)

Elio Martinez (elio1979) on 2018-10-17

description:

updated

Revision history for this message

Ricardo Perez (richomx) wrote on 2018-10-17:

This is also visible in the following:

System Configuration
--------------------
Bare Metal Multinode External (CEPH) Storage Configuration 2 controllers 2 computes 2 Storages (CEPH).

The signature of the failure is the same as described by @Elio.

Ghada Khalil (gkhalil) on 2018-10-17

tags:

added: stx.networking

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2018-10-18:

Jumbo frames (mtu > 1500) have not been tested with ovs-dpdk and we expect there are issues. For now, please do not run this test-case on the data network. We are planning a more thorough test activity shortly.

Targeting stx.2019.03

tags:	added: stx.2019.03
summary:	- Changing MTU value on Data network causes reboot cycling + Changing MTU to jumbo frame on Data network causes reboot cycling
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged

Ghada Khalil (gkhalil) on 2018-10-24

Changed in starlingx:
assignee:	nobody → Steven Webster (swebster-wr)

Ken Young (kenyis) on 2019-01-18

tags:

added: stx.2019.05
removed: stx.2019.03

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-02-04:

This should be addressed by:
https://storyboard.openstack.org/#!/story/2004472
as of master 2018-01-24

This story makes the vswitch and hugepage size configurable to allow users to allocate enough memory if they intend to use jumbo frames

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-02-04:

correction: The above story merged as of 2019-01-24 (not 2018)

Revision history for this message

Steven Webster (swebster-wr) wrote on 2019-02-05:

Vswitch pages can now be configured via the system command line or horizon GUI in the Host Detail -> Memory tab.

For example, to enable the vswitch with 4 1G hugepages, one could issue the commands:

system host-lock <host name>
system host-memory-modify -f vswitch -1G 4 <host name> <numa node>
system host-unlock <host name>

After the unlock, the system will reboot and apply the worker manifest. The worker manifest will detect a change to the 1G hugepages, update grub, and reboot again. Once the system comes up, the memory configuration can be confirmed by issuing the command:

system host-memory-list <host name>