Changing MTU to jumbo frame on Data network causes reboot cycling

Bug #1798442 reported by Elio Martinez on 2018-10-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
High
Steven Webster

Bug Description

Title
-----

Changing MTU value on Data network causes reboot cycling

Brief Description
-----------------

On Multinode Local storage configuration with data network, we should change MTU value from 1500 to 3000 on locked compute node. After performing Unlock action the compute reboots normally until the terminal shows the login console, starting with the reboot cycle.

Severity
--------
Major

Steps to Reproduce
------------------
description: change the mtu value of the data interface using cli
step-1: lock a compute node
step-2: use the system host-if-modify command to specify the interface and the new mtu value on the node
$ system host-if-modify -m 3000 compute-0 eth1000
step-3: unlock the node
step-4: repeat the above steps on each compute nodes

Expected Behavior
-----------------
The compute should be on unlock active state without issues

Actual Behavior
---------------

Compute node is rebooting as soon as console ask for password.

Las messages on dmesg shows:
[ 94.543418] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[ 105.314803] cgroup: new mount options do not match the existing superblock, will be ignored
[ 112.198930] watchdog watchdog0: watchdog did not stop!
[ 113.456691] nfsd: last server has exited, flushing export cache
[ 118.958720] device ovs-netdev left promiscuous mode
[ 120.218603] watchdog watchdog0: watchdog did not stop!

Reproducibility
---------------
100%

System Configuration
--------------------
Bare Metal Multinode Local Storage Configuration 2 controllers 2 computes

ISO
---

Tiemstamp/Logs
--------------
[ 94.543418] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null)
[ 105.314803] cgroup: new mount options do not match the existing superblock, will be ignored
[ 112.198930] watchdog watchdog0: watchdog did not stop!
[ 113.456691] nfsd: last server has exited, flushing export cache
[ 118.958720] device ovs-netdev left promiscuous mode
[ 120.218603] watchdog watchdog0: watchdog did not stop!

/var/log/daemon.log:

The ovs-vswitchd.log shows |ERR| failed to create memory pool for netdev eth0 with MTU 3000 on socket 0: Invalid argument.

puppet.log shows

 /Stage[main]/Platform::Vswitch::Ovs/Platform::Vswitch::Ovs::Port[eth0]/Exec[ovs-add-port: eth0]/returns: ovs-vsctl: Error detected while setting up 'eth0': Error attaching device

iso : stx-2018-10-03-11-r-2018.10.iso

Branch/Pull Time/Commit
-----------------------
stx-tools

Attaching dmesg output

Elio Martinez (elio1979) wrote :
Elio Martinez (elio1979) on 2018-10-17
description: updated
Ricardo Perez (richomx) wrote :

This is also visible in the following:

System Configuration
--------------------
Bare Metal Multinode External (CEPH) Storage Configuration 2 controllers 2 computes 2 Storages (CEPH).

The signature of the failure is the same as described by @Elio.

Ghada Khalil (gkhalil) on 2018-10-17
tags: added: stx.networking
Ghada Khalil (gkhalil) wrote :

Jumbo frames (mtu > 1500) have not been tested with ovs-dpdk and we expect there are issues. For now, please do not run this test-case on the data network. We are planning a more thorough test activity shortly.

Targeting stx.2019.03

tags: added: stx.2019.03
summary: - Changing MTU value on Data network causes reboot cycling
+ Changing MTU to jumbo frame on Data network causes reboot cycling
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
Ghada Khalil (gkhalil) on 2018-10-24
Changed in starlingx:
assignee: nobody → Steven Webster (swebster-wr)
Ken Young (kenyis) on 2019-01-18
tags: added: stx.2019.05
removed: stx.2019.03
Ghada Khalil (gkhalil) wrote :

This should be addressed by:
https://storyboard.openstack.org/#!/story/2004472
as of master 2018-01-24

This story makes the vswitch and hugepage size configurable to allow users to allocate enough memory if they intend to use jumbo frames

Ghada Khalil (gkhalil) wrote :

correction: The above story merged as of 2019-01-24 (not 2018)

Steven Webster (swebster-wr) wrote :

Vswitch pages can now be configured via the system command line or horizon GUI in the Host Detail -> Memory tab.

For example, to enable the vswitch with 4 1G hugepages, one could issue the commands:

system host-lock <host name>
system host-memory-modify -f vswitch -1G 4 <host name> <numa node>
system host-unlock <host name>

After the unlock, the system will reboot and apply the worker manifest. The worker manifest will detect a change to the 1G hugepages, update grub, and reboot again. Once the system comes up, the memory configuration can be confirmed by issuing the command:

system host-memory-list <host name>

Changed in starlingx:
status: Triaged → Fix Committed
Ghada Khalil (gkhalil) on 2019-02-06
Changed in starlingx:
status: Fix Committed → Fix Released
Ken Young (kenyis) on 2019-04-05
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments