compute node keeps offline after unlock due to vswitch error, caused by a hugepage allocation failure

Bug #1829403 reported by Peng Peng
48
This bug affects 5 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Austin Sun

Bug Description

Brief Description
-----------------
During lab initial period, after unlock compute nodes, one of the compute node keep "offline"

Severity
--------
Major

Steps to Reproduce
------------------
- Commission lab
- unlock worker/compute nodes

Expected Behavior
------------------
- All compute nodes are enabled without issues

Actual Behavior
----------------
- one compute node remains offline

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system

Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
stx master as of 20190515T220331Z

Last Pass
---------
2019-05-09_16-05-20

Timestamp/Logs
--------------
[2019-05-16 08:01:40,938] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-list --nowrap'
[2019-05-16 08:01:42,478] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | degraded |
| 5 | compute-0 | worker | locked | disabled | online |
| 6 | compute-1 | worker | locked | disabled | online |
| 7 | compute-2 | worker | locked | disabled | online |
| 8 | compute-3 | worker | locked | disabled | online |
| 9 | compute-4 | worker | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+

[2019-05-16 08:02:30,918] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-3'

[2019-05-16 08:25:49,329] 262 DEBUG MainThread ssh.send :: Send 'system --os-endpoint-type internalURL --os-region-name RegionOne host-list --nowrap'
[2019-05-16 08:25:50,848] 387 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | disabled | offline |
| 9 | compute-4 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Test Activity
-------------
lab setup

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

daemon.log (compute-3)

2019-05-16T08:06:47.366 compute-3 systemd[1]: info Starting Open vSwitch...
2019-05-16T08:06:47.377 compute-3 systemd[1]: info Started Open vSwitch.
2019-05-16T08:06:47.395 compute-3 systemd[1]: info Reloading.
2019-05-16T08:06:47.000 compute-3 ovs-vsctl: notice ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-hugepage-dir=/mnt/huge-1048576kB
2019-05-16T08:06:47.000 compute-3 ovs-vsctl: notice ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:pmd-cpu-mask=6
2019-05-16T08:06:47.000 compute-3 ovs-vsctl: notice ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem=0,0
2019-05-16T08:06:47.000 compute-3 ovs-vsctl: notice ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-extra=-n 4"
2019-05-16T08:06:47.000 compute-3 ovs-vsctl: notice ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-lcore-mask=7
2019-05-16T08:06:47.000 compute-3 ovs-vsctl: notice ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
2019-05-16T08:06:47.000 compute-3 ovs-vswitchd: err ovs|00018|dpdk|ERR|EAL: invalid parameters for --socket-mem
2019-05-16T08:06:47.000 compute-3 ovs-vswitchd: err ovs|00019|dpdk|ERR|EAL: Invalid 'command line' arguments.
2019-05-16T08:06:47.000 compute-3 ovs-vswitchd: alert ovs|00020|dpdk|EMER|Unable to initialize DPDK: Invalid argument
2019-05-16T08:06:48.380 compute-3 ovs-ctl[37096]: info 2019-05-16T08:06:48Z|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.36875.ctl
2019-05-16T08:06:48.000 compute-3 ovs-appctl: warning ovs|00001|unixctl|WARN|failed to connect to /var/run/openvswitch/ovs-vswitchd.36875.ctl

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

ovs-vswitchd.log
compute-3:/var/log/openvswitch$ cat ovs-vswitchd.log | grep ERR
2019-05-16T08:06:47.595Z|00018|dpdk|ERR|EAL: invalid parameters for --socket-mem
2019-05-16T08:06:47.595Z|00019|dpdk|ERR|EAL: Invalid 'command line' arguments.

Ghada Khalil (gkhalil)
tags: added: stx.networking
Changed in starlingx:
assignee: nobody → Forrest Zhao (forrest.zhao)
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote : Re: compute node keeps offline after unlock due to vswitch error

@Peng, It says the issue is reproducible. How many times was this issue seen? Is it always on the same hardware node or different hardware nodes?

summary: - compute node keeps offline after unlock
+ compute node keeps offline after unlock due to vswitch error
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The same issue is also reported in https://bugs.launchpad.net/starlingx/+bug/1829390 which indicates a 30-50% frequency. @Peng, please clarify the frequency you are seeing.

Revision history for this message
Peng Peng (ppeng) wrote :

I saw it twice recently

Changed in starlingx:
assignee: Forrest Zhao (forrest.zhao) → ChenjieXu (midone)
Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,

Is this bug always on the same machine or different machines?

The similar issue (https://bugs.launchpad.net/starlingx/+bug/1829390) is caused by that hugepage is not allocated. Could you please run following commands on compute-3 to check the hugepage and attach the outputs?
   sudo find /sys -name "nr_huge*"
   sudo find /sys -name "nr_huge*" | xargs -L1 grep -E "^"
   sudo find /sys -name "free_hugepages*"
   sudo find /sys -name "free_hugepages" | xargs -L1 grep -E "^"
   mount | grep -i huge
   cat /proc/cmdline

If hugepage is not allocated, could you please try to allocate hugepage by following command and then restart ovs-vswitchd as I described in comment #2 in https://bugs.launchpad.net/starlingx/+bug/1829390

Revision history for this message
Peng Peng (ppeng) wrote :

The two times were happened on same lab.

Changed in starlingx:
status: New → Incomplete
Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (5.7 KiB)

This time, it is happened on different nodes.

check the CMD running output below, And ovs log attached.

[wrsroot@controller-0 tmp(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | enabled | available |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | disabled | offline |
| 8 | compute-3 | worker | unlocked | enabled | available |
| 9 | compute-4 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@controller-0 tmp(keystone_admin)]$

on compute-2:
compute-2:~$ sudo find /sys -name "free_hugepages*"
Password:
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
/sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
/sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo find /sys -name "free_hugepages" | xargs -L1 grep -E "^"
0
0
0
0
0
0
compute-2:~$ mount | grep -i huge
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
none on /dev/huge-1048576kB type hugetlbfs (rw,relatime,pagesize=1048576kB)
none on /mnt/huge-2048kB type hugetlbfs (rw,relatime,pagesize=2048kB)
none on /dev/huge-2048kB type hugetlbfs (rw,relatime,pagesize=2048kB)
none on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
none on /mnt/huge-1048576kB type hugetlbfs (rw,relatime,pagesize=1048576kB)
compute-2:~$ sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
compute-2:~$ sudo echo 3 > sudo tee /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
compute-2:~$ sudo echo 5000 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo echo 5000 > sudo tee /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
0
compute-2:~$ systemctl s...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,

Thank you for your response! Based on your logs, this bug is same as the issue (https://bugs.launchpad.net/starlingx/+bug/1829390). Because there is no hugepage available and OVS-DPDK needs hugepage to start.

Last time I requested to allocate hugepage manually but you executed wrong commands. You should allocate hugepage by the following command:
   sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
not the following command:
   sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
"nr_hugepages" is used to allocate hugepages and "free_hugepages" is used check how many hugepages are available.

Please execute the following commands (some commands have been updated) to allocate hugepage manually and then restart ovs-vswitchd:
1. allocate hugepages on each numa node:
   sudo bash
   echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
2. make sure hugepages have been allocated by checking nr_hugepages and free_hugepages:
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
3. Make sure ovsdb-server is running
   systemctl status ovsdb-server
4. restart ovs-vswitchd
   systemctl status ovs-vswitchd
   sudo systemctl restart ovs-vswitchd
   systemctl status ovs-vswitchd

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue with memory allocation for ovs-dpdk. High priority as issue has been seen multiple times.

Changed in starlingx:
importance: Undecided → High
tags: added: stx.2.0
Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (3.5 KiB)

WCP_113-121 compute 3/4 offline again.

on compute-3:
compute-3:~$ sudo bash
Password:
compute-3:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-3:/home/wrsroot# echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-3:/home/wrsroot# echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-3:/home/wrsroot# echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-3:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 13:40:07 UTC; 3h 43min ago
  Process: 36858 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 37349 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 36637 (code=dumped, signal=ABRT)
compute-3:/home/wrsroot# sudo systemctl restart ovs-vswitchd
Job for ovs-vswitchd.service failed because the control process exited with error code. See "systemctl status ovs-vswitchd.service" and "journalctl -xe" for details.
compute-3:/home/wrsroot# systemctl status ovs-vswitchd.service
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 17:23:41 UTC; 27s ago
  Process: 36858 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 175968 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 36637 (code=dumped, signal=ABRT)
compute-3:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 17:23:41 UTC; 5min ago
  Process: 3685...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,
According to your response, the hugepages can be allocated manually but ovs-vswitch still failed to start. The following line in the log "ovs-vswitchd.log" explains why ovs-vswitch can't start:

2019-05-27T17:23:40.222Z|00012|dpdk|INFO|EAL ARGS: ovs-vswitchd -n 4 -c 7 --huge-dir /mnt/huge-1048576kB --socket-mem 0,0 --socket-limit 0,0.
2019-05-27T17:23:40.223Z|00013|dpdk|INFO|EAL: Detected 88 lcore(s)
2019-05-27T17:23:40.223Z|00014|dpdk|INFO|EAL: Detected 2 NUMA nodes
2019-05-27T17:23:40.223Z|00015|dpdk|ERR|EAL: invalid parameters for --socket-mem

The socket memory has been set to 0 for both numa node. And the value should be 1024,1024. You can set this value by the following commands:
sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"

Could you please change the socket memory and then restart ovs-vswitchd again?
1. allocate hugepages on each numa node:
   sudo bash
   echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
2. make sure hugepages have been allocated by checking nr_hugepages and free_hugepages:
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
3. Make sure ovsdb-server is running
   systemctl status ovsdb-server
4. change the socket memory from 0 to 1024MB for 2 numa nodes
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"
5. restart ovs-vswitchd
   systemctl status ovs-vswitchd
   sudo systemctl restart ovs-vswitchd
   systemctl status ovs-vswitchd

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (4.4 KiB)

[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | storage-0 | storage | unlocked | enabled | available |
| 4 | storage-1 | storage | unlocked | enabled | available |
| 5 | compute-0 | worker | unlocked | disabled | offline |
| 6 | compute-1 | worker | unlocked | enabled | available |
| 7 | compute-2 | worker | unlocked | enabled | available |
| 8 | compute-3 | worker | unlocked | enabled | available |
| 9 | compute-4 | worker | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

compute-0:~$ sudo bash
Password:
compute-0:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-0:/home/wrsroot# echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-0:/home/wrsroot# echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-0:/home/wrsroot# echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-0:/home/wrsroot# systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2019-05-28 13:32:52 UTC; 22min ago
 Main PID: 32828 (ovsdb-server)
    Tasks: 1
   Memory: 7.8M
   CGroup: /system.slice/ovsdb-server.service
           └─32828 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=d...
compute-0:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
> "
compute-0:/home/wrsroot#...

Read more...

Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,

Could you please attach the logs "ovs-vswitchd.log" and "ovsdb-server.log".

Could you please execute following commands and attach the outputs:
   cd /tmp
   grep -rn “vswitch::dpdk::socket_mem”
   cd /opt
   grep -rn “vswitch::dpdk::socket_mem”

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,

The following commands for step 4 are wrong:
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"
The commands should be:
   sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
   sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="1024,1024"

Sorry for typing wrong commands! Could you please try again?

Revision history for this message
Peng Peng (ppeng) wrote :
Download full text (3.3 KiB)

compute-2:~$ sudo bash
Password:
compute-2:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-2:/home/wrsroot# echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-2:/home/wrsroot# echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-2:/home/wrsroot# echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-2:/home/wrsroot# systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2019-06-06 07:43:09 UTC; 6h ago
 Main PID: 33097 (ovsdb-server)
    Tasks: 1
   Memory: 7.8M
   CGroup: /system.slice/ovsdb-server.service
           └─33097 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=d...
compute-2:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
compute-2:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="1024,1024"
compute-2:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2019-06-06 07:43:47 UTC; 6h ago
  Process: 37193 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
compute-2:/home/wrsroot# sudo systemctl restart ovs-vswitchd
compute-2:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2019-06-06 13:52:25 UTC; 4s ago
    Tasks: 99
   Memory: 811.7M
   CGroup: /system.slice/ovs-vswitchd.service
           └─275814 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openv...
compute-2:/home/wrsr...

Read more...

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Abraham Arce (xe1gyq) wrote :

This "Compute Node Keeps Offline After Unlock Due To Vswitch Error" has not been seen while deploying a Bare Metal Dedicated Storage 2+2+2 with ISO image:

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190523T013000Z"

Not seen also while testing the following CEPH scenarios:

 https://github.com/xe1gyq/starlingx/blob/master/validation/STOR_CORE_014.md
 https://github.com/xe1gyq/starlingx/blob/master/validation/STOR_CORE_015.md
 https://github.com/xe1gyq/starlingx/blob/master/validation/STOR_CORE_016.md

And not seeing while retesting this bug:

  https://bugs.launchpad.net/starlingx/+bug/1797187

Several unlocks were executed.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Abraham, what vswitch type are you using in your testing? This was only seen when using ovs-dpdk. The default configuration is ovs.

Revision history for this message
Abraham Arce (xe1gyq) wrote :

Ghada, I deployed according to the instructions from our current official Wiki Containers Dedicated Storage:
  https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandardStorage

I have executed the section "Configure the vswitch type (optional)" only in step 1, step 2 did not completed since I got confused due to reference for controller-0, for the full log of the deployment see same section"Configure the vswitch type (optional)" where my error is reflected, under:
  https://github.com/xe1gyq/starlingx/blob/master/deployment/baremetal/dedicatedstorage.md

Let me know if the following steps if a valid use case to allow the configuration of the compute-0 even if controller-0 has been unlocked:

  $ system host-lock compute-0
  $ system host-cpu-modify -f vswitch -p0 1 compute-0
  $ system host-unlock compute-0

If the above is not a valid use case, should it be tested with a new deployment?
I am taking this learning to reflect those changes in our docs.starlingx.io site.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Abraham, I guess I got confused as to why you are updating this bug report. Is there a reason you are trying to reproduce this issue? It's already noted in the notes that it is not 100% reproducible on all systems, so the fact that you cannot reproduce it doesn't mean there is no issue. The issue is also only specific to ovs-dpdk deployments (hence my first comment to you as it wasn't clear what vswitch backend you are using).

Anyway, to answer your question, yes you can change the cpu assignment for vswitch using a lock/unlock.

Revision history for this message
Abraham Arce (xe1gyq) wrote :

Thanks for the answer Ghada. I thought it was a good idea to retest (based on tag and date of submission) some of the bugs around with a couple of dedicated storage systems recently deployed. Let me know what is your preferred way to help in this bug fixing activity :)

Revision history for this message
ChenjieXu (midone) wrote :

Hi Peng,

Based your previous log, the OVS-DPDK can run correctly. So the problem is that hugepages are not allocated during boot time.

For now StarlingX allocates 1G hugepage for each NUMA node during boot time. But if the there doesn't exist 1G contiguous areas of memory, allocation for 1G hugepages will fail. Need help from the expert on memory to figure out why sometimes there doesn't exist enough contiguous areas of memory for allocating hugepages in StarlingX.

Revision history for this message
ChenjieXu (midone) wrote :

Hi Abraham,

Thank you for your testing. Based on your steps, you are using OVS-DPDK.

The OVS in container is used by default. You can change to use OVS-DPDK by the following command:
   system modify --vswitch_type ovs-dpdk
   system host-cpu-modify -f vswitch -p0 1 controller-0
What's more, you can change to OVS running in a container by the following command:
   system modify --vswitch_type none

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Chenjie, So the next step for this bug is to have Tao Liu or Austin Sun look at this as they are familiar with huge page allocation on starlingx. Is that correct?

Revision history for this message
ChenjieXu (midone) wrote :

Hi Ghada,

Yes, it's correct.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Austin, please investigate this.

Changed in starlingx:
status: Incomplete → Triaged
assignee: ChenjieXu (midone) → Austin Sun (sunausti)
Revision history for this message
Austin Sun (sunausti) wrote :

from compute-3 /etc/platform/worker_reserved.conf the worker did not reserve any vs hp. what the result of system host-memory-list compute-3 ? and share /opt/platform/puppet/19.01/hieradata/ in active controller ?

Revision history for this message
ChenjieXu (midone) wrote :

Hi Austin,

The following bug is the same bug as this one:
https://bugs.launchpad.net/starlingx/+bug/1829390

And the reporter has attached the result of "system host-memory-list". Hope this can be useful.

Revision history for this message
Austin Sun (sunausti) wrote :

Thanks. ChenJie.
@Pengpeng, could you help collect the command output of "/sys/devices/system/node/node0/meminfo" and "/sys/devices/system/node/node1/meminfo" for compute which have such issue ? i think this info is not in collect info.

Austin Sun (sunausti)
Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Bill Zvonar (billzvonar) wrote :

Hi Austin - do you still consider this LP incomplete, are you still waiting on ChenJie for the info requested above?

Revision history for this message
Austin Sun (sunausti) wrote :

Hi, Bill:
   I'm waiting peng peng to reproduce this issue and provide more info as discussed offline.

Ghada Khalil (gkhalil)
summary: - compute node keeps offline after unlock due to vswitch error
+ compute node keeps offline after unlock due to vswitch error, caused by
+ a hugepage allocation failure
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This issue is related to huge pages. It's not really a networking issue. Removing the stx.networking tag and adding stx.config instead. This is already being investigated by Austin.

tags: added: stx.config
removed: stx.networking
Revision history for this message
Chris Winnicki (chriswinnicki) wrote :
Download full text (5.2 KiB)

Same failure was seen again in another of the Wind River lab: (yow-cgcs-wildcat-99-103):

Info below as requested Austin (comment #35):
(Fresh set of logs are attached - as generated by collect all)
(In this system compute-0 experienced the issue described above)

Contents of /opt/platform/puppet/19.01/hieradata/* attached as opt_platform_puppet_19.01_hieradata.tar.gz

[sysadmin@controller-0 ~(keystone_admin)]$ system host-memory-list compute-0
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | app_tota | app_hp | app_hp | app_hp_p | app_hp | app_hp_a | app_hp_pending_1G | app_hp_use_1G |
| | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | l_4K | _total | _avail | ending_2 | _total | vail_1G | | |
| | | | | | iB) | | | | | _2M | _2M | M | _1G | | | |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+
| 0 | 5390 | 8000 | 5390 | True | 1024 | 0 | 0 | None | 1379840 | 0 | 0 | None | 0 | 0 | None | True |
| 1 | 6023 | 2000 | 6023 | True | 1024 | 0 | 0 | None | 1541888 | 0 | 0 | None | 0 | 0 | None | True |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+

[sysadmin@controller-0 ~(keystone_admin)]$ ssh compute-0 cat /etc/platform/worker_reserved.conf
sysadmin@compute-0's password:
################################################################################
# Copyright (c) 2018 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
# - This file is managed by Puppet. DO NOT EDIT.
################################################################################
# WORKER Node configuration parameters for reserved memory and physical cores
# used by Base software and VSWITCH. These are resources that libvirt cannot use.
#

################################################################################
#
# List of logical CPU instances available in the system. This value is used
# for auditing purposes so that the current configuration can be checked for
# validity against the actual number of logical CPU instances in the system.
#
################################################################################
WORKER_CPU_LIST="0-27"

################################################################################
#
# List of logical CPU instances that reserved for platform applications.
#
###################...

Read more...

Revision history for this message
Chris Winnicki (chriswinnicki) wrote :
Revision history for this message
Chris Winnicki (chriswinnicki) wrote :
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Incomplete → Confirmed
Revision history for this message
Austin Sun (sunausti) wrote :

compute-0
 uuid memtotal_mib memavail_mib platform_reserved_mib hugepages_configured vswitch_hugepages_size_mib vswitch_hugepages_reqd vswitch_hugepages_nr vswitch_hugepages_avail capabilities forihostid forinodeid
7c8406ea-f7e3-4c30-a40f-0d222dd9aa5d 5390 5390 8000 t 1024 \N 0 0 \N 2 3
81760ce4-f59b-4b32-a547-bfe619678233 6023 6023 2000 t 1024 \N 0 0 \N 2 4

The memtotal and memavail are wrong. need check where the error from.

Revision history for this message
Austin Sun (sunausti) wrote :

from controller-0 sysinv.log.1
2019-07-19 20:50:01.036 103738 INFO sysinv.api.controllers.v1.host [-] Memory: Total=62908 MiB, Allocated=8000 MiB, 2M: 31454 pages None pages pending, 1G: 61 pages None pages pending
2019-07-19 20:50:01.064 103738 INFO sysinv.api.controllers.v1.host [-] Memory: Total=63239 MiB, Allocated=2000 MiB, 2M: 31619 pages None pages pending, 1G: 61 pages None pages pending
2019-07-19 20:50:01.085 103738 INFO sysinv.api.controllers.v1.host [-] host(compute-0) node(3): vm_mem_mib=53884,vm_mem_mib_possible (from agent) = 62908
2019-07-19 20:50:01.085 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(3): {'vm_hugepages_nr_4K': 1379840, 'vm_hugepages_nr_2M': 24247, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:01.187 103738 INFO sysinv.api.controllers.v1.host [-] host(compute-0) node(4): vm_mem_mib=60215,vm_mem_mib_possible (from agent) = 63238
2019-07-19 20:50:01.188 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(4): {'vm_hugepages_nr_4K': 1541888, 'vm_hugepages_nr_2M': 27096, 'vswitch_hugepages_nr': 1}

So, at least perform unlock action the db should be correct vswitch_hugepages_nr is 1.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi, Chris.
   Thanks you reproduce this issue. it seems the first time unlock compute-0 was in July-19, but the puppet log was discarded . so it is hard to find any clue for the first time unlock info. if you env is still there , could you delete compute-0 and re-install it again , if the issue is still there, please collect the compute-0 log again ?

Thanks.

Revision history for this message
Tao Liu (tliu88) wrote :

Hi Austin,

I took a look at yow-cgcs-wildcat-99-103 yesterday. The database and hiera data were not updated on both compute-0 and compute-2, although there were logs from 2019-07-19 showing both database and hiera data were updated. There are 2 possible sources that could cause this failure:
One, the database update failed without error logs, resulting in the hiera data not being updated. Second, the sysinv-agent reporting the default memory inventory prior to hiera data update, resetting the vswitch huge pages to 0 in the database.

To recover, I locked /unlocked the compute-0 to re-populate the vswitch huge pages and it was successful.

# compute-2
2019-07-19 20:49:47.430 103746 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-2) node(5): {'vm_hugepages_nr_4K': 1379328, 'vm_hugepages_nr_2M': 24246, 'vswitch_hugepages_nr': 1}
2019-07-19 20:49:47.459 103738 INFO sysinv.api.controllers.v1.host [-] compute-1 1. delta_handle ['uptime', 'task']
2019-07-19 20:49:47.530 103746 INFO sysinv.api.controllers.v1.host [-] host(compute-2) node(6): vm_mem_mib=60218,vm_mem_mib_possible (from agent) = 63242
2019-07-19 20:49:47.530 103746 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-2) node(6): {'vm_hugepages_nr_4K': 1541632, 'vm_hugepages_nr_2M': 27098, 'vswitch_hugepages_nr': 1}
2019-07-19 20:49:50.612 102604 INFO sysinv.puppet.puppet [req-df405ec9-1017-4b2d-861a-e4790ad819d6 admin admin] Updating hiera for host: compute-2 with config_uuid: None

# compute-0
2019-07-19 20:50:01.085 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(3): {'vm_hugepages_nr_4K': 1379840, 'vm_hugepages_nr_2M': 24247, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:01.187 103738 INFO sysinv.api.controllers.v1.host [-] host(compute-0) node(4): vm_mem_mib=60215,vm_mem_mib_possible (from agent) = 63238
2019-07-19 20:50:01.188 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(4): {'vm_hugepages_nr_4K': 1541888, 'vm_hugepages_nr_2M': 27096, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:05.532 102604 INFO sysinv.puppet.puppet [req-3de697a6-37d1-4c78-bc0f-6005f67b27b9 admin admin] Updating hiera for host: compute-0 with config_uuid: None

Revision history for this message
Tao Liu (tliu88) wrote :

I added addition logs to a designer load, re-installed it on yow-cgcs-wildcat-99-103, using the deployment manager. This problem was not present.

It looks like a timing issue in the previous install (the second possible source). The sysinv-agent memory report came in after the database was updated, although the hiera data has not been updated. At this time, before the first unlock, the ‘vswitch_hugepages_nr’, ‘vm_hugepages_nr_2M’ and ‘vm_hugepages_nr_1G’ fields in the agent report were set to the default 0, which reset the vswitch huge pages to 0 in the database. The hiera data was then set to 0 via reading from the database.

I think we can add a check to the host state, should the host is not provisioned the conductor could ignore those fields from the agent memory report.

Another option is to make a change in sysinv agent, which does not include the ‘vswitch_hugepages_nr’, ‘vm_hugepages_nr_2M’ and ‘vm_hugepages_use_1G’ in the initial report prior to the first unlock.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672634

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
Austin Sun (sunausti) wrote :

Hi Tao, Thanks a lot your help to investigating.
   This time issue was not same as 2019-05-09 peng found , which db data is correct.
But I think your analysis was perhaps right.

from controller-0 sysinv.log.1
compute-1:
2019-07-19 20:49:38.371 compute-1 ihost check_unlock_worker
2019-07-19 20:49:42.437 Updating hiera for host: compute-1 with config_uuid
from compute-1 agent schedule is around
20:39:27.891 every one minutes. it was not between 38s to 42s.

compute-0:
2019-07-19 20:50:00.666 compute-0 ihost check_unlock_worker
2019-07-19 20:50:05.532 Updating hiera for host: compute-0 with config_uuid: None
from compute-0 agent schedule is around
20:37:00.718 every one minutes. it was between 00s to 05s.

compute-2:
019-07-19 20:49:47.163 compute-2 ihost check_unlock_worker
2019-07-19 20:49:50.612 Updating hiera for host: compute-2 with config_uuid: None
from compute-0 agent schedule is around
20:44:47.016 every one minutes, os it was between 47s to 50s

I made a change to ingore agent mem report when unlocking. you can review it
Thanks.

Revision history for this message
Peng Peng (ppeng) wrote :

Issue reproduced on
Lab: WCP_63_66
Load: 20190726T013000Z

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | controller-1 | controller | unlocked | enabled | available |
| 3 | compute-0 | worker | unlocked | enabled | available |
| 4 | compute-1 | worker | unlocked | disabled | offline |
+----+--------------+-------------+----------------+-------------+--------------+

more log coolected

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (3.6 KiB)

Getting issue on another lab HW PV-0
after clean install, compute-1 and compute-2 did not unlock

BUILD_ID="20190724T013000Z"

see logs below:

Compute-1
Ovs-vswitchd.log
2019-07-26T15:21:37.256Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-26T15:21:37.278Z|00002|ovs_numa|INFO|Discovered 20 CPU cores on NUMA node 0
2019-07-26T15:21:37.278Z|00003|ovs_numa|INFO|Discovered 20 CPU cores on NUMA node 1
2019-07-26T15:21:37.278Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 40 CPU cores
2019-07-26T15:21:37.278Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2019-07-26T15:21:37.278Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2019-07-26T15:21:37.279Z|00007|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2019-07-26T15:21:37.280Z|00008|dpif_netlink|INFO|The kernel module does not support meters.
2019-07-26T15:21:37.286Z|00009|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
2019-07-26T15:21:37.587Z|00010|dpdk|INFO|Using DPDK 18.11.0
2019-07-26T15:21:37.587Z|00011|dpdk|INFO|DPDK Enabled - initializing...
2019-07-26T15:21:37.587Z|00012|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2019-07-26T15:21:37.587Z|00013|dpdk|INFO|IOMMU support for vhost-user-client disabled.
2019-07-26T15:21:37.587Z|00014|dpdk|INFO|Per port memory for DPDK devices disabled.
2019-07-26T15:21:37.587Z|00015|dpdk|INFO|EAL ARGS: ovs-vswitchd -n 4 -c 7 --huge-dir /mnt/huge-1048576kB --socket-mem 0,0 --socket-limit 0,0.
2019-07-26T15:21:37.591Z|00016|dpdk|INFO|EAL: Detected 40 lcore(s)
2019-07-26T15:21:37.591Z|00017|dpdk|INFO|EAL: Detected 2 NUMA nodes
2019-07-26T15:21:37.591Z|00018|dpdk|ERR|EAL: invalid parameters for --socket-mem
2019-07-26T15:21:37.591Z|00019|dpdk|ERR|EAL: Invalid 'command line' arguments.
2019-07-26T15:21:37.591Z|00020|dpdk|EMER|Unable to initialize DPDK: Invalid argument

Compute-2
2019-07-26T15:19:41.329Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovs-vswitchd.log
2019-07-26T15:19:41.340Z|00002|ovs_numa|INFO|Discovered 28 CPU cores on NUMA node 0
2019-07-26T15:19:41.340Z|00003|ovs_numa|INFO|Discovered 28 CPU cores on NUMA node 1
2019-07-26T15:19:41.340Z|00004|ovs_numa|INFO|Discovered 2 NUMA nodes and 56 CPU cores
2019-07-26T15:19:41.340Z|00005|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connecting...
2019-07-26T15:19:41.340Z|00006|reconnect|INFO|unix:/var/run/openvswitch/db.sock: connected
2019-07-26T15:19:41.341Z|00007|dpdk|INFO|DPDK Disabled - Use other_config:dpdk-init to enable
2019-07-26T15:19:41.342Z|00008|dpif_netlink|INFO|The kernel module does not support meters.
2019-07-26T15:19:41.346Z|00009|bridge|INFO|ovs-vswitchd (Open vSwitch) 2.11.0
2019-07-26T15:19:41.689Z|00010|dpdk|INFO|Using DPDK 18.11.0
2019-07-26T15:19:41.689Z|00011|dpdk|INFO|DPDK Enabled - initializing...
2019-07-26T15:19:41.689Z|00012|dpdk|INFO|No vhost-sock-dir provided - defaulting to /var/run/openvswitch
2019-07-26T15:19:41.689Z|00013|dpdk|INFO|IOMMU support for vhost-user-client disabled.
2019-07-26T15:19:41.689Z|00014|dpdk|INFO|Per port memory for DPDK devices disabled.
2019-07-26T15:19:41.689Z|00015|dpdk|INFO|EAL ARGS: ovs-vswitchd -n 4 -c 7 --huge...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/672634
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=85e34657c59df9b2f18c694bf6c7ba187e8f062a
Submitter: Zuul
Branch: master

commit 85e34657c59df9b2f18c694bf6c7ba187e8f062a
Author: Sun Austin <email address hidden>
Date: Thu Jul 25 14:39:38 2019 +0800

    Avoid agent mem update during unlocking

    sysinv-agent might report memory after unlocking action was performing,
    The DB was updated, but hiera data has not been updated. During this
    time, memory report from agent will set ‘vswitch_hugepages_nr’,
    ‘vm_hugepages_nr_2M’ and ‘vm_hugepages_nr_1G’ values to default 0.

    adding protect to ignore agent mem report during unlocking (host is
    locked state and ihost_action is 'unlock' or 'force-unlock')

    Closes-Bug: 1829403
    Signed-off-by: Sun Austin <email address hidden>

    Change-Id: I3438809782560e90248a3e63e51aa0315fcf49d3

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

The issue was not reproduced recently.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.