Bug #1829403 “compute node keeps offline after unlock due to vsw...” : Bugs : StarlingX

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-16:

#1

ALL_NODES_20190516.151251.tar Edit (237.3 MiB, application/x-tar)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-16:

#2

compute-3_console.txt Edit (166.8 KiB, text/plain)

Numan Waheed (nwaheed) on 2019-05-16

tags:

added: stx.retestneeded

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-05-16:

#3

daemon.log (compute-3)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-05-16:

#4

Ghada Khalil (gkhalil) on 2019-05-16

tags:	added: stx.networking
Changed in starlingx:
assignee:	nobody → Forrest Zhao (forrest.zhao)
description:	updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-16: Re: compute node keeps offline after unlock due to vswitch error

#5

@Peng, It says the issue is reproducible. How many times was this issue seen? Is it always on the same hardware node or different hardware nodes?

summary:

- compute node keeps offline after unlock
+ compute node keeps offline after unlock due to vswitch error

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-16:

#6

The same issue is also reported in https://bugs.launchpad.net/starlingx/+bug/1829390 which indicates a 30-50% frequency. @Peng, please clarify the frequency you are seeing.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-16:

#7

I saw it twice recently

Forrest Zhao (forrest.zhao) on 2019-05-16

Changed in starlingx:
assignee:	Forrest Zhao (forrest.zhao) → ChenjieXu (midone)

Revision history for this message

ChenjieXu (midone) wrote on 2019-05-17:

#8

Hi Peng,

Is this bug always on the same machine or different machines?

The similar issue (https://bugs.launchpad.net/starlingx/+bug/1829390) is caused by that hugepage is not allocated. Could you please run following commands on compute-3 to check the hugepage and attach the outputs?
   sudo find /sys -name "nr_huge*"
   sudo find /sys -name "nr_huge*" | xargs -L1 grep -E "^"
   sudo find /sys -name "free_hugepages*"
   sudo find /sys -name "free_hugepages" | xargs -L1 grep -E "^"
   mount | grep -i huge
   cat /proc/cmdline

If hugepage is not allocated, could you please try to allocate hugepage by following command and then restart ovs-vswitchd as I described in comment #2 in https://bugs.launchpad.net/starlingx/+bug/1829390

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-17:

#9

The two times were happened on same lab.

Forrest Zhao (forrest.zhao) on 2019-05-17

Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-21:

#10

Download full text (5.7 KiB)

This time, it is happened on different nodes.

check the CMD running output below, And ovs log attached.

on compute-2:
compute-2:~$ sudo find /sys -name "free_hugepages*"
Password:
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
/sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
/sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo find /sys -name "free_hugepages" | xargs -L1 grep -E "^"
0
0
0
0
0
0
compute-2:~$ mount | grep -i huge
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
none on /dev/huge-1048576kB type hugetlbfs (rw,relatime,pagesize=1048576kB)
none on /mnt/huge-2048kB type hugetlbfs (rw,relatime,pagesize=2048kB)
none on /dev/huge-2048kB type hugetlbfs (rw,relatime,pagesize=2048kB)
none on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
none on /mnt/huge-1048576kB type hugetlbfs (rw,relatime,pagesize=1048576kB)
compute-2:~$ sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
compute-2:~$ sudo echo 3 > sudo tee /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
compute-2:~$ sudo echo 5000 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo echo 5000 > sudo tee /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
0
compute-2:~$ systemctl s...

This time, it is happened on different nodes.

check the CMD running output below, And ovs log attached.

on compute-2:
compute-2:~$ sudo find /sys -name "free_hugepages*"
Password:
/sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
/sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
/sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
/sys/kernel/mm/hugepages/hugepages-1048576kB/free_hugepages
/sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo find /sys -name "free_hugepages" | xargs -L1 grep -E "^"
0
0
0
0
0
0
compute-2:~$ mount | grep -i huge
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
none on /dev/huge-1048576kB type hugetlbfs (rw,relatime,pagesize=1048576kB)
none on /mnt/huge-2048kB type hugetlbfs (rw,relatime,pagesize=2048kB)
none on /dev/huge-2048kB type hugetlbfs (rw,relatime,pagesize=2048kB)
none on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M)
none on /mnt/huge-1048576kB type hugetlbfs (rw,relatime,pagesize=1048576kB)
compute-2:~$ sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
compute-2:~$ sudo echo 3 > sudo tee /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
compute-2:~$ sudo echo 5000 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo echo 5000 > sudo tee /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
compute-2:~$ sudo cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
0
compute-2:~$ sudo cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
0
compute-2:~$ systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2019-05-21 13:41:38 UTC; 37min ago
  Process: 36956 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd stop (code=exited, status=0/SUCCESS)
  Process: 36985 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovs-vswitchd --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=0/SUCCESS)
  Process: 36980 ExecStartPre=/bin/sh -c rm -f /run/openvswitch/useropts; if [ "$${OVS_USER_ID/:*/}" != "root" ]; then /usr/bin/echo "OVSUSER=--ovs-user=${OVS_USER_ID}" > /run/openvswitch/useropts; fi (code=exited, status=0/SUCCESS)
  Process: 36978 ExecStartPre=/usr/bin/chown ${OVS_USER_ID} /var/run/openvswitch /var/log/openvswitch (code=exited, status=0/SUCCESS)
 Main PID: 37021 (ovsdb-server)
    Tasks: 1
   Memory: 1.2M
   CGroup: /system.slice/ovsdb-server.service
           └─37021 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=d...
compute-2:~$ systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Tue 2019-05-21 13:41:43 UTC; 38min ago
  Process: 36931 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 37401 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 36710 (code=dumped, signal=ABRT)
compute-2:~$ sudo systemctl restart ovs-vswitchd
Job for ovs-vswitchd.service failed because the control process exited with error code. See "systemctl status ovs-vswitchd.service" and "journalctl -xe" for details.
compute-2:~$ systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Tue 2019-05-21 14:20:08 UTC; 32s ago
  Process: 36931 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 61311 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 36710 (code=dumped, signal=ABRT)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-21:

#11

ovsdb-server.log Edit (1.2 KiB, text/plain)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-21:

#12

ovs-vswitchd.log Edit (18.0 KiB, text/plain)

Revision history for this message

ChenjieXu (midone) wrote on 2019-05-22:

#13

Hi Peng,

Thank you for your response! Based on your logs, this bug is same as the issue (https://bugs.launchpad.net/starlingx/+bug/1829390). Because there is no hugepage available and OVS-DPDK needs hugepage to start.

Last time I requested to allocate hugepage manually but you executed wrong commands. You should allocate hugepage by the following command:
sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
not the following command:
sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
"nr_hugepages" is used to allocate hugepages and "free_hugepages" is used check how many hugepages are available.

Please execute the following commands (some commands have been updated) to allocate hugepage manually and then restart ovs-vswitchd:
1. allocate hugepages on each numa node:
   sudo bash
   echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
2. make sure hugepages have been allocated by checking nr_hugepages and free_hugepages:
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
3. Make sure ovsdb-server is running
   systemctl status ovsdb-server
4. restart ovs-vswitchd
   systemctl status ovs-vswitchd
   sudo systemctl restart ovs-vswitchd
   systemctl status ovs-vswitchd

Hi Peng,

Thank you for your response! Based on your logs, this bug is same as the issue (https://bugs.launchpad.net/starlingx/+bug/1829390). Because there is no hugepage available and OVS-DPDK needs hugepage to start.

Last time I requested to allocate hugepage manually but you executed wrong commands. You should allocate hugepage by the following command:
   sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
not the following command:
   sudo echo 3 > sudo tee /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
"nr_hugepages" is used to allocate hugepages and "free_hugepages" is used check how many hugepages are available.

Please execute the following commands (some commands have been updated) to allocate hugepage manually and then restart ovs-vswitchd:
1. allocate hugepages on each numa node:
   sudo bash
   echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
2. make sure hugepages have been allocated by checking nr_hugepages and free_hugepages:
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
3. Make sure ovsdb-server is running
   systemctl status ovsdb-server
4. restart ovs-vswitchd
   systemctl status ovs-vswitchd
   sudo systemctl restart ovs-vswitchd
   systemctl status ovs-vswitchd

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-22:

#14

Marking as release gating; issue with memory allocation for ovs-dpdk. High priority as issue has been seen multiple times.

Changed in starlingx:
importance:	Undecided → High
tags:	added: stx.2.0

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-27:

#15

ovs-vswitchd.log Edit (18.0 KiB, text/plain)

Download full text (3.5 KiB)

WCP_113-121 compute 3/4 offline again.

on compute-3:
compute-3:~$ sudo bash
Password:
compute-3:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-3:/home/wrsroot# echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-3:/home/wrsroot# echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-3:/home/wrsroot# echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-3:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-3:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 13:40:07 UTC; 3h 43min ago
  Process: 36858 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 37349 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
Main PID: 36637 (code=dumped, signal=ABRT)
compute-3:/home/wrsroot# sudo systemctl restart ovs-vswitchd
Job for ovs-vswitchd.service failed because the control process exited with error code. See "systemctl status ovs-vswitchd.service" and "journalctl -xe" for details.
compute-3:/home/wrsroot# systemctl status ovs-vswitchd.service
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 17:23:41 UTC; 27s ago
  Process: 36858 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 175968 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
Main PID: 36637 (code=dumped, signal=ABRT)
compute-3:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 17:23:41 UTC; 5min ago
  Process: 3685...

WCP_113-121 compute 3/4 offline again.

on compute-3:
compute-3:~$ sudo bash
Password:
compute-3:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-3:/home/wrsroot#    echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-3:/home/wrsroot#    echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-3:/home/wrsroot#    echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-3:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-3:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-3:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-3:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-3:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-3:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-3:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-3:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-3:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 13:40:07 UTC; 3h 43min ago
  Process: 36858 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 37349 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 36637 (code=dumped, signal=ABRT)
compute-3:/home/wrsroot# sudo systemctl restart ovs-vswitchd
Job for ovs-vswitchd.service failed because the control process exited with error code. See "systemctl status ovs-vswitchd.service" and "journalctl -xe" for details.
compute-3:/home/wrsroot# systemctl status ovs-vswitchd.service
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 17:23:41 UTC; 27s ago
  Process: 36858 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 175968 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 36637 (code=dumped, signal=ABRT)
compute-3:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2019-05-27 17:23:41 UTC; 5min ago
  Process: 36858 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 175968 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
 Main PID: 36637 (code=dumped, signal=ABRT)
compute-3:/home/wrsroot#

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-27:

#16

ovsdb-server.log Edit (1.4 KiB, text/plain)

Revision history for this message

ChenjieXu (midone) wrote on 2019-05-28:

#17

Hi Peng,
According to your response, the hugepages can be allocated manually but ovs-vswitch still failed to start. The following line in the log "ovs-vswitchd.log" explains why ovs-vswitch can't start:

The socket memory has been set to 0 for both numa node. And the value should be 1024,1024. You can set this value by the following commands:
sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"

Could you please change the socket memory and then restart ovs-vswitchd again?
1. allocate hugepages on each numa node:
   sudo bash
   echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
2. make sure hugepages have been allocated by checking nr_hugepages and free_hugepages:
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
3. Make sure ovsdb-server is running
   systemctl status ovsdb-server
4. change the socket memory from 0 to 1024MB for 2 numa nodes
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"
5. restart ovs-vswitchd
   systemctl status ovs-vswitchd
   sudo systemctl restart ovs-vswitchd
   systemctl status ovs-vswitchd

Hi Peng,
According to your response, the hugepages can be allocated manually but ovs-vswitch still failed to start. The following line in the log "ovs-vswitchd.log" explains why ovs-vswitch can't start:

The socket memory has been set to 0 for both numa node. And the value should be 1024,1024. You can set this value by the following commands:
sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"

Could you please change the socket memory and then restart ovs-vswitchd again?
1. allocate hugepages on each numa node:
   sudo bash
   echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
2. make sure hugepages have been allocated by checking nr_hugepages and free_hugepages:
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
   cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
   cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
3. Make sure ovsdb-server is running
   systemctl status ovsdb-server
4. change the socket memory from 0 to 1024MB for 2 numa nodes
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"
5. restart ovs-vswitchd
   systemctl status ovs-vswitchd
   sudo systemctl restart ovs-vswitchd
   systemctl status ovs-vswitchd

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-28:

#18

Download full text (4.4 KiB)

compute-0:~$ sudo bash
Password:
compute-0:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-0:/home/wrsroot# echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-0:/home/wrsroot# echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-0:/home/wrsroot# echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-0:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-0:/home/wrsroot# systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2019-05-28 13:32:52 UTC; 22min ago
Main PID: 32828 (ovsdb-server)
    Tasks: 1
   Memory: 7.8M
   CGroup: /system.slice/ovsdb-server.service
           └─32828 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=d...
compute-0:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
> "
compute-0:/home/wrsroot#...

compute-0:~$ sudo bash
Password:
compute-0:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-0:/home/wrsroot#    echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-0:/home/wrsroot#    echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-0:/home/wrsroot#    echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-0:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-0:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-0:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-0:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-0:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-0:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-0:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-0:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-0:/home/wrsroot# systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Tue 2019-05-28 13:32:52 UTC; 22min ago
 Main PID: 32828 (ovsdb-server)
    Tasks: 1
   Memory: 7.8M
   CGroup: /system.slice/ovsdb-server.service
           └─32828 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=d...
compute-0:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
> "
compute-0:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024""
compute-0:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Tue 2019-05-28 13:33:28 UTC; 23min ago
  Process: 36852 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
compute-0:/home/wrsroot# sudo systemctl restart ovs-vswitchd
Job for ovs-vswitchd.service failed because the control process exited with error code. See "systemctl status ovs-vswitchd.service" and "journalctl -xe" for details.
compute-0:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: failed (Result: start-limit) since Tue 2019-05-28 13:56:48 UTC; 30s ago
  Process: 36852 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
  Process: 51381 ExecStart=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server --no-monitor --system-id=random ${OVSUSER} start $OPTIONS (code=exited, status=1/FAILURE)
compute-0:/home/wrsroot#

Revision history for this message

ChenjieXu (midone) wrote on 2019-05-29:

#19

Hi Peng,

Could you please attach the logs "ovs-vswitchd.log" and "ovsdb-server.log".

Could you please execute following commands and attach the outputs:
   cd /tmp
   grep -rn “vswitch::dpdk::socket_mem”
   cd /opt
   grep -rn “vswitch::dpdk::socket_mem”

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-29:

#20

ovsdb-server.log Edit (1.4 KiB, text/plain)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-29:

#21

ovs-vswitchd.log Edit (9.9 KiB, text/plain)

Revision history for this message

ChenjieXu (midone) wrote on 2019-05-29:

#22

Hi Peng,

The following commands for step 4 are wrong:
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-mem="1024,1024"
   sudo ovs-vsctl --no-wait set Open_vSwitch . "other_config:dpdk-socket-limit="1024,1024"
The commands should be:
   sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
   sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="1024,1024"

Sorry for typing wrong commands! Could you please try again?

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-06:

#23

ovs-vswitchd.log Edit (11.8 KiB, text/plain)

Download full text (3.3 KiB)

compute-2:~$ sudo bash
Password:
compute-2:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-2:/home/wrsroot# echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-2:/home/wrsroot# echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-2:/home/wrsroot# echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-2:/home/wrsroot# cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-2:/home/wrsroot# systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2019-06-06 07:43:09 UTC; 6h ago
Main PID: 33097 (ovsdb-server)
    Tasks: 1
   Memory: 7.8M
   CGroup: /system.slice/ovsdb-server.service
           └─33097 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=d...
compute-2:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
compute-2:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="1024,1024"
compute-2:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2019-06-06 07:43:47 UTC; 6h ago
  Process: 37193 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
compute-2:/home/wrsroot# sudo systemctl restart ovs-vswitchd
compute-2:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2019-06-06 13:52:25 UTC; 4s ago
    Tasks: 99
   Memory: 811.7M
   CGroup: /system.slice/ovs-vswitchd.service
           └─275814 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openv...
compute-2:/home/wrsr...

compute-2:~$ sudo bash
Password:
compute-2:/home/wrsroot# echo 3 > /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
compute-2:/home/wrsroot#    echo 3 > /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
compute-2:/home/wrsroot#    echo 5000 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
compute-2:/home/wrsroot#    echo 5000 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
compute-2:/home/wrsroot# cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-2:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/nr_hugepages
3
compute-2:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-2:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
5000
compute-2:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-1048576kB/free_hugepages
3
compute-2:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-1048576kB/free_hugepages
3
compute-2:/home/wrsroot#    cat /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages
5000
compute-2:/home/wrsroot#    cat /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages
5000
compute-2:/home/wrsroot# systemctl status ovsdb-server
● ovsdb-server.service - Open vSwitch Database Unit
   Loaded: loaded (/usr/lib/systemd/system/ovsdb-server.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2019-06-06 07:43:09 UTC; 6h ago
 Main PID: 33097 (ovsdb-server)
    Tasks: 1
   Memory: 7.8M
   CGroup: /system.slice/ovsdb-server.service
           └─33097 ovsdb-server /etc/openvswitch/conf.db -vconsole:emer -vsyslog:err -vfile:info --remote=punix:/var/run/openvswitch/db.sock --private-key=db:Open_vSwitch,SSL,private_key --certificate=d...
compute-2:/home/wrsroot# sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-mem="1024,1024"
compute-2:/home/wrsroot#    sudo ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-socket-limit="1024,1024"
compute-2:/home/wrsroot# systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: inactive (dead) since Thu 2019-06-06 07:43:47 UTC; 6h ago
  Process: 37193 ExecStop=/usr/share/openvswitch/scripts/ovs-ctl --no-ovsdb-server stop (code=exited, status=0/SUCCESS)
compute-2:/home/wrsroot#    sudo systemctl restart ovs-vswitchd
compute-2:/home/wrsroot#    systemctl status ovs-vswitchd
● ovs-vswitchd.service - Open vSwitch Forwarding Unit
   Loaded: loaded (/usr/lib/systemd/system/ovs-vswitchd.service; static; vendor preset: disabled)
   Active: active (running) since Thu 2019-06-06 13:52:25 UTC; 4s ago
    Tasks: 99
   Memory: 811.7M
   CGroup: /system.slice/ovs-vswitchd.service
           └─275814 ovs-vswitchd unix:/var/run/openvswitch/db.sock -vconsole:emer -vsyslog:err -vfile:info --mlockall --no-chdir --log-file=/var/log/openvswitch/ovs-vswitchd.log --pidfile=/var/run/openv...
compute-2:/home/wrsroot# cd /tmp
compute-2:/tmp#    grep -rn “vswitch::dpdk::socket_mem”
compute-2:/tmp#    cd /opt
compute-2:/opt#    grep -rn “vswitch::dpdk::socket_mem”
compute-2:/opt#

Revision history for this message

Peng Peng (ppeng) wrote on 2019-06-06:

#24

ovsdb-server.log Edit (1.2 KiB, text/plain)

Revision history for this message

Abraham Arce (xe1gyq) wrote on 2019-06-07:

#25

This "Compute Node Keeps Offline After Unlock Due To Vswitch Error" has not been seen while deploying a Bare Metal Dedicated Storage 2+2+2 with ISO image:

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20190523T013000Z"

Not seen also while testing the following CEPH scenarios:

https://github.com/xe1gyq/starlingx/blob/master/validation/STOR_CORE_014.md
https://github.com/xe1gyq/starlingx/blob/master/validation/STOR_CORE_015.md
https://github.com/xe1gyq/starlingx/blob/master/validation/STOR_CORE_016.md

And not seeing while retesting this bug:

https://bugs.launchpad.net/starlingx/+bug/1797187

Several unlocks were executed.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-07:

#26

Abraham, what vswitch type are you using in your testing? This was only seen when using ovs-dpdk. The default configuration is ovs.

Revision history for this message

Abraham Arce (xe1gyq) wrote on 2019-06-07:

#27

Ghada, I deployed according to the instructions from our current official Wiki Containers Dedicated Storage:
https://wiki.openstack.org/wiki/StarlingX/Containers/InstallationOnStandardStorage

I have executed the section "Configure the vswitch type (optional)" only in step 1, step 2 did not completed since I got confused due to reference for controller-0, for the full log of the deployment see same section"Configure the vswitch type (optional)" where my error is reflected, under:
https://github.com/xe1gyq/starlingx/blob/master/deployment/baremetal/dedicatedstorage.md

Let me know if the following steps if a valid use case to allow the configuration of the compute-0 even if controller-0 has been unlocked:

  $ system host-lock compute-0
  $ system host-cpu-modify -f vswitch -p0 1 compute-0
  $ system host-unlock compute-0

If the above is not a valid use case, should it be tested with a new deployment?
I am taking this learning to reflect those changes in our docs.starlingx.io site.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-07:

#28

Abraham, I guess I got confused as to why you are updating this bug report. Is there a reason you are trying to reproduce this issue? It's already noted in the notes that it is not 100% reproducible on all systems, so the fact that you cannot reproduce it doesn't mean there is no issue. The issue is also only specific to ovs-dpdk deployments (hence my first comment to you as it wasn't clear what vswitch backend you are using).

Anyway, to answer your question, yes you can change the cpu assignment for vswitch using a lock/unlock.

Revision history for this message

Abraham Arce (xe1gyq) wrote on 2019-06-07:

#29

Thanks for the answer Ghada. I thought it was a good idea to retest (based on tag and date of submission) some of the bugs around with a couple of dedicated storage systems recently deployed. Let me know what is your preferred way to help in this bug fixing activity :)

Revision history for this message

ChenjieXu (midone) wrote on 2019-06-10:

#30

Hi Peng,

Based your previous log, the OVS-DPDK can run correctly. So the problem is that hugepages are not allocated during boot time.

For now StarlingX allocates 1G hugepage for each NUMA node during boot time. But if the there doesn't exist 1G contiguous areas of memory, allocation for 1G hugepages will fail. Need help from the expert on memory to figure out why sometimes there doesn't exist enough contiguous areas of memory for allocating hugepages in StarlingX.

Revision history for this message

ChenjieXu (midone) wrote on 2019-06-10:

#31

Hi Abraham,

Thank you for your testing. Based on your steps, you are using OVS-DPDK.

The OVS in container is used by default. You can change to use OVS-DPDK by the following command:
   system modify --vswitch_type ovs-dpdk
   system host-cpu-modify -f vswitch -p0 1 controller-0
What's more, you can change to OVS running in a container by the following command:
   system modify --vswitch_type none

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-10:

#32

Chenjie, So the next step for this bug is to have Tao Liu or Austin Sun look at this as they are familiar with huge page allocation on starlingx. Is that correct?

Revision history for this message

ChenjieXu (midone) wrote on 2019-06-11:

#33

Hi Ghada,

Yes, it's correct.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-12:

#34

Austin, please investigate this.

Changed in starlingx:
status:	Incomplete → Triaged
assignee:	ChenjieXu (midone) → Austin Sun (sunausti)

Revision history for this message

Austin Sun (sunausti) wrote on 2019-06-13:

#35

from compute-3 /etc/platform/worker_reserved.conf the worker did not reserve any vs hp. what the result of system host-memory-list compute-3 ? and share /opt/platform/puppet/19.01/hieradata/ in active controller ?

Revision history for this message

ChenjieXu (midone) wrote on 2019-06-13:

#36

Hi Austin,

The following bug is the same bug as this one:
https://bugs.launchpad.net/starlingx/+bug/1829390

And the reporter has attached the result of "system host-memory-list". Hope this can be useful.

Revision history for this message

Austin Sun (sunausti) wrote on 2019-06-13:

#37

Thanks. ChenJie.
@Pengpeng, could you help collect the command output of "/sys/devices/system/node/node0/meminfo" and "/sys/devices/system/node/node1/meminfo" for compute which have such issue ? i think this info is not in collect info.

Austin Sun (sunausti) on 2019-06-17

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Bill Zvonar (billzvonar) wrote on 2019-07-03:

#38

Hi Austin - do you still consider this LP incomplete, are you still waiting on ChenJie for the info requested above?

Revision history for this message

Austin Sun (sunausti) wrote on 2019-07-04:

#39

Hi, Bill:
I'm waiting peng peng to reproduce this issue and provide more info as discussed offline.

Ghada Khalil (gkhalil) on 2019-07-10

summary:

- compute node keeps offline after unlock due to vswitch error
+ compute node keeps offline after unlock due to vswitch error, caused by
+ a hugepage allocation failure

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-10:

#40

This issue is related to huge pages. It's not really a networking issue. Removing the stx.networking tag and adding stx.config instead. This is already being investigated by Austin.

tags:

added: stx.config
removed: stx.networking

Revision history for this message

Chris Winnicki (chriswinnicki) wrote on 2019-07-23:

#41

Download full text (5.2 KiB)

Same failure was seen again in another of the Wind River lab: (yow-cgcs-wildcat-99-103):

Info below as requested Austin (comment #35):
(Fresh set of logs are attached - as generated by collect all)
(In this system compute-0 experienced the issue described above)

Contents of /opt/platform/puppet/19.01/hieradata/* attached as opt_platform_puppet_19.01_hieradata.tar.gz

[sysadmin@controller-0 ~(keystone_admin)]$ system host-memory-list compute-0
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | app_tota | app_hp | app_hp | app_hp_p | app_hp | app_hp_a | app_hp_pending_1G | app_hp_use_1G |
| | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | l_4K | _total | _avail | ending_2 | _total | vail_1G | | |
| | | | | | iB) | | | | | _2M | _2M | M | _1G | | | |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+
| 0 | 5390 | 8000 | 5390 | True | 1024 | 0 | 0 | None | 1379840 | 0 | 0 | None | 0 | 0 | None | True |
| 1 | 6023 | 2000 | 6023 | True | 1024 | 0 | 0 | None | 1541888 | 0 | 0 | None | 0 | 0 | None | True |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+

[sysadmin@controller-0 ~(keystone_admin)]$ ssh compute-0 cat /etc/platform/worker_reserved.conf
sysadmin@compute-0's password:
################################################################################
# Copyright (c) 2018 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
# - This file is managed by Puppet. DO NOT EDIT.
################################################################################
# WORKER Node configuration parameters for reserved memory and physical cores
# used by Base software and VSWITCH. These are resources that libvirt cannot use.
#

################################################################################
#
# List of logical CPU instances available in the system. This value is used
# for auditing purposes so that the current configuration can be checked for
# validity against the actual number of logical CPU instances in the system.
#
################################################################################
WORKER_CPU_LIST="0-27"

################################################################################
#
# List of logical CPU instances that reserved for platform applications.
#
###################...

Same failure was seen again in another of the Wind River lab: (yow-cgcs-wildcat-99-103):

Info below as requested Austin (comment #35):
(Fresh set of logs are attached - as generated by collect all)
(In this system compute-0 experienced the issue described above)

Contents of /opt/platform/puppet/19.01/hieradata/* attached as opt_platform_puppet_19.01_hieradata.tar.gz

[sysadmin@controller-0 ~(keystone_admin)]$ system host-memory-list compute-0
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | app_tota | app_hp | app_hp | app_hp_p | app_hp | app_hp_a | app_hp_pending_1G | app_hp_use_1G |
|           | al(MiB) | rm(MiB)    | il(MiB) | configured     | size(M | total  | avail  | _reqd | l_4K     | _total | _avail | ending_2 | _total | vail_1G  |                   |               |
|           |         |            |         |                | iB)    |        |        |       |          | _2M    | _2M    | M        | _1G    |          |                   |               |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+
| 0         | 5390    | 8000       | 5390    | True           | 1024   | 0      | 0      | None  | 1379840  | 0      | 0      | None     | 0      | 0        | None              | True          |
| 1         | 6023    | 2000       | 6023    | True           | 1024   | 0      | 0      | None  | 1541888  | 0      | 0      | None     | 0      | 0        | None              | True          |
+-----------+---------+------------+---------+----------------+--------+--------+--------+-------+----------+--------+--------+----------+--------+----------+-------------------+---------------+

[sysadmin@controller-0 ~(keystone_admin)]$ ssh compute-0 cat /etc/platform/worker_reserved.conf
sysadmin@compute-0's password: 
################################################################################
# Copyright (c) 2018 Wind River Systems, Inc.
#
# SPDX-License-Identifier: Apache-2.0
#
# - This file is managed by Puppet. DO NOT EDIT.
################################################################################
# WORKER Node configuration parameters for reserved memory and physical cores
# used by Base software and VSWITCH. These are resources that libvirt cannot use.
#

################################################################################
#
# List of logical CPU instances available in the system.  This value is used
# for auditing purposes so that the current configuration can be checked for
# validity against the actual number of logical CPU instances in the system.
#
################################################################################
WORKER_CPU_LIST="0-27"

################################################################################
#
# List of logical CPU instances that reserved for platform applications.
#
################################################################################
PLATFORM_CPU_LIST="0"

################################################################################
#
# List of Base software resources reserved per numa node. Each array element
# consists of a 3-tuple formatted as: <node>:<memory>:<cores>.
#
# Example: To reserve 1500MB and 1 core on NUMA node0, and 1500MB and 1 core
# on NUMA node1, the variable must be specified as follows.
#   WORKER_BASE_MEMORY=("node0:1500MB:1" "node1:1500MB:1")
#
################################################################################
WORKER_BASE_RESERVED=("node0:8000MB:1" "node1:2000MB:1")

################################################################################
#
# List of HugeTLB memory descriptors to configure.  Each array element
# consists of a 3-tuple descriptor formatted as: <node>:<pgsize>:<pgcount>.
# The NUMA node specified must exist and the HugeTLB pagesize must be a valid
# value such as 2048kB or 1048576kB.
#
# For example, to request 256 x 2MB HugeTLB pages on NUMA node0 and node1 the
# variable must be specified as follows.
#   COMPUTE_VSWITCH_MEMORY=("node0:2048kB:256" "node1:2048kB:256")
#
################################################################################
COMPUTE_VSWITCH_MEMORY=("node0:1048576kB:0" "node1:1048576kB:0")

################################################################################
#
# List of VSWITCH physical cores reserved for VSWITCH applications.
#
# Example: To reserve 2 cores on NUMA node0, and 2 cores on NUMA node1, the
# variable must be specified as follows.
#   COMPUTE_VSWITCH_CORES=("node0:2" "node1:2")
#
################################################################################
COMPUTE_VSWITCH_CORES=("node0:2")

################################################################################
#
# List of platform physical cores reserved for platform applications.
#
# Example: To reserve 1 core on NUMA node0, the variable must be specified
# as follows.
#   WORKER_PLATFORM_CORES=("node0:0")
#
################################################################################

Revision history for this message

Chris Winnicki (chriswinnicki) wrote on 2019-07-23:

#42

yow-cgcs-wildcat-99-103_ALL_NODES_20190723.141943.tar Edit (173.3 MiB, application/x-tar)

Revision history for this message

Chris Winnicki (chriswinnicki) wrote on 2019-07-23:

#43

yow-cgcs-wildcat-99-103_opt_platform_puppet_19.01_hieradata.tar.gz Edit (17.0 KiB, application/x-tar)

Ghada Khalil (gkhalil) on 2019-07-23

Changed in starlingx:
status:	Incomplete → Confirmed

Revision history for this message

Austin Sun (sunausti) wrote on 2019-07-24:

#44

compute-0
uuid memtotal_mib memavail_mib platform_reserved_mib hugepages_configured vswitch_hugepages_size_mib vswitch_hugepages_reqd vswitch_hugepages_nr vswitch_hugepages_avail capabilities forihostid forinodeid
7c8406ea-f7e3-4c30-a40f-0d222dd9aa5d 5390 5390 8000 t 1024 \N 0 0 \N 2 3
81760ce4-f59b-4b32-a547-bfe619678233 6023 6023 2000 t 1024 \N 0 0 \N 2 4

The memtotal and memavail are wrong. need check where the error from.

Revision history for this message

Austin Sun (sunausti) wrote on 2019-07-24:

#45

from controller-0 sysinv.log.1
2019-07-19 20:50:01.036 103738 INFO sysinv.api.controllers.v1.host [-] Memory: Total=62908 MiB, Allocated=8000 MiB, 2M: 31454 pages None pages pending, 1G: 61 pages None pages pending
2019-07-19 20:50:01.064 103738 INFO sysinv.api.controllers.v1.host [-] Memory: Total=63239 MiB, Allocated=2000 MiB, 2M: 31619 pages None pages pending, 1G: 61 pages None pages pending
2019-07-19 20:50:01.085 103738 INFO sysinv.api.controllers.v1.host [-] host(compute-0) node(3): vm_mem_mib=53884,vm_mem_mib_possible (from agent) = 62908
2019-07-19 20:50:01.085 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(3): {'vm_hugepages_nr_4K': 1379840, 'vm_hugepages_nr_2M': 24247, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:01.187 103738 INFO sysinv.api.controllers.v1.host [-] host(compute-0) node(4): vm_mem_mib=60215,vm_mem_mib_possible (from agent) = 63238
2019-07-19 20:50:01.188 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(4): {'vm_hugepages_nr_4K': 1541888, 'vm_hugepages_nr_2M': 27096, 'vswitch_hugepages_nr': 1}

So, at least perform unlock action the db should be correct vswitch_hugepages_nr is 1.

Revision history for this message

Austin Sun (sunausti) wrote on 2019-07-24:

#46

Hi, Chris.
Thanks you reproduce this issue. it seems the first time unlock compute-0 was in July-19, but the puppet log was discarded . so it is hard to find any clue for the first time unlock info. if you env is still there , could you delete compute-0 and re-install it again , if the issue is still there, please collect the compute-0 log again ?

Thanks.

Revision history for this message

Tao Liu (tliu88) wrote on 2019-07-24:

#47

Hi Austin,

I took a look at yow-cgcs-wildcat-99-103 yesterday. The database and hiera data were not updated on both compute-0 and compute-2, although there were logs from 2019-07-19 showing both database and hiera data were updated. There are 2 possible sources that could cause this failure:
One, the database update failed without error logs, resulting in the hiera data not being updated. Second, the sysinv-agent reporting the default memory inventory prior to hiera data update, resetting the vswitch huge pages to 0 in the database.

To recover, I locked /unlocked the compute-0 to re-populate the vswitch huge pages and it was successful.

# compute-2
2019-07-19 20:49:47.430 103746 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-2) node(5): {'vm_hugepages_nr_4K': 1379328, 'vm_hugepages_nr_2M': 24246, 'vswitch_hugepages_nr': 1}
2019-07-19 20:49:47.459 103738 INFO sysinv.api.controllers.v1.host [-] compute-1 1. delta_handle ['uptime', 'task']
2019-07-19 20:49:47.530 103746 INFO sysinv.api.controllers.v1.host [-] host(compute-2) node(6): vm_mem_mib=60218,vm_mem_mib_possible (from agent) = 63242
2019-07-19 20:49:47.530 103746 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-2) node(6): {'vm_hugepages_nr_4K': 1541632, 'vm_hugepages_nr_2M': 27098, 'vswitch_hugepages_nr': 1}
2019-07-19 20:49:50.612 102604 INFO sysinv.puppet.puppet [req-df405ec9-1017-4b2d-861a-e4790ad819d6 admin admin] Updating hiera for host: compute-2 with config_uuid: None

# compute-0
2019-07-19 20:50:01.085 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(3): {'vm_hugepages_nr_4K': 1379840, 'vm_hugepages_nr_2M': 24247, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:01.187 103738 INFO sysinv.api.controllers.v1.host [-] host(compute-0) node(4): vm_mem_mib=60215,vm_mem_mib_possible (from agent) = 63238
2019-07-19 20:50:01.188 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(4): {'vm_hugepages_nr_4K': 1541888, 'vm_hugepages_nr_2M': 27096, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:05.532 102604 INFO sysinv.puppet.puppet [req-3de697a6-37d1-4c78-bc0f-6005f67b27b9 admin admin] Updating hiera for host: compute-0 with config_uuid: None

Hi Austin,

I took a look at yow-cgcs-wildcat-99-103 yesterday. The database and hiera data were not updated on both compute-0 and compute-2, although there were logs from 2019-07-19 showing both database and hiera data were updated. There are 2 possible sources that could cause this failure:
One, the database update failed without error logs, resulting in the hiera data not being updated. Second, the sysinv-agent reporting the default memory inventory prior to hiera data update, resetting the vswitch huge pages to 0 in the database.

To recover, I locked /unlocked the compute-0 to re-populate the vswitch huge pages and it was successful.

# compute-2
2019-07-19 20:49:47.430 103746 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-2) node(5): {'vm_hugepages_nr_4K': 1379328, 'vm_hugepages_nr_2M': 24246, 'vswitch_hugepages_nr': 1}
2019-07-19 20:49:47.459 103738 INFO sysinv.api.controllers.v1.host [-] compute-1 1. delta_handle ['uptime', 'task']
2019-07-19 20:49:47.530 103746 INFO sysinv.api.controllers.v1.host [-] host(compute-2) node(6): vm_mem_mib=60218,vm_mem_mib_possible (from agent) = 63242
2019-07-19 20:49:47.530 103746 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-2) node(6): {'vm_hugepages_nr_4K': 1541632, 'vm_hugepages_nr_2M': 27098, 'vswitch_hugepages_nr': 1}
2019-07-19 20:49:50.612 102604 INFO sysinv.puppet.puppet [req-df405ec9-1017-4b2d-861a-e4790ad819d6 admin admin] Updating hiera for host: compute-2 with config_uuid: None

# compute-0
2019-07-19 20:50:01.085 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(3): {'vm_hugepages_nr_4K': 1379840, 'vm_hugepages_nr_2M': 24247, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:01.187 103738 INFO sysinv.api.controllers.v1.host [-] host(compute-0) node(4): vm_mem_mib=60215,vm_mem_mib_possible (from agent) = 63238
2019-07-19 20:50:01.188 103738 INFO sysinv.api.controllers.v1.host [-] Updating mem values of host(compute-0) node(4): {'vm_hugepages_nr_4K': 1541888, 'vm_hugepages_nr_2M': 27096, 'vswitch_hugepages_nr': 1}
2019-07-19 20:50:05.532 102604 INFO sysinv.puppet.puppet [req-3de697a6-37d1-4c78-bc0f-6005f67b27b9 admin admin] Updating hiera for host: compute-0 with config_uuid: None

Revision history for this message

Tao Liu (tliu88) wrote on 2019-07-24:

#48

I added addition logs to a designer load, re-installed it on yow-cgcs-wildcat-99-103, using the deployment manager. This problem was not present.

It looks like a timing issue in the previous install (the second possible source). The sysinv-agent memory report came in after the database was updated, although the hiera data has not been updated. At this time, before the first unlock, the ‘vswitch_hugepages_nr’, ‘vm_hugepages_nr_2M’ and ‘vm_hugepages_nr_1G’ fields in the agent report were set to the default 0, which reset the vswitch huge pages to 0 in the database. The hiera data was then set to 0 via reading from the database.

I think we can add a check to the host state, should the host is not provisioned the conductor could ignore those fields from the agent memory report.

Another option is to make a change in sysinv agent, which does not include the ‘vswitch_hugepages_nr’, ‘vm_hugepages_nr_2M’ and ‘vm_hugepages_use_1G’ in the initial report prior to the first unlock.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-25: Fix proposed to config (master)

#49

Fix proposed to branch: master
Review: https://review.opendev.org/672634

Changed in starlingx:
status:	Confirmed → In Progress

Revision history for this message

Austin Sun (sunausti) wrote on 2019-07-25:

#50

Hi Tao, Thanks a lot your help to investigating.
This time issue was not same as 2019-05-09 peng found , which db data is correct.
But I think your analysis was perhaps right.

from controller-0 sysinv.log.1
compute-1:
2019-07-19 20:49:38.371 compute-1 ihost check_unlock_worker
2019-07-19 20:49:42.437 Updating hiera for host: compute-1 with config_uuid
from compute-1 agent schedule is around
20:39:27.891 every one minutes. it was not between 38s to 42s.

compute-0:
2019-07-19 20:50:00.666 compute-0 ihost check_unlock_worker
2019-07-19 20:50:05.532 Updating hiera for host: compute-0 with config_uuid: None
from compute-0 agent schedule is around
20:37:00.718 every one minutes. it was between 00s to 05s.

compute-2:
019-07-19 20:49:47.163 compute-2 ihost check_unlock_worker
2019-07-19 20:49:50.612 Updating hiera for host: compute-2 with config_uuid: None
from compute-0 agent schedule is around
20:44:47.016 every one minutes, os it was between 47s to 50s

I made a change to ingore agent mem report when unlocking. you can review it
Thanks.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-26:

#51

compute-1_20190726.135158.tar Edit (14.9 MiB, application/x-tar)

Issue reproduced on
Lab: WCP_63_66
Load: 20190726T013000Z

more log coolected

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-26:

#52

ALL_NODES_20190726.134345.tar Edit (63.0 MiB, application/x-tar)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-26:

#53

meminfo_node0 Edit (965 bytes, text/plain)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-26:

#54

meminfo_node1 Edit (965 bytes, text/plain)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-07-26:

#55

Download full text (3.6 KiB)

Getting issue on another lab HW PV-0
after clean install, compute-1 and compute-2 did not unlock

BUILD_ID="20190724T013000Z"

see logs below:

Getting issue on another lab  HW PV-0
after clean install, compute-1 and compute-2 did not unlock

BUILD_ID="20190724T013000Z"

see logs below:

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-07-29:

#56

Also reported from sanity run by the Intel test team:
https://bugs.launchpad.net/starlingx/+bug/1837936
https://bugs.launchpad.net/starlingx/+bug/1838031

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-29: Fix merged to config (master)

#57

Reviewed: https://review.opendev.org/672634
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=85e34657c59df9b2f18c694bf6c7ba187e8f062a
Submitter: Zuul
Branch: master

commit 85e34657c59df9b2f18c694bf6c7ba187e8f062a
Author: Sun Austin <email address hidden>
Date: Thu Jul 25 14:39:38 2019 +0800

Avoid agent mem update during unlocking

    sysinv-agent might report memory after unlocking action was performing,
    The DB was updated, but hiera data has not been updated. During this
    time, memory report from agent will set ‘vswitch_hugepages_nr’,
    ‘vm_hugepages_nr_2M’ and ‘vm_hugepages_nr_1G’ values to default 0.

adding protect to ignore agent mem report during unlocking (host is
locked state and ihost_action is 'unlock' or 'force-unlock')

Closes-Bug: 1829403
Signed-off-by: Sun Austin <email address hidden>

Change-Id: I3438809782560e90248a3e63e51aa0315fcf49d3

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-08-12:

#58

The issue was not reproduced recently.

tags:

removed: stx.retestneeded

StarlingX

compute node keeps offline after unlock due to vswitch error, caused by a hugepage allocation failure

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches