both controllers remained in degraded status and out-of-config when configured with static IPv4 addressing for mgmt network

Bug #1798836 reported by mhg
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Austin Sun

Bug Description

Brief Description
-----------------
After install an 2 nodes system with static addressing for management network, both nodes eventually went into degraded status. There were alarms showing:
controller-1 Configuration is out-of-date
controller-0 Configuration is out-of-date
Service group controller-services degraded

drbd sync was done (wait and checking by drbd-overiew)
Lock/unlock controller-1 did not clear the out-of-date alarm for controller-1.
Swact to controller-1 got rejected.

Severity
--------
Major

Steps to Reproduce
------------------
1) install controller-0 (while keeping controller-1 powered off)
2) configure the system with static IPv4 addressing for management network by setting:
DYNAMIC_ALLOCATION = N
IP_START_ADDRESS=192.168.204.12
IP_END_ADDRESS=192.168.204.99
in configuration file fed to config_controller
3) add the 2nd (node) controller-1 to sysinv via host-bulk-add file:
...
    <host>
        <personality>controller</personality>
        <hostname>controller-1</hostname>
        <mgmt_mac>c8:1f:66:e1:5f:17</mgmt_mac>
        <mgmt_ip>192.168.204.13</mgmt_ip>
        <power_on/>
        <bm_type>bmc</bm_type>
        <bm_ip>128.224.64.223</bm_ip>
        <bm_username>root</bm_username>
        <bm_password>root</bm_password>
        <install_output>text</install_output>
    </host>
...
4) pxe-boot controller-1 from controller-0
5) run lab_setup and unlock controller-1

Expected Behavior
------------------
1) both controllers in 'available' status after drbd-sync finished, and system is working without any issue.

Actual Behavior
----------------
1) both controllers in degraded status
2) there were alarms showing:
controller-1 Configuration is out-of-date
controller-0 Configuration is out-of-date
Service group controller-services degraded
3)
Lock/unlock controller-1 did not clear the out-of-date alarm for controller-1.
Swact to controller-1 got rejected.

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Two node system

Branch/Pull Time/Commit
-----------------------
StarlingX_18.10 as of 2018-10-16_01-52-00

Timestamp/Logs
--------------
2018-10-18T18:41:15

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.2019.03 - most deployments use dynamic IP, so not required for stx.2018.10

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Bruce Jones (brucej)
tags: added: stx.2019.03 stx.config stx.networking
Revision history for this message
mhg (marvinhg) wrote :
Revision history for this message
mhg (marvinhg) wrote :
Revision history for this message
Bruce Jones (brucej) wrote :

Cindy please assign an engineer to work this bug, thanks!

Changed in starlingx:
assignee: Bruce Jones (brucej) → Cindy Xie (xxie1)
Austin Sun (sunausti)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Austin Sun (sunausti)
Austin Sun (sunausti)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Austin Sun (sunausti) wrote :

Hi mhg:
   In my setup, I can not reproduce this issue.
in you comment:
IP_START_ADDRESS=192.168.204.12
IP_END_ADDRESS=192.168.204.99

But in the config attached. IP is starting with 192.168.204.2 , is it typo ?

Revision history for this message
mhg (marvinhg) wrote :

Hi Austin,

That's probably a typo.

Is your setup with the same version?

Revision history for this message
Austin Sun (sunausti) wrote :

Yes. I'm using 2018.10 version and the VM setup. are you using VM setup or bare metal setup ?

Revision history for this message
mhg (marvinhg) wrote :

I was load StarlingX_18.10 as of 2018-10-16_01-52-00 and VM setup.

And also I was using non-default IP range.

Revision history for this message
Austin Sun (sunausti) wrote :

is it possible share the log on controller-1 and files under /etc/platform/ ?
Thanks.

Revision history for this message
mhg (marvinhg) wrote :

The tar ball containing contents from /etc/platform of controller-1 was attached etc.platform.tgz.
By the way, the test was done on a 2-node hardware lab (not install with Bare Metal though).

Revision history for this message
Austin Sun (sunausti) wrote :

Hi mhg:
   Thanks. I did not mention clearly. could you provide /var/log/* in controller-1. and if you are using KVM, could you share kvm xml files?

Revision history for this message
mhg (marvinhg) wrote :

Hi Austin,

Here's the files from controller-1:/var/log/: controller-1-var.log.tgz.

Because the test failed in very early stage (installation), no VMs were created yet and I did not check the states of kvm processes or module at that time.

By the way, the first node was named controller-0 in our tests and controller-1 for the second node.

Hope these can be helpful.

If the issue could not be reproduced, that's great.

Revision history for this message
Austin Sun (sunausti) wrote :

mhg.
Thanks a lot. it seems ovs-dpdk does not work well in controller-1, could you share more information about below command on controller-1
1)"sudo ovs-vsctl show"; this will check ovs status
2) '/usr/share/openvswitch/scripts/dpdk-devbind.py --status' ; this will show the NIC list.

from ovs-vswitchd.log in controller-1,
2018-10-18T18:30:54.208Z|00013|dpdk|INFO|EAL ARGS: ovs-vswitchd -c 51 --huge-dir /mnt/huge-1048576kB --socket-mem 1024,1024 -n 4
2018-10-18T18:30:54.209Z|00014|dpdk|INFO|EAL: Detected 20 lcore(s)
2018-10-18T18:30:54.224Z|00015|dpdk|INFO|EAL: 22550 hugepages of size 2097152 reserved, but no mounted hugetlbfs found for that size

2018-10-19T00:13:48.781Z|00120|bridge|INFO|bridge br-phy0: using datapath ID 0000c6430341a240
2018-10-19T00:13:48.781Z|00121|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connecting...
2018-10-19T00:13:48.781Z|00122|rconn|WARN|br-int<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2018-10-19T00:13:48.781Z|00123|rconn|INFO|br-int<->tcp:127.0.0.1:6633: waiting 2 seconds before reconnect
2018-10-19T00:13:48.781Z|00124|rconn|INFO|br-phy0<->tcp:127.0.0.1:6633: connecting...
2018-10-19T00:13:48.781Z|00125|rconn|WARN|br-phy0<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2018-10-19T00:13:48.781Z|00126|rconn|INFO|br-phy0<->tcp:127.0.0.1:6633: waiting 2 seconds before reconnect
2018-10-19T00:13:48.787Z|00127|bridge|INFO|bridge br-phy0: deleted interface lldp7db201dc-4d on port 2
2018-10-19T00:13:48.806Z|00128|bridge|INFO|bridge br-phy0: deleted interface phy-br-phy0 on port 3
2018-10-19T00:13:48.811Z|00129|bridge|INFO|bridge br-phy0: deleted interface br-phy0 on port 65534
2018-10-19T00:13:48.916Z|00130|poll_loop|INFO|wakeup due to [POLLIN] on fd 11 (<->/var/run/openvswitch/db.sock) at lib/stream-fd.c:157 (83% CPU usage)

Revision history for this message
mhg (marvinhg) wrote :

That's great that you find a root cause.

Unfortunately test environment was gone. The lab was reinstalled many times with (typically) dynamically IP addressing for other tests since then. I cannot dig into it any more.

Revision history for this message
Austin Sun (sunausti) wrote :

Hi, mhg:
   Thanks your support, is ok to close this bug now, if you meet same issue again , then we can re-open it with new logs (Controller-0 and Contorller-1 logs, NIC types, etc) ?

Revision history for this message
mhg (marvinhg) wrote :

Hi Austin,

I agree with you that we can close the bug as cannot-reproduced.
Thanks for your investigation and analysis.

Changed in starlingx:
status: In Progress → Invalid
Revision history for this message
mhg (marvinhg) wrote :

I changed its status to 'invalid', which is the status closest to 'cannot-reproduced' available that I can change it to.
If you can close it in other way, please go ahead.

Ken Young (kenyis)
tags: added: stx.2019.05
removed: stx.2019.03
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers