nodes access to board management module has failed

Bug #1826421 reported by Peng Peng
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Yi Wang

Bug Description

Brief Description
-----------------
After system installed, Alarm "200.010 node access to board management module has failed." raised.

Severity
--------
Major

Steps to Reproduce
------------------
As description
....
TC-name:

Expected Behavior
------------------
no 200.010 alarm

Actual Behavior
----------------
200.010 alarm raised

Reproducibility
---------------
Reproducible

System Configuration
--------------------
Multi-node system

Lab-name:
WCP_99-103

Branch/Pull Time/Commit
-----------------------
stx master as of 20190425T013000Z

Last Pass
---------
20190410T013000Z

Timestamp/Logs
--------------
[2019-04-25 08:37:19,812] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-04-25 08:37:21,068] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+----------------------------------------------------------------------+--------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+----------------------------------------------------------------------+--------------------------------------+----------+----------------------------+
| e9382b26-d4cc-4674-9b40-e51c6b0f1af9 | 100.114 | NTP address 167.114.156.48 is not a valid or a reachable NTP server. | host=controller-0.ntp=167.114.156.48 | minor | 2019-04-25T07:09:02.366190 |
| f2a7d010-43a0-44ae-99d9-3a7373663060 | 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-04-25T07:08:02.580267 |
| a7d08307-4b57-4eb4-88d5-bc8b28e996a6 | 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-04-25T07:08:00.686273 |
| f5ef9130-1a45-43de-ad78-7eeb97aff8be | 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-04-25T07:07:58.241273 |
| 324b84fe-30af-45b5-82d9-9a5733e6a55a | 200.010 | controller-1 access to board management module has failed. | host=controller-1 | warning | 2019-04-25T07:07:56.898248 |
| c556aaf2-5d59-4f59-a546-13121538e948 | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-04-25T07:07:53.901273 |
+--------------------------------------+----------+----------------------------------------------------------------------+--------------------------------------+----------+----------------------------+
[wrsroot@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Assigning to Eric to perform initial triage and determine if this is a lab network issue or a software bug.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Eric MacDonald (rocksolidmtce)
status: Triaged → Incomplete
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

I was this very same issue in the LO (wc35-60 today) and found that even the linux ping command to the same address yielded the samer error from the kernel.

2019-04-26T20:44:36.211 [100186.14611] controller-0 mtcAgent --- msgClass.cpp ( 734) write :Error : Failed to send with errno=1
2019-04-26T20:44:36.211 [100186.14612] controller-0 mtcAgent acc pingUtil.cpp ( 244) pingUtil_send : Warn : compute-1 ping 128.224.64.141 send failed (rc:-1) (1:Operation not permitted)
2019-04-26T20:44:36.211 [100186.14613] controller-0 mtcAgent acc pingUtil.cpp ( 665) pingUtil_acc_monitor :Error : compute-1 failed to send bmc ping
2019-04-26T20:44:36.216 [100186.14614] controller-0 mtcAgent --- msgClass.cpp (1058) msgClassTx : Info : Creating socket on port 0 with address: 128.224.64.141

and then ...

controller-0:~$ ping 128.224.64.141
PING 128.224.64.141 (128.224.64.141) 56(84) bytes of data.
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted
ping: sendmsg: Operation not permitted

But from cgts4 is ok but with redirect from 144 to 64 subnet as expected.

[emacdona@yow-cgts4-lx:~/emacdona/development/RedFish/Emulator/commands ] $ ping 128.224.64.141
PING 128.224.64.141 (128.224.64.141) 56(84) bytes of data.
From 128.224.144.1: icmp_seq=1 Redirect Host(New nexthop: 128.224.144.75)
From 128.224.144.1 icmp_seq=1 Redirect Host64 bytes from 128.224.64.141: icmp_seq=1 ttl=63 time=19.0 ms
From 128.224.144.1: icmp_seq=2 Redirect Host(New nexthop: 128.224.144.75)
From 128.224.144.1 icmp_seq=2 Redirect Host64 bytes from 128.224.64.141: icmp_seq=2 ttl=63 time=3.38 ms
From 128.224.144.1: icmp_seq=3 Redirect Host(New nexthop: 128.224.144.75)
From 128.224.144.1 icmp_seq=3 Redirect Host64 bytes from 128.224.64.141: icmp_seq=3 ttl=63 time=0.848 ms
From 128.224.144.1: icmp_seq=4 Redirect Host(New nexthop: 128.224.144.75)
From 128.224.144.1 icmp_seq=4 Redirect Host64 bytes from 128.224.64.141: icmp_seq=4 ttl=63 time=0.664 ms
From 128.224.144.1: icmp_seq=5 Redirect Host(New nexthop: 128.224.144.75)
From 128.224.144.1 icmp_seq=5 Redirect Host64 bytes from 128.224.64.141: icmp_seq=5 ttl=63 time=0.726 ms

I think this is a networking or firewall rule issue.

Revision history for this message
Frank Miller (sensfan22) wrote :

Based on Eric's triage, this appears to be an issue related to firewalls. Assigning to Yi to determine if this is an issue introduced by the Calico SB https://storyboard.openstack.org/#!/story/2005066

Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Yi Wang (wangyi4)
tags: added: stx.2.0
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as release gating; firewall functionality appears to be broken

Revision history for this message
Matt Peters (mpeters-wrs) wrote :

All egress TCP and UDP traffic should be permitted through the firewall. Can you confirm that ICMP is required, or was that just to test connectivity?

If it is required, ICMP will need to be added to the supported egress protocols in the calico_oam_if_gnp.yaml.erb template.

Revision history for this message
Yi Wang (wangyi4) wrote :

Matt, all egress TCP and UDP traffic are already permitted on OAM interfaces.
"Can you confirm that ICMP is required, or was that just to test connectivity?"
Where can I check this?

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Maintenance has a BMC reachability audit that uses ping towards the provisioned BMC IP.
While that ping does not succeed maintenance raises a BMC access alarm and goes into a background periodic audit retrying the ping operation util it succeeds.

While the ping test does not succeed maintnance does not enable power control or sensor monitoring.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

So yes, ICMP is required unless we make a significant change to how maintenance verifies basic connectivity to the BMC.

Frank Miller (sensfan22)
Changed in starlingx:
status: Incomplete → Triaged
Revision history for this message
Matt Peters (mpeters-wrs) wrote :

Thanks Eric for confirming. It isn't a problem to add ICMP to the egress policy, I just wanted to make sure that it was required.

Revision history for this message
Yi Wang (wangyi4) wrote : Re: [Bug 1826421] Re: nodes access to board management module has failed
Download full text (3.9 KiB)

Thanks Eric and Matt for triaging it! I will make a patch to fix this issue.

BR
Yi
Sent from my iPhone

> On May 2, 2019, at 10:56 PM, Matt Peters <email address hidden> wrote:
>
> Thanks Eric for confirming. It isn't a problem to add ICMP to the
> egress policy, I just wanted to make sure that it was required.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1826421
>
> Title:
> nodes access to board management module has failed
>
> Status in StarlingX:
> Triaged
>
> Bug description:
> Brief Description
> -----------------
> After system installed, Alarm "200.010 node access to board management module has failed." raised.
>
> Severity
> --------
> Major
>
>
> Steps to Reproduce
> ------------------
> As description
> ....
> TC-name:
>
>
> Expected Behavior
> ------------------
> no 200.010 alarm
>
> Actual Behavior
> ----------------
> 200.010 alarm raised
>
>
> Reproducibility
> ---------------
> Reproducible
>
>
> System Configuration
> --------------------
> Multi-node system
>
>
> Lab-name:
> WCP_99-103
>
>
> Branch/Pull Time/Commit
> -----------------------
> stx master as of 20190425T013000Z
>
>
> Last Pass
> ---------
> 20190410T013000Z
>
>
> Timestamp/Logs
> --------------
> [2019-04-25 08:37:19,812] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
> [2019-04-25 08:37:21,068] 387 DEBUG MainThread ssh.expect :: Output:
> +--------------------------------------+----------+----------------------------------------------------------------------+--------------------------------------+----------+----------------------------+
> | UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
> +--------------------------------------+----------+----------------------------------------------------------------------+--------------------------------------+----------+----------------------------+
> | e9382b26-d4cc-4674-9b40-e51c6b0f1af9 | 100.114 | NTP address 167.114.156.48 is not a valid or a reachable NTP server. | host=controller-0.ntp=167.114.156.48 | minor | 2019-04-25T07:09:02.366190 |
> | f2a7d010-43a0-44ae-99d9-3a7373663060 | 200.010 | compute-2 access to board management module has failed. | host=compute-2 | warning | 2019-04-25T07:08:02.580267 |
> | a7d08307-4b57-4eb4-88d5-bc8b28e996a6 | 200.010 | compute-1 access to board management module has failed. | host=compute-1 | warning | 2019-04-25T07:08:00.686273 |
> | f5ef9130-1a45-43de-ad78-7eeb97aff8be | 200.010 | compute-0 access to board management module has failed. | host=compute-0 | warning | 2019-04-25T07:07:58.241273 |
> | 324b84fe-30af-...

Read more...

Revision history for this message
Yi Wang (wangyi4) wrote :

Eric, I am going to verify my patch. How do I provision the BMC IP?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/657301

Changed in starlingx:
status: Triaged → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.networking
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/657301
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=ce41d7e01b4176f51a814989a1edf0dd13b8390f
Submitter: Zuul
Branch: master

commit ce41d7e01b4176f51a814989a1edf0dd13b8390f
Author: Yi Wang <email address hidden>
Date: Tue May 7 09:50:25 2019 +0800

    Fix board management module access bug

    Add egress icmp support in calico global policy because maintenance
    has a BMC reachability audit that uses ping towards the provisioned
    BMC IP.

    Change-Id: I911a4da23a5b4ac7c0302bb9465cfe7cdab4c34c
    Closes-Bug: #1826421
    Signed-off-by: Yi Wang <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

No 200.010 alarm found in
BUILD_ID="20190508T013000Z"

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.