Storage-0 went offline due to NIC driver continuousely failed to allocate memory

Bug #1924209 reported by Zhixiong Chi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Zhixiong Chi

Bug Description

Brief Description
-----------------
 storage-0 went offline suddenly as the NICs driver 'bnx2x' on storage-0 continuousely failed to allocate memory with error message 'page allocation failure: order:2, mode:0x104020', it caused storage-0's Mgmnt and Infra heartbeats lost and it being rebooted by controller-0 because the 2 networkings were using the same NICs on it, and after that, the cpu usage '%wait' of some VMs became very high. After storage-0 went online again, some I/O requests for ceph became slow.

Severity
--------
Major.

Steps to Reproduce
------------------
happened suddenly in the customer board.
couldn't be reproduced internally.

Expected Behavior
------------------
Storage-0 works well after a couple of times

Actual Behavior
----------------
Storage-0 went offline due to the malloc failed and after storage-0 went online again, some I/O requests for ceph became slow.

Reproducibility
---------------
Occurred a couple of times

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
any

Last Pass
---------
unknown

Timestamp/Logs
--------------
At about 2020-12-10 15:40, the NICs driver 'bnx2x' on storage-0 continuousely failed to allocate memory with error message 'page allocation failure: order:2, mode:0x104020', it caused storage-0's Mgmnt and Infra heartbeats lost and it being rebooted by controller-0 because the 2 networkings were using the same NICs on it.

2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829312] warn_alloc_failed: 515 callbacks suppressed
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829316] ms_pipe_read: page allocation failure: order:2, mode:0x104020
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829319] CPU: 0 PID: 84434 Comm: ms_pipe_read Tainted: G O ------------ 3.10.0-693.21.1.el7.39.tis.x86_64 #1
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829320] Hardware name: Dell Inc. PowerEdge FC430/00TXH1, BIOS 2.8.0 05/23/2018
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829321] Call Trace:
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829323] <IRQ> [<ffffffff817826ff>] dump_stack+0x19/0x1b
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829334] [<ffffffff8115d600>] warn_alloc_failed+0x110/0x180
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829337] [<ffffffff8177e360>] __alloc_pages_slowpath+0x69c/0x70a
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829340] [<ffffffff811616cb>] __alloc_pages_nodemask+0x43b/0x460
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829345] [<ffffffff811a4cc8>] alloc_pages_current+0x98/0x110
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829347] [<ffffffff8115c50e>] __get_free_pages+0xe/0x40
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829349] [<ffffffff811b02ee>] kmalloc_order_trace+0x2e/0xe0
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829351] [<ffffffff811b3c7f>] __kmalloc+0x1ff/0x290
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829391] [<ffffffffc04a95ba>] bnx2x_frag_alloc.isra.62+0x2a/0x40 [bnx2x]
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829399] [<ffffffffc04a9b14>] bnx2x_alloc_rx_data.isra.70+0x54/0x1d0 [bnx2x]
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829406] [<ffffffffc04aaaeb>] bnx2x_rx_int+0x86b/0x19b0 [bnx2x]
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829410] [<ffffffff816586e9>] ? consume_skb+0x39/0xd0
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829415] [<ffffffff81544d56>] ? ata_qc_complete+0xc6/0x340
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829429] [<ffffffffc04adacd>] bnx2x_poll+0x1dd/0x260 [bnx2x]
2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829449] [<ffffffff8166d89c>] net_rx_action+0x18c/0x3d0
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829453] [<ffffffff8107b68a>] __do_softirq+0x11a/0x320
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829459] [<ffffffff8179428c>] call_softirq+0x1c/0x30
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829462] [<ffffffff8101de15>] do_softirq+0x65/0xa0
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829469] [<ffffffff8107ba6d>] irq_exit+0xed/0x120
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829476] [<ffffffff817952a6>] do_IRQ+0x56/0xf0
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829487] [<ffffffff817902a2>] common_interrupt+0x162/0x162
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829493] <EOI> [<ffffffff81160d4a>] ? get_page_from_freelist+0x3fa/0x940
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829500] [<ffffffff8116140e>] __alloc_pages_nodemask+0x17e/0x460
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829504] [<ffffffff8116554b>] ? pagevec_lru_move_fn+0xeb/0x110
2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829508] [<ffffffff811a7a3a>] alloc_pages_vma+0x9a/0x170
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829512] [<ffffffff811863d4>] handle_mm_fault+0x994/0xdc0
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829517] [<ffffffff810580de>] __do_page_fault+0x14e/0x3a0
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829520] [<ffffffff810583a7>] do_page_fault+0x37/0xa0
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829523] [<ffffffff81790648>] page_fault+0x28/0x30
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829527] [<ffffffff81331519>] ? copy_user_enhanced_fast_string+0x9/0x20
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829531] [<ffffffff81337d5a>] ? memcpy_toiovec+0x4a/0x90
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829534] [<ffffffff8165f5ba>] skb_copy_datagram_iovec+0x11a/0x2a0
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829538] [<ffffffff816beada>] tcp_recvmsg+0x24a/0xb50
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829543] [<ffffffff816edc54>] inet_recvmsg+0xa4/0xe0
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829548] [<ffffffff816509b5>] sock_recvmsg+0xc5/0x100
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829550] [<ffffffff8116140e>] ? __alloc_pages_nodemask+0x17e/0x460
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829553] [<ffffffff81650b28>] SYSC_recvfrom+0xe8/0x160
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829559] [<ffffffff810d5112>] ? ktime_get_ts64+0x52/0xf0
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829562] [<ffffffff8165130e>] SyS_recvfrom+0xe/0x10
2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829565] [<ffffffff81791015>] system_call_fastpath+0x1c/0x21

2020-12-10T15:40:33.200 [69138.11065] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Mgmnt Pulse Miss ( 2) (max: 1)
2020-12-10T15:40:33.200 [69138.11066] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Infra Pulse Miss ( 2) (max: 1)
2020-12-10T15:40:33.306 [69138.11067] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Mgmnt Pulse Miss ( 3) (max: 2)
......
2020-12-10T15:40:36.369 [69138.11099] controller-0 hbsAgent — nodeClass.cpp (8156) lost_pulses : Info : storage-0 Mgmnt Pulse Miss (10) (log throttled to every 4095)
2020-12-10T15:40:36.369 [69138.11100] controller-0 hbsAgent — nodeClass.cpp (8282) lost_pulses :Error : storage-0 Mgmnt -> FAILED
2020-12-10T15:40:36.369 [69138.11101] controller-0 hbsAgent — nodeClass.cpp (8284) lost_pulses :Error : storage-0 Mgmnt *** Heartbeat Loss ***
......
2020-12-10T15:40:36.381 [69138.11103] controller-0 hbsAgent — nodeClass.cpp (8156) lost_pulses : Info : storage-0 Infra Pulse Miss (10) (log throttled to every 4095)
2020-12-10T15:40:36.381 [69138.11104] controller-0 hbsAgent — nodeClass.cpp (8282) lost_pulses :Error : storage-0 Infra -> FAILED
2020-12-10T15:40:36.381 [69138.11105] controller-0 hbsAgent — nodeClass.cpp (8284) lost_pulses :Error : storage-0 Infra *** Heartbeat Loss ***

2020-12-10T15:40:33.412 [67760.63906] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Mgmnt) 15 enabled hosts (threshold:2) (1:Mgmnt:16) (0:Infra:11)
2020-12-10T15:40:33.523 [67760.63907] controller-0 mtcAgent — nodeClass.cpp (4326) hbs_minor_clear : Info : storage-0 MNFA removed from pool
2020-12-10T15:40:33.523 [67760.63908] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 168) mnfa_recover_host : Info : storage-0 graceful recovery from MNFA
2020-12-10T15:40:33.523 [67760.63909] controller-0 mtcAgent — nodeClass.cpp (4380) hbs_minor_clear : Warn : storage-0 MNFA host tally (Mgmnt:0)
2020-12-10T15:40:33.523 [67760.63910] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1545) recovery_handler : Info : storage-0 MNFA Graceful Recovery (uptime was 31652216)
2020-12-10T15:40:33.523 [67760.63911] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 792) send_hbs_command : Info : storage-0 sending 'stop' to heartbeat service
2020-12-10T15:40:33.523 [67760.63912] controller-0 mtcAgent inv mtcInvApi.cpp ( 341) mtcInvApi_update_task : Info : storage-0 Task: Graceful Recovery (seq:186962)
2020-12-10T15:40:33.529 [67760.63913] controller-0 mtcAgent |-| nodeClass.cpp (2424) start_offline_handler : Info : storage-0 starting offline handler (unlocked-enabled-available) (stage:0)
2020-12-10T15:40:33.529 [67760.63914] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1625) recovery_handler : Info : storage-0 requesting mtcAlive with 5 sec timeout
2020-12-10T15:40:33.529 [67760.63915] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (3135) offline_handler : Info : storage-0 starting 100 msec offline audit (enabled-available)
2020-12-10T15:40:33.529 [67760.63916] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1643) recovery_handler : Info : storage-0 got requested mtcAlive
2020-12-10T15:40:33.529 [67760.63917] controller-0 mtcAgent |-| nodeClass.cpp (2437) stop_offline_handler : Info : storage-0 stopping offline handler (unlocked-enabled-available) (stage:3)
2020-12-10T15:40:33.529 [67760.63918] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1674) recovery_handler : Warn : storage-0 Connectivity Recovered ; host did not reset
2020-12-10T15:40:33.529 [67760.63919] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1675) recovery_handler : Warn : storage-0 ... continuing with MNFA graceful recovery
2020-12-10T15:40:33.529 [67760.63920] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1676) recovery_handler : Warn : storage-0 ... with no affect to host services
2020-12-10T15:40:33.529 [67760.63921] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 796) send_hbs_command : Info : storage-0 sending 'start' to heartbeat service
2020-12-10T15:40:33.529 [67760.63922] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (2379) recovery_handler : Info : storage-0 Starting 11 sec Heartbeat Soak (with ready event)
2020-12-10T15:40:35.734 [67760.63923] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Mgmnt) 15 enabled hosts (threshold:2) (1:Mgmnt:17) (0:Infra:11)
2020-12-10T15:40:35.945 [67760.63924] controller-0 mtcAgent — nodeClass.cpp (4608) manage_heartbeat_degrade: Warn : storage-0 Mgmnt *** Heartbeat Miss ***
2020-12-10T15:40:35.945 [67760.63925] controller-0 mtcAgent — nodeClass.cpp (4447) manage_service_degrade : Warn : storage-0 'Mgmnt' degrade assert
2020-12-10T15:40:35.946 [67760.63926] controller-0 mtcAgent inv mtcInvApi.cpp (1086) mtcInvApi_update_state : Info : storage-0 degraded (seq:186963)
2020-12-10T15:40:35.946 [67760.63927] controller-0 mtcAgent — nodeClass.cpp (4608) manage_heartbeat_degrade: Warn : storage-0 Infra *** Heartbeat Miss ***
2020-12-10T15:40:35.946 [67760.63928] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Infra) 15 enabled hosts (threshold:2) (1:Mgmnt:17) (1:Infra:12)
2020-12-10T15:40:35.946 [67760.63929] controller-0 mtcAgent — nodeClass.cpp (4447) manage_service_degrade : Warn : storage-0 'Infra' degrade assert
......
2020-12-10T15:40:36.407 [67760.63976] controller-0 mtcAgent |-| nodeClass.cpp (2424) start_offline_handler : Info : storage-0 starting offline handler (unlocked-disabled-failed) (stage:0)
2020-12-10T15:40:36.407 [67760.63977] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (3135) offline_handler : Info : storage-0 starting 100 msec offline audit (disabled-failed)
2020-12-10T15:40:36.408 [67760.63978] controller-0 mtcAgent |-| mtcCmdHdlr.cpp ( 368) cmd_handler : Info : storage-0 Performing REBOOT (mgmnt network)
2020-12-10T15:40:36.408 [67760.63979] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 636) send_mtc_cmd : Info : storage-0 sending 'reboot' request (Mgmnt network)
2020-12-10T15:40:36.408 [67760.63980] controller-0 mtcAgent |-| mtcCmdHdlr.cpp ( 381) cmd_handler : Info : storage-0 Performing REBOOT (infra network)
2020-12-10T15:40:36.408 [67760.63981] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 636) send_mtc_cmd : Info : storage-0 sending 'reboot' request (Infra network)
2020-12-10T15:40:36.408 [67760.63982] controller-0 mtcAgent cmd mtcCmdHdlr.cpp ( 399) cmd_handler : Info : storage-0 waiting for REBOOT ACK
2020-12-10T15:40:36.409 [67760.63983] controller-0 mtcAgent |-| mtcCmdHdlr.cpp ( 433) cmd_handler : Info : storage-0 REBOOT Request Succeeded

After storage-0 went online again, some I/O requests for ceph became slow, it should be the reason why the cpu usage '%wait' of some VMs became very high.

2020-12-10T15:44:59.366 [67760.64015] controller-0 mtcAgent vim mtcVimApi.cpp ( 259) mtcVimApi_state_change : Info : storage-0 sending 'host' state change to vim (enabled)
2020-12-10T15:44:59.366 [67760.64016] controller-0 mtcAgent |-| mtcNodeHdlrs.cpp (1476) enable_handler : Info : storage-0 is ENABLED
2020-12-10T15:44:59.372 [67760.64017] controller-0 mtcAgent inv mtcInvApi.cpp (1086) mtcInvApi_update_state : Info : storage-0 available (seq:186985)

2020-12-10 15:45:39.941555 osd.9 172.16.2.192:6828/38606 2734 : cluster [WRN] 4 slow requests, 4 included below; oldest blocked for > 30.415315 secs
2020-12-10 15:45:39.941572 osd.9 172.16.2.192:6828/38606 2735 : cluster [WRN] slow request 30.415315 seconds old, received at 2020-12-10 15:45:09.525436: osd_op(client.556459574.0:1949887 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1552384~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for subops from 2
2020-12-10 15:45:39.941581 osd.9 172.16.2.192:6828/38606 2736 : cluster [WRN] slow request 30.415272 seconds old, received at 2020-12-10 15:45:09.525479: osd_op(client.556459574.0:1949888 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1601536~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks
2020-12-10 15:45:39.941590 osd.9 172.16.2.192:6828/38606 2737 : cluster [WRN] slow request 30.415140 seconds old, received at 2020-12-10 15:45:09.525611: osd_op(client.556459574.0:1949889 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1605632~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks
2020-12-10 15:45:39.941594 osd.9 172.16.2.192:6828/38606 2738 : cluster [WRN] slow request 30.415085 seconds old, received at 2020-12-10 15:45:09.525666: osd_op(client.556459574.0:1949890 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1613824~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks

1)kernel error 'page allocation failure: order:2, mode:0x104020'. It have been occurring quite regularly since 2019-06-06. In most of the stacktraces, there's also mention of the bnx2x NIC. We should find out if there is a potential defect which need be fixed, such as memory leak, or if there is any solution/workaround to avoid it.

Performing a tally of which process experienced this:
grep -oE "[^ ]* page allocation failure" Downloads/kern.log.storage-0.kacbhi1 |sort|uniq -c|sort -n
1 jbd2/dm-1-8: page allocation failure
1 kworker/0:0: page allocation failure
2 hbsClient: page allocation failure
2 systemd-journal: page allocation failure
2 tp_osd_recov: page allocation failure
10 journal_write: page allocation failure
10 kworker/0:1: page allocation failure
10 ms_pipe_write: page allocation failure
12 tp_fstore_op: page allocation failure
14 tp_osd_tp: page allocation failure
22 ms_accepter: page allocation failure
41 journal_wrt_fin: page allocation failure
57 syslog-ng: page allocation failure
120 ksoftirqd/0: page allocation failure
356 ms_pipe_read: page allocation failure
540 swapper/0: page allocation failure
569 kswapd0: page allocation failure

Test Activity
-------------

Workaround
----------

CVE References

Changed in starlingx:
assignee: nobody → Zhixiong Chi (zhixiongchi)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/786354

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.distro.other
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: marking as low; this is a highly intermittent issue and therefore will not gate the stx.5.0 release which is imminent. A fix in stx master should be sufficient.

Changed in starlingx:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/786354
Committed: https://opendev.org/starlingx/stx-puppet/commit/94ec35ff2d5363d3816f6d267a77a4efba6c6aa8
Submitter: "Zuul (22348)"
Branch: master

commit 94ec35ff2d5363d3816f6d267a77a4efba6c6aa8
Author: Zhixiong Chi <email address hidden>
Date: Wed Apr 14 23:28:03 2021 -0400

    Increase min_free_kbytes to 256M for storage to avoid OOM issue

    Help to prevent the OOM issue that it failed to allocate memory
    with error message 'page allocation failure: order:2, mode:0x104020'

    As the min_free_kbytes in the linux documentation shows:
    This is used to force the Linux VM to keep a minimum number
    of kilobytes free. The VM uses this number to compute a
    watermark[WMARK_MIN] value for each lowmem zone in the system.
    Each lowmem zone gets a number of reserved free pages based
    proportionally on its size.

    Keeping more memory free in those zones means that the os itself is
    less likely to run out of memory during high memory pressure and high
    allocation events.

    Based on the issue occurs on the storage node so far, we only update
    the value for the storage node.

    Closes-Bug: #1924209

    Change-Id: Iae2e5a0787f69c62ba5da53663371fd2be148e15
    Signed-off-by: Zhixiong Chi <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792013

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792018

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (48.0 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792029
Committed: https://opendev.org/starlingx/stx-puppet/commit/2b026190a3cb6d561b6ec4a46dfb3add67f1fa69
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 3e3940824dfb830ebd39fd93265b983c6a22fc51
Author: Dan Voiculeasa <email address hidden>
Date: Thu May 13 18:03:45 2021 +0300

    Enable kubelet support for pod pid limit

    Enable limiting the number of pids inside of pods.

    Add a default value to protect against a missing value.
    Default to 750 pids limit to align with service parameter default
    value for most resource consuming StarlingX optional app (openstack).
    In fact any value above service parameter minimum value is good for the
    default.

    Closes-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I10c1684fe3145e0a46b011f8e87f7a23557ddd4a

commit 0c16d288fbc483103b7ba5dad7782e97f59f4e17
Author: Jessica Castelino <email address hidden>
Date: Tue May 11 10:21:57 2021 -0400

    Safe restart of the etcd SM service in etcd upgrade runtime class

    While upgrading the central cloud of a DC system, activation failed
    because there was an unexpected SWACT to controller-1. This was due
    to the etcd upgrade script. Part of this script runs the etcd
    manifest. This triggers a reload/restart of the etcd service. As this
    is done outside of the sm, sm saw the process failure and triggered
    the SWACT.

    This commit modifies platform::etcd::upgrade::runtime puppet class
    to do a safe restart of the etcd SM service and thus, solve the
    issue.

    Change-Id: I3381b6976114c77ee96028d7d96a00302ad865ec
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1928135

commit eec3008f600aeeb69a42338ed44332228a862d11
Author: Mihnea Saracin <email address hidden>
Date: Mon May 10 13:09:52 2021 +0300

    Serialize updates to global_filter in the AIO manifest

    Right now, looking at the aio manifest:
    https://review.opendev.org/c/starlingx/stx-puppet/+/780600/15/puppet-manifests/src/manifests/aio.pp
    there are 3 classes that update
    in parallel the lvm global_filter:
    - include ::platform::lvm::controller
    - include ::platform::worker::storage
    - include ::platform::lvm::compute
    And this generates some errors.

    We fix this by adding dependencies between the above classes
    in order to update the global_filter in a serial mode.

    Closes-Bug: 1927762
    Signed-off-by: Mihnea Saracin <email address hidden>
    Change-Id: If6971e520454cdef41138b2f29998c036d8307ff

commit 97371409b9b2ae3f0db6a6a0acaeabd74927160e
Author: Steven Webster <email address hidden>
Date: Fri May 7 15:33:43 2021 -0400

    Add SR-IOV rate-limit dependency

    Currently, the binding of an SR-IOV virtual function (VF) to a
    driver has a dependency on platform::networking. This is needed
    to ensure that SR-IOV is enabled (VFs created) before actually
    doing the bind.

    This dependency does not exist for configuring the VF rate-limits
    however. There is a cha...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.