StarlingX

Bug #1924209
Activity log

Activity log for bug #1924209

Date	Who	What changed	Old value	New value	Message
2021-04-15 03:20:18	Zhixiong Chi	bug			added bug
2021-04-15 03:20:33	Zhixiong Chi	starlingx: assignee		Zhixiong Chi (zhixiongchi)
2021-04-15 04:33:51	OpenStack Infra	starlingx: status	New	In Progress
2021-04-16 14:02:59	Ghada Khalil	tags		stx.distro.other
2021-04-16 14:03:49	Ghada Khalil	description	Brief Description ----------------- storage-0 went offline suddenly as the NICs driver 'bnx2x' on storage-0 continuousely failed to allocate memory with error message 'page allocation failure: order:2, mode:0x104020', it caused storage-0's Mgmnt and Infra heartbeats lost and it being rebooted by controller-0 because the 2 networkings were using the same NICs on it, and after that, the cpu usage '%wait' of some VMs became very high. After storage-0 went online again, some I/O requests for ceph became slow. Severity -------- Major. Steps to Reproduce ------------------ happened suddenly in the customer board. couldn't be reproduced internally. Expected Behavior ------------------ Storage-0 works well after a couple of times Actual Behavior ---------------- Storage-0 went offline due to the malloc failed and after storage-0 went online again, some I/O requests for ceph became slow. Reproducibility --------------- Occurred a couple of times System Configuration -------------------- Dedicated storage Branch/Pull Time/Commit ----------------------- Last Pass --------- Timestamp/Logs -------------- At about 2020-12-10 15:40, the NICs driver 'bnx2x' on storage-0 continuousely failed to allocate memory with error message 'page allocation failure: order:2, mode:0x104020', it caused storage-0's Mgmnt and Infra heartbeats lost and it being rebooted by controller-0 because the 2 networkings were using the same NICs on it. 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829312] warn_alloc_failed: 515 callbacks suppressed 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829316] ms_pipe_read: page allocation failure: order:2, mode:0x104020 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829319] CPU: 0 PID: 84434 Comm: ms_pipe_read Tainted: G O ------------ 3.10.0-693.21.1.el7.39.tis.x86_64 #1 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829320] Hardware name: Dell Inc. PowerEdge FC430/00TXH1, BIOS 2.8.0 05/23/2018 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829321] Call Trace: 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829323] <IRQ> [<ffffffff817826ff>] dump_stack+0x19/0x1b 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829334] [<ffffffff8115d600>] warn_alloc_failed+0x110/0x180 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829337] [<ffffffff8177e360>] __alloc_pages_slowpath+0x69c/0x70a 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829340] [<ffffffff811616cb>] __alloc_pages_nodemask+0x43b/0x460 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829345] [<ffffffff811a4cc8>] alloc_pages_current+0x98/0x110 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829347] [<ffffffff8115c50e>] __get_free_pages+0xe/0x40 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829349] [<ffffffff811b02ee>] kmalloc_order_trace+0x2e/0xe0 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829351] [<ffffffff811b3c7f>] __kmalloc+0x1ff/0x290 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829391] [<ffffffffc04a95ba>] bnx2x_frag_alloc.isra.62+0x2a/0x40 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829399] [<ffffffffc04a9b14>] bnx2x_alloc_rx_data.isra.70+0x54/0x1d0 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829406] [<ffffffffc04aaaeb>] bnx2x_rx_int+0x86b/0x19b0 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829410] [<ffffffff816586e9>] ? consume_skb+0x39/0xd0 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829415] [<ffffffff81544d56>] ? ata_qc_complete+0xc6/0x340 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829429] [<ffffffffc04adacd>] bnx2x_poll+0x1dd/0x260 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829449] [<ffffffff8166d89c>] net_rx_action+0x18c/0x3d0 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829453] [<ffffffff8107b68a>] __do_softirq+0x11a/0x320 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829459] [<ffffffff8179428c>] call_softirq+0x1c/0x30 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829462] [<ffffffff8101de15>] do_softirq+0x65/0xa0 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829469] [<ffffffff8107ba6d>] irq_exit+0xed/0x120 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829476] [<ffffffff817952a6>] do_IRQ+0x56/0xf0 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829487] [<ffffffff817902a2>] common_interrupt+0x162/0x162 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829493] <EOI> [<ffffffff81160d4a>] ? get_page_from_freelist+0x3fa/0x940 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829500] [<ffffffff8116140e>] __alloc_pages_nodemask+0x17e/0x460 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829504] [<ffffffff8116554b>] ? pagevec_lru_move_fn+0xeb/0x110 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829508] [<ffffffff811a7a3a>] alloc_pages_vma+0x9a/0x170 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829512] [<ffffffff811863d4>] handle_mm_fault+0x994/0xdc0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829517] [<ffffffff810580de>] __do_page_fault+0x14e/0x3a0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829520] [<ffffffff810583a7>] do_page_fault+0x37/0xa0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829523] [<ffffffff81790648>] page_fault+0x28/0x30 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829527] [<ffffffff81331519>] ? copy_user_enhanced_fast_string+0x9/0x20 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829531] [<ffffffff81337d5a>] ? memcpy_toiovec+0x4a/0x90 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829534] [<ffffffff8165f5ba>] skb_copy_datagram_iovec+0x11a/0x2a0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829538] [<ffffffff816beada>] tcp_recvmsg+0x24a/0xb50 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829543] [<ffffffff816edc54>] inet_recvmsg+0xa4/0xe0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829548] [<ffffffff816509b5>] sock_recvmsg+0xc5/0x100 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829550] [<ffffffff8116140e>] ? __alloc_pages_nodemask+0x17e/0x460 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829553] [<ffffffff81650b28>] SYSC_recvfrom+0xe8/0x160 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829559] [<ffffffff810d5112>] ? ktime_get_ts64+0x52/0xf0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829562] [<ffffffff8165130e>] SyS_recvfrom+0xe/0x10 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829565] [<ffffffff81791015>] system_call_fastpath+0x1c/0x21 2020-12-10T15:40:33.200 [69138.11065] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Mgmnt Pulse Miss ( 2) (max: 1) 2020-12-10T15:40:33.200 [69138.11066] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Infra Pulse Miss ( 2) (max: 1) 2020-12-10T15:40:33.306 [69138.11067] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Mgmnt Pulse Miss ( 3) (max: 2) ...... 2020-12-10T15:40:36.369 [69138.11099] controller-0 hbsAgent — nodeClass.cpp (8156) lost_pulses : Info : storage-0 Mgmnt Pulse Miss (10) (log throttled to every 4095) 2020-12-10T15:40:36.369 [69138.11100] controller-0 hbsAgent — nodeClass.cpp (8282) lost_pulses :Error : storage-0 Mgmnt -> FAILED 2020-12-10T15:40:36.369 [69138.11101] controller-0 hbsAgent — nodeClass.cpp (8284) lost_pulses :Error : storage-0 Mgmnt * Heartbeat Loss * ...... 2020-12-10T15:40:36.381 [69138.11103] controller-0 hbsAgent — nodeClass.cpp (8156) lost_pulses : Info : storage-0 Infra Pulse Miss (10) (log throttled to every 4095) 2020-12-10T15:40:36.381 [69138.11104] controller-0 hbsAgent — nodeClass.cpp (8282) lost_pulses :Error : storage-0 Infra -> FAILED 2020-12-10T15:40:36.381 [69138.11105] controller-0 hbsAgent — nodeClass.cpp (8284) lost_pulses :Error : storage-0 Infra * Heartbeat Loss * 2020-12-10T15:40:33.412 [67760.63906] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Mgmnt) 15 enabled hosts (threshold:2) (1:Mgmnt:16) (0:Infra:11) 2020-12-10T15:40:33.523 [67760.63907] controller-0 mtcAgent — nodeClass.cpp (4326) hbs_minor_clear : Info : storage-0 MNFA removed from pool 2020-12-10T15:40:33.523 [67760.63908] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 168) mnfa_recover_host : Info : storage-0 graceful recovery from MNFA 2020-12-10T15:40:33.523 [67760.63909] controller-0 mtcAgent — nodeClass.cpp (4380) hbs_minor_clear : Warn : storage-0 MNFA host tally (Mgmnt:0) 2020-12-10T15:40:33.523 [67760.63910] controller-0 mtcAgent \|-\| mtcNodeHdlrs.cpp (1545) recovery_handler : Info : storage-0 MNFA Graceful Recovery (uptime was 31652216) 2020-12-10T15:40:33.523 [67760.63911] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 792) send_hbs_command : Info : storage-0 sending 'stop' to heartbeat service 2020-12-10T15:40:33.523 [67760.63912] controller-0 mtcAgent inv mtcInvApi.cpp ( 341) mtcInvApi_update_task : Info : storage-0 Task: Graceful Recovery (seq:186962) 2020-12-10T15:40:33.529 [67760.63913] controller-0 mtcAgent \|-\| nodeClass.cpp (2424) start_offline_handler : Info : storage-0 starting offline handler (unlocked-enabled-available) (stage:0) 2020-12-10T15:40:33.529 [67760.63914] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1625) recovery_handler : Info : storage-0 requesting mtcAlive with 5 sec timeout 2020-12-10T15:40:33.529 [67760.63915] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (3135) offline_handler : Info : storage-0 starting 100 msec offline audit (enabled-available) 2020-12-10T15:40:33.529 [67760.63916] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1643) recovery_handler : Info : storage-0 got requested mtcAlive 2020-12-10T15:40:33.529 [67760.63917] controller-0 mtcAgent \|-\| nodeClass.cpp (2437) stop_offline_handler : Info : storage-0 stopping offline handler (unlocked-enabled-available) (stage:3) 2020-12-10T15:40:33.529 [67760.63918] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1674) recovery_handler : Warn : storage-0 Connectivity Recovered ; host did not reset 2020-12-10T15:40:33.529 [67760.63919] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1675) recovery_handler : Warn : storage-0 ... continuing with MNFA graceful recovery 2020-12-10T15:40:33.529 [67760.63920] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1676) recovery_handler : Warn : storage-0 ... with no affect to host services 2020-12-10T15:40:33.529 [67760.63921] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 796) send_hbs_command : Info : storage-0 sending 'start' to heartbeat service 2020-12-10T15:40:33.529 [67760.63922] controller-0 mtcAgent \|-\| mtcNodeHdlrs.cpp (2379) recovery_handler : Info : storage-0 Starting 11 sec Heartbeat Soak (with ready event) 2020-12-10T15:40:35.734 [67760.63923] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Mgmnt) 15 enabled hosts (threshold:2) (1:Mgmnt:17) (0:Infra:11) 2020-12-10T15:40:35.945 [67760.63924] controller-0 mtcAgent — nodeClass.cpp (4608) manage_heartbeat_degrade: Warn : storage-0 Mgmnt * Heartbeat Miss * 2020-12-10T15:40:35.945 [67760.63925] controller-0 mtcAgent — nodeClass.cpp (4447) manage_service_degrade : Warn : storage-0 'Mgmnt' degrade assert 2020-12-10T15:40:35.946 [67760.63926] controller-0 mtcAgent inv mtcInvApi.cpp (1086) mtcInvApi_update_state : Info : storage-0 degraded (seq:186963) 2020-12-10T15:40:35.946 [67760.63927] controller-0 mtcAgent — nodeClass.cpp (4608) manage_heartbeat_degrade: Warn : storage-0 Infra * Heartbeat Miss * 2020-12-10T15:40:35.946 [67760.63928] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Infra) 15 enabled hosts (threshold:2) (1:Mgmnt:17) (1:Infra:12) 2020-12-10T15:40:35.946 [67760.63929] controller-0 mtcAgent — nodeClass.cpp (4447) manage_service_degrade : Warn : storage-0 'Infra' degrade assert ...... 2020-12-10T15:40:36.407 [67760.63976] controller-0 mtcAgent \|-\| nodeClass.cpp (2424) start_offline_handler : Info : storage-0 starting offline handler (unlocked-disabled-failed) (stage:0) 2020-12-10T15:40:36.407 [67760.63977] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (3135) offline_handler : Info : storage-0 starting 100 msec offline audit (disabled-failed) 2020-12-10T15:40:36.408 [67760.63978] controller-0 mtcAgent \|-\| mtcCmdHdlr.cpp ( 368) cmd_handler : Info : storage-0 Performing REBOOT (mgmnt network) 2020-12-10T15:40:36.408 [67760.63979] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 636) send_mtc_cmd : Info : storage-0 sending 'reboot' request (Mgmnt network) 2020-12-10T15:40:36.408 [67760.63980] controller-0 mtcAgent \|-\| mtcCmdHdlr.cpp ( 381) cmd_handler : Info : storage-0 Performing REBOOT (infra network) 2020-12-10T15:40:36.408 [67760.63981] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 636) send_mtc_cmd : Info : storage-0 sending 'reboot' request (Infra network) 2020-12-10T15:40:36.408 [67760.63982] controller-0 mtcAgent cmd mtcCmdHdlr.cpp ( 399) cmd_handler : Info : storage-0 waiting for REBOOT ACK 2020-12-10T15:40:36.409 [67760.63983] controller-0 mtcAgent \|-\| mtcCmdHdlr.cpp ( 433) cmd_handler : Info : storage-0 REBOOT Request Succeeded After storage-0 went online again, some I/O requests for ceph became slow, it should be the reason why the cpu usage '%wait' of some VMs became very high. 2020-12-10T15:44:59.366 [67760.64015] controller-0 mtcAgent vim mtcVimApi.cpp ( 259) mtcVimApi_state_change : Info : storage-0 sending 'host' state change to vim (enabled) 2020-12-10T15:44:59.366 [67760.64016] controller-0 mtcAgent \|-\| mtcNodeHdlrs.cpp (1476) enable_handler : Info : storage-0 is ENABLED 2020-12-10T15:44:59.372 [67760.64017] controller-0 mtcAgent inv mtcInvApi.cpp (1086) mtcInvApi_update_state : Info : storage-0 available (seq:186985) 2020-12-10 15:45:39.941555 osd.9 172.16.2.192:6828/38606 2734 : cluster [WRN] 4 slow requests, 4 included below; oldest blocked for > 30.415315 secs 2020-12-10 15:45:39.941572 osd.9 172.16.2.192:6828/38606 2735 : cluster [WRN] slow request 30.415315 seconds old, received at 2020-12-10 15:45:09.525436: osd_op(client.556459574.0:1949887 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1552384~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for subops from 2 2020-12-10 15:45:39.941581 osd.9 172.16.2.192:6828/38606 2736 : cluster [WRN] slow request 30.415272 seconds old, received at 2020-12-10 15:45:09.525479: osd_op(client.556459574.0:1949888 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1601536~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks 2020-12-10 15:45:39.941590 osd.9 172.16.2.192:6828/38606 2737 : cluster [WRN] slow request 30.415140 seconds old, received at 2020-12-10 15:45:09.525611: osd_op(client.556459574.0:1949889 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1605632~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks 2020-12-10 15:45:39.941594 osd.9 172.16.2.192:6828/38606 2738 : cluster [WRN] slow request 30.415085 seconds old, received at 2020-12-10 15:45:09.525666: osd_op(client.556459574.0:1949890 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1613824~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks 1)kernel error 'page allocation failure: order:2, mode:0x104020'. It have been occurring quite regularly since 2019-06-06. In most of the stacktraces, there's also mention of the bnx2x NIC. We should find out if there is a potential defect which need be fixed, such as memory leak, or if there is any solution/workaround to avoid it. Performing a tally of which process experienced this: grep -oE "[^ ]* page allocation failure" Downloads/kern.log.storage-0.kacbhi1 \|sort\|uniq -c\|sort -n 1 jbd2/dm-1-8: page allocation failure 1 kworker/0:0: page allocation failure 2 hbsClient: page allocation failure 2 systemd-journal: page allocation failure 2 tp_osd_recov: page allocation failure 10 journal_write: page allocation failure 10 kworker/0:1: page allocation failure 10 ms_pipe_write: page allocation failure 12 tp_fstore_op: page allocation failure 14 tp_osd_tp: page allocation failure 22 ms_accepter: page allocation failure 41 journal_wrt_fin: page allocation failure 57 syslog-ng: page allocation failure 120 ksoftirqd/0: page allocation failure 356 ms_pipe_read: page allocation failure 540 swapper/0: page allocation failure 569 kswapd0: page allocation failure Test Activity ------------- Workaround ----------	Brief Description ----------------- storage-0 went offline suddenly as the NICs driver 'bnx2x' on storage-0 continuousely failed to allocate memory with error message 'page allocation failure: order:2, mode:0x104020', it caused storage-0's Mgmnt and Infra heartbeats lost and it being rebooted by controller-0 because the 2 networkings were using the same NICs on it, and after that, the cpu usage '%wait' of some VMs became very high. After storage-0 went online again, some I/O requests for ceph became slow. Severity -------- Major. Steps to Reproduce ------------------ happened suddenly in the customer board. couldn't be reproduced internally. Expected Behavior ------------------ Storage-0 works well after a couple of times Actual Behavior ---------------- Storage-0 went offline due to the malloc failed and after storage-0 went online again, some I/O requests for ceph became slow. Reproducibility --------------- Occurred a couple of times System Configuration -------------------- Dedicated storage Branch/Pull Time/Commit ----------------------- any Last Pass --------- unknown Timestamp/Logs -------------- At about 2020-12-10 15:40, the NICs driver 'bnx2x' on storage-0 continuousely failed to allocate memory with error message 'page allocation failure: order:2, mode:0x104020', it caused storage-0's Mgmnt and Infra heartbeats lost and it being rebooted by controller-0 because the 2 networkings were using the same NICs on it. 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829312] warn_alloc_failed: 515 callbacks suppressed 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829316] ms_pipe_read: page allocation failure: order:2, mode:0x104020 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829319] CPU: 0 PID: 84434 Comm: ms_pipe_read Tainted: G O ------------ 3.10.0-693.21.1.el7.39.tis.x86_64 #1 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829320] Hardware name: Dell Inc. PowerEdge FC430/00TXH1, BIOS 2.8.0 05/23/2018 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829321] Call Trace: 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829323] <IRQ> [<ffffffff817826ff>] dump_stack+0x19/0x1b 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829334] [<ffffffff8115d600>] warn_alloc_failed+0x110/0x180 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829337] [<ffffffff8177e360>] __alloc_pages_slowpath+0x69c/0x70a 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829340] [<ffffffff811616cb>] __alloc_pages_nodemask+0x43b/0x460 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829345] [<ffffffff811a4cc8>] alloc_pages_current+0x98/0x110 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829347] [<ffffffff8115c50e>] __get_free_pages+0xe/0x40 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829349] [<ffffffff811b02ee>] kmalloc_order_trace+0x2e/0xe0 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829351] [<ffffffff811b3c7f>] __kmalloc+0x1ff/0x290 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829391] [<ffffffffc04a95ba>] bnx2x_frag_alloc.isra.62+0x2a/0x40 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829399] [<ffffffffc04a9b14>] bnx2x_alloc_rx_data.isra.70+0x54/0x1d0 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829406] [<ffffffffc04aaaeb>] bnx2x_rx_int+0x86b/0x19b0 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829410] [<ffffffff816586e9>] ? consume_skb+0x39/0xd0 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829415] [<ffffffff81544d56>] ? ata_qc_complete+0xc6/0x340 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829429] [<ffffffffc04adacd>] bnx2x_poll+0x1dd/0x260 [bnx2x] 2020-12-10T15:40:32.990 storage-0 kernel: warning [31652469.829449] [<ffffffff8166d89c>] net_rx_action+0x18c/0x3d0 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829453] [<ffffffff8107b68a>] __do_softirq+0x11a/0x320 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829459] [<ffffffff8179428c>] call_softirq+0x1c/0x30 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829462] [<ffffffff8101de15>] do_softirq+0x65/0xa0 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829469] [<ffffffff8107ba6d>] irq_exit+0xed/0x120 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829476] [<ffffffff817952a6>] do_IRQ+0x56/0xf0 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829487] [<ffffffff817902a2>] common_interrupt+0x162/0x162 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829493] <EOI> [<ffffffff81160d4a>] ? get_page_from_freelist+0x3fa/0x940 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829500] [<ffffffff8116140e>] __alloc_pages_nodemask+0x17e/0x460 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829504] [<ffffffff8116554b>] ? pagevec_lru_move_fn+0xeb/0x110 2020-12-10T15:40:32.991 storage-0 kernel: warning [31652469.829508] [<ffffffff811a7a3a>] alloc_pages_vma+0x9a/0x170 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829512] [<ffffffff811863d4>] handle_mm_fault+0x994/0xdc0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829517] [<ffffffff810580de>] __do_page_fault+0x14e/0x3a0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829520] [<ffffffff810583a7>] do_page_fault+0x37/0xa0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829523] [<ffffffff81790648>] page_fault+0x28/0x30 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829527] [<ffffffff81331519>] ? copy_user_enhanced_fast_string+0x9/0x20 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829531] [<ffffffff81337d5a>] ? memcpy_toiovec+0x4a/0x90 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829534] [<ffffffff8165f5ba>] skb_copy_datagram_iovec+0x11a/0x2a0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829538] [<ffffffff816beada>] tcp_recvmsg+0x24a/0xb50 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829543] [<ffffffff816edc54>] inet_recvmsg+0xa4/0xe0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829548] [<ffffffff816509b5>] sock_recvmsg+0xc5/0x100 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829550] [<ffffffff8116140e>] ? __alloc_pages_nodemask+0x17e/0x460 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829553] [<ffffffff81650b28>] SYSC_recvfrom+0xe8/0x160 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829559] [<ffffffff810d5112>] ? ktime_get_ts64+0x52/0xf0 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829562] [<ffffffff8165130e>] SyS_recvfrom+0xe/0x10 2020-12-10T15:40:32.992 storage-0 kernel: warning [31652469.829565] [<ffffffff81791015>] system_call_fastpath+0x1c/0x21 2020-12-10T15:40:33.200 [69138.11065] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Mgmnt Pulse Miss ( 2) (max: 1) 2020-12-10T15:40:33.200 [69138.11066] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Infra Pulse Miss ( 2) (max: 1) 2020-12-10T15:40:33.306 [69138.11067] controller-0 hbsAgent — nodeClass.cpp (8193) lost_pulses : Info : storage-0 Mgmnt Pulse Miss ( 3) (max: 2) ...... 2020-12-10T15:40:36.369 [69138.11099] controller-0 hbsAgent — nodeClass.cpp (8156) lost_pulses : Info : storage-0 Mgmnt Pulse Miss (10) (log throttled to every 4095) 2020-12-10T15:40:36.369 [69138.11100] controller-0 hbsAgent — nodeClass.cpp (8282) lost_pulses :Error : storage-0 Mgmnt -> FAILED 2020-12-10T15:40:36.369 [69138.11101] controller-0 hbsAgent — nodeClass.cpp (8284) lost_pulses :Error : storage-0 Mgmnt * Heartbeat Loss * ...... 2020-12-10T15:40:36.381 [69138.11103] controller-0 hbsAgent — nodeClass.cpp (8156) lost_pulses : Info : storage-0 Infra Pulse Miss (10) (log throttled to every 4095) 2020-12-10T15:40:36.381 [69138.11104] controller-0 hbsAgent — nodeClass.cpp (8282) lost_pulses :Error : storage-0 Infra -> FAILED 2020-12-10T15:40:36.381 [69138.11105] controller-0 hbsAgent — nodeClass.cpp (8284) lost_pulses :Error : storage-0 Infra * Heartbeat Loss * 2020-12-10T15:40:33.412 [67760.63906] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Mgmnt) 15 enabled hosts (threshold:2) (1:Mgmnt:16) (0:Infra:11) 2020-12-10T15:40:33.523 [67760.63907] controller-0 mtcAgent — nodeClass.cpp (4326) hbs_minor_clear : Info : storage-0 MNFA removed from pool 2020-12-10T15:40:33.523 [67760.63908] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 168) mnfa_recover_host : Info : storage-0 graceful recovery from MNFA 2020-12-10T15:40:33.523 [67760.63909] controller-0 mtcAgent — nodeClass.cpp (4380) hbs_minor_clear : Warn : storage-0 MNFA host tally (Mgmnt:0) 2020-12-10T15:40:33.523 [67760.63910] controller-0 mtcAgent \|-\| mtcNodeHdlrs.cpp (1545) recovery_handler : Info : storage-0 MNFA Graceful Recovery (uptime was 31652216) 2020-12-10T15:40:33.523 [67760.63911] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 792) send_hbs_command : Info : storage-0 sending 'stop' to heartbeat service 2020-12-10T15:40:33.523 [67760.63912] controller-0 mtcAgent inv mtcInvApi.cpp ( 341) mtcInvApi_update_task : Info : storage-0 Task: Graceful Recovery (seq:186962) 2020-12-10T15:40:33.529 [67760.63913] controller-0 mtcAgent \|-\| nodeClass.cpp (2424) start_offline_handler : Info : storage-0 starting offline handler (unlocked-enabled-available) (stage:0) 2020-12-10T15:40:33.529 [67760.63914] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1625) recovery_handler : Info : storage-0 requesting mtcAlive with 5 sec timeout 2020-12-10T15:40:33.529 [67760.63915] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (3135) offline_handler : Info : storage-0 starting 100 msec offline audit (enabled-available) 2020-12-10T15:40:33.529 [67760.63916] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1643) recovery_handler : Info : storage-0 got requested mtcAlive 2020-12-10T15:40:33.529 [67760.63917] controller-0 mtcAgent \|-\| nodeClass.cpp (2437) stop_offline_handler : Info : storage-0 stopping offline handler (unlocked-enabled-available) (stage:3) 2020-12-10T15:40:33.529 [67760.63918] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1674) recovery_handler : Warn : storage-0 Connectivity Recovered ; host did not reset 2020-12-10T15:40:33.529 [67760.63919] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1675) recovery_handler : Warn : storage-0 ... continuing with MNFA graceful recovery 2020-12-10T15:40:33.529 [67760.63920] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (1676) recovery_handler : Warn : storage-0 ... with no affect to host services 2020-12-10T15:40:33.529 [67760.63921] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 796) send_hbs_command : Info : storage-0 sending 'start' to heartbeat service 2020-12-10T15:40:33.529 [67760.63922] controller-0 mtcAgent \|-\| mtcNodeHdlrs.cpp (2379) recovery_handler : Info : storage-0 Starting 11 sec Heartbeat Soak (with ready event) 2020-12-10T15:40:35.734 [67760.63923] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Mgmnt) 15 enabled hosts (threshold:2) (1:Mgmnt:17) (0:Infra:11) 2020-12-10T15:40:35.945 [67760.63924] controller-0 mtcAgent — nodeClass.cpp (4608) manage_heartbeat_degrade: Warn : storage-0 Mgmnt * Heartbeat Miss * 2020-12-10T15:40:35.945 [67760.63925] controller-0 mtcAgent — nodeClass.cpp (4447) manage_service_degrade : Warn : storage-0 'Mgmnt' degrade assert 2020-12-10T15:40:35.946 [67760.63926] controller-0 mtcAgent inv mtcInvApi.cpp (1086) mtcInvApi_update_state : Info : storage-0 degraded (seq:186963) 2020-12-10T15:40:35.946 [67760.63927] controller-0 mtcAgent — nodeClass.cpp (4608) manage_heartbeat_degrade: Warn : storage-0 Infra * Heartbeat Miss * 2020-12-10T15:40:35.946 [67760.63928] controller-0 mtcAgent — mtcNodeMnfa.cpp ( 132) mnfa_add_host : Info : storage-0 MNFA new candidate (Infra) 15 enabled hosts (threshold:2) (1:Mgmnt:17) (1:Infra:12) 2020-12-10T15:40:35.946 [67760.63929] controller-0 mtcAgent — nodeClass.cpp (4447) manage_service_degrade : Warn : storage-0 'Infra' degrade assert ...... 2020-12-10T15:40:36.407 [67760.63976] controller-0 mtcAgent \|-\| nodeClass.cpp (2424) start_offline_handler : Info : storage-0 starting offline handler (unlocked-disabled-failed) (stage:0) 2020-12-10T15:40:36.407 [67760.63977] controller-0 mtcAgent hdl mtcNodeHdlrs.cpp (3135) offline_handler : Info : storage-0 starting 100 msec offline audit (disabled-failed) 2020-12-10T15:40:36.408 [67760.63978] controller-0 mtcAgent \|-\| mtcCmdHdlr.cpp ( 368) cmd_handler : Info : storage-0 Performing REBOOT (mgmnt network) 2020-12-10T15:40:36.408 [67760.63979] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 636) send_mtc_cmd : Info : storage-0 sending 'reboot' request (Mgmnt network) 2020-12-10T15:40:36.408 [67760.63980] controller-0 mtcAgent \|-\| mtcCmdHdlr.cpp ( 381) cmd_handler : Info : storage-0 Performing REBOOT (infra network) 2020-12-10T15:40:36.408 [67760.63981] controller-0 mtcAgent msg mtcCtrlMsg.cpp ( 636) send_mtc_cmd : Info : storage-0 sending 'reboot' request (Infra network) 2020-12-10T15:40:36.408 [67760.63982] controller-0 mtcAgent cmd mtcCmdHdlr.cpp ( 399) cmd_handler : Info : storage-0 waiting for REBOOT ACK 2020-12-10T15:40:36.409 [67760.63983] controller-0 mtcAgent \|-\| mtcCmdHdlr.cpp ( 433) cmd_handler : Info : storage-0 REBOOT Request Succeeded After storage-0 went online again, some I/O requests for ceph became slow, it should be the reason why the cpu usage '%wait' of some VMs became very high. 2020-12-10T15:44:59.366 [67760.64015] controller-0 mtcAgent vim mtcVimApi.cpp ( 259) mtcVimApi_state_change : Info : storage-0 sending 'host' state change to vim (enabled) 2020-12-10T15:44:59.366 [67760.64016] controller-0 mtcAgent \|-\| mtcNodeHdlrs.cpp (1476) enable_handler : Info : storage-0 is ENABLED 2020-12-10T15:44:59.372 [67760.64017] controller-0 mtcAgent inv mtcInvApi.cpp (1086) mtcInvApi_update_state : Info : storage-0 available (seq:186985) 2020-12-10 15:45:39.941555 osd.9 172.16.2.192:6828/38606 2734 : cluster [WRN] 4 slow requests, 4 included below; oldest blocked for > 30.415315 secs 2020-12-10 15:45:39.941572 osd.9 172.16.2.192:6828/38606 2735 : cluster [WRN] slow request 30.415315 seconds old, received at 2020-12-10 15:45:09.525436: osd_op(client.556459574.0:1949887 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1552384~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for subops from 2 2020-12-10 15:45:39.941581 osd.9 172.16.2.192:6828/38606 2736 : cluster [WRN] slow request 30.415272 seconds old, received at 2020-12-10 15:45:09.525479: osd_op(client.556459574.0:1949888 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1601536~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks 2020-12-10 15:45:39.941590 osd.9 172.16.2.192:6828/38606 2737 : cluster [WRN] slow request 30.415140 seconds old, received at 2020-12-10 15:45:09.525611: osd_op(client.556459574.0:1949889 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1605632~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks 2020-12-10 15:45:39.941594 osd.9 172.16.2.192:6828/38606 2738 : cluster [WRN] slow request 30.415085 seconds old, received at 2020-12-10 15:45:09.525666: osd_op(client.556459574.0:1949890 1.34060a0a rbd_data.56bbea349dcfa8.000000000000001c [stat,write 1613824~4096] snapc a3a=[a3a,a39,a38,a37,a36,a35,a34,a33,a32,a31,a30,a2f,a2e,a2d,a2c,a2b,a2a,a29,a28] ack+ondisk+write+known_if_redirected e4594) currently waiting for rw locks 1)kernel error 'page allocation failure: order:2, mode:0x104020'. It have been occurring quite regularly since 2019-06-06. In most of the stacktraces, there's also mention of the bnx2x NIC. We should find out if there is a potential defect which need be fixed, such as memory leak, or if there is any solution/workaround to avoid it. Performing a tally of which process experienced this: grep -oE "[^ ]* page allocation failure" Downloads/kern.log.storage-0.kacbhi1 \|sort\|uniq -c\|sort -n 1 jbd2/dm-1-8: page allocation failure 1 kworker/0:0: page allocation failure 2 hbsClient: page allocation failure 2 systemd-journal: page allocation failure 2 tp_osd_recov: page allocation failure 10 journal_write: page allocation failure 10 kworker/0:1: page allocation failure 10 ms_pipe_write: page allocation failure 12 tp_fstore_op: page allocation failure 14 tp_osd_tp: page allocation failure 22 ms_accepter: page allocation failure 41 journal_wrt_fin: page allocation failure 57 syslog-ng: page allocation failure 120 ksoftirqd/0: page allocation failure 356 ms_pipe_read: page allocation failure 540 swapper/0: page allocation failure 569 kswapd0: page allocation failure Test Activity ------------- Workaround ----------
2021-04-17 02:28:00	Ghada Khalil	starlingx: importance	Undecided	Low
2021-05-05 21:47:19	OpenStack Infra	starlingx: status	In Progress	Fix Released
2021-06-03 18:36:22	OpenStack Infra	tags	stx.distro.other	in-f-centos8 stx.distro.other
2021-06-03 18:36:23	OpenStack Infra	cve linked		2020-15705