Here are traces displayed on the KVM host console before it hung: ================================================================ 29282.720336] mlx5_core 0000:01:03.3: assert_var[0] 0x0000004b [29282.720454] mlx5_core 0000:01:03.3: assert_var[1] 0x00004b81 [29282.720574] mlx5_core 0000:01:03.3: assert_var[2] 0x00000000 [29282.720693] mlx5_core 0000:01:03.3: assert_var[3] 0x00000000 [29282.720813] mlx5_core 0000:01:03.3: assert_var[4] 0x00000000 [29282.720933] mlx5_core 0000:01:03.3: assert_exit_ptr 0x00602a78 [29282.721052] mlx5_core 0000:01:03.3: assert_callra 0x00602b84 [29282.721172] mlx5_core 0000:01:03.3: fw_ver 12.17.1010 [29282.721268] mlx5_core 0000:01:03.3: hw_id 0x00000209 [29282.721365] mlx5_core 0000:01:03.3: irisc_index 3 [29282.721429] mlx5_core 0000:01:03.3: synd 0xe: Invalid EQ refrenced [29282.721499] mlx5_core 0000:01:03.3: ext_synd 0x0001 [29282.721967] mlx5_core 0000:01:0c.0: handling bad device here [29282.722041] mlx5_core 0000:01:0c.0: 0000:01:0c.0:mlx5_enter_error_state:115:(pid 48269): start [29282.722136] mlx5_core 0000:01:0c.0: 0000:01:0c.0:mlx5_enter_error_state:120:(pid 48269): end [29282.729501] mlx5_core 0000:01:02.7: handling bad device here [29282.729578] mlx5_core 0000:01:02.7: 0000:01:02.7:mlx5_enter_error_state:115:(pid 48296): start [29282.729675] mlx5_core 0000:01:02.7: 0000:01:02.7:mlx5_enter_error_state:120:(pid 48296): end [29282.731331] mlx5_core 0000:01:03.3: handling bad device here [29282.731459] mlx5_core 0000:01:03.3: 0000:01:03.3:mlx5_enter_error_state:115:(pid 48315): start [29282.731633] mlx5_core 0000:01:03.3: 0000:01:03.3:mlx5_enter_error_state:120:(pid 48315): end [29282.969491] mlx5_core 0000:01:09.7: device's health compromised - reached miss count [29282.969641] mlx5_core 0000:01:09.7: NIC was disabled [29282.969698] mlx5_core 0000:01:09.7: assert_var[0] 0x0000004b [29282.969764] mlx5_core 0000:01:09.7: assert_var[1] 0x00004b81 [29282.969831] mlx5_core 0000:01:09.7: assert_var[2] 0x00000000 [29282.969897] mlx5_core 0000:01:09.7: assert_var[3] 0x00000000 [29282.969963] mlx5_core 0000:01:09.7: assert_var[4] 0x00000000 [29282.970029] mlx5_core 0000:01:09.7: assert_exit_ptr 0x00602a78 [29282.970095] mlx5_core 0000:01:09.7: assert_callra 0x00602b84 [29282.970164] mlx5_core 0000:01:09.7: fw_ver 12.17.1010 [29282.970218] mlx5_core 0000:01:09.7: hw_id 0x00000209 [29282.970271] mlx5_core 0000:01:09.7: irisc_index 3 [29282.970327] mlx5_core 0000:01:09.7: synd 0xe: Invalid EQ refrenced [29282.970394] mlx5_core 0000:01:09.7: ext_synd 0x0001 [29282.970459] mlx5_core 0000:01:0a.4: device's health compromised - reached miss count [29282.970464] mlx5_core 0000:01:09.7: handling bad device here [29282.970471] mlx5_core 0000:01:09.7: 0000:01:09.7:mlx5_enter_error_state:115:(pid 48321): start [29282.970488] mlx5_core 0000:01:09.7: 0000:01:09.7:mlx5_enter_error_state:120:(pid 48321): end [29282.970771] mlx5_core 0000:01:0a.4: NIC was disabled [29282.970827] mlx5_core 0000:01:0a.4: assert_var[0] 0x0000004b [29282.970893] mlx5_core 0000:01:0a.4: assert_var[1] 0x00004b81 [29282.970958] mlx5_core 0000:01:0a.4: assert_var[2] 0x00000000 [29282.971022] mlx5_core 0000:01:0a.4: assert_var[3] 0x00000000 [29282.971088] mlx5_core 0000:01:0a.4: assert_var[4] 0x00000000 [29282.971153] mlx5_core 0000:01:0a.4: assert_exit_ptr 0x00602a78 [29282.971218] mlx5_core 0000:01:0a.4: assert_callra 0x00602b84 [29282.971284] mlx5_core 0000:01:0a.4: fw_ver 12.17.1010 [29282.971337] mlx5_core 0000:01:0a.4: hw_id 0x00000209 [29282.971390] mlx5_core 0000:01:0a.4: irisc_index 3 [29282.971446] mlx5_core 0000:01:0a.4: synd 0xe: Invalid EQ refrenced [29282.971511] mlx5_core 0000:01:0a.4: ext_synd 0x0001 [29282.971568] mlx5_core 0000:01:08.6: device's health compromised - reached miss count [29282.971577] mlx5_core 0000:01:0a.4: handling bad device here [29282.971584] mlx5_core 0000:01:0a.4: 0000:01:0a.4:mlx5_enter_error_state:115:(pid 48318): start [29282.971597] mlx5_core 0000:01:0a.4: 0000:01:0a.4:mlx5_enter_error_state:120:(pid 48318): end [29282.971882] mlx5_core 0000:01:08.6: NIC was disabled [29282.971938] mlx5_core 0000:01:08.6: assert_var[0] 0x0000004b [29282.972003] mlx5_core 0000:01:08.6: assert_var[1] 0x00004b81 [29282.972069] mlx5_core 0000:01:08.6: assert_var[2] 0x00000000 [29282.972135] mlx5_core 0000:01:08.6: assert_var[3] 0x00000000 [29282.972200] mlx5_core 0000:01:08.6: assert_var[4] 0x00000000 [29282.972265] mlx5_core 0000:01:08.6: assert_exit_ptr 0x00602a78 [29282.972330] mlx5_core 0000:01:08.6: assert_callra 0x00602b84 [29282.972396] mlx5_core 0000:01:08.6: fw_ver 12.17.1010 [29282.972448] mlx5_core 0000:01:08.6: hw_id 0x00000209 [29282.972501] mlx5_core 0000:01:08.6: irisc_index 3 [29282.972580] mlx5_core 0000:01:08.6: synd 0xe: Invalid EQ refrenced [29282.972698] mlx5_core 0000:01:08.6: ext_synd 0x0001 [29282.972804] mlx5_core 0000:01:08.6: handling bad device here [29282.972935] mlx5_core 0000:01:08.6: 0000:01:08.6:mlx5_enter_error_state:115:(pid 48316): start [29282.973102] mlx5_core 0000:01:08.6: 0000:01:08.6:mlx5_enter_error_state:120:(pid 48316): end [29283.229473] mlx5_core 0000:01:04.1: device's health compromised - reached miss count [29283.229604] mlx5_core 0000:01:04.1: NIC was disabled [29283.229653] mlx5_core 0000:01:04.1: assert_var[0] 0x0000004b [29283.229711] mlx5_core 0000:01:04.1: assert_var[1] 0x00004b81 [29283.229768] mlx5_core 0000:01:04.1: assert_var[2] 0x00000000 [29283.229827] mlx5_core 0000:01:04.1: assert_var[3] 0x00000000 [29283.229884] mlx5_core 0000:01:04.1: assert_var[4] 0x00000000 [29283.229942] mlx5_core 0000:01:04.1: assert_exit_ptr 0x00602a78 [29283.229998] mlx5_core 0000:01:04.1: assert_callra 0x00602b84 [29283.230055] mlx5_core 0000:01:04.1: fw_ver 12.17.1010 [29283.230102] mlx5_core 0000:01:04.1: hw_id 0x00000209 [29283.230148] mlx5_core 0000:01:04.1: irisc_index 3 [29283.230196] mlx5_core 0000:01:04.1: synd 0xe: Invalid EQ refrenced [29283.230254] mlx5_core 0000:01:04.1: ext_synd 0x0001 [29283.230304] mlx5_core 0000:01:02.5: device's health compromised - reached miss count [29283.230313] mlx5_core 0000:01:04.1: handling bad device here [29283.230322] mlx5_core 0000:01:04.1: 0000:01:04.1:mlx5_enter_error_state:115:(pid 48185): start [29283.230342] mlx5_core 0000:01:04.1: 0000:01:04.1:mlx5_enter_error_state:120:(pid 48185): end [29283.230582] mlx5_core 0000:01:02.5: NIC was disabled [29283.230630] mlx5_core 0000:01:02.5: assert_var[0] 0x0000004b [29283.230686] mlx5_core 0000:01:02.5: assert_var[1] 0x00004b81 [29283.230743] mlx5_core 0000:01:02.5: assert_var[2] 0x00000000 [29283.230800] mlx5_core 0000:01:02.5: assert_var[3] 0x00000000 [29283.230856] mlx5_core 0000:01:02.5: assert_var[4] 0x00000000 [29283.230913] mlx5_core 0000:01:02.5: assert_exit_ptr 0x00602a78 [29283.230970] mlx5_core 0000:01:02.5: assert_callra 0x00602b84 [29283.231028] mlx5_core 0000:01:02.5: fw_ver 12.17.1010 [29283.231073] mlx5_core 0000:01:02.5: hw_id 0x00000209 [29283.231119] mlx5_core 0000:01:02.5: irisc_index 3 [29283.231169] mlx5_core 0000:01:02.5: synd 0xe: Invalid EQ refrenced [29283.231226] mlx5_core 0000:01:02.5: ext_synd 0x0001 [29283.231288] mlx5_core 0000:01:02.5: handling bad device here [29283.231362] mlx5_core 0000:01:02.5: 0000:01:02.5:mlx5_enter_error_state:115:(pid 48320): start [29283.231455] mlx5_core 0000:01:02.5: 0000:01:02.5:mlx5_enter_error_state:120:(pid 48320): end [29283.481490] mlx5_core 0000:01:0b.4: device's health compromised - reached miss count [29283.481587] mlx5_core 0000:01:0b.4: NIC was disabled [29283.481625] mlx5_core 0000:01:0b.4: assert_var[0] 0x0000004b [29283.481698] mlx5_core 0000:01:0b.4: assert_var[1] 0x00004b81 [29283.481790] mlx5_core 0000:01:0b.4: assert_var[2] 0x00000000 [29283.481856] mlx5_core 0000:01:0b.4: assert_var[3] 0x00000000 [29283.481955] mlx5_core 0000:01:0b.4: assert_var[4] 0x00000000 [29283.482067] mlx5_core 0000:01:0b.4: assert_exit_ptr 0x00602a78 [29283.482147] mlx5_core 0000:01:0b.4: assert_callra 0x00602b84 [29283.482250] mlx5_core 0000:01:0b.4: fw_ver 12.17.1010 [29283.482340] mlx5_core 0000:01:0b.4: hw_id 0x00000209 [29283.482440] mlx5_core 0000:01:0b.4: irisc_index 3 [29283.482542] mlx5_core 0000:01:0b.4: synd 0xe: Invalid EQ refrenced [29283.482639] mlx5_core 0000:01:0b.4: ext_synd 0x0001 [29283.482701] mlx5_core 0000:01:0b.7: device's health compromised - reached miss count [29283.482707] mlx5_core 0000:01:0b.4: handling bad device here [29283.482744] mlx5_core 0000:01:0b.4: 0000:01:0b.4:mlx5_enter_error_state:115:(pid 48270): start [29283.482761] mlx5_core 0000:01:0b.4: 0000:01:0b.4:mlx5_enter_error_state:120:(pid 48270): end [29283.483043] mlx5_core 0000:01:0b.7: NIC was disabled [29283.483148] mlx5_core 0000:01:0b.7: assert_var[0] 0x0000004b [29283.483254] mlx5_core 0000:01:0b.7: assert_var[1] 0x00004b81 [29283.483354] mlx5_core 0000:01:0b.7: assert_var[2] 0x00000000 [29283.483464] mlx5_core 0000:01:0b.7: assert_var[3] 0x00000000 [29283.483579] mlx5_core 0000:01:0b.7: assert_var[4] 0x00000000 [29283.483691] mlx5_core 0000:01:0b.7: assert_exit_ptr 0x00602a78 [29283.483809] mlx5_core 0000:01:0b.7: assert_callra 0x00602b84 [29283.483931] mlx5_core 0000:01:0b.7: fw_ver 12.17.1010 [29283.484042] mlx5_core 0000:01:0b.7: hw_id 0x00000209 [29283.484157] mlx5_core 0000:01:0b.7: irisc_index 3 [29283.484272] mlx5_core 0000:01:0b.7: synd 0xe: Invalid EQ refrenced [29283.484388] mlx5_core 0000:01:0b.7: ext_synd 0x0001 [29283.484444] mlx5_core 0000:01:03.2: device's health compromised - reached miss count [29283.484454] mlx5_core 0000:01:0b.7: handling bad device here [29283.484503] mlx5_core 0000:01:0b.7: 0000:01:0b.7:mlx5_enter_error_state:115:(pid 48323): start [29283.484517] mlx5_core 0000:01:0b.7: 0000:01:0b.7:mlx5_enter_error_state:120:(pid 48323): end [29283.484973] mlx5_core 0000:01:03.2: NIC was disabled [29283.485088] mlx5_core 0000:01:03.2: assert_var[0] 0x0000004b [29283.485205] mlx5_core 0000:01:03.2: assert_var[1] 0x00004b81 [29283.485322] mlx5_core 0000:01:03.2: assert_var[2] 0x00000000 [29283.485441] mlx5_core 0000:01:03.2: assert_var[3] 0x00000000 [29283.485571] mlx5_core 0000:01:03.2: assert_var[4] 0x00000000 [29283.485689] mlx5_core 0000:01:03.2: assert_exit_ptr 0x00602a78 [29283.485809] mlx5_core 0000:01:03.2: assert_callra 0x00602b84 [29283.485929] mlx5_core 0000:01:03.2: fw_ver 12.17.1010 [29283.486026] mlx5_core 0000:01:03.2: hw_id 0x00000209 [29283.486159] mlx5_core 0000:01:03.2: irisc_index 3 [29283.486282] mlx5_core 0000:01:03.2: synd 0xe: Invalid EQ refrenced [29283.486399] mlx5_core 0000:01:03.2: ext_synd 0x0001 [29283.489019] mlx5_core 0000:01:03.2: handling bad device here [29283.489097] mlx5_core 0000:01:03.2: 0000:01:03.2:mlx5_enter_error_state:115:(pid 48317): start [29283.489198] mlx5_core 0000:01:03.2: 0000:01:03.2:mlx5_enter_error_state:120:(pid 48317): end [29283.737502] mlx5_core 0000:01:0a.6: device's health compromised - reached miss count [29283.737653] mlx5_core 0000:01:0a.6: NIC was disabled [29283.737710] mlx5_core 0000:01:0a.6: assert_var[0] 0x0000004b [29283.737776] mlx5_core 0000:01:0a.6: assert_var[1] 0x00004b81 [29283.737844] mlx5_core 0000:01:0a.6: assert_var[2] 0x00000000 [29283.737910] mlx5_core 0000:01:0a.6: assert_var[3] 0x00000000 [29283.737976] mlx5_core 0000:01:0a.6: assert_var[4] 0x00000000 [29283.738042] mlx5_core 0000:01:0a.6: assert_exit_ptr 0x00602a78 [29283.738108] mlx5_core 0000:01:0a.6: assert_callra 0x00602b84 [29283.738175] mlx5_core 0000:01:0a.6: fw_ver 12.17.1010 [29283.738231] mlx5_core 0000:01:0a.6: hw_id 0x00000209 [29283.738284] mlx5_core 0000:01:0a.6: irisc_index 3 [29283.738340] mlx5_core 0000:01:0a.6: synd 0xe: Invalid EQ refrenced [29283.738406] mlx5_core 0000:01:0a.6: ext_synd 0x0001 [29283.738473] mlx5_core 0000:01:0a.6: handling bad device here [29283.738550] mlx5_core 0000:01:0a.6: 0000:01:0a.6:mlx5_enter_error_state:115:(pid 48319): start [29283.738649] mlx5_core 0000:01:0a.6: 0000:01:0a.6:mlx5_enter_error_state:120:(pid 48319): end [29283.993504] mlx5_core 0000:01:0a.5: device's health compromised - reached miss count [29283.993669] mlx5_core 0000:01:0a.5: NIC was disabled [29283.993735] mlx5_core 0000:01:0a.5: assert_var[0] 0x0000004b [29283.993802] mlx5_core 0000:01:0a.5: assert_var[1] 0x00004b81 [29283.993874] mlx5_core 0000:01:0a.5: assert_var[2] 0x00000000 [29283.993948] mlx5_core 0000:01:0a.5: assert_var[3] 0x00000000 [29283.994021] mlx5_core 0000:01:0a.5: assert_var[4] 0x00000000 [29283.994096] mlx5_core 0000:01:0a.5: assert_exit_ptr 0x00602a78 [29283.994173] mlx5_core 0000:01:0a.5: assert_callra 0x00602b84 [29283.994244] mlx5_core 0000:01:0a.5: fw_ver 12.17.1010 [29283.994302] mlx5_core 0000:01:0a.5: hw_id 0x00000209 [29283.994365] mlx5_core 0000:01:0a.5: irisc_index 3 [29283.994425] mlx5_core 0000:01:0a.5: synd 0xe: Invalid EQ refrenced [29283.994499] mlx5_core 0000:01:0a.5: ext_synd 0x0001 [29283.994567] mlx5_core 0000:01:0a.5: handling bad device here [29283.994644] mlx5_core 0000:01:0a.5: 0000:01:0a.5:mlx5_enter_error_state:115:(pid 48324): start [29283.994741] mlx5_core 0000:01:0a.5: 0000:01:0a.5:mlx5_enter_error_state:120:(pid 48324): end [29284.253519] mlx5_core 0000:01:09.0: device's health compromised - reached miss count [29284.253671] mlx5_core 0000:01:09.0: NIC was disabled [29284.253730] mlx5_core 0000:01:09.0: assert_var[0] 0x0000004b [29284.253796] mlx5_core 0000:01:09.0: assert_var[1] 0x00004b81 [29284.253861] mlx5_core 0000:01:09.0: assert_var[2] 0x00000000 [29284.253927] mlx5_core 0000:01:09.0: assert_var[3] 0x00000000 [29284.253993] mlx5_core 0000:01:09.0: assert_var[4] 0x00000000 [29284.254058] mlx5_core 0000:01:09.0: assert_exit_ptr 0x00602a78 [29284.254125] mlx5_core 0000:01:09.0: assert_callra 0x00602b84 [29284.254191] mlx5_core 0000:01:09.0: fw_ver 12.17.1010 [29284.254244] mlx5_core 0000:01:09.0: hw_id 0x00000209 [29284.254297] mlx5_core 0000:01:09.0: irisc_index 3 [29284.254353] mlx5_core 0000:01:09.0: synd 0xe: Invalid EQ refrenced [29284.254419] mlx5_core 0000:01:09.0: ext_synd 0x0001 [29284.254477] mlx5_core 0000:01:02.2: device's health compromised - reached miss count [29284.254486] mlx5_core 0000:01:09.0: handling bad device here [29284.254494] mlx5_core 0000:01:09.0: 0000:01:09.0:mlx5_enter_error_state:115:(pid 48172): start [29284.254511] mlx5_core 0000:01:09.0: 0000:01:09.0:mlx5_enter_error_state:120:(pid 48172): end [29284.254794] mlx5_core 0000:01:02.2: NIC was disabled [29284.254849] mlx5_core 0000:01:02.2: assert_var[0] 0x0000004b [29284.254915] mlx5_core 0000:01:02.2: assert_var[1] 0x00004b81 [29284.254980] mlx5_core 0000:01:02.2: assert_var[2] 0x00000000 [29284.255046] mlx5_core 0000:01:02.2: assert_var[3] 0x00000000 [29284.255111] mlx5_core 0000:01:02.2: assert_var[4] 0x00000000 [29284.255176] mlx5_core 0000:01:02.2: assert_exit_ptr 0x00602a78 [29284.255243] mlx5_core 0000:01:02.2: assert_callra 0x00602b84 [29284.255308] mlx5_core 0000:01:02.2: fw_ver 12.17.1010 [29284.255361] mlx5_core 0000:01:02.2: hw_id 0x00000209 [29284.255416] mlx5_core 0000:01:02.2: irisc_index 3 [29284.255471] mlx5_core 0000:01:02.2: synd 0xe: Invalid EQ refrenced [29284.255536] mlx5_core 0000:01:02.2: ext_synd 0x0001 [29284.256758] mlx5_core 0000:01:02.2: handling bad device here [29284.256886] mlx5_core 0000:01:02.2: 0000:01:02.2:mlx5_enter_error_state:115:(pid 48193): start [29284.257057] mlx5_core 0000:01:02.2: 0000:01:02.2:mlx5_enter_error_state:120:(pid 48193): end [29342.617462] mlx5_core 0000:01:0b.5: starting health recovery flow [29342.617463] mlx5_core 0000:01:0b.2: starting health recovery flow [29342.617476] mlx5_core 0000:01:02.3: starting health recovery flow [29342.617477] mlx5_core 0000:01:0b.0: starting health recovery flow [29342.621517] mlx5_core 0000:01:03.6: starting health recovery flow [29344.665463] mlx5_core 0000:01:0b.3: starting health recovery flow [29344.665481] mlx5_core 0000:01:08.5: starting health recovery flow [29344.665533] mlx5_core 0000:01:03.7: starting health recovery flow [29344.667132] mlx5_core 0000:01:03.4: starting health recovery flow [29344.669461] mlx5_core 0000:01:02.4: starting health recovery flow [29346.713453] mlx5_core 0000:01:02.6: starting health recovery flow [29356.953457] mlx5_core 0000:01:00.1: starting health recovery flow [29360.617458] mlx5_core 0000:01:0b.5: mlx5_pci_slot_reset: wait_vital timed out [29360.621485] mlx5_core 0000:01:03.6: mlx5_pci_slot_reset: wait_vital timed out [29360.825476] mlx5_core 0000:01:0b.2: mlx5_pci_slot_reset: wait_vital timed out [29360.833475] mlx5_core 0000:01:0b.0: mlx5_pci_slot_reset: wait_vital timed out [29360.877462] mlx5_core 0000:01:02.3: mlx5_pci_slot_reset: wait_vital timed out [29362.821467] mlx5_core 0000:01:0b.3: mlx5_pci_slot_reset: wait_vital timed out [29362.825456] mlx5_core 0000:01:03.7: mlx5_pci_slot_reset: wait_vital timed out [29362.897438] mlx5_core 0000:01:03.4: mlx5_pci_slot_reset: wait_vital timed out [29362.973426] mlx5_core 0000:01:08.5: mlx5_pci_slot_reset: wait_vital timed out [29363.149452] mlx5_core 0000:01:02.4: mlx5_pci_slot_reset: wait_vital timed out [29365.101457] mlx5_core 0000:01:02.6: mlx5_pci_slot_reset: wait_vital timed out [29375.077442] mlx5_core 0000:01:00.1: mlx5_pci_slot_reset: wait_vital timed out [29483.933710] INFO: task kworker/u128:1:47189 blocked for more than 120 seconds. [29483.933792] Tainted: G OE 4.8.0-17-generic #19 [29483.933802] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.933971] INFO: task kworker/u128:2:48059 blocked for more than 120 seconds. [29483.934046] Tainted: G OE 4.8.0-17-generic #19 [29483.934108] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.934289] INFO: task kworker/u128:0:48158 blocked for more than 120 seconds. [29483.934363] Tainted: G OE 4.8.0-17-generic #19 [29483.934424] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.934602] INFO: task kworker/u128:3:48169 blocked for more than 120 seconds. [29483.934676] Tainted: G OE 4.8.0-17-generic #19 [29483.934738] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.934915] INFO: task kworker/u128:4:48172 blocked for more than 120 seconds. [29483.934988] Tainted: G OE 4.8.0-17-generic #19 [29483.935050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.935226] INFO: task kworker/u128:5:48184 blocked for more than 120 seconds. [29483.935300] Tainted: G OE 4.8.0-17-generic #19 [29483.935360] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.935538] INFO: task kworker/u128:6:48185 blocked for more than 120 seconds. [29483.935612] Tainted: G OE 4.8.0-17-generic #19 [29483.935672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.935849] INFO: task kworker/u128:7:48193 blocked for more than 120 seconds. [29483.935923] Tainted: G OE 4.8.0-17-generic #19 [29483.935984] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.936162] INFO: task kworker/u128:11:48239 blocked for more than 120 seconds. [29483.936236] Tainted: G OE 4.8.0-17-generic #19 [29483.936296] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [29483.936472] INFO: task kworker/u128:12:48240 blocked for more than 120 seconds. [29483.936546] Tainted: G OE 4.8.0-17-generic #19 [29483.936607] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [48452.457095] kernel BUG at /root/linux-4.8.0/drivers/pci/msi.c:371! [48452.457204] Oops: Exception in kernel mode, sig: 5 [#1] [48452.457249] SMP NR_CPUS=2048 NUMA PowerNV [48452.457295] Modules linked in: vhost_net vhost macvtap macvlan vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) vmx_crypto bridge stp llc ipmi_powernv ipmi_msghandler powernv_rng leds_powernv uio_pdrv_genirq ibmpowernv powernv_op_panel uio binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) ip_tables x_tables autofs4 ses enclosure scsi_transport_sas mlx4_en(OE) uas usb_storage lpfc bnx2x mdio libcrc32c ipr crc32c_vpmsum mlx4_core(OE) scsi_transport_fc be2net mlx5_core(OE) mlx_compat(OE) scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [48452.458485] CPU: 40 PID: 48240 Comm: kworker/u128:12 Tainted: G OE 4.8.0-17-generic #19 [48452.458574] Workqueue: mlx5_health0000:01:0a.1 health_care [mlx5_core] [48452.458642] task: c000002fd27a2600 task.stack: c000006ae224c000 [48452.458698] NIP: c000000000677350 LR: c000000000677340 CTR: 000000003001b314 [48452.458765] REGS: c000006ae224f7c0 TRAP: 0700 Tainted: G OE (4.8.0-17-generic) [48452.458843] MSR: 900000010282b033 CR: 42242422 XER: 20000000 [48452.459080] CFAR: c0000000001493c8 SOFTE: 1 GPR00: c000000000677340 c000006ae224fa40 c0000000014d5e00 c000002f0eed3000 GPR04: 0000000000000037 0000000000000000 0000000000000000 c000002f0eed3000 GPR08: c000002f0eed3000 0000000000000001 c000006ae176a280 c000002ff0001d40 GPR12: 0000000000000040 c00000000fb96800 c0000000000fd038 c000002f865f4a80 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 GPR20: d0000000335afb08 c000002f092763a0 0000000000000001 0000000000000002 GPR24: c000002f09275b58 c000002f09275b20 0000000000000000 c000002f09240000 GPR28: c00000000280b288 c00000000280b000 c000006a913b68a0 0000000000000000 [48452.459938] NIP [c000000000677350] free_msi_irqs+0x90/0x200 [48452.459983] LR [c000000000677340] free_msi_irqs+0x80/0x200 [48452.460029] Call Trace: [48452.460052] [c000006ae224fa40] [c000000000677340] free_msi_irqs+0x80/0x200 (unreliable) [48452.460136] [c000006ae224fa90] [d000000033531a60] mlx5_unload_one+0x1d8/0x420 [mlx5_core] [48452.460293] [c000006ae224fb50] [d000000033531d1c] mlx5_pci_err_detected+0x74/0x100 [mlx5_core] [48452.460480] [c000006ae224fbd0] [d00000003353f328] health_care+0xa0/0x180 [mlx5_core] [48452.460638] [c000006ae224fc50] [c0000000000f4018] process_one_work+0x2a8/0x5a0 [48452.460797] [c000006ae224fce0] [c0000000000f43b8] worker_thread+0xa8/0x650 [48452.460930] [c000006ae224fd80] [c0000000000fd140] kthread+0x110/0x130[48452.460985] mlx5_core 0000:01:0a.7: 0000:01:0a.7:wait_func:938:(pid 48242): DESTROY_CQ(0x401) timeout. Will cause a leak of a command resource [48452.461262] [48452.461330] [c000006ae224fe30] [c0000000000098f0] ret_from_kernel_thread+0x5c/0x6c [48452.461487] Instruction dump: [48452.461554] 2f890000 419e0044 3be00000 48000008 807e0010 7c7f1a14 78630020 4bad204d [48452.461775] 60000000 e9430158 312affff 7d295110 <0b090000> 813e0014 3bff0001 7f89f840 [48452.462000] ---[ end trace 0cae433f9c8a44ba ]--- [48452.473807] [48452.474241] Unable to handle kernel paging request for data at address 0xffffffffffffffd8 [48452.474310] Faulting instruction address: 0xc0000000000fd9e8 [48452.474367] Oops: Kernel access of bad area, sig: 11 [#2] [48452.474412] SMP NR_CPUS=2048 NUMA PowerNV [48452.474458] Modules linked in: vhost_net vhost macvtap macvlan vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) vmx_crypto bridge stp llc ipmi_powernv ipmi_msghandler powernv_rng leds_powernv uio_pdrv_genirq ibmpowernv powernv_op_panel uio binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) ip_tables x_tables autofs4 ses enclosure scsi_transport_sas mlx4_en(OE) uas usb_storage lpfc bnx2x mdio libcrc32c ipr crc32c_vpmsum mlx4_core(OE) scsi_transport_fc be2net mlx5_core(OE) mlx_compat(OE) scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [48452.475635] CPU: 40 PID: 48240 Comm: kworker/u128:12 Tainted: G D OE 4.8.0-17-generic #19 [48452.475721] task: c000002fd27a2600 task.stack: c000006ae224c000 [48452.475777] NIP: c0000000000fd9e8 LR: c0000000000f4f38 CTR: c000000000120680 [48452.475845] REGS: c000006ae224f070 TRAP: 0300 Tainted: G D OE (4.8.0-17-generic) [48452.475924] MSR: 900000010280b033 CR: 58242323 XER: 80000000 [48452.476162] CFAR: c000000000008750 DAR: ffffffffffffffd8 DSISR: 40000000 SOFTE: 0 GPR00: c0000000000f4f38 c000006ae224f2f0 c0000000014d5e00 c000002fd27a2600 GPR04: 00000000ffffffff 0000000000000003 c000002fd27a26a8 0000000000000001 GPR08: 0000000000000020 0000000000000000 0000000000000003 0000000000000000 GPR12: 0000000038242329 c00000000fb96800 c0000000000fd038 c000002f865f4a80 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 GPR20: d0000000335afb08 c000002f092763a0 0000000000000000 0000002ff1a00000 GPR24: c000002ff2a23100 c000000000b88fb8 c000000001023100 0000000000000000 GPR28: c00000000150aae0 c000002fd27a2600 c000002ff2a23100 c000002fd27a2600 [48452.477058] NIP [c0000000000fd9e8] kthread_data+0x28/0x40 [48452.477104] LR [c0000000000f4f38] wq_worker_sleeping+0x28/0xf0 [48452.477160] Call Trace: [48452.477183] [c000006ae224f2f0] [c000002ff2a23100] 0xc000002ff2a23100 (unreliable) [48452.477262] [c000006ae224f320] [c0000000000f4f38] wq_worker_sleeping+0x28/0xf0 [48452.477342] [c000006ae224f350] [c000000000b88ce8] __schedule+0x728/0x9b0 [48452.477410] [c000006ae224f430] [c000000000b88fb8] schedule+0x48/0xc0 [48452.477478] [c000006ae224f460] [c0000000000d3a9c] do_exit+0x79c/0xce0 [48452.477547] [c000006ae224f530] [c000000000025a84] die+0x314/0x470 [48452.477615] [c000006ae224f5c0] [c000000000025e14] _exception+0x1b4/0x1e0 [48452.477683] [c000006ae224f750] [c000000000006208] program_check_common+0x108/0x180 [48452.477762] --- interrupt: 700 at free_msi_irqs+0x90/0x200 [48452.477762] LR = free_msi_irqs+0x80/0x200 [48452.477858] [c000006ae224fa90] [d000000033531a60] mlx5_unload_one+0x1d8/0x420 [mlx5_core] [48452.477942] [c000006ae224fb50] [d000000033531d1c] mlx5_pci_err_detected+0x74/0x100 [mlx5_core] [48452.478037] [c000006ae224fbd0] [d00000003353f328] health_care+0xa0/0x180 [mlx5_core] [48452.478116] [c000006ae224fc50] [c0000000000f4018] process_one_work+0x2a8/0x5a0 [48452.478195] [c000006ae224fce0] [c0000000000f43b8] worker_thread+0xa8/0x650 [48452.478263] [c000006ae224fd80] [c0000000000fd140] kthread+0x110/0x130 [48452.478331] [c000006ae224fe30] [c0000000000098f0] ret_from_kernel_thread+0x5c/0x6c [48452.478410] Instruction dump: [48452.478443] 60000000 4bfffec8 3c4c013e 38428440 7c0802a6 fbe1fff8 f8010010 f821ffd1 [48452.478557] 7c7f1b78 60000000 60000000 e93f0690 38210030 e8010010 ebe1fff8 [48452.478714] ---[ end trace 0cae433f9c8a44bb ]--- [48452.489475] [48452.489500] Fixing recursive fault but reboot is needed! [48473.460924] INFO: rcu_sched detected stalls on CPUs/tasks: [48473.461031] 40-...: (0 ticks this GP) idle=781/140000000000000/0 softirq=6425666/6425666 fqs=3 [48473.461122] (detected by 48, t=5252 jiffies, g=2231511, c=2231510, q=878) [48473.461188] Task dump for CPU 40: [48473.461223] kworker/u128:12 D 0000000000000000 0 48240 0 0x00000800 [48473.461303] Call Trace: [48473.461328] rcu_sched kthread starved for 5244 jiffies! g2231511 c2231510 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 [48473.461435] rcu_sched S 0000000000000000 0 9 2 0x00000800 [48473.461504] Call Trace: [48473.461531] [c000006aeca2b870] [c00000000150ef90] sysctl_sched_migration_cost+0x0/0x4 (unreliable) [48473.461622] [c000006aeca2b8b0] [c000000000126eec] load_balance+0x33c/0xa90 [48473.461690] [c000006aeca2b9f0] [c000000000127894] pick_next_task_fair+0x254/0x6a0 [48473.461770] [c000006aeca2baa0] [c000000000b88728] __schedule+0x168/0x9b0 [48473.461838] [c000006aeca2bb80] [c000000000b88fb8] schedule+0x48/0xc0 [48473.461906] [c000006aeca2bbb0] [c000000000b8d5ac] schedule_timeout+0x25c/0x500 [48473.461986] [c000006aeca2bcb0] [c00000000015dfa4] rcu_gp_kthread+0x634/0xb20 [48473.462054] [c000006aeca2bd80] [c0000000000fd140] kthread+0x110/0x130 [48473.462122] [c000006aeca2be30] [c0000000000098f0] ret_from_kernel_thread+0x5c/0x6c [48476.304917] NMI watchdog: BUG: soft lockup - CPU#56 stuck for 23s! [qemu-system-ppc:16678] [48476.305020] Modules linked in: vhost_net vhost macvtap macvlan vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) vmx_crypto bridge stp llc ipmi_powernv ipmi_msghandler powernv_rng leds_powernv uio_pdrv_genirq ibmpowernv powernv_op_panel uio binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) ip_tables x_tables autofs4 ses enclosure scsi_transport_sas mlx4_en(OE) uas usb_storage lpfc bnx2x mdio libcrc32c ipr crc32c_vpmsum mlx4_core(OE) scsi_transport_fc be2net mlx5_core(OE) mlx_compat(OE) scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [48476.306209] CPU: 56 PID: 16678 Comm: qemu-system-ppc Tainted: G D OE 4.8.0-17-generic #19 [48476.306287] task: c000006aa282ec00 task.stack: c000006a99f88000 [48476.306343] NIP: c0000000001868e4 LR: c0000000001868a4 CTR: c000000000076340 [48476.306410] REGS: c000006a99f8b7d0 TRAP: 0901 Tainted: G D OE (4.8.0-17-generic) [48476.306488] MSR: 900000010280b033 CR: 44424824 XER: 20000000 [48476.306724] CFAR: c0000000001868f0 SOFTE: 1 GPR00: c000000000186884 c000006a99f8ba50 c0000000014d5e00 0000000000000000 GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000001 GPR08: 0000000000000001 0000000000000003 c000002ff2026060 c00000000150ec20 GPR12: c000000000076340 c00000000fb9f800 [48476.307140] NIP [c0000000001868e4] smp_call_function_many+0x344/0x3e0 [48476.307196] LR [c0000000001868a4] smp_call_function_many+0x304/0x3e0 [48476.307252] Call Trace: [48476.307276] [c000006a99f8ba50] [c000000000186884] smp_call_function_many+0x2e4/0x3e0 (unreliable) [48476.307366] [c000006a99f8bad0] [c000000000186aec] kick_all_cpus_sync+0x3c/0x50 [48476.307445] [c000006a99f8baf0] [c00000000005a4d8] hash__pmdp_huge_get_and_clear+0xb8/0x100 [48476.307524] [c000006a99f8bb30] [c000000000300df0] change_huge_pmd+0x1b0/0x280 [48476.307592] [c000006a99f8bba0] [c0000000002b6dac] change_protection_range+0xc1c/0xd90 [48476.307671] [c000006a99f8bcd0] [c0000000002dfe00] change_prot_numa+0x50/0xd0 [48476.307739] [c000006a99f8bd20] [c000000000117ce4] task_numa_work+0x2c4/0x3c0 [48476.307807] [c000006a99f8bdb0] [c0000000000f9a50] task_work_run+0x140/0x1a0 [48476.307875] [c000006a99f8be00] [c00000000001c7f4] do_notify_resume+0xc4/0xd0 [48476.307942] [c000006a99f8be30] [c000000000009b44] ret_from_except_lite+0x70/0x74 [48476.308020] Instruction dump: [48476.308056] 409dfdb0 3d020003 78691f24 39484ce0 7d2a482a e95e0000 7d4a4a14 812a0018 [48476.308169] 71270001 4182001c 60420000 7c210b78 <7c421378> 812a0018 71280001 4082fff0 [48508.284845] NMI watchdog: BUG: soft lockup - CPU#32 stuck for 22s! [qemu-system-ppc:7582] [48508.284973] Modules linked in: vhost_net vhost macvtap macvlan vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) vmx_crypto bridge stp llc ipmi_powernv ipmi_msghandler powernv_rng leds_powernv uio_pdrv_genirq ibmpowernv powernv_op_panel uio binfmt_misc nfsd auth_rpcgss nfs_acl lockd grace sunrpc knem(OE) ip_tables x_tables autofs4 ses enclosure scsi_transport_sas mlx4_en(OE) uas usb_storage lpfc bnx2x mdio libcrc32c ipr crc32c_vpmsum mlx4_core(OE) scsi_transport_fc be2net mlx5_core(OE) mlx_compat(OE) scsi_dh_emc scsi_dh_rdac scsi_dh_alua dm_multipath [48508.286143] CPU: 32 PID: 7582 Comm: qemu-system-ppc Tainted: G D OEL 4.8.0-17-generic #19 [48508.286222] task: c000006ab3725200 task.stack: c000006ab3a44000 [48508.286279] NIP: c0000000001868e4 LR: c0000000001868a4 CTR: c000000000076340 [48508.286346] REGS: c000006ab3a477d0 TRAP: 0901 Tainted: G D OEL (4.8.0-17-generic) [48508.286424] MSR: 900000010280b033 CR: 44424844 XER: 20000000 [48508.286661] CFAR: c0000000001868f0 SOFTE: 1 GPR00: c000000000186884 c000006ab3a47a50 c0000000014d5e00 0000000000000000 GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000001 GPR08: 0000000000000001 0000000000000003 c000002ff2025d60 c00000000150ec20 GPR12: c000000000076340 c00000000fb92000 [48508.287077] NIP [c0000000001868e4] smp_call_function_many+0x344/0x3e0 [48508.287134] LR [c0000000001868a4] smp_call_function_many+0x304/0x3e0 [48508.287189] Call Trace: [48508.287213] [c000006ab3a47a50] [c000000000186884] smp_call_function_many+0x2e4/0x3e0 (unreliable) [48508.287304] [c000006ab3a47ad0] [c000000000186aec] kick_all_cpus_sync+0x3c/0x50 [48508.287383] [c000006ab3a47af0] [c00000000005a4d8] hash__pmdp_huge_get_and_clear+0xb8/0x100 [48508.287462] [c000006ab3a47b30] [c000000000300df0] change_huge_pmd+0x1b0/0x280 [48508.287529] [c000006ab3a47ba0] [c0000000002b6dac] change_protection_range+0xc1c/0xd90 [48508.287608] [c000006ab3a47cd0] [c0000000002dfe00] change_prot_numa+0x50/0xd0 [48508.287677] [c000006ab3a47d20] [c000000000117ce4] task_numa_work+0x2c4/0x3c0 [48508.287744] [c000006ab3a47db0] [c0000000000f9a50] task_work_run+0x140/0x1a0 [48508.287813] [c000006ab3a47e00] [c00000000001c7f4] do_notify_resume+0xc4/0xd0 [48508.287880] [c000006ab3a47e30] [c000000000009b44] ret_from_except_lite+0x70/0x74 [48508.287958] Instruction dump: [48508.287993] 409dfdb0 3d020003 78691f24 39484ce0 7d2a482a e95e0000 7d4a4a14 812a0018 [48508.288106] 71270001 4182001c 60420000 7c210b78 <7c421378> 812a0018 71280001 4082fff0 [48536.480780] INFO: rcu_sched detected stalls on CPUs/tasks: [48536.480872] 40-...: (0 ticks this GP) idle=781/140000000000000/0 softirq=6425666/6425666 fqs=3 [48536.480950] (detected by 56, t=21007 jiffies, g=2231511, c=2231510, q=879) [48536.481016] Task dump for CPU 40: [48536.481050] kworker/u128:12 D 0000000000000000 0 48240 0 0x00000800 [48536.481126] Call Trace: [48536.481150] rcu_sched kthread starved for 20999 jiffies! g2231511 c2231510 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1 [48536.481239] rcu_sched S 0000000000000000 0 9 2 0x00000800 [48536.481307] Call Trace: [48536.481331] [c000006aeca2b870] [c00000000150ef90] sysctl_sched_migration_cost+0x0/0x4 (unreliable) [48536.481438] [c000006aeca2b8b0] [c000000000126eec] load_balance+0x33c/0xa90 [48536.481506] [c000006aeca2b9f0] [c000000000127894] pick_next_task_fair+0x254/0x6a0 [48536.481585] [c000006aeca2baa0] [c000000000b88728] __schedule+0x168/0x9b0 [48536.481653] [c000006aeca2bb80] [c000000000b88fb8] schedule+0x48/0xc0 [48536.481721] [c000006aeca2bbb0] [c000000000b8d5ac] schedule_timeout+0x25c/0x500 [48536.481799] [c000006aeca2bcb0] [c00000000015dfa4] rcu_gp_kthread+0x634/0xb20 [48536.481867] [c000006aeca2bd80] [c0000000000fd140] kthread+0x110/0x130 [48536.481935] [c000006aeca2be30] [c0000000000098f0] ret_from_kernel_thread+0x5c/0x6c