qlcnic firmware hang detected kvm ganeti

Bug #1723482 reported by Dani García
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

1) Ubuntu release:

Description: Ubuntu 16.04.3 LTS
Release: 16.04

2) Package version:

* linux-image-extra-4.4.0-96-generic (4.4.0-96.119)
* Also with HWE kernel (4.10.x)

3) What I expect:

I have a 10G interface (HP NC523SFP 10Gb 2-port) in a HP ProLiant DL380p Gen8, BIOS P70 07/01/2015. The interface is configured using the module qlcnic and it works with the names ens2f0 and ens2f1. They also have VLANs configured.

I have installed Ganeti software and bridges over those interfaces, br-dmz over ens2f0 and br-str over ens2f1.

Everything should work without connectivity loss.

4) What happened instead:

The interface loses the connectivity from time to time, although it recovers itself, with the following error:

Oct 12 18:23:14 mazinger kernel: [107906.678468] qlcnic 0000:07:00.1: Pause control frames disabled on all ports
Oct 12 18:23:14 mazinger kernel: [107906.678470] qlcnic 0000:07:00.0: Pause control frames disabled on all ports
Oct 12 18:23:14 mazinger kernel: [107906.678475] qlcnic 0000:07:00.0: firmware hang detected
Oct 12 18:23:14 mazinger kernel: [107906.678482] qlcnic 0000:07:00.0: Dumping hw/fw registers
Oct 12 18:23:14 mazinger kernel: [107906.678482] PEG_HALT_STATUS1: 0x40001502, PEG_HALT_STATUS2: 0x3e1f80,
Oct 12 18:23:14 mazinger kernel: [107906.678482] PEG_NET_0_PC: 0x6d920, PEG_NET_1_PC: 0x6d976,
Oct 12 18:23:14 mazinger kernel: [107906.678482] PEG_NET_2_PC: 0x149, PEG_NET_3_PC: 0x6edbe,
Oct 12 18:23:14 mazinger kernel: [107906.678482] PEG_NET_4_PC: 0x1e2f3
Oct 12 18:23:14 mazinger kernel: [107906.680107] qlcnic 0000:07:00.1: firmware hang detected
Oct 12 18:23:14 mazinger kernel: [107906.680385] qlcnic 0000:07:00.1: Dumping hw/fw registers
Oct 12 18:23:14 mazinger kernel: [107906.680385] PEG_HALT_STATUS1: 0x40001502, PEG_HALT_STATUS2: 0x3e1f80,
Oct 12 18:23:14 mazinger kernel: [107906.680385] PEG_NET_0_PC: 0x6d920, PEG_NET_1_PC: 0x6d976,
Oct 12 18:23:14 mazinger kernel: [107906.680385] PEG_NET_2_PC: 0x149, PEG_NET_3_PC: 0x6edbe,
Oct 12 18:23:14 mazinger kernel: [107906.680385] PEG_NET_4_PC: 0x1e2f3
Oct 12 18:23:14 mazinger kernel: [107906.695571] br-dmz: port 1(ens2f0.2) entered disabled state
Oct 12 18:23:15 mazinger kernel: [107907.690629] br-str: port 1(ens2f1.10) entered disabled state
Oct 12 18:23:16 mazinger kernel: [107908.706988] qlcnic 0000:07:00.1: Detected state change from DEV_NEED_RESET, skipping ack check
Oct 12 18:23:17 mazinger kernel: [107909.423713] qlcnic 0000:07:00.0 ens2f0: Dump data 15044136 bytes captured, dump data address = ffffc900334c3000, template header size 36864 bytes, template address = ffffc900193da000
Oct 12 18:23:21 mazinger kernel: [107912.800338] qlcnic 0000:07:00.0: loading firmware from flash
Oct 12 18:23:27 mazinger kernel: [107919.137580] qlcnic 0000:07:00.0: Driver v5.3.63, firmware v4.20.1
Oct 12 18:23:27 mazinger kernel: [107919.501555] qlcnic 0000:07:00.1: Driver v5.3.63, firmware v4.20.1
Oct 12 18:23:28 mazinger kernel: [107920.425737] qlcnic 0000:07:00.0 ens2f0: Rx Context[0] Created, state 0x2
Oct 12 18:23:28 mazinger kernel: [107920.435780] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x8000] Created, state 0x2
Oct 12 18:23:28 mazinger kernel: [107920.453103] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x8008] Created, state 0x2
Oct 12 18:23:29 mazinger kernel: [107921.598651] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x800a] Created, state 0x2
Oct 12 18:23:29 mazinger kernel: [107921.615752] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x800c] Created, state 0x2
Oct 12 18:23:30 mazinger kernel: [107922.196706] qlcnic 0000:07:00.1 ens2f1: Rx Context[1] Created, state 0x2
Oct 12 18:23:30 mazinger kernel: [107922.406680] qlcnic 0000:07:00.1 ens2f1: Tx Context[0x8001] Created, state 0x2
Oct 12 18:23:30 mazinger kernel: [107922.422646] qlcnic 0000:07:00.1 ens2f1: Tx Context[0x8009] Created, state 0x2
Oct 12 18:23:30 mazinger kernel: [107922.439890] qlcnic 0000:07:00.1 ens2f1: Tx Context[0x800b] Created, state 0x2
Oct 12 18:23:30 mazinger kernel: [107922.456417] qlcnic 0000:07:00.1 ens2f1: Tx Context[0x800d] Created, state 0x2
Oct 12 18:23:31 mazinger kernel: [107923.500128] qlcnic 0000:07:00.0 ens2f0: NIC Link is up
Oct 12 18:23:31 mazinger kernel: [107923.500360] br-dmz: port 1(ens2f0.2) entered forwarding state
Oct 12 18:23:31 mazinger kernel: [107923.500375] br-dmz: port 1(ens2f0.2) entered forwarding state
Oct 12 18:23:31 mazinger kernel: [107923.500680] qlcnic 0000:07:00.1 ens2f1: NIC Link is up
Oct 12 18:23:31 mazinger kernel: [107923.500971] br-str: port 1(ens2f1.10) entered forwarding state
Oct 12 18:23:31 mazinger kernel: [107923.500985] br-str: port 1(ens2f1.10) entered forwarding state
---------------

Sometimes it also has kernel errors and need to be rebooted to recover the connectivity:

Oct 9 14:36:41 mazinger kernel: [262273.497512] ------------[ cut here ]------------
Oct 9 14:36:41 mazinger kernel: [262273.497821] WARNING: CPU: 6 PID: 0 at /build/linux-z2ccW0/linux-4.4.0/net/sched/sch_generic.c:306 dev_watchdog+0x237/0x240()
Oct 9 14:36:41 mazinger kernel: [262273.498083] NETDEV WATCHDOG: ens2f0 (qlcnic): transmit queue 0 timed out
Oct 9 14:36:41 mazinger kernel: [262273.498579] Modules linked in: joydev binfmt_misc hpwdt ipmi_ssif bridge intel_rapl x86_pkg_temp_thermal input_leds intel_powerclamp serio_raw sb_edac edac_core lpc_ich 8250_fintek hpilo ioatdma shpchp ipmi_si ipmi_msghandler mac_hid kvm_intel kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi 8021q garp mrp stp llc coretemp drbd lru_cache autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper qlcnic hid_generic tg3 igb dca hpsa vxlan cryptd usbhid ptp psmouse ip6_udp_tunnel pata_acpi hid i2c_algo_bit scsi_transport_sas pps_core udp_tunnel wmi fjes
Oct 9 14:36:41 mazinger kernel: [262273.498651] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.4.0-96-generic #119-Ubuntu
Oct 9 14:36:41 mazinger kernel: [262273.498652] Hardware name: HP ProLiant DL380p Gen8, BIOS P70 07/01/2015
Oct 9 14:36:41 mazinger kernel: [262273.498654] 0000000000000286 fc090740aa4761f7 ffff881fbf783d98 ffffffff813fabd3
Oct 9 14:36:41 mazinger kernel: [262273.498666] ffff881fbf783de0 ffffffff81d715f8 ffff881fbf783dd0 ffffffff810812e2
Oct 9 14:36:41 mazinger kernel: [262273.498668] 0000000000000000 ffff881fade31b00 0000000000000006 ffff881fade30000
Oct 9 14:36:41 mazinger kernel: [262273.498681] Call Trace:
Oct 9 14:36:41 mazinger kernel: [262273.498683] <IRQ> [<ffffffff813fabd3>] dump_stack+0x63/0x90
Oct 9 14:36:41 mazinger kernel: [262273.498691] [<ffffffff810812e2>] warn_slowpath_common+0x82/0xc0
Oct 9 14:36:41 mazinger kernel: [262273.498693] [<ffffffff8108137c>] warn_slowpath_fmt+0x5c/0x80
Oct 9 14:36:41 mazinger kernel: [262273.498697] [<ffffffff8175eca7>] dev_watchdog+0x237/0x240
Oct 9 14:36:41 mazinger kernel: [262273.498700] [<ffffffff8175ea70>] ? qdisc_rcu_free+0x40/0x40
Oct 9 14:36:41 mazinger kernel: [262273.498705] [<ffffffff810ed035>] call_timer_fn+0x35/0x120
Oct 9 14:36:41 mazinger kernel: [262273.498708] [<ffffffff8175ea70>] ? qdisc_rcu_free+0x40/0x40
Oct 9 14:36:41 mazinger kernel: [262273.498711] [<ffffffff810ed9ea>] run_timer_softirq+0x23a/0x2f0
Oct 9 14:36:41 mazinger kernel: [262273.498714] [<ffffffff81085dc1>] __do_softirq+0x101/0x290
Oct 9 14:36:41 mazinger kernel: [262273.498717] [<ffffffff810860c3>] irq_exit+0xa3/0xb0
Oct 9 14:36:41 mazinger kernel: [262273.498721] [<ffffffff81845d22>] smp_apic_timer_interrupt+0x42/0x50
Oct 9 14:36:41 mazinger kernel: [262273.498724] [<ffffffff81843fe2>] apic_timer_interrupt+0x82/0x90
Oct 9 14:36:41 mazinger kernel: [262273.498726] <EOI> [<ffffffff816d680e>] ? cpuidle_enter_state+0x10e/0x2b0
Oct 9 14:36:41 mazinger kernel: [262273.498731] [<ffffffff816d69e7>] cpuidle_enter+0x17/0x20
Oct 9 14:36:41 mazinger kernel: [262273.498735] [<ffffffff810c47c2>] call_cpuidle+0x32/0x60
Oct 9 14:36:41 mazinger kernel: [262273.498737] [<ffffffff816d69c3>] ? cpuidle_select+0x13/0x20
Oct 9 14:36:41 mazinger kernel: [262273.498739] [<ffffffff810c4a80>] cpu_startup_entry+0x290/0x350
Oct 9 14:36:41 mazinger kernel: [262273.498743] [<ffffffff810517b4>] start_secondary+0x154/0x190
Oct 9 14:36:41 mazinger kernel: [262273.498749] ---[ end trace 6388d35f388918bc ]---
Oct 9 14:36:41 mazinger kernel: [262273.498765] qlcnic 0000:07:00.0 ens2f0: rds_ring=0 crb_rcv_producer=3113 producer=3114 num_desc=4096
Oct 9 14:36:41 mazinger kernel: [262273.498773] qlcnic 0000:07:00.0 ens2f0: rds_ring=1 crb_rcv_producer=1023 producer=0 num_desc=1024
Oct 9 14:36:41 mazinger kernel: [262273.498781] qlcnic 0000:07:00.0 ens2f0: sds_ring=0 crb_sts_consumer=659 consumer=659 crb_intr_mask=0 num_desc=4096
Oct 9 14:36:41 mazinger kernel: [262273.498788] qlcnic 0000:07:00.0 ens2f0: sds_ring=1 crb_sts_consumer=2894 consumer=2894 crb_intr_mask=0 num_desc=4096
Oct 9 14:36:41 mazinger kernel: [262273.498792] qlcnic 0000:07:00.0 ens2f0: sds_ring=2 crb_sts_consumer=3092 consumer=3092 crb_intr_mask=0 num_desc=4096
Oct 9 14:36:41 mazinger kernel: [262273.498796] qlcnic 0000:07:00.0 ens2f0: sds_ring=3 crb_sts_consumer=570 consumer=570 crb_intr_mask=0 num_desc=4096
Oct 9 14:36:41 mazinger kernel: [262273.498798] qlcnic 0000:07:00.0 ens2f0: Tx ring=0 Context Id=0x8000
Oct 9 14:36:41 mazinger kernel: [262273.498800] qlcnic 0000:07:00.0 ens2f0: xmit_finished=161917485, xmit_called=161920455, xmit_on=0, xmit_off=2
Oct 9 14:36:41 mazinger kernel: [262273.498802] qlcnic 0000:07:00.0 ens2f0: crb_intr_mask=0
Oct 9 14:36:41 mazinger kernel: [262273.498805] qlcnic 0000:07:00.0 ens2f0: hw_producer=481, sw_producer=481 sw_consumer=491, hw_consumer=491
Oct 9 14:36:41 mazinger kernel: [262273.498807] qlcnic 0000:07:00.0 ens2f0: Total desc=1024, Available desc=10
Oct 9 14:36:41 mazinger kernel: [262273.498809] qlcnic 0000:07:00.0 ens2f0: Tx ring=1 Context Id=0x8008
Oct 9 14:36:41 mazinger kernel: [262273.498811] qlcnic 0000:07:00.0 ens2f0: xmit_finished=152057037, xmit_called=152059997, xmit_on=0, xmit_off=2
Oct 9 14:36:41 mazinger kernel: [262273.498813] qlcnic 0000:07:00.0 ens2f0: crb_intr_mask=0
Oct 9 14:36:41 mazinger kernel: [262273.498816] qlcnic 0000:07:00.0 ens2f0: hw_producer=81, sw_producer=81 sw_consumer=91, hw_consumer=91
Oct 9 14:36:41 mazinger kernel: [262273.498818] qlcnic 0000:07:00.0 ens2f0: Total desc=1024, Available desc=10
Oct 9 14:36:41 mazinger kernel: [262273.498819] qlcnic 0000:07:00.0 ens2f0: Tx ring=2 Context Id=0x800a
Oct 9 14:36:41 mazinger kernel: [262273.498821] qlcnic 0000:07:00.0 ens2f0: xmit_finished=133645903, xmit_called=133648936, xmit_on=0, xmit_off=2
Oct 9 14:36:41 mazinger kernel: [262273.498824] qlcnic 0000:07:00.0 ens2f0: crb_intr_mask=0
Oct 9 14:36:41 mazinger kernel: [262273.498827] qlcnic 0000:07:00.0 ens2f0: hw_producer=572, sw_producer=572 sw_consumer=582, hw_consumer=582
Oct 9 14:36:41 mazinger kernel: [262273.498828] qlcnic 0000:07:00.0 ens2f0: Total desc=1024, Available desc=10
Oct 9 14:36:41 mazinger kernel: [262273.498830] qlcnic 0000:07:00.0 ens2f0: Tx ring=3 Context Id=0x800c
Oct 9 14:36:41 mazinger kernel: [262273.498836] qlcnic 0000:07:00.0 ens2f0: xmit_finished=162932700, xmit_called=162935603, xmit_on=0, xmit_off=2
Oct 9 14:36:41 mazinger kernel: [262273.498843] qlcnic 0000:07:00.0 ens2f0: crb_intr_mask=0
Oct 9 14:36:41 mazinger kernel: [262273.498850] qlcnic 0000:07:00.0 ens2f0: hw_producer=568, sw_producer=568 sw_consumer=578, hw_consumer=578
Oct 9 14:36:41 mazinger kernel: [262273.498857] qlcnic 0000:07:00.0 ens2f0: Total desc=1024, Available desc=10
Oct 9 14:36:41 mazinger kernel: [262273.498863] qlcnic 0000:07:00.0 ens2f0: Tx timeout, reset adapter context.
Oct 9 14:36:43 mazinger kernel: [262275.251864] qlcnic 0000:07:00.0: CDRP command failed: [7]
Oct 9 14:36:43 mazinger kernel: [262275.252143] qlcnic 0000:07:00.0: Host MBX regs(2)
Oct 9 14:36:43 mazinger kernel: [262275.252146] 00000039
Oct 9 14:36:43 mazinger kernel: [262275.252148] 00050032 <6>[262275.252150]
Oct 9 14:36:43 mazinger kernel: [262275.252153] qlcnic 0000:07:00.0: FW MBX regs(3)
Oct 9 14:36:43 mazinger kernel: [262275.252155] 00000007
Oct 9 14:36:43 mazinger kernel: [262275.252156] 00000000 00000000
Oct 9 14:36:43 mazinger kernel: [262275.252158]
Oct 9 14:36:43 mazinger kernel: [262275.252166] qlcnic 0000:07:00.0 ens2f0: Failed to Delete interrupts 7
Oct 9 14:36:43 mazinger kernel: [262275.279376] br-dmz: port 1(ens2f0.2) entered disabled state
Oct 9 14:36:43 mazinger kernel: [262275.447095] qlcnic 0000:07:00.0 ens2f0: Rx Context[0] Created, state 0x2
Oct 9 14:36:43 mazinger kernel: [262275.493365] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x8000] Created, state 0x2
Oct 9 14:36:43 mazinger kernel: [262275.509816] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x800e] Created, state 0x2
Oct 9 14:36:43 mazinger kernel: [262275.527651] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x8010] Created, state 0x2
Oct 9 14:36:43 mazinger kernel: [262275.543852] qlcnic 0000:07:00.0 ens2f0: Tx Context[0x8012] Created, state 0x2
Oct 9 14:36:43 mazinger kernel: [262275.545966] qlcnic 0000:07:00.0 ens2f0: qlcnic_reset_hw_context: soft reset complete
-----------

What I have tried to fix it:

- I have upgraded the interface firmware to the latest version provided by HP:

# ethtool -i ens2f0
driver: qlcnic
version: 5.3.63
firmware-version: 4.20.1
expansion-rom-version:
bus-info: 0000:07:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

- I have opened a case with HP. Following their recomendations I have upgraded the firmware of the server to the latest version. After capturing a AHS (Active Health System) log the have told me there isn't a hardware problem and it should be a software issue.

- I have tried HWE Kernel (version 4.10.x) which comes with a newer version of qlcnic module (5.3.65) but it didn't solved the problem.

- After reading about some problems with TOS and virtual environments, I have disabled TOS/GOS and other configuration in the interfaces:

auto <iface>
iface <iface> inet manual
    pre-up /sbin/ethtool --offload <iface> gso off tso off sg off gro off

I have found similar problems googling but all of them were solved applying one/some of those things. The issue seems to be related to this kind of interfaces and using them with virtual environments.

Juhani Numminen (jsonic)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1723482

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.