In kernel ixgbevf driver locks up on AWS M4 instances(among others)

Bug #1510315 reported by Jarrod Petz
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
cloud-images
Incomplete
Undecided
Stefan Bader

Bug Description

Creating this bug to separate the issue from this bug report https://bugs.launchpad.net/ubuntu-on-ec2/+bug/1254930.

In that bug report people called out possible issues with the ixbgevf driver included in Ubuntu 14.04. I came across this while researching a customer issue, I was hoping to find known issues with the driver.

The issue is that after operating normally for some time. The AWS instance seems to lose all network connectivity. The console log shows now errors but we did note that after a reboot of the instance. DHCP was failing to renew it's lease (seen in /var/logs multiple attempts). Which would explain the connectivity loss, no IP address after lease expires.

Unfortunately I do not have access to the customers AMI/image or a working reproduction of this. AWS being predominately is an IaaS provider, which means this is typically the case for most issues I work on with our customers. Unless the customer is willing to provide me a copy of there data/image. Even if they did want to, I usually don't ask if I suspected it was workload related because simulating a customer workload is often complex/difficult. In this case I suspected it would be workload related/triggered, given the instances failed randomly after having worked fine for hours.

Regardless, the evidence of a problem was clear. The customer never had these problems before on older instance types which did not provide Intel SR-IOV using the same identical image/AMI. After moving to m4s they started to randomly get instances failing in their auto scaling groups(ASGs). After working with them the pattern of failures was clear(they had multiple ASGs), only the ASGs using m4s had these random instance failures and looking at the console log the main difference I noted was the change in network driver on the m4s. I provided them detailed instruction on compiling the latest out of tree driver(at the time this was 2.16.1 https://sourceforge.net/projects/e1000/files/ixgbevf%20stable/2.16.1/ with a patch to make it compile for Ubuntu) and installing it. After the customer setup this driver and baked a new image from it for their ASGs, the problem was gone. We waited a week to verify this as they were seeing 1-2 instances per day fail in the same way. The customer contacted me a week later to confirm this fixed the problem.

Given the in kernel driver has additional fixes now in the upstream source. The only thing I can suggest is looking there for clues on what was done to address it in later releases.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/log/drivers/net/ethernet/intel/ixgbevf

Specifically I think the issue is fixed in kernels which have version 2.12.1-k or later(as OSes running this seem fine). As per my post at the time Ubuntu 14.04 was running 2.11.3-k. This was the change in version from 2.11.3-k to 2.12.1-k in the Linux tree. Whether the fix for the issue is between these two version of one of the changes which landed since(without a version bump) I'm not sure.
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/drivers/net/ethernet/intel/ixgbevf?id=86f359f6b8a303d4cc99e889a3481d88cae1bec2

If the in kernel driver version number tracked the out of tree driver version it would probably be easier to pinpoint. Though people from Intel told me the in Kernel driver is at parity with the latest out of tree version, the version number does not reflect this which is a little confusing.

Revision history for this message
Robert C Jennings (rcj) wrote :

smb, Here is the new bug for the EC2 M4 work issue while using the in-tree ixgbevf driver.

Revision history for this message
Robert C Jennings (rcj) wrote :

s/work/network/ too quick on the keys tonight.

Robert C Jennings (rcj)
Changed in ubuntu-on-ec2:
assignee: nobody → Stefan Bader (smb)
Revision history for this message
Stefan Bader (smb) wrote :

So when I looked at the kernel source to see when the change from version 2.11.3-k to 2.12.1-k happened, it turned out that this was part of 3.14. Which, if I read the comment right, would mean all kernel versions >=3.14 are ok. And that would mean for us that any release past Trusty should be ok. Can we get confirmation for that from those affected?

If that were true it would at least simplify the effort to figure out a fix since the delta between 3.13 and 3.14 (without checking whether any of those might be already part of stable) was only around 14 instead of around 75. And even better would be if Intel could be bothered to make the upstream Linux driver at least following the standalone driver...

Mathew Hodson (mhodson)
affects: ubuntu-on-ec2 → cloud-images
Revision history for this message
Joshua Powers (powersj) wrote :

Hi,

Going through old bugs to check on status. I am going to mark this incomplete as Stefan above had looked into this, but never heard back. Additionally Trust is end of standard support.

Thanks!

Changed in cloud-images:
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.