[Hyper-V] Docker failures with linux-azure 4.11.0-1011

Bug #1719045 reported by Joshua R. Poulson on 2017-09-23
44
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
High
Marcelo Cerri
Xenial
High
Marcelo Cerri

Bug Description

Users of docker on the linux-azure kernel are seeing errors similar to "fatal error: workbuf is empty" or “fatal error: unaligned sysUnused”. Hugepage configuration is the same for 4.4 and 4.11 kernels, so there may be a page alignment bug in the go library, the linux-azure kernel, or a dependent library.

Multiple users have reported this issue and we are investigating.

Joshua R. Poulson (jrp) wrote :

https://github.com/golang/go/blob/964639cc338db650ccadeafb7424bc8ebb2c0f6c/src/runtime/mem_linux.go has the message around the line 138:

                if uintptr(v)&(physPageSize-1) != 0 || n&(physPageSize-1) != 0 {
                                // madvise will round this to any physical page
                                // *covered* by this range, so an unaligned madvise
                                // will release more memory than intended.
                                throw("unaligned sysUnused")

Changed in linux-azure (Ubuntu):
status: New → Confirmed
Joshua R. Poulson (jrp) wrote :

https://golang.org/src/runtime/mgcwork.go around line 306

  // workbuf factory routines. These funcs are used to manage the
  // workbufs.
  // If the GC asks for some work these are the only routines that
  // make wbufs available to the GC.

  func (b *workbuf) checknonempty() {
   if b.nobj == 0 {
    throw("workbuf is empty")
   }
  }

  func (b *workbuf) checkempty() {
   if b.nobj != 0 {
    throw("workbuf is not empty")
   }
  }

Joshua R. Poulson (jrp) wrote :

Trying to build the go tests... and I am having trouble building it:
$ GOOS=linux GOARCH=amd64 ./bootstrap.bash
#### Copying to ../../go-linux-amd64-bootstrap

#### Cleaning ../../go-linux-amd64-bootstrap

#### Building ../../go-linux-amd64-bootstrap

##### Building Go bootstrap tool.
cmd/dist

##### Building Go toolchain using /usr/lib/go-1.6.
bootstrap/cmd/internal/src
bootstrap/cmd/internal/objabi
bootstrap/cmd/internal/sys
bootstrap/cmd/asm/internal/flags
bootstrap/cmd/internal/dwarf
bootstrap/cmd/internal/bio
bootstrap/math/bits
bootstrap/math/big
bootstrap/cmd/compile/internal/syntax
bootstrap/cmd/internal/gcprog
bootstrap/debug/pe
bootstrap/cmd/asm/internal/lex
bootstrap/cmd/internal/obj
bootstrap/cmd/link/internal/ld
bootstrap/cmd/internal/obj/arm
bootstrap/cmd/internal/obj/arm64
bootstrap/cmd/internal/obj/mips
bootstrap/cmd/internal/obj/ppc64
bootstrap/cmd/internal/obj/s390x
bootstrap/cmd/internal/obj/x86
bootstrap/cmd/compile/internal/types
bootstrap/cmd/asm/internal/arch
bootstrap/cmd/compile/internal/ssa
bootstrap/cmd/asm/internal/asm
bootstrap/cmd/asm
bootstrap/cmd/link/internal/amd64
bootstrap/cmd/link/internal/arm
bootstrap/cmd/link/internal/arm64
bootstrap/cmd/link/internal/mips
bootstrap/cmd/link/internal/mips64
bootstrap/cmd/link/internal/ppc64
bootstrap/cmd/link/internal/s390x
bootstrap/cmd/link/internal/x86
bootstrap/cmd/link
bootstrap/cmd/compile/internal/gc
bootstrap/cmd/compile/internal/amd64
bootstrap/cmd/compile/internal/arm
bootstrap/cmd/compile/internal/arm64
bootstrap/cmd/compile/internal/mips
bootstrap/cmd/compile/internal/mips64
bootstrap/cmd/compile/internal/ppc64
bootstrap/cmd/compile/internal/s390x
bootstrap/cmd/compile/internal/x86
bootstrap/cmd/compile
go tool dist: FAILED: /usr/lib/go-1.6/bin/go install -gcflags=-l -tags=math_big_pure_go -v bootstrap/cmd/...: signal: segmentation fault (core dumped)

Dexuan Cui (decui) wrote :

Let me add 2 more symptoms:

1) dockerd can exit due to SIGSEGV. The issue does NOT always happen: after such a failure, sometimes a retry can succeed.

# systemctl start docker; systemctl status docker
Job for docker.service failed because a fatal signal was delivered to the control process. See "systemctl status docker.service" and "journalctl -xe" for details.
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: failed (Result: signal) since Mon 2017-09-25 23:22:04 UTC; 4ms ago
     Docs: https://docs.docker.com
  Process: 24853 ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS (code=killed, signal=SEGV)
 Main PID: 24853 (code=killed, signal=SEGV)

Sep 25 23:22:02 decui-u-nested systemd[1]: Starting Docker Application Container Engine...
Sep 25 23:22:02 decui-u-nested dockerd[24853]: time="2017-09-25T23:22:02.027039570Z" level=info msg="libcontainerd: new containerd process, pid: 24867"
Sep 25 23:22:04 decui-u-nested systemd[1]: docker.service: Main process exited, code=killed, status=11/SEGV
Sep 25 23:22:04 decui-u-nested systemd[1]: Failed to start Docker Application Container Engine.
Sep 25 23:22:04 decui-u-nested systemd[1]: docker.service: Unit entered failed state.
Sep 25 23:22:04 decui-u-nested systemd[1]: docker.service: Failed with result 'signal'.

2) "code=exited, status=2"

# systemctl start docker; systemctl status docker

Job for docker.service failed because the control process exited with error code. See "systemctl status docker.service" and "journalctl -xe" for details.
● docker.service - Docker Application Container Engine
   Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2017-09-25 23:32:37 UTC; 5ms ago
     Docs: https://docs.docker.com
  Process: 26237 ExecStart=/usr/bin/dockerd -H fd:// $DOCKER_OPTS (code=exited, status=2)
 Main PID: 26237 (code=exited, status=2)

Sep 25 23:32:37 decui-u-nested dockerd[26237]: runtime.gcBgMarkWorker(0xc82001e000)
Sep 25 23:32:37 decui-u-nested dockerd[26237]: /usr/lib/go-1.6/src/runtime/mgc.go:1433 +0x25f fp=0xc820483fb8 sp=0xc820483f40
Sep 25 23:32:37 decui-u-nested dockerd[26237]: runtime.goexit()
Sep 25 23:32:37 decui-u-nested dockerd[26237]: /usr/lib/go-1.6/src/runtime/asm_amd64.s:1998 +0x1 fp=0xc820483fc0 sp=0xc820483fb8
Sep 25 23:32:37 decui-u-nested dockerd[26237]: created by runtime.gcBgMarkStartWorkers
Sep 25 23:32:37 decui-u-nested dockerd[26237]: /usr/lib/go-1.6/src/runtime/mgc.go:1330 +0x92
Sep 25 23:32:37 decui-u-nested systemd[1]: docker.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Sep 25 23:32:37 decui-u-nested systemd[1]: Failed to start Docker Application Container Engine.
Sep 25 23:32:37 decui-u-nested systemd[1]: docker.service: Unit entered failed state.
Sep 25 23:32:37 decui-u-nested systemd[1]: docker.service: Failed with result 'exit-code'.

It looks the issues are more likely caused by dockerd itself.

Dexuan Cui (decui) wrote :

BTW, issue #2 in the last comment doesn't always happen either. Sometimes a retry would succeed.

Joshua R. Poulson (jrp) wrote :

I'm also getting a report that if everything is recompiled with golang-1.9 this does not reproduce.

Joshua R. Poulson (jrp) wrote :

If I install golang-1.4 and golang-1.9 and remove golang-1.6 I can build go from source (1.4 is needed for the bootstrapping) and everything runs without error. If I have golang-1.6 installed, I get segfaults.

Marcelo Cerri (mhcerri) on 2017-09-26
Changed in linux-azure (Ubuntu):
assignee: nobody → Marcelo Cerri (mhcerri)
importance: Undecided → High
Marcelo Cerri (mhcerri) wrote :

Does it happen in some specific instance? I restarted docker more than 50 times without any problem in a Standard DS15 v2 instance (20 vcpus, 140 GB memory) with the linux-azure kernel.

Dexuan Cui (decui) wrote :

I got the issue (comment #4) with Standard_D16_v3 (16 vcpus, 64GB memory).

I easily (~25% docker installation fails) reproduce the issue on NC6 (gpu) vm.

Joshua R. Poulson (jrp) wrote :

It certainly seems to happen much more easily on GPU-based instances.

Joshua R. Poulson (jrp) wrote :

I would test with SR-IOV as well, I'm wondering if this is related to PCI passthrough.

Marcelo Cerri (mhcerri) wrote :

I was able to reproduce it on a Standard_D16_v3 instance when I create some containers that are configured to restart with dockerd. It doesn't happen very often though, and sometimes I can't reproduce it for some time. I will try to test it on a GPU-based instance.

Jason Greathouse (jgreat) wrote :

I'm seeing this pretty often on Standard_D2s_v3 instances. About 1/3 of my new vm instances are failing. Please let me know if I can provide logs or testing.

Michal Koziorowski (mkozioro) wrote :

On all my azure machines using Standard_DS2_v2 - North Europe telegraf crashed with that error. I'm not using docker.

root@frontvm-fas427000000:/# uname -a
Linux frontvm-fas427000000 4.11.0-1011-azure #11-Ubuntu SMP Tue Sep 19 19:03:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Telegraf: 1.4.0-1

Telegraf is working properly on instances with non azure kernel.

I'm attaching syslog with all information.

Marcelo Cerri (mhcerri) wrote :

The 4.11 base kernel for linux-azure doesn't show the problem and the same is true for linux-azure-edge (linux-azure-edge causes the failure while it's 4.13 base kernel doesn't).

I'm bisecting the delta between them to identify the commit that introduced the problem.

Changed in linux-azure (Ubuntu):
importance: High → Critical
Joshua R. Poulson (jrp) wrote :

It's interesting that telegraf is also made with golang.

Marcelo Cerri (mhcerri) wrote :

Yes. Probably something in golang runtime is triggering the bug.

The last good version is 4.11.0-1004.4 and the problem still happens with 4.11.0-1005.5. 1005 comprehends just Bug #1707285, which introduced several changes. I'm restricting the bisect to it.

Marcelo Cerri (mhcerri) wrote :

Reverting the following commits seems to avoid the problem:

47755dae5ca9 UBUNTU: SAUCE: tracing/hyper-v: trace hyperv_mmu_flush_tlb_others()
0aa1395238f7 UBUNTU: SAUCE: x86/hyper-v: support extended CPU ranges for TLB flush hypercalls
5e61aa7ca654 UBUNTU: SAUCE: x86/hyper-v: use hypercall for remote TLB flush

Josh, can you confirm if that works? I built the latest linux-azure with those patches reverted and placed the deb packages at:

http://kernel.ubuntu.com/~mhcerri/azure/golang/002/

The upstream versions of those commits are significantly different. I will backport them and check if the problem still happens.

Joshua R. Poulson (jrp) wrote :

We will check out the kernel as soon as possible. Thanks!

Marcelo Cerri (mhcerri) wrote :

Backporting the changes from upstream causes the problem to happen. I'm building the mainline kernel to test it.

Changed in linux-azure (Ubuntu):
status: Confirmed → In Progress
Marcelo Cerri (mhcerri) on 2017-09-28
Changed in linux-azure (Ubuntu):
importance: Critical → High
Marcelo Cerri (mhcerri) wrote :

Mainline doesn't cause docker to fail. I will start to check if linux-azure has some issue in TBL or in the helpers for hyper-v hypercalls.

Marcelo Cerri (mhcerri) wrote :

Also, the problem only happens when using HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX hypercalls.

Joshua R. Poulson (jrp) wrote :

The test kernel seems to be pretty good so far.

Here are the tests we’ve done so far in Azure
1. InfiniBand Tests – PASSING.
2. GPU Tests – PASSING.
3. GPU Tests on RDMA loaded VM (NC24r) – PASSING.
4. Docker Tests – PASSING
a. Note that, we’re using official script to install and verify docker –
https://get.docker.com/
b. Verified Docker version - 17.09.0-ce
c. GoLang version - go1.8.3

Docker tests for following scenarios are also done –
1. SRIOV enabled VMs.
2. InfiniBand enabled VMs.
3. Nvidia GPU driver loaded VMs.
No issues found so far.

Also a Docker loop test for 100 Standard_D2s_v3 VMs.
Current progress – 10%. No issues so far.

Our basic validation tests also all pass.

Marcelo Cerri (mhcerri) wrote :

Thanks for all the testing, Josh. I reverted the paravirtualized TLB flushing patches in the test kernel. Do you think we should spin a new kernel without it while we try to find the main cause of the problem?

I enabled the tracepoint available in arch/x86/hyperv/mmu.c for both the mainline and linux-azure kernel and I got some interesting information. The mainline kernel never does a call to flush_tlb_others passing TLB_FLUSH_ALL while in the 4.11 and 4.13 linux-azure kernels that is done very ofter.

I'm attaching the tracing files for both kernels. You can check that TLB_FLUSH_ALL is given to flush_tlb_others when `end` is equal to "ffffffffffffffff" (-1ULL).

Also if I force hyperv_flush_tlb_others_ex() to do a native flush when end is equal to TLB_FLUSH_ALL the problem does not occur. That is another alternative for a temporary fix.

I believe the mainline kernel is carrying the same bug as linux-azure but the problematic path (end == TLB_FLUSH_ALL) is not being executed.

Marcelo Cerri (mhcerri) wrote :
Marcelo Cerri (mhcerri) wrote :
Marcelo Cerri (mhcerri) wrote :

Josh, is there any documentation available for the Hyper-V hypercalls?

Joshua R. Poulson (jrp) wrote :

Yes, let's spin one into proposed with KVM enabled and these patches reverted for SRU.

You might be interested in Vitaly's upstream submission of tracing today.... we'll work that into a future PR. As far as Hypercalls go, I can always take questions back to the team for detailed answers.

Joshua R. Poulson (jrp) wrote :

You may also not be surprised to see this in a run I did with upstream 4.14-rc2:

[ 27.274499] general protection fault: 0000 [#1] SMP
[ 27.274499] Modules linked in: mlx4_en mlx4_core devlink sb_edac crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hv_utils pcbc hyperv_keyboard ptp snd_pcm aesni_intel hyperv_fb pps_core snd_timer sg aes_x86_64 pci_hyperv serio_raw hv_balloon crypto_simd glue_helper cryptd snd soundcore pcspkr evdev joydev button intel_rapl_perf ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 fscrypto sd_mod hid_generic sr_mod cdrom ata_generic ata_piix libata hv_storvsc hid_hyperv scsi_transport_fc hv_netvsc hid crc32c_intel psmouse scsi_mod i2c_piix4 hv_vmbus floppy
[ 27.274499] CPU: 8 PID: 1030 Comm: ntpd Not tainted 4.14.0-rc2 #1
[ 27.274499] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 01/06/2017
[ 27.274499] task: ffff9bf72f3aa0c0 task.stack: ffffa99322e78000
[ 27.274499] RIP: 0010:0xffffa9930c489000
[ 27.274499] RSP: 0000:ffffa99322e7bc60 EFLAGS: 00000046
[ 27.274499] RAX: 0000000000000000 RBX: ffffc9817f803000 RCX: 0000000100060014
[ 27.274499] RDX: 00002dadff803000 RSI: 0000000100060014 RDI: 0000000000000002
[ 27.274499] RBP: ffffa99322e7bca8 R08: 0000000000000000 R09: 00007f60f0f47000
[ 27.274499] R10: ffffc9817f803020 R11: 0000000000000000 R12: 0000000100060014
[ 27.274499] R13: ffffa99322e7bcb8 R14: 0000000000000001 R15: ffff9be5ae658540
[ 27.274499] FS: 00007f60f2900700(0000) GS:ffff9be5b9000000(0000) knlGS:0000000000000000
[ 27.274499] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 27.274499] CR2: 00007f60f0f46e28 CR3: 000000121c34c000 CR4: 00000000001406e0
[ 27.274499] Call Trace:
[ 27.274499] ? hyperv_flush_tlb_others_ex+0x1b0/0x440
[ 27.274499] flush_tlb_mm_range+0xad/0x110
[ 27.274499] ptep_clear_flush+0x39/0x40
[ 27.274499] wp_page_copy+0x27f/0x630
[ 27.274499] do_wp_page+0x151/0x4f0
[ 27.274499] __handle_mm_fault+0x80c/0x1140
[ 27.274499] ? do_mmap+0x445/0x510
[ 27.274499] handle_mm_fault+0xce/0x1c0
[ 27.274499] __do_page_fault+0x23e/0x4e0
[ 27.274499] do_page_fault+0x2e/0xd0
[ 27.274499] page_fault+0x22/0x30
[ 27.274499] RIP: 0033:0x7f60f26f4f1b
[ 27.274499] RSP: 002b:00007f60f28fed50 EFLAGS: 00010202
[ 27.274499] RAX: 00007f60f0f46e20 RBX: 00007f60ec000ac0 RCX: 0000000000000028
[ 27.274499] RDX: 00007f60f0f46f60 RSI: 00007f60ec000b00 RDI: 000000006fffffff
[ 27.274499] RBP: 00007f60f28fee90 R08: 00007f60f0d42000 R09: 0000000070000021
[ 27.274499] R10: 0000000000000031 R11: 000000006ffffdff R12: 00007f60f28fef78
[ 27.274499] R13: 000000037ffff1a0 R14: 0000000000000801 R15: 00007f60f28fef30
[ 27.274499] Code: <0f> 01 c1 c3 8b c8 b8 11 00 00 00 0f 01 c1 c3 48 8b c1 48 c7 c1 11
[ 27.274499] RIP: 0xffffa9930c489000 RSP: ffffa99322e7bc60
[ 27.274499] ---[ end trace f506aa4e20966c27 ]---

Dexuan Cui (decui) wrote :

@mhcerri: I believe the hypercalls are documented in "Hypervisor Top Level Functional Specification v5.0b.pdf" (https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs).

Marcelo Cerri (mhcerri) wrote :

Including the rationale to the commit messages:

https://lists.ubuntu.com/archives/kernel-team/2017-October/087342.html

Marcelo Cerri (mhcerri) on 2017-10-02
Changed in linux-azure (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Marcelo Cerri (mhcerri)
status: In Progress → Fix Committed
Changed in linux-azure (Ubuntu):
status: In Progress → Fix Committed
Marcelo Cerri (mhcerri) wrote :

Thanks, Dexuan.

Reading the specification I believe I found the problem. The original implementation is assuming the entire hv_vp_set structure is variable and it's passing `nr_bank + 2` as variable header size for hv_do_rep_hypercall().

However the specification states:

"Since the fixed header size is implicit, instead of supplying the total header size, only the variable portion is supplied in the input controls."

Based on that, the first two fields in hv_vp_set should not be considered for the variable header size and `nr_bank` should be used instead of `nr_bank + 2`.

I'm attaching the backport for the 3 upstream commits plus a forth patch with that fix. I also built a kernel with those patches for testing purposes and it can be downloaded from:

http://kernel.ubuntu.com/~mhcerri/azure/lp1719045-fix/

I was able to restart dockerd with this kernel hundred of times without any problem.

Let me know if the solution seems reasonable for you. Once that is tested I will submit the fix upstream.

Marcelo Cerri (mhcerri) wrote :

The kernel with the reverted changes is already in the pipeline to be published to the archive. So we can test this fix at a normal pace.

tags: added: patch
Marcelo Cerri (mhcerri) wrote :

linux-azure 4.11.0.1013.13 (with the reverted changes) just landed in -proposed. Can you please confirm the issue does not happen with this version?

Joshua R. Poulson (jrp) wrote :

Yes, I've already kicked off the call for testing with the local teams and people experiencing the issue.

Joshua R. Poulson (jrp) wrote :

Early results from @Chris

The proposed kernel looks good from our BVT testing on Hyper-V.
Besides the BVT, @Ionut also has done the following scenarios:
- Compiled go v1.6 and telegraf v1.2 from sources
- Compiled go v.18.3 and telegraf v1.5 from sources.

Both series worked fine, with no seg faults errors.

BVT covers:
- Core tests and time sync
- Networking tests
- Memory – DM & Runtime Memory results

Joshua R. Poulson (jrp) wrote :

Later results also look good. IB, BVT, Deployment, Perf, GPU, LTP, kdump, Docker, and Telegraf tests all passed. Please publish to updates.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-azure - 4.11.0-1013.13

---------------
linux-azure (4.11.0-1013.13) xenial; urgency=low

  * linux-azure: 4.11.0-1013.13 -proposed tracker (LP: #1720818)

  * [Hyper-V] Docker failures with linux-azure 4.11.0-1011 (LP: #1719045)
    - Revert "UBUNTU: SAUCE: tracing/hyper-v: trace hyperv_mmu_flush_tlb_others()"
    - Revert "UBUNTU: SAUCE: x86/hyper-v: support extended CPU ranges for TLB
      flush hypercalls"
    - Revert "UBUNTU: SAUCE: x86/hyper-v: use hypercall for remote TLB flush"

linux-azure (4.11.0-1012.12) xenial; urgency=low

  * linux-azure: 4.11.0-1012.12 -proposed tracker (LP: #1718763)

  * linux-azure: KVM nested virtualization is disabled (LP: #1718740)
    - [Config] azure: enable KVM support

 -- Marcelo Henrique Cerri <email address hidden> Mon, 02 Oct 2017 12:20:26 -0300

Changed in linux-azure (Ubuntu Xenial):
status: Fix Committed → Fix Released
Marcelo Cerri (mhcerri) wrote :

@jrp, should we include the following upstream fixes into the azure kernel?

ab7ff471aa5d x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushing
60d73a7c9660 x86/hyperv: Don't use percpu areas for pcpu_flush/pcpu_flush_ex structures
a3b742439292 x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUs

I can build a test kernel with those changes if you prefer.

Joshua R. Poulson (jrp) wrote :

@mhcerri Yes, we should queue these up, but we'd like to switch our focus to 4.13 soon.

Thanks, --jrp

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers