Ubuntu on IBM z Systems

juju-db fails to start -- WiredTiger reports Input/output error

Bug #1632030 reported by Vance Morris on 2016-10-10

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Invalid	Critical	Alexis Bruemmer
	Ubuntu on IBM z Systems	Invalid	Critical	Unassigned

Bug Description

$ juju --version
2.0-rc3-xenial-s390x

$ lxd --version
2.0.4

Controller bootstrapped into clean LXD/local environment 3 days ago. Multiple models were created and deleted on the first day, and then the system sat idle over the weekend.

Coming in today, most "juju X" commands simply hang with no output. Debug switch shows that wss is failing:

$ juju switch default --debug
13:13:00 INFO juju.cmd supercommand.go:63 running juju [2.0-rc3 gc go1.6.2]
13:13:00 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "switch", "default", "--debug"}
13:13:00 INFO juju.juju api.go:72 connecting to API addresses: [10.113.186.232:17070]
13:13:00 INFO juju.api apiclient.go:507 dialing "wss://10.113.186.232:17070/api"
13:13:02 INFO juju.api apiclient.go:507 dialing "wss://10.113.186.232:17070/api"
^C

Attempting to restart jujud manually are successful, but I noticed mongodb connection errors in the log:

2016-10-10 17:18:03 WARNING juju.mongo open.go:134 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: getsockopt: connection refused

Checking the logs for juju-db, it looks bad:
https://gist.github.com/vmorris/7750b8f9d3dfaa14238df39f7628ea3a

Further issues found in dmesg:
https://gist.github.com/vmorris/f81217815059c6fc748eaba8cc1b5318

I've attached the full mongodb.log.

Tags:

Revision history for this message

Vance Morris (vmorris) wrote on 2016-10-10:

mongodb.log Edit (1.7 MiB, text/plain)

Anastasia (anastasia-macmood) on 2016-10-10

Changed in juju:
status:	New → Triaged
importance:	Undecided → High
milestone:	none → 2.0.1

Revision history for this message

Michael Hudson-Doyle (mwhudson) wrote on 2016-10-10:

Hi, I ported the s390x support from master back to the 3.2 branch for juju-mongodb so it's possible I've missed something, I'm not really a deep expert on mongodb or s390x. But my patches pass mongodb's own tests which are reasonably comprehensive. But I don't really understand what it going on here...

The first error is this:

Oct 08 14:42:57 juju-84a348-0 mongod.37017[17241]: [thread1] WiredTiger (52) [1475937777:172958][17241:0x3ff977ff910], file:collection-27-3785058392042379666.wt, WT_SESSION.checkpoint: /var/lib/juju/db/collection-27-3785058392042379666.wt: handle-write: pwrite: failed to write 8192 bytes at offset 1044480: Invalid exchange

"Invalid exchange" means EBADE and grepping the kernel suggests you are using dasd storage? Is it possible your disk has gone bad or something (although I guess that sort of thing is less likely on big iron). Is there anything in dmesg or so on?

Anastasia (anastasia-macmood) on 2016-10-10

Changed in juju:
status:	Triaged → Incomplete
milestone:	2.0.1 → none
importance:	High → Undecided

Revision history for this message

Antonio Rosales (arosales) wrote on 2016-10-12:

@Vance,

Thanks for the bug report. I am trying to reproduce this in our environment. In doing so I wanted to confirm what s390x environment you are working in. Specifically:
- Ubuntu on LPAR
- Ubuntu on VM
- CPU allocation
- Memory allocation
- Disk/Storage "/" allocation

-thanks,
Antonio

Revision history for this message

Vance Morris (vmorris) wrote on 2016-10-12:

Hi Antonio,

I'm running Ubuntu 16.04.1 on LPAR - 32 CPU and 40G RAM

I've got 3 DASD (ECKD) combined to 150G in an LVM / (root) with boot separated out in it's own partition.

A 4th 50G DASD is used for ZFS pool.

I was unable to recover the logs as requested in comment #2, unfortunately the z13 I was working on had to be POR yesterday. (Just FYI I work in a test environment and there's always interesting things failing ;))

After I was able to get back into the LPAR yesterday afternoon, I unfortunately was hasty in restarting the workload, and blew away the container that contained the logs.

At this time, I've had juju bootstrapped and the openstack-on-lxd bundle deployed for about 18 hours. No issue with juju itself yet.

Frank Heimes (fheimes) on 2016-10-12

Changed in ubuntu-z-systems:
status:	New → Incomplete

Revision history for this message

Vance Morris (vmorris) wrote on 2016-10-13:

After starting the workload (openstack-on-lxd bundle), and letting it sit idle for some time, the ceph-radowgw unit went error state with the machine reporting down.

Looking into the container, the jujud-machine.service is failing to start, and I'm detecting the following messages in dmesg output:

[134426.946494] User process fault: interruption code 003b ilc:2 in beam.smp[2aa02780000+289000]
[134426.946510] failing address: 0000000000000000 TEID: 0000000000000400
[134426.946512] Fault in primary space mode while using user ASCE.
[134426.946515] AS:00000001a4fb81c7 R3:0000000000000024
[134426.946519] CPU: 0 PID: 124954 Comm: async_10 Tainted: P O 4.4.0-38-generic #57-Ubuntu
[134426.946521] task: 000000067565e270 ti: 00000001ed4f4000 task.ti: 00000001ed4f4000
[134426.946522] User PSW : 0705000180000000 000002aa0292026e
[134426.946524] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:0 PM:0 EA:3
                User GPRS: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
[134426.946526] 000002aa02920268 0000000000000100 000003ff7c972f10 000003ff7d720090
[134426.946527] 000003ff00000000 000002aa02a6a240 000003ff0000001b 000003ff852c0078
[134426.946529] 000003ff84f9c000 000002aa029b26c8 000002aa02920268 000003ff7c972db0
[134426.946538] User Code: 000002aa02920262: a7290018 lghi %r2,24
                           000002aa02920266: 0de1 basr %r14,%r1
                          #000002aa02920268: e31070280004 lg %r1,40(%r7)
                          >000002aa0292026e: 50a01008 st %r10,8(%r1)
                           000002aa02920272: b24f0010 ear %r1,%a0
                           000002aa02920276: e32070280004 lg %r2,40(%r7)
                           000002aa0292027c: 5080200c st %r8,12(%r2)
                           000002aa02920280: eb110020000d sllg %r1,%r1,32
[134426.946549] Last Breaking-Event-Address:
[134426.946551] [<000002aa028069e6>] 0x2aa028069e6

Anastasia (anastasia-macmood) on 2016-10-14

Changed in juju:
status:	Incomplete → Triaged
importance:	Undecided → Critical
milestone:	none → 2.0.1

Frank Heimes (fheimes) on 2016-10-14

Changed in ubuntu-z-systems:
status:	Incomplete → Triaged

Alexis Bruemmer (alexis-bruemmer) on 2016-10-19

Changed in juju:
assignee:	nobody → Alexis Bruemmer (alexis-bruemmer)

Frank Heimes (fheimes) on 2016-10-20

Changed in ubuntu-z-systems:
importance:	Undecided → Critical

Curtis Hovey (sinzui) on 2016-10-28

Changed in juju:
milestone:	2.0.1 → none

Revision history for this message

cargonza (cargonza) wrote on 2016-11-15:

Hi, What are the next steps on this item? Have we determine the cause of the process not restarting? Thank you!

Revision history for this message

Vance Morris (vmorris) wrote on 2016-11-15:

I was able to duplicate the issue once, but haven't made any attempts to do so in weeks.

Anastasia (anastasia-macmood) on 2016-11-16

Changed in juju:
milestone:	none → 2.1.0

Revision history for this message

Alexis Bruemmer (alexis-bruemmer) wrote on 2016-12-12:

Based on comment #2 and lack of replication this looks to be an issues with the environment/hardware; marking invalid.

Changed in juju:
status:	Triaged → Invalid

Anastasia (anastasia-macmood) on 2016-12-12

Changed in juju:
milestone:	2.1.0 → none

Frank Heimes (fheimes) on 2016-12-13

Changed in ubuntu-z-systems:
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

mongodb.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.