juju-db fails to start -- WiredTiger reports Input/output error

Bug #1632030 reported by Vance Morris
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Critical
Alexis Bruemmer
Ubuntu on IBM z Systems
Invalid
Critical
Unassigned

Bug Description

$ juju --version
2.0-rc3-xenial-s390x

$ lxd --version
2.0.4

Controller bootstrapped into clean LXD/local environment 3 days ago. Multiple models were created and deleted on the first day, and then the system sat idle over the weekend.

Coming in today, most "juju X" commands simply hang with no output. Debug switch shows that wss is failing:

$ juju switch default --debug
13:13:00 INFO juju.cmd supercommand.go:63 running juju [2.0-rc3 gc go1.6.2]
13:13:00 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "switch", "default", "--debug"}
13:13:00 INFO juju.juju api.go:72 connecting to API addresses: [10.113.186.232:17070]
13:13:00 INFO juju.api apiclient.go:507 dialing "wss://10.113.186.232:17070/api"
13:13:02 INFO juju.api apiclient.go:507 dialing "wss://10.113.186.232:17070/api"
^C

Attempting to restart jujud manually are successful, but I noticed mongodb connection errors in the log:

2016-10-10 17:18:03 WARNING juju.mongo open.go:134 mongodb connection failed, will retry: dial tcp 127.0.0.1:37017: getsockopt: connection refused

Checking the logs for juju-db, it looks bad:
https://gist.github.com/vmorris/7750b8f9d3dfaa14238df39f7628ea3a

Further issues found in dmesg:
https://gist.github.com/vmorris/f81217815059c6fc748eaba8cc1b5318

I've attached the full mongodb.log.

Revision history for this message
Vance Morris (vmorris) wrote :
Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.0.1
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Hi, I ported the s390x support from master back to the 3.2 branch for juju-mongodb so it's possible I've missed something, I'm not really a deep expert on mongodb or s390x. But my patches pass mongodb's own tests which are reasonably comprehensive. But I don't really understand what it going on here...

The first error is this:

Oct 08 14:42:57 juju-84a348-0 mongod.37017[17241]: [thread1] WiredTiger (52) [1475937777:172958][17241:0x3ff977ff910], file:collection-27-3785058392042379666.wt, WT_SESSION.checkpoint: /var/lib/juju/db/collection-27-3785058392042379666.wt: handle-write: pwrite: failed to write 8192 bytes at offset 1044480: Invalid exchange

"Invalid exchange" means EBADE and grepping the kernel suggests you are using dasd storage? Is it possible your disk has gone bad or something (although I guess that sort of thing is less likely on big iron). Is there anything in dmesg or so on?

Changed in juju:
status: Triaged → Incomplete
milestone: 2.0.1 → none
importance: High → Undecided
Revision history for this message
Antonio Rosales (arosales) wrote :

@Vance,

Thanks for the bug report. I am trying to reproduce this in our environment. In doing so I wanted to confirm what s390x environment you are working in. Specifically:
- Ubuntu on LPAR
- Ubuntu on VM
- CPU allocation
- Memory allocation
- Disk/Storage "/" allocation

-thanks,
Antonio

Revision history for this message
Vance Morris (vmorris) wrote :

Hi Antonio,

I'm running Ubuntu 16.04.1 on LPAR - 32 CPU and 40G RAM

I've got 3 DASD (ECKD) combined to 150G in an LVM / (root) with boot separated out in it's own partition.

A 4th 50G DASD is used for ZFS pool.

I was unable to recover the logs as requested in comment #2, unfortunately the z13 I was working on had to be POR yesterday. (Just FYI I work in a test environment and there's always interesting things failing ;))

After I was able to get back into the LPAR yesterday afternoon, I unfortunately was hasty in restarting the workload, and blew away the container that contained the logs.

At this time, I've had juju bootstrapped and the openstack-on-lxd bundle deployed for about 18 hours. No issue with juju itself yet.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Incomplete
Revision history for this message
Vance Morris (vmorris) wrote :

After starting the workload (openstack-on-lxd bundle), and letting it sit idle for some time, the ceph-radowgw unit went error state with the machine reporting down.

Looking into the container, the jujud-machine.service is failing to start, and I'm detecting the following messages in dmesg output:

[134426.946494] User process fault: interruption code 003b ilc:2 in beam.smp[2aa02780000+289000]
[134426.946510] failing address: 0000000000000000 TEID: 0000000000000400
[134426.946512] Fault in primary space mode while using user ASCE.
[134426.946515] AS:00000001a4fb81c7 R3:0000000000000024
[134426.946519] CPU: 0 PID: 124954 Comm: async_10 Tainted: P O 4.4.0-38-generic #57-Ubuntu
[134426.946521] task: 000000067565e270 ti: 00000001ed4f4000 task.ti: 00000001ed4f4000
[134426.946522] User PSW : 0705000180000000 000002aa0292026e
[134426.946524] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:1 AS:0 CC:0 PM:0 EA:3
                User GPRS: 0000000000000001 0000000000000000 0000000000000000 0000000000000000
[134426.946526] 000002aa02920268 0000000000000100 000003ff7c972f10 000003ff7d720090
[134426.946527] 000003ff00000000 000002aa02a6a240 000003ff0000001b 000003ff852c0078
[134426.946529] 000003ff84f9c000 000002aa029b26c8 000002aa02920268 000003ff7c972db0
[134426.946538] User Code: 000002aa02920262: a7290018 lghi %r2,24
                           000002aa02920266: 0de1 basr %r14,%r1
                          #000002aa02920268: e31070280004 lg %r1,40(%r7)
                          >000002aa0292026e: 50a01008 st %r10,8(%r1)
                           000002aa02920272: b24f0010 ear %r1,%a0
                           000002aa02920276: e32070280004 lg %r2,40(%r7)
                           000002aa0292027c: 5080200c st %r8,12(%r2)
                           000002aa02920280: eb110020000d sllg %r1,%r1,32
[134426.946549] Last Breaking-Event-Address:
[134426.946551] [<000002aa028069e6>] 0x2aa028069e6

Changed in juju:
status: Incomplete → Triaged
importance: Undecided → Critical
milestone: none → 2.0.1
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Incomplete → Triaged
Changed in juju:
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → Critical
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0.1 → none
Revision history for this message
cargonza (cargonza) wrote :

Hi, What are the next steps on this item? Have we determine the cause of the process not restarting? Thank you!

Revision history for this message
Vance Morris (vmorris) wrote :

I was able to duplicate the issue once, but haven't made any attempts to do so in weeks.

Changed in juju:
milestone: none → 2.1.0
Revision history for this message
Alexis Bruemmer (alexis-bruemmer) wrote :

Based on comment #2 and lack of replication this looks to be an issues with the environment/hardware; marking invalid.

Changed in juju:
status: Triaged → Invalid
Changed in juju:
milestone: 2.1.0 → none
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.