juju controllers stuck with juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped

Bug #1785623 reported by Junien Fridrick on 2018-08-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
High
John A Meinel
2.3
Medium
John A Meinel
2.4
Medium
John A Meinel

Bug Description

Hi,

The following happened on a 2.3.8 HA controller with 3 nodes.

At the start of all this, machine 1 is the mongodb primary. It all starts with 2 jujuds getting OOM killed on machine 2 (root cause is unknown) :

Aug 5 10:06:23 machine-2 kernel: [13712355.881318] Out of memory: Kill process 17322 (jujud) score 552 or sacrifice child
Aug 5 10:06:23 machine-2 kernel: [13712355.888758] Out of memory: Kill process 21610 (jujud) score 552 or sacrifice child

systemd then restarts jujud-machine-2 back up. Right after this, a new mongodb primary election takes place, presumably because the primary on machine 1 is overloaded :

Aug 5 10:06:47 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Connecting to <machine-1>:37017
Aug 5 10:06:48 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Successfully connected to <machine-1>:37017, took 75ms (2 connections now open to <machine-1>:37017)
Aug 5 10:07:38 machine-2 mongod.37017[11143]: [ReplicationExecutor] syncing from: <machine-1>:37017
Aug 5 10:07:48 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host <machine-1>:37017 because the pool meets constraints; 1 connections to that host remain open
Aug 5 10:08:13 machine-2 mongod.37017[11143]: [ReplicationExecutor] Not starting an election, since we are not electable due to: Not standing for election because member is not caught up enough to the most up-to-date member to call for priority takeover - must be within 2 seconds (mask 0x400)
Aug 5 10:08:14 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Ending connection to host <machine-1>:37017 due to bad connection status; 0 connections to that host remain open
Aug 5 10:08:14 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Operation timed out
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [rsBackgroundSync] Socket recv() timeout <machine-1>:37017
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [rsBackgroundSync] sync producer problem: 9001 socket exception [RECV_TIMEOUT] server [<machine-1>:37017]
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [ReplicationExecutor] could not find member to sync from
Aug 5 10:08:16 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] conducting a dry run election to see if we could be elected
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] dry election run succeeded, running for election
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] election succeeded, assuming primary role in term 652
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [ReplicationExecutor] transition to PRIMARY
Aug 5 10:08:23 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:24 machine-2 mongod.37017[11143]: [rsSync] transition to primary complete; database writes are now permitted
Aug 5 10:08:26 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Couldn't get a connection within the time limit
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to <machine-1>:37017 due to failed operation on a connection
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - HostUnreachable: short read
Aug 5 10:08:32 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to close stream: Transport endpoint is not connected
Aug 5 10:08:36 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - ExceededTimeLimit: Operation timed out
Aug 5 10:08:38 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Couldn't get a connection within the time limit
Aug 5 10:08:38 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Failed to heartbeat to :27017 - HostUnreachable: short read
Aug 5 10:08:38 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-BGSync-0] Dropping all pooled connections to <machine-1>:37017 due to failed operation on a connection
Aug 5 10:08:50 machine-2 mongod.37017[11143]: [ReplicationExecutor] Error in heartbeat request to <machine-1>:37017; ExceededTimeLimit: Couldn't get a connection within the time limit
Aug 5 10:08:52 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Failed to connect to <machine-1>:37017 - ExceededTimeLimit: Operation timed out
Aug 5 10:08:52 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:08:53 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Successfully connected to <machine-1>:37017, took 705ms (1 connections now open to <machine-1>:37017)
Aug 5 10:08:53 machine-2 mongod.37017[11143]: [ReplicationExecutor] Member <machine-1>:37017 is now in state SECONDARY
Aug 5 10:08:59 machine-2 mongod.37017[11143]: [ReplicationExecutor] dropping unhealthy pooled connection to <machine-1>:37017
Aug 5 10:08:59 machine-2 mongod.37017[11143]: [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
Aug 5 10:08:59 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Connecting to <machine-1>:37017
Aug 5 10:09:03 machine-2 mongod.37017[11143]: [NetworkInterfaceASIO-Replication-0] Successfully connected to <machine-1>:37017, took 4507ms (1 connections now open to <machine-1>:37017)
Aug 5 10:09:03 machine-2 mongod.37017[11143]: [ReplicationExecutor] Member <machine-1>:37017 is now in state ROLLBACK
Aug 5 10:09:10 machine-2 mongod.37017[11143]: [ReplicationExecutor] Member <machine-1>:37017 is now in state SECONDARY

On machine 1, nothing really says why mongodb is overloaded, the only relevant logs are :

Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] Member <machine-2>:37017 is now in state PRIMARY
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:24 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:26 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:26 machine-1 mongod.37017[23313]: message repeated 3 times: [ [ReplicationExecutor] stepping down from primary, because a new term has begun: 652]
Aug 5 10:08:28 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:28 machine-1 mongod.37017[23313]: message repeated 3 times: [ [ReplicationExecutor] stepping down from primary, because a new term has begun: 652]
Aug 5 10:08:28 machine-1 mongod.37017[23313]: [ReplicationExecutor] Member <machine-0>:37017 is now in state SECONDARY
Aug 5 10:08:30 machine-1 mongod.37017[23313]: [ReplicationExecutor] stepping down from primary, because a new term has begun: 652
Aug 5 10:08:30 machine-1 mongod.37017[23313]: message repeated 3 times: [ [ReplicationExecutor] stepping down from primary, because a new term has begun: 652]
Aug 5 10:08:31 machine-1 mongod.37017[23313]: [replExecDBWorker-1] transition to SECONDARY

Some findAndModify commands were taking ~10s to complete before this happened though.

Anyway, it's around that time that machine 0 and machine 2 controllers stopped logging anything but :

2018-08-05 10:08:55 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:01 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:07 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:14 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:20 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:26 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped
2018-08-05 10:09:32 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped

Also, the machine agents for 0 and 2 are in status "down" :
Machine State DNS Inst id Series AZ Message
0 down 1.1.1.1 <uuid> xenial nova ACTIVE
1 started 1.1.1.2 <uuid> xenial nova ACTIVE
2 down 1.1.1.3 <uuid> xenial nova ACTIVE

The goroutine dumps are available for machine 0 : https://pastebin.canonical.com/p/WjBKPbwBc2/ and machine 2 : https://pastebin.canonical.com/p/V6jnQxVvZP/

Please let us know if you need anything else from us quickly. We need to restart the controllers on machine 0 and 2 to revive them and to spread the load across all 3 controllers.

Thanks !

Junien Fridrick (axino) on 2018-08-06
tags: added: canonical-is
John A Meinel (jameinel) wrote :

The one thing google-fu turns up is that
  Failed to connect to <machine-1>:37017 - HostUnreachable: short read

might be occurring if one machine is configured for SSL and the other is not. However, AFAIK we always run mongod with --ssl.

As for:
2018-08-05 10:08:55 ERROR juju.worker.dependency engine.go:551 "api-caller" manifold worker returned unexpected error: cannot open api: try was stopped

I was able to track down that it occurs when we start trying to connect, and something causes us to abort (that error is given during tomb.Dying() which means something called try.Kill())

However, it isn't clear why we would have called try.Kill(). The connect function: dialWebsocketMulti will call kill at any point that it returns (it calls defer try.Kill()), but it should only really exit with either an error or succeed. I wonder if the issue is that the 'try was stopped' is that we are suppressing the actual erorr.

John A Meinel (jameinel) on 2018-08-09
Changed in juju:
status: New → In Progress
importance: Undecided → High
assignee: nobody → John A Meinel (jameinel)
John A Meinel (jameinel) wrote :

Sorry, I had done some debugging but failed to hit submit yesterday. I'll at least put up a PR to check the errors.

John A Meinel (jameinel) wrote :

Trying to reproduce this, I started with this test:
 ctx := context.TODO()
 ctx, cancel := context.WithCancel(ctx)
 started := make(chan struct{})
 go func() {
  select {
  case <-started:
  case <-time.After(jtesting.LongWait):
   c.Fatalf("timed out waiting %s for started", jtesting.LongWait)
  }
  <-time.After(10 * time.Millisecond)
  if cancel != nil {
   c.Logf("cancelling")
   cancel()
  }
 }()
 listen, err := net.Listen("tcp4", ":0")
 c.Assert(err, jc.ErrorIsNil)
        defer listen.Close()
 addr := listen.Addr().String()
 c.Logf("listening at: %s", addr)
 // Note that we Listen, but we never Accept
 close(started)
 info := &Info{
  Addrs: []string{addr},
 }
 opts := DialOpts{
  DialAddressInterval: 1 * time.Millisecond,
  RetryDelay: 1 * time.Millisecond,
  Timeout: 10 * time.Millisecond,
  DialTimeout: 5 * time.Millisecond,
 }
// uncomment to get "try was stopped"
// listen.Close()
 _, err = dialAPI(ctx, info, opts)
 c.Assert(err, jc.ErrorIsNil)

Some notes:

1) If you are connecting to a socket that has a server that calls Listen but not Accept, the client hangs indefinitely.
This *might* be what we're seeing with Agents that end up hung. I don't know how this would look on the server side, but it is a symptom of "client tries to dial but never interrupts to retry".

2) With listen.Close() it does progress and it does give the error "try was stopped" which is certainly not a helpful error. At the very least understanding if it was something like "exceeded 2s trying to connect" or something else along those lines would have been a much more useful error. And possibly also including the address that we were trying to connect to.

John A Meinel (jameinel) wrote :

https://github.com/juju/juju/pull/9066 is reporting a better error against 2.3.

Changed in juju:
milestone: none → 2.5-beta1
John A Meinel (jameinel) on 2018-11-13
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers