Comment 3 for bug 1785623

John A Meinel (jameinel) wrote :

Trying to reproduce this, I started with this test:
 ctx := context.TODO()
 ctx, cancel := context.WithCancel(ctx)
 started := make(chan struct{})
 go func() {
  select {
  case <-started:
  case <-time.After(jtesting.LongWait):
   c.Fatalf("timed out waiting %s for started", jtesting.LongWait)
  }
  <-time.After(10 * time.Millisecond)
  if cancel != nil {
   c.Logf("cancelling")
   cancel()
  }
 }()
 listen, err := net.Listen("tcp4", ":0")
 c.Assert(err, jc.ErrorIsNil)
        defer listen.Close()
 addr := listen.Addr().String()
 c.Logf("listening at: %s", addr)
 // Note that we Listen, but we never Accept
 close(started)
 info := &Info{
  Addrs: []string{addr},
 }
 opts := DialOpts{
  DialAddressInterval: 1 * time.Millisecond,
  RetryDelay: 1 * time.Millisecond,
  Timeout: 10 * time.Millisecond,
  DialTimeout: 5 * time.Millisecond,
 }
// uncomment to get "try was stopped"
// listen.Close()
 _, err = dialAPI(ctx, info, opts)
 c.Assert(err, jc.ErrorIsNil)

Some notes:

1) If you are connecting to a socket that has a server that calls Listen but not Accept, the client hangs indefinitely.
This *might* be what we're seeing with Agents that end up hung. I don't know how this would look on the server side, but it is a symptom of "client tries to dial but never interrupts to retry".

2) With listen.Close() it does progress and it does give the error "try was stopped" which is certainly not a helpful error. At the very least understanding if it was something like "exceeded 2s trying to connect" or something else along those lines would have been a much more useful error. And possibly also including the address that we were trying to connect to.