Trying to reproduce this, I started with this test:
ctx := context.TODO()
ctx, cancel := context.WithCancel(ctx)
started := make(chan struct{})
go func() {
select {
case <-started:
case <-time.After(jtesting.LongWait):
c.Fatalf("timed out waiting %s for started", jtesting.LongWait)
}
<-time.After(10 * time.Millisecond)
if cancel != nil {
c.Logf("cancelling")
cancel()
}
}()
listen, err := net.Listen("tcp4", ":0")
c.Assert(err, jc.ErrorIsNil)
defer listen.Close()
addr := listen.Addr().String()
c.Logf("listening at: %s", addr)
// Note that we Listen, but we never Accept
close(started)
info := &Info{
Addrs: []string{addr},
}
opts := DialOpts{
DialAddressInterval: 1 * time.Millisecond,
RetryDelay: 1 * time.Millisecond,
Timeout: 10 * time.Millisecond,
DialTimeout: 5 * time.Millisecond,
}
// uncomment to get "try was stopped"
// listen.Close()
_, err = dialAPI(ctx, info, opts)
c.Assert(err, jc.ErrorIsNil)
Some notes:
1) If you are connecting to a socket that has a server that calls Listen but not Accept, the client hangs indefinitely.
This *might* be what we're seeing with Agents that end up hung. I don't know how this would look on the server side, but it is a symptom of "client tries to dial but never interrupts to retry".
2) With listen.Close() it does progress and it does give the error "try was stopped" which is certainly not a helpful error. At the very least understanding if it was something like "exceeded 2s trying to connect" or something else along those lines would have been a much more useful error. And possibly also including the address that we were trying to connect to.
Trying to reproduce this, I started with this test: WithCancel( ctx) After(jtesting. LongWait) : "cancelling" ) Addr(). String( ) terval: 1 * time.Millisecond,
ctx := context.TODO()
ctx, cancel := context.
started := make(chan struct{})
go func() {
select {
case <-started:
case <-time.
c.Fatalf("timed out waiting %s for started", jtesting.LongWait)
}
<-time.After(10 * time.Millisecond)
if cancel != nil {
c.Logf(
cancel()
}
}()
listen, err := net.Listen("tcp4", ":0")
c.Assert(err, jc.ErrorIsNil)
defer listen.Close()
addr := listen.
c.Logf("listening at: %s", addr)
// Note that we Listen, but we never Accept
close(started)
info := &Info{
Addrs: []string{addr},
}
opts := DialOpts{
DialAddressIn
RetryDelay: 1 * time.Millisecond,
Timeout: 10 * time.Millisecond,
DialTimeout: 5 * time.Millisecond,
}
// uncomment to get "try was stopped"
// listen.Close()
_, err = dialAPI(ctx, info, opts)
c.Assert(err, jc.ErrorIsNil)
Some notes:
1) If you are connecting to a socket that has a server that calls Listen but not Accept, the client hangs indefinitely.
This *might* be what we're seeing with Agents that end up hung. I don't know how this would look on the server side, but it is a symptom of "client tries to dial but never interrupts to retry".
2) With listen.Close() it does progress and it does give the error "try was stopped" which is certainly not a helpful error. At the very least understanding if it was something like "exceeded 2s trying to connect" or something else along those lines would have been a much more useful error. And possibly also including the address that we were trying to connect to.