Bug #1942804 “Exclude Ubuntu Fan addresses” : Bugs : Canonical Juju

Haw Loeung (hloeung) on 2021-09-06

summary:

- Exclude Ubuntu FAN addresses
+ Exclude Ubuntu Fan addresses

Revision history for this message

John A Meinel (jameinel) wrote on 2021-09-16:

#1

Just to confirm, Haw, is this something that causes actual issues, or just something that is juju not taking into account routing when supplying addresses? (I would assume that logs will include entries that indicate it is trying but not succeeding at using those addresses, but it should successfully connect to the other addresses)

I'm loathe to just have a special case that strips them, because I would like to have a clearer model of networking that includes routing, that would naturally say "you're not on the same fan as the controller, therefore those addresses are not routable, and should thus not be included."

In a practical sense, we know that Fan isn't inherently better than normal addresses for this use case (we won't be putting controllers into LXD containers any time soon). So we could just strip them.

Changed in juju:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Haw Loeung (hloeung) wrote on 2021-09-19:

#2

It's trying to connect to the controllers via the advertised Fan addresses, but does eventually time out. Example snippet from a machine provisioned in Taipei using Juju controllers in PS5):

| 2021-09-09 08:40:07 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
| 2021-09-09 08:41:11 INFO juju.worker.logger logger.go:136 logger worker stopped
| 2021-09-09 08:41:17 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: read tcp 211.75.xxx.xxx:37388->10.131.4.170:17070: read: connection reset by peer
| 2021-09-09 08:41:24 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.170.0.1:17070: i/o timeout
| 2021-09-09 08:41:31 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:41:40 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.170.0.1:17070: i/o timeout
| 2021-09-09 08:41:49 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:41:59 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:42:10 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:42:22 INFO juju.api apiclient.go:639 connection established to "wss://10.131.4.170:17070/model/093b6e8e-e00d-4c85-8052-52789d9e0132/api"
| 2021-09-09 08:42:22 INFO juju.worker.apicaller connect.go:158 [093b6e] "machine-1" successfully connected to "10.131.4.170:17070"
| 2021-09-09 08:42:25 INFO juju.worker.logger logger.go:120 logger worker started

(we're also doing something weird with DNAT'ing 10.xxx.xxx.xxx that the Juju controllers are using due to our weird and complex set up, this will change once we get routing between the two DCs fixed, but that's separate to this)

It's trying to connect to the controllers via the advertised Fan addresses, but does eventually time out. Example snippet from a machine provisioned in Taipei using Juju controllers in PS5):

| 2021-09-09 08:40:07 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
| 2021-09-09 08:41:11 INFO juju.worker.logger logger.go:136 logger worker stopped
| 2021-09-09 08:41:17 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: read tcp 211.75.xxx.xxx:37388->10.131.4.170:17070: read: connection reset by peer
| 2021-09-09 08:41:24 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.170.0.1:17070: i/o timeout
| 2021-09-09 08:41:31 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:41:40 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.170.0.1:17070: i/o timeout
| 2021-09-09 08:41:49 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:41:59 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:42:10 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:42:22 INFO juju.api apiclient.go:639 connection established to "wss://10.131.4.170:17070/model/093b6e8e-e00d-4c85-8052-52789d9e0132/api"
| 2021-09-09 08:42:22 INFO juju.worker.apicaller connect.go:158 [093b6e] "machine-1" successfully connected to "10.131.4.170:17070"
| 2021-09-09 08:42:25 INFO juju.worker.logger logger.go:120 logger worker started

(we're also doing something weird with DNAT'ing 10.xxx.xxx.xxx that the Juju controllers are using due to our weird and complex set up, this will change once we get routing between the two DCs fixed, but that's separate to this)

Joseph Phillips (manadart) on 2021-09-22

Changed in juju:
assignee:	nobody → Joseph Phillips (manadart)

Joseph Phillips (manadart) on 2022-09-19

Changed in juju:
status:	Triaged → Fix Committed
milestone:	none → 2.9.35

Revision history for this message

Haw Loeung (hloeung) wrote on 2022-09-19:

#3

Thank you (FWIW https://github.com/juju/juju/pull/14611).

Canonical Juju QA Bot (juju-qa-bot) on 2022-10-12

Changed in juju:
status:	Fix Committed → Fix Released

Revision history for this message

Joseph Phillips (manadart) wrote on 2022-10-17 (last edit on 2022-10-17):

#4

Just a note that we've had to revert this filtering on account of:
https://bugs.launchpad.net/juju/+bug/1993137

The approach taken was the wrong one and will have to be revisited.

Joseph Phillips (manadart) on 2022-10-17

Changed in juju:
status:	Fix Released → Triaged
milestone:	2.9.35 → none

Revision history for this message

John A Meinel (jameinel) wrote on 2023-02-22:

#5

So I think the original issue isn't about Fan addresses in the delay on coming back up. Specifically we see this line:
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent

Which is fairly clearly indicating that the controller itself just died with a 'txn watcher sync error' which generally is a hard restart of a lot of the controller internals. (essentially we lost sync with the database, so we start over to make sure that our current state is back in sync).

I wouldn't be surprised if it actually really did take 60s for the controller to come back up, and in that time it was just a while before the genuine ports of 10.131.4.170:17070 or 10.131.4.109:17070 became available.
We actually try multiple addresses simultaneously (because we are also trying .170 as well as .109 at the same time), and it happens that the fan addresses fail faster than the real addresses which is why you see the messages in the logs.

We should probably downgrade those failures to WARNING rather than ERROR, and it wouldn't hurt for us to have a clearer INFO level message about what addresses we are trying to connect to, so you can see when we start attempting all the possible addresses.

Revision history for this message

Haw Loeung (hloeung) wrote on 2023-02-22: Re: [Bug 1942804] Re: Exclude Ubuntu Fan addresses

#6

On Wed, Feb 22, 2023 at 03:53:19PM -0000, John A Meinel wrote:
> So I think the original issue isn't about Fan addresses in the delay on coming back up. Specifically we see this line:
>
> | 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
> | 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
>
> Which is fairly clearly indicating that the controller itself just died
> with a 'txn watcher sync error' which generally is a hard restart of a
> lot of the controller internals. (essentially we lost sync with the
> database, so we start over to make sure that our current state is back
> in sync).
>
> I wouldn't be surprised if it actually really did take 60s for the controller to come back up, and in that time it was just a while before the genuine ports of 10.131.4.170:17070 or 10.131.4.109:17070 became available.
> We actually try multiple addresses simultaneously (because we are also trying .170 as well as .109 at the same time), and it happens that the fan addresses fail faster than the real addresses which is why you see the messages in the logs.
>

Oh! Thank you for the explanation here. This is highly likely it and
the logging being just noise then.

Joseph Phillips (manadart) on 2023-12-14

Changed in juju:
status:	Triaged → Won't Fix

Revision history for this message

Junien F (axino) wrote on 2023-12-14:

#7

@manadart is there a chance to fix this ? Noisy logs aren't good and make debugging things harder. Thanks.

Haw Loeung (hloeung) on 2023-12-14

Changed in juju:
status:	Won't Fix → Triaged

Canonical Juju

Exclude Ubuntu Fan addresses

Bug Description

Other bug subscribers

Remote bug watches