Exclude Ubuntu Fan addresses

Bug #1942804 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Medium
Joseph Phillips

Bug Description

Hi,

For dedicated Juju 2.x controllers, agent.conf has the following:

apiaddresses:
- 10.131.X.Y:17070
- 252.170.0.1:17070
- 10.131.X.Y:17070
- 252.248.0.1:17070
- 10.131.X.Y:17070
- 252.109.0.1:17070

From the above, the 252.X.Y.Z addresses are all from Ubuntu Fan Networking[1][2]. These are unreachable in set ups where Juju controllers are separate environments. I think Juju should exclude Fan addresses when advertising the list of IPs of the controllers as well as on the agent side discovering Juju controllers.

Looking at the ubuntu-fan package source, in particular debian/ubuntu-fan.postinst, it seems these are the netblocks used:

250.0.0.0/8
251.0.0.0/8
252.0.0.0/8
253.0.0.0/8
254.0.0.0/8

[1]https://ubuntu.com/blog/fan-networking
[2]https://wiki.ubuntu.com/FanNetworking

Haw Loeung (hloeung)
summary: - Exclude Ubuntu FAN addresses
+ Exclude Ubuntu Fan addresses
Revision history for this message
John A Meinel (jameinel) wrote :

Just to confirm, Haw, is this something that causes actual issues, or just something that is juju not taking into account routing when supplying addresses? (I would assume that logs will include entries that indicate it is trying but not succeeding at using those addresses, but it should successfully connect to the other addresses)

I'm loathe to just have a special case that strips them, because I would like to have a clearer model of networking that includes routing, that would naturally say "you're not on the same fan as the controller, therefore those addresses are not routable, and should thus not be included."

In a practical sense, we know that Fan isn't inherently better than normal addresses for this use case (we won't be putting controllers into LXD containers any time soon). So we could just strip them.

Changed in juju:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Haw Loeung (hloeung) wrote :

It's trying to connect to the controllers via the advertised Fan addresses, but does eventually time out. Example snippet from a machine provisioned in Taipei using Juju controllers in PS5):

| 2021-09-09 08:40:07 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: api connection broken unexpectedly
| 2021-09-09 08:41:11 INFO juju.worker.logger logger.go:136 logger worker stopped
| 2021-09-09 08:41:17 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: read tcp 211.75.xxx.xxx:37388->10.131.4.170:17070: read: connection reset by peer
| 2021-09-09 08:41:24 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.170.0.1:17070: i/o timeout
| 2021-09-09 08:41:31 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:41:40 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.170.0.1:17070: i/o timeout
| 2021-09-09 08:41:49 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:41:59 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:42:10 ERROR juju.worker.dependency engine.go:671 "api-caller" manifold worker returned unexpected error: [093b6e] "machine-1" cannot open api: unable to connect to API: dial tcp 252.109.0.1:17070: i/o timeout
| 2021-09-09 08:42:22 INFO juju.api apiclient.go:639 connection established to "wss://10.131.4.170:17070/model/093b6e8e-e00d-4c85-8052-52789d9e0132/api"
| 2021-09-09 08:42:22 INFO juju.worker.apicaller connect.go:158 [093b6e] "machine-1" successfully connected to "10.131.4.170:17070"
| 2021-09-09 08:42:25 INFO juju.worker.logger logger.go:120 logger worker started

(we're also doing something weird with DNAT'ing 10.xxx.xxx.xxx that the Juju controllers are using due to our weird and complex set up, this will change once we get routing between the two DCs fixed, but that's separate to this)

Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
Changed in juju:
status: Triaged → Fix Committed
milestone: none → 2.9.35
Revision history for this message
Haw Loeung (hloeung) wrote :
Changed in juju:
status: Fix Committed → Fix Released
Revision history for this message
Joseph Phillips (manadart) wrote (last edit ):

Just a note that we've had to revert this filtering on account of:
https://bugs.launchpad.net/juju/+bug/1993137

The approach taken was the wrong one and will have to be revisited.

Changed in juju:
status: Fix Released → Triaged
milestone: 2.9.35 → none
Revision history for this message
John A Meinel (jameinel) wrote :

So I think the original issue isn't about Fan addresses in the delay on coming back up. Specifically we see this line:
 | 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
| 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent

Which is fairly clearly indicating that the controller itself just died with a 'txn watcher sync error' which generally is a hard restart of a lot of the controller internals. (essentially we lost sync with the database, so we start over to make sure that our current state is back in sync).

I wouldn't be surprised if it actually really did take 60s for the controller to come back up, and in that time it was just a while before the genuine ports of 10.131.4.170:17070 or 10.131.4.109:17070 became available.
We actually try multiple addresses simultaneously (because we are also trying .170 as well as .109 at the same time), and it happens that the fan addresses fail faster than the real addresses which is why you see the messages in the logs.

We should probably downgrade those failures to WARNING rather than ERROR, and it wouldn't hurt for us to have a clearer INFO level message about what addresses we are trying to connect to, so you can see when we start attempting all the possible addresses.

Revision history for this message
Haw Loeung (hloeung) wrote : Re: [Bug 1942804] Re: Exclude Ubuntu Fan addresses

On Wed, Feb 22, 2023 at 03:53:19PM -0000, John A Meinel wrote:
> So I think the original issue isn't about Fan addresses in the delay on coming back up. Specifically we see this line:
>
> | 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "migration-minion" manifold worker returned unexpected error: txn watcher sync error
> | 2021-09-09 08:41:11 ERROR juju.worker.dependency engine.go:671 "log-sender" manifold worker returned unexpected error: cannot send log message: websocket: close sent
>
> Which is fairly clearly indicating that the controller itself just died
> with a 'txn watcher sync error' which generally is a hard restart of a
> lot of the controller internals. (essentially we lost sync with the
> database, so we start over to make sure that our current state is back
> in sync).
>
> I wouldn't be surprised if it actually really did take 60s for the controller to come back up, and in that time it was just a while before the genuine ports of 10.131.4.170:17070 or 10.131.4.109:17070 became available.
> We actually try multiple addresses simultaneously (because we are also trying .170 as well as .109 at the same time), and it happens that the fan addresses fail faster than the real addresses which is why you see the messages in the logs.
>

Oh! Thank you for the explanation here. This is highly likely it and
the logging being just noise then.

Changed in juju:
status: Triaged → Won't Fix
Revision history for this message
Junien F (axino) wrote :

@manadart is there a chance to fix this ? Noisy logs aren't good and make debugging things harder. Thanks.

Haw Loeung (hloeung)
Changed in juju:
status: Won't Fix → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.