cleanup VIPs in network-get and relation-data after upgrade

Bug #1944758 reported by Rodrigo Barbieri
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

Following up on bug #1897261, a customer that was running Juju 2.8.8 and had hit the issue with the following network-get and relation data:

VIP: x.16.211.10
keystone/0: x.16.211.55
keystone/1: x.16.211.66

Before upgrade:

keystone network-get internal (same as shared-db and mysql-router's db-router)
- Stdout: |
    bind-addresses:
    - macaddress: redacted
      interfacename: eth1
      addresses:
      - hostname: ""
        address: x.16.211.55
        cidr: x.16.211.0/24
      - hostname: ""
        address: x.16.211.10
        cidr: x.16.211.0/24
    egress-subnets:
    - x.16.211.10/32
    ingress-addresses:
    - x.16.211.10
    - x.16.211.55
  UnitId: keystone/0

+ juju run --unit mysql-innodb-cluster/0 'relation-get -r db-router:152 - keystone-mysql-router/0'
MRUP_database: keystone
MRUP_hostname: x.16.211.55
MRUP_username: keystone
egress-subnets: x.16.211.10/32
ingress-address: x.16.211.10
mysqlrouter_hostname: x.16.211.55
mysqlrouter_username: mysqlrouteruser
private-address: x.16.211.10

mysql-innodb-cluster logs:

/var/log/juju/unit-mysql-innodb-cluster-1.log:3261306:2021-08-20 00:53:39 DEBUG juju-log db-router:152: Grant does NOT exist for host 'x.16.211.10' on db 'keystone'

We had restarted corosync on keystone/0 and observed the problem move to keystone/1, while the problem with network-get and relations related to keystone/0 auto-resolved.

We then upgraded to Juju 2.9.11, and we observed the following (on keystone/1):

After upgrade:

keystone network-get internal (same as shared-db and mysql-router's db-router)
- Stdout: |
    bind-addresses:
    - mac-address: redacted
      interface-name: eth1
      addresses:
      - hostname: ""
        address: x.16.211.66
        cidr: x.16.211.0/24
      - hostname: ""
        address: x.16.211.10
        cidr: x.16.211.0/24
      macaddress: redacted
      interfacename: eth1
    egress-subnets:
    - x.16.211.66/32
    ingress-addresses:
    - x.16.211.66
    - x.16.211.10
  UnitId: keystone/1

+ juju run --unit mysql-innodb-cluster/0 'relation-get -r db-router:152 - keystone-mysql-router/1'
MRUP_database: keystone
MRUP_hostname: x.16.211.66
MRUP_username: keystone
egress-subnets: x.16.211.10/32
ingress-address: x.16.211.10
mysqlrouter_hostname: x.16.211.66
mysqlrouter_username: mysqlrouteruser
private-address: x.16.211.10

/var/log/juju/unit-mysql-innodb-cluster-2.log:3800396:2021-08-23 01:58:00 DEBUG juju-log Grant does NOT exist for host 'x.16.211.10' on db 'keystone'

In order to fix the issue we had to perform the following:

db.ip.addresses.find({ $and: [ {_id: /.*x.16.211.10.*/}, {"machine-id": "y/lxd/z"} ] } )
db.ip.addresses.remove({ $and: [ {_id: /.*x.16.211.10.*/}, {"machine-id": "y/lxd/z"} ] } )
systemctl restart jujud-machine-y-lxd-z.service

The network-get and relation-data changed to:

keystone network-get internal (same as shared-db and mysql-router's db-router)
- Stdout: |
    bind-addresses:
    - mac-address: redacted
      interface-name: eth1
      addresses:
      - hostname: ""
        address: x.16.211.66
        cidr: x.16.211.0/24
      macaddress: redacted
      interfacename: eth1
    egress-subnets:
    - x.16.211.66/32
    ingress-addresses:
    - x.16.211.66
  UnitId: keystone/1

+ juju run --unit mysql-innodb-cluster/0 'relation-get -r db-router:152 - keystone-mysql-router/1'
MRUP_database: keystone
MRUP_hostname: x.16.211.66
MRUP_username: keystone
egress-subnets: x.16.211.66/32
ingress-address: x.16.211.66
mysqlrouter_hostname: x.16.211.66
mysqlrouter_username: mysqlrouteruser
private-address: x.16.211.66

I have not confirmed the mysql-innodb-cluster db-router messages in the logs yet, but I'm 99.99% positive it is fixed. I will post an update when I get the data.

The reason I'm opening this bug is that I feel this is an upgrade issue that should be tackled, the juju code should be able to identify VIPs are in the network-get data and relation, and fix itself. The upgraded juju does not include any VIP in those, so if it can detect the VIPs, and it should also detect the discrepancies and fix itself to prevent having to do mongodb surgery, which in this case of removing an IP should be something juju should be able to do.

Tags: sts
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I got the mysql-innodb-cluster log data and confirmed that the Grant error messages are no longer displayed when an endpoint-changed of db-router relation hook runs.

Revision history for this message
Joseph Phillips (manadart) wrote :

Would have liked to see the output from when you ran:

db.ip.addresses.find({ $and: [ {_id: /.*x.16.211.10.*/}, {"machine-id": "y/lxd/z"} ] } )

That would tell us the "source" of the address.

We do have logic to handle this:
- There are 2 sources of addresses - the machine agent and the instance-poller (which asks the provider),
- Addresses have a source to indicate which of the sources above is responsible.
- There is a precedent here. If the instance-poller doesn't see an address, it updates it to be machine-sourced. The machine agent can only remove addresses that it is responsible for.

For safety there is an upgrade step accompanying the patches that fixed persistent VIPs. It set the source of all addresses to be the instance-poller.

I think there *may* have been a situation here where these processes had not *yet* run in the desired order - the instance-poller did not relinquish the address before the machine agent updated network config, so the machine agent had not yet deleted the address.

I'm going to mark this one as incomplete, but I'm happy to revisit it with more info. The logs should have entries when network configuration is updated, and the output of the DB query would tell us the currently assigned source for the address.

Changed in juju:
status: New → Incomplete
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote (last edit ):

Hi Joseph, see below the output of mongodb queries:

juju:PRIMARY> db.ip.addresses.find({ $and: [ {_id: /.*x.16.211.10.*/}, {"machine-id": "y/lxd/z"} ] } )
{ "_id" : "c0105cfc-ef76-406d-80b9-8820dc6879b7:m#y/lxd/z#d#eth1#ip#x.16.211.10", "model-uuid" : "c0105cfc-ef76-406d-80b9-8820dc6879b7", "device-name" : "eth1", "machine-id" : "y/lxd/z", "subnet-cidr" : "x.16.211.0/24", "config-method" : "static", "value" : "x.16.211.10", "origin" : "machine", "txn-revno" : NumberLong(3), "txn-queue" : [ ], "is-secondary" : true }
juju:PRIMARY>
juju:PRIMARY> db.ip.addresses.find( {_id: /.*x.16.211.10.*/} )
{ "_id" : "c0105cfc-ef76-406d-80b9-8820dc6879b7:m#y/lxd/z#d#eth1#ip#x.16.211.10", "model-uuid" : "c0105cfc-ef76-406d-80b9-8820dc6879b7", "device-name" : "eth1", "machine-id" : "y/lxd/z", "subnet-cidr" : "x.16.211.0/24", "config-method" : "static", "value" : "x.16.211.10", "origin" : "machine", "txn-revno" : NumberLong(3), "txn-queue" : [ ], "is-secondary" : true }
juju:PRIMARY>

Changed in juju:
status: Incomplete → New
John A Meinel (jameinel)
Changed in juju:
status: New → Triaged
importance: Undecided → Medium
tags: added: sts
Changed in juju:
status: Triaged → In Progress
importance: Medium → High
Changed in juju:
status: In Progress → Fix Committed
milestone: none → 2.9.26
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Hi Joseph, is there a pull request associated with this fix?

Revision history for this message
Joseph Phillips (manadart) wrote :
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.