Non-leader units stuck on Waiting on relations: db

Bug #2019086 reported by Diko Parvanov
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Landscape Charm
Triaged
High
Unassigned

Bug Description

using latest/stable revision 87, the leader is active/idle but the non-leader units are stuck:

landscape-server/0* active idle 0 10.5.1.130 Unit is ready
landscape-server/1 waiting idle 1 10.5.1.16 Waiting on relations: db
landscape-server/2 waiting idle 2 10.5.2.250 Waiting on relations: db

Removing and re-adding the db relation towards postgresql (using 14/stable rev 288) doesn't change anything.

Removing the 2 waiting units and adding again - same situation. On a single landscape-server unit there is no problem.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

We are seeing this with SQA too, except that the leader is also in the `waiting on relations: db` state. We collected some crashdumps here: https://oil-jenkins.canonical.com/artifacts/a02a6b88-187c-42dc-91bc-38fb06db1e85/generated/generated/lma-maas/juju-crashdump-lma-maas-2023-06-03-17.19.42.tar.gz

Changed in landscape-charm:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

I'm also seeing this in a deployment.

In my case, all three units (including the leader) are waiting for db.

Also, removing and re-adding relation does not help, in fact make it worse, the three units become blocked with message:

ubuntu@infra01:~$ juju status landscape-server landscape-postgresql
Model Controller Cloud/Region Version SLA Timestamp
lma-maas foundations-maas maas_cloud 2.9.44 unsupported 20:24:07Z

App Version Status Scale Charm Channel Rev Exposed Message
landscape-postgresql 12.16 active 2 postgresql latest/stable 316 no Live master (12.16)
landscape-server blocked 3 landscape-server stable 87 no Failed to update database schema

Unit Workload Agent Machine Public address Ports Message
landscape-postgresql/0* active idle 4 10.3.4.130 5432/tcp Live master (12.16)
landscape-postgresql/1 active idle 10 10.3.4.117 5432/tcp Live secondary (12.16)
landscape-server/0* blocked idle 3 10.3.4.132 Failed to update database schema
landscape-server/1 blocked idle 9 10.3.4.129 Failed to update database schema
landscape-server/2 blocked idle 16 10.3.4.119 Failed to update database schema

Machine State Address Inst id Series AZ Message
3 started 10.3.4.132 landscape-1 focal zone1 Deployed
4 started 10.3.4.130 landscapesql-1 focal zone1 Deployed
9 started 10.3.4.129 landscape-2 focal zone3 Deployed
10 started 10.3.4.117 landscapesql-2 focal zone3 Deployed
16 started 10.3.4.119 landscape-3 focal zone5 Deployed

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :
Download full text (4.7 KiB)

I redeployed and watched the progression of the states. For a long time landscape stays in a blocked state with a status message similar to "failed to upgrade database schema", and at the very end status changes to "waiting on relations: db", while state switches from blocked back to waiting.

Then I see:

landscape-server/0 waiting idle 3 10.3.4.128 Waiting on relations: db
landscape-server/1 waiting idle 9 10.3.4.118 Waiting on relations: db
landscape-server/2* waiting idle 16 10.3.4.139 Waiting on relations: db

But in the debug-log you can see the error:

unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed gpg: keybox '/root/.gnupg/pubring.kbx' created
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed Traceback (most recent call last):
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed File "/opt/canonical/landscape/canonical/landscape/scripts/schema.py", line 226, in handle_existing_licenses
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed output = check_output(["gpg", "--skip-verify", "--output", "-", license_file])
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed File "/usr/lib/python3.8/subprocess.py", line 516, in run
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed raise CalledProcessError(retcode, process.args,
unit-landscape-server-1: 20:50:34 WARNING unit.landscape-server/1.db-relation-changed subprocess.CalledProcessError: Command '['gpg', '--skip-verify', '--output', '-', '/etc/landscape/license.txt']' returned non
unit-landscape-server-1: 20:50:35 INFO juju.worker.uniter.operation ran "db-relation-changed" hook (via hook dispatching script: dispatch)
unit-landscape-server-1: 20:50:49 WARNING unit.landscape-server/1.db-relation-changed gpg: WARNING: no command supplied. Trying to guess what you mean ...
unit-landscape-server-1: 20:50:49 WARNING unit.landscape-server/1.db-relation-changed gpg: invalid armor header: iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI\n
unit-landscape-server-1: 20:50:49 WARNING unit.landscape-server/1.db-relation-changed Traceback (most recent call last):
unit-landscape-server-1: 20:50:49 WARNING unit.landscape-server/1.db-relation-changed File "/usr/bin/landscape-schema", line 12, in <module>
unit-landscape-server-1: 20:50:49 WARNING unit.landscape-server/1.db-relation-changed canonical.landscape.scripts.schema.run()
unit-landscape-server-1: 20:50:49 WARNING unit.landscape-server/1.db-relation-changed File "/opt/canonical/landscape/cano...

Read more...

Revision history for this message
Andre Ruiz (andre-ruiz) wrote (last edit ):

This is weird because the license file appears to be correct, contents do not seem to be corrupted.

Also, it was given to the charm in base64 format, and they ended up correctly in plain text inside the units at /etc/landscape/license.txt. The contents there match exactly the contents of original file before passed to charm option in encoded form.

Also, the part "gpg: WARNING: no command supplied. Trying to guess what you mean ..." seem unexpected.

If I manually run that gpg command, the result is the same:

root@landscape-2:/etc/landscape# gpg --skip-verify --output - /etc/landscape/license.txt
gpg: WARNING: no command supplied. Trying to guess what you mean ...
gpg: invalid armor header: iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI\n

This is the (obscured) content of the license file as it is inside the unit, just for reference:

============================8<-----------------------------
root@landscape-2:/etc/landscape# cat license.txt
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

{
  "licensee":"xxxxxxxxxxxxxxxx",
  "licenses":[
    {
      "role":"BasicFeatures",
      "seats":yyy,
      "license_key":"1INxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxGTzv",
      "expires":"202x-xx-xx"
    }
  ]
}
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI
(deleted lines...)
uhnsOoORDPDQ27zANORG4C/rrFFGoQHk+Q3c4ynnF96qeKLVa1mL/WdxLahP+xA4
ETKttDoP45r0uXhpzmvuvETsyUrWgg==
=71TG
-----END PGP SIGNATURE-----
============================8<-----------------------------

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Ok, I think I understood it.

My landscape license was recently re-generated (prior one expired) -- it is an internal field engineering temporary license for customer installation, btw -- and the new one is not correctly formatted (first time I'm using this one).

It is missing a header inside the signature part.

Where it now reads:

===========================8<---------------------------------------
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI
===========================8<---------------------------------------

It should read something like:

===========================8<---------------------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (MingW32)

iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI
===========================8<---------------------------------------

(see the line with version and operating system). I added a random header (the one shown above) and now gpg works fine and I don't have the charm errors anymore.

@dparv, I suspect you are using that same file.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

What could be improved is that failure of license extraction with gpg should not show up as a database issue in the status line (be it the initial failure to update database schema, or later the missing db relation status).

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Another update on this.

Turns out that the header text is not so important, but the header line is, even if it is a blank line between the BEGIN and first signature line. In my case, somehow, the blank lines were stripped between wget-ing the file and copying it to the remote machine where fce was running. And that was the source of my problems since the first line of the signature became the header.

I tested again and made sure that the exact downloaded file was used and I don't see the problem anymore. So this was a red herring for me and I'm not sure my problem was the same as the OP's problem.

Just to summarize.

This DOES NOT WORK:
===========================8<---------------------------------------
-----BEGIN PGP SIGNATURE-----
iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI
===========================8<---------------------------------------

This works (and is how the downloaded file is, actually):
===========================8<---------------------------------------
-----BEGIN PGP SIGNATURE-----

iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI
===========================8<---------------------------------------

This also works (real header instead of just a blank line):
===========================8<---------------------------------------
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (MingW32)
iQEzBAEBCgAdFiEENTODksx68VxYHaPbcqZmZcXIjuYFAmTlPXQACgkQcqZmZcXI
===========================8<---------------------------------------

Maybe original license file from landscape could have a proper header there to avoid this kind of confusion (or errors, since blank lines could be stripped like it happened to me), but it is not broken.

Sorry for the noise. This may be unrelated to original issue.

Revision history for this message
Vern Hart (vern) wrote :
Download full text (4.0 KiB)

I've encountered this and can't seem to get past it. My leader unit is active/idle but the two other landscape-server units are "Waiting on relations: db".

The last few lines of the landscape-server unit logs on the waiting units:

2023-09-22 08:14:13 INFO juju.worker.uniter.operation runhook.go:159 ran "amqp-relation-changed" hook (via hook dispatching script: dispatch)
2023-09-22 08:14:14 INFO juju.worker.uniter.operation runhook.go:159 ran "amqp-relation-changed" hook (via hook dispatching script: dispatch)
2023-09-22 08:14:28 INFO unit.landscape-server/1.juju-log server.go:316 db:51: landscape-server/1 not in allowed_units
2023-09-22 08:14:28 INFO juju.worker.uniter.operation runhook.go:159 ran "db-relation-changed" hook (via hook dispatching script: dispatch)
2023-09-22 08:15:24 INFO unit.landscape-server/1.juju-log server.go:316 db:51: landscape-server/1 not in allowed_units
2023-09-22 08:15:24 INFO juju.worker.uniter.operation runhook.go:159 ran "db-relation-changed" hook (via hook dispatching script: dispatch)
2023-09-22 08:18:04 INFO juju.worker.uniter.operation runhook.go:159 ran "update-status" hook (via hook dispatching script: dispatch)
2023-09-22 08:23:57 INFO juju.worker.uniter.operation runhook.go:159 ran "update-status" hook (via hook dispatching script: dispatch)
2023-09-22 08:29:31 INFO juju.worker.uniter.operation runhook.go:159 ran "update-status" hook (via hook dispatching script: dispatch)
2023-09-22 08:34:24 INFO juju.worker.uniter.operation runhook.go:159 ran "update-status" hook (via hook dispatching script: dispatch)

However, when I look at the relation data, I see landscape-server/1 in the allowed-units:

$ juju show-unit landscape-server/1 --endpoint db
landscape-server/1:
  machine: "9"
  opened-ports: []
  public-address: 172.24.15.179
  charm: ch:amd64/jammy/landscape-server-93
  leader: true
  life: alive
  relation-info:
  - relation-id: 51
    endpoint: db
    related-endpoint: db-admin
    application-data: {}
    related-units:
      landscape-postgresql/0:
        in-scope: true
        data:
          allowed-subnets: 172.24.15.171/32,172.24.15.179/32,172.24.15.181/32
          allowed-units: landscape-server/0,landscape-server/1,landscape-server/2
          egress-subnets: 172.24.15.167/32
          host: 172.24.15.168
          ingress-address: 172.24.15.167
          master: dbname=landscape-server fallback_application_name=landscape-server
            host=172.24.15.168 password=xx port=5432 user=relation-51
          port: "5432"
          private-address: 172.24.15.167
          standbys: dbname=landscape-server fallback_application_name=landscape-server
            host=172.24.15.167 password=xx port=5432 user=relation-51
          state: standby
          user: relation-51
      landscape-postgresql/1:
        in-scope: true
        data:
          allowed-subnets: 172.24.15.171/32,172.24.15.179/32,172.24.15.181/32
          allowed-units: landscape-server/0,landscape-server/1,landscape-server/2
          database: landscape-server
          egress-subnets: 172.24.15.168/32
          host: 172.24.15.168
          ingress-address: 172.24.15.168
          master: dbname=landscape-server fallback...

Read more...

tags: added: cdo-qa foundations-enginer
tags: added: foundations-engine
removed: foundations-enginer
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.