node fails to join cluster, fails to list disks

Bug #2063223 reported by Marian Gasparovic
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Snap
Confirmed
Undecided
Unassigned

Bug Description

2024-04-23-01:44:13 root ERROR [localhost] Command failed: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null solqa-lab1-server-33.nosilo.lab1.solutionsqa -- sunbeam cluster join --token eyJuYW1lIjoic29scWEtbGFiMS1zZXJ2ZXItMzMubm9zaWxvLmxhYjEuc29sdXRpb25zcWEiLCJzZWNyZXQiOiI2MTY1YWVlMjFiNzJhYWQ0NDRjYjA3ODljNmE1MDdmNWUzMGZkNzdhZmJjYjUwMGIzY2Q3Yjg2NTdjNWFhZDg4IiwiZmluZ2VycHJpbnQiOiJhNjFiYTRiZDUwODRiNzhiZWRiZjg4M2FlMjM2YWY0ZWU3NDI0Y2E5ZjU0NWQ2ZGE2MWQ0YTQ2YTU0ODcyZGExIiwiam9pbl9hZGRyZXNzZXMiOlsiMTAuMjQ2LjE2NC4yMjE6NzAwMCJdfQ== --role control --role compute --role storage
2024-04-23-01:44:13 root ERROR 1[localhost] STDOUT follows:
b''
2024-04-23-01:44:13 root ERROR 2[localhost] STDERR follows:
Warning: Permanently added 'solqa-lab1-server-33.nosilo.lab1.solutionsqa' (ED25519) to the list of known hosts.
Error: Unable to list disks

Job which prepares disks (it removes existing partitions) ran before this as usual, it shows

2024-04-23-00:55:53 root DEBUG [localhost]: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null solqa-lab1-server-33.nosilo.lab1.solutionsqa -- sudo umount '/dev/sdb?*' '||' exit 0
2024-04-23-00:55:54 root DEBUG [localhost]: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null solqa-lab1-server-33.nosilo.lab1.solutionsqa -- sudo wipefs -a /dev/sdb
2024-04-23-00:55:54 root INFO [localhost]: /dev/sdb: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
/dev/sdb: 8 bytes were erased at offset 0x37e4895e00 (gpt): 45 46 49 20 50 41 52 54
/dev/sdb: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/sdb: calling ioctl to re-read partition table: Success

and machine journal show that as well

pr 23 00:55:54 solqa-lab1-server-33 systemd[1]: Started Session 4 of User ubuntu.
Apr 23 00:55:54 solqa-lab1-server-33 sudo[1871]: ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/partprobe /dev/sdb

Manifest has the correct disk as well

microceph_config:
    solqa-lab1-server-31.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-32.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-33.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-34.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-35.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-36.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb

logs and artifacts - https://oil-jenkins.canonical.com/artifacts/b7b759bf-c94e-4155-a1a3-b7de6dbcc1cd/index.html

Tags: cdo-qa
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I also just hit this with 2024.1/edge

Attaching logs.

Changed in snap-openstack:
status: New → Confirmed
Revision history for this message
James Page (james-page) wrote :

@andre-ruiz

unit-microceph-1: 12:34:57 ERROR unit.microceph/1.juju-log peers:1: Failed executing cmd: ['microceph', 'cluster', 'join', 'eyJuYW1lIjoibWljcm9jZXBoLzEiLCJzZWNyZXQiOiIzM2I0MzBkMWRiYzA2N2RiNWE0NmI5OGVjZDI3YzU2MzEzNGZmNDkxMjQwYjg4ZjFkMmI3YjVhZmM0NmI4NGEyIiwiZmluZ2VycHJpbnQiOiJmM2Q4YmQ3ZDgxYmFmZDRhN2Y1ZDE1MDJmNzM2MDAwOWI2ZmJhM2RlZmI0ZDVkMDE0Y2VhMjcwNzExZDgwODU0Iiwiam9pbl9hZGRyZXNzZXMiOlsiMTQ3Ljc1LjU1LjIxOTo3NDQzIl19'], error: Error: failed to generate the configuration: failed to locate IP on public network 147.75.55.219/31: no IP belongs to provided subnet 147.75.55.219/31

unit-microceph-1: 12:34:57 WARNING unit.microceph/1.juju-log peers:1: Error: failed to generate the configuration: failed to locate IP on public network 147.75.55.219/31: no IP belongs to provided subnet 147.75.55.219/31

Looks like the microceph unit failed to join the cluster - but maybe that was not detected by the charm

Revision history for this message
James Page (james-page) wrote :

Actually it did:

microceph/1 blocked idle 1 147.75.55.139 (workload) Error in charm (see logs): Command '['microceph', 'cluster', 'join', 'eyJuYW1lIjoibWljcm9jZXBoLzEiLCJzZWNy...

maybe the trip point after this happens is listing the disks

summary: - cluster join fails on Error: Unable to list disks
+ microceph unit fails to join cluster, fails to list disks
summary: - microceph unit fails to join cluster, fails to list disks
+ node fails to join cluster, fails to list disks
Revision history for this message
James Page (james-page) wrote :

@andre - I think your issue is different, but the same symptom.

You can see in the status for the original bug report that all three units are active:

microk8s/0* active idle 0 10.246.164.221 16443/tcp
microk8s/1 active idle 1 10.246.167.161 16443/tcp
microk8s/2 active idle 2 10.246.164.223 16443/tcp

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

Ok, I have a separate bug report for the case where microceph cluster fails to join the node because of that "no IP belongs to provided subnet" error. That one is LP: #2056218 and also a similar one exists from Nobuto which is LP: #2065700

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I had two more cases today that seem exactly like this again but looking at the logs it might be that the cluster was not formed correctly and this error is just a consequence of that but the installation script did not catch it soon enough.

The output on the console is:

16:28:32 + sunbeam cluster join --role compute --role control --role storage --token eyJuYW1lIjoib2I3Ni1ub2RlNS5tYWFzIiwic2VjcmV0IjoiNDNmYTA4YmYwNDQ0Y2QxNTg0Mjg4MmY1MTMzZTgxMDIyMmMzZDMwZmVlMzY2MWQyMDc3YzZmNmNjZjYyZDQ2OSIsImZpbmdlcnByaW50IjoiOTk5ZDk2MWI0NDU4NTM4MDAyZTQxOGZiNzc5YjQ3ZGNjMmZiYjU2Y2QzNTNhY2NiOTRjOTI5NGRmZTBmZmEzYiIsImpvaW5fYWRkcmVzc2VzIjpbIjE3Mi4yNy43Ni4xNjI6NzAwMCJdfQ==
16:28:34 > Checking for host configuration of minimum 4 core and 16G RAM ...
16:28:34 > Checking for presence of Juju ...
16:28:34 > Checking for presence of ssh-keys interface ...
16:28:34 > Checking if user ubuntu is member of group snap_daemon ...
16:28:34 > Checking for ~/.local/share directory ...
16:28:34 > Checking if join token looks valid ...
16:28:34 > Checking if Hypervisor Hostname is same as FQDN ...
16:28:34 > Authenticating with Juju controller ...
16:28:34 > Adding node to Sunbeam cluster ...
16:28:34 > Saving machine user ob76-node5.maas for local usage ...
16:28:41 > Registering machine user ob76-node5.maas using token ...
16:29:27 > Adding machine to Juju model ...
16:29:27 > Updating node info in cluster database ...
16:29:31 > Adding Sunbeam Machine unit to machine(s) ...
16:32:38 > Adding MicroK8S unit to machine ...
16:37:29 > Adding MicroCeph unit to machine ...
16:37:29 > Configuring MicroCeph storage ...
16:38:15 Error: Unable to list disks
16:38:15 DEBUG: execute return code is 1

I'm attaching both logs just in case.

Revision history for this message
Andre Ruiz (andre-ruiz) wrote :
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.