node fails to join cluster, fails to list disks

Bug #2063223 reported by Marian Gasparovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Snap
Confirmed
Undecided
Unassigned

Bug Description

2024-04-23-01:44:13 root ERROR [localhost] Command failed: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null solqa-lab1-server-33.nosilo.lab1.solutionsqa -- sunbeam cluster join --token eyJuYW1lIjoic29scWEtbGFiMS1zZXJ2ZXItMzMubm9zaWxvLmxhYjEuc29sdXRpb25zcWEiLCJzZWNyZXQiOiI2MTY1YWVlMjFiNzJhYWQ0NDRjYjA3ODljNmE1MDdmNWUzMGZkNzdhZmJjYjUwMGIzY2Q3Yjg2NTdjNWFhZDg4IiwiZmluZ2VycHJpbnQiOiJhNjFiYTRiZDUwODRiNzhiZWRiZjg4M2FlMjM2YWY0ZWU3NDI0Y2E5ZjU0NWQ2ZGE2MWQ0YTQ2YTU0ODcyZGExIiwiam9pbl9hZGRyZXNzZXMiOlsiMTAuMjQ2LjE2NC4yMjE6NzAwMCJdfQ== --role control --role compute --role storage
2024-04-23-01:44:13 root ERROR 1[localhost] STDOUT follows:
b''
2024-04-23-01:44:13 root ERROR 2[localhost] STDERR follows:
Warning: Permanently added 'solqa-lab1-server-33.nosilo.lab1.solutionsqa' (ED25519) to the list of known hosts.
Error: Unable to list disks

Job which prepares disks (it removes existing partitions) ran before this as usual, it shows

2024-04-23-00:55:53 root DEBUG [localhost]: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null solqa-lab1-server-33.nosilo.lab1.solutionsqa -- sudo umount '/dev/sdb?*' '||' exit 0
2024-04-23-00:55:54 root DEBUG [localhost]: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null solqa-lab1-server-33.nosilo.lab1.solutionsqa -- sudo wipefs -a /dev/sdb
2024-04-23-00:55:54 root INFO [localhost]: /dev/sdb: 8 bytes were erased at offset 0x00000200 (gpt): 45 46 49 20 50 41 52 54
/dev/sdb: 8 bytes were erased at offset 0x37e4895e00 (gpt): 45 46 49 20 50 41 52 54
/dev/sdb: 2 bytes were erased at offset 0x000001fe (PMBR): 55 aa
/dev/sdb: calling ioctl to re-read partition table: Success

and machine journal show that as well

pr 23 00:55:54 solqa-lab1-server-33 systemd[1]: Started Session 4 of User ubuntu.
Apr 23 00:55:54 solqa-lab1-server-33 sudo[1871]: ubuntu : PWD=/home/ubuntu ; USER=root ; COMMAND=/usr/sbin/partprobe /dev/sdb

Manifest has the correct disk as well

microceph_config:
    solqa-lab1-server-31.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-32.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-33.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-34.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-35.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb
    solqa-lab1-server-36.nosilo.lab1.solutionsqa:
      osd_devices: /dev/sdb

logs and artifacts - https://oil-jenkins.canonical.com/artifacts/b7b759bf-c94e-4155-a1a3-b7de6dbcc1cd/index.html

Tags: cdo-qa
Revision history for this message
Andre Ruiz (andre-ruiz) wrote :

I also just hit this with 2024.1/edge

Attaching logs.

Changed in snap-openstack:
status: New → Confirmed
Revision history for this message
James Page (james-page) wrote :

@andre-ruiz

unit-microceph-1: 12:34:57 ERROR unit.microceph/1.juju-log peers:1: Failed executing cmd: ['microceph', 'cluster', 'join', 'eyJuYW1lIjoibWljcm9jZXBoLzEiLCJzZWNyZXQiOiIzM2I0MzBkMWRiYzA2N2RiNWE0NmI5OGVjZDI3YzU2MzEzNGZmNDkxMjQwYjg4ZjFkMmI3YjVhZmM0NmI4NGEyIiwiZmluZ2VycHJpbnQiOiJmM2Q4YmQ3ZDgxYmFmZDRhN2Y1ZDE1MDJmNzM2MDAwOWI2ZmJhM2RlZmI0ZDVkMDE0Y2VhMjcwNzExZDgwODU0Iiwiam9pbl9hZGRyZXNzZXMiOlsiMTQ3Ljc1LjU1LjIxOTo3NDQzIl19'], error: Error: failed to generate the configuration: failed to locate IP on public network 147.75.55.219/31: no IP belongs to provided subnet 147.75.55.219/31

unit-microceph-1: 12:34:57 WARNING unit.microceph/1.juju-log peers:1: Error: failed to generate the configuration: failed to locate IP on public network 147.75.55.219/31: no IP belongs to provided subnet 147.75.55.219/31

Looks like the microceph unit failed to join the cluster - but maybe that was not detected by the charm

Revision history for this message
James Page (james-page) wrote :

Actually it did:

microceph/1 blocked idle 1 147.75.55.139 (workload) Error in charm (see logs): Command '['microceph', 'cluster', 'join', 'eyJuYW1lIjoibWljcm9jZXBoLzEiLCJzZWNy...

maybe the trip point after this happens is listing the disks

summary: - cluster join fails on Error: Unable to list disks
+ microceph unit fails to join cluster, fails to list disks
summary: - microceph unit fails to join cluster, fails to list disks
+ node fails to join cluster, fails to list disks
Revision history for this message
James Page (james-page) wrote :

@andre - I think your issue is different, but the same symptom.

You can see in the status for the original bug report that all three units are active:

microk8s/0* active idle 0 10.246.164.221 16443/tcp
microk8s/1 active idle 1 10.246.167.161 16443/tcp
microk8s/2 active idle 2 10.246.164.223 16443/tcp

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.