landscape-server 503 error when landscape-scalable is deployed on localhost

Bug #2022982 reported by Rajan Patel
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Incomplete
Undecided
Joseph Phillips
lxd
New
Undecided
Unassigned

Bug Description

If somebody follows the Juju installation instructions as seen on:
https://discourse.ubuntu.com/t/landscape-beta-juju-installation/31538

Within 2 weeks the Landscape Server unit looks like this:

```
ubuntu@landscape-beta:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
landscape-model landscape-controller localhost/localhost 3.1.2 unsupported 23:23:45Z

App Version Status Scale Charm Channel Rev Exposed Message
haproxy active 1 haproxy stable 66 yes Unit is ready
landscape-server active 0/1 landscape-server stable 79 no Unit is ready
postgresql 12.15 active 1 postgresql latest/stable 270 no Live master (12.15)
rabbitmq-server 3.8.2 active 1 rabbitmq-server stable 123 no Unit is ready

Unit Workload Agent Machine Public address Ports Message
haproxy/0* active idle 0 10.251.146.186 80,443/tcp Unit is ready
landscape-server/0 unknown lost 1 10.251.146.75 agent lost, see 'juju show-status-log landscape-server/0'
postgresql/0* active idle 2 10.251.146.217 5432/tcp Live master (12.15)
rabbitmq-server/0* active idle 3 10.251.146.94 5672/tcp Unit is ready

Machine State Address Inst id Base AZ Message
0 started 10.251.146.186 juju-acf4dc-0 ubuntu@20.04 Running
1 down 10.251.146.75 juju-acf4dc-1 ubuntu@22.04 Running
2 started 10.251.146.217 juju-acf4dc-2 ubuntu@20.04 Running
3 started 10.251.146.94 juju-acf4dc-3 ubuntu@20.04 Running
ubuntu@landscape-beta:~$ juju --version
3.2.0-genericlinux-arm64
ubuntu@landscape-beta:~$ lxc --version
5.14
```

This happens on Landscape Beta and Landscape Stable.

Tags: landscape
description: updated
Revision history for this message
Joseph Phillips (manadart) wrote :

Can you run:
- "lxc list" to see if the container in question is up?
- If it's up "lxc exec juju-acf4dc-1 bash" to get on to it directly.
- Then look at the machine's log at /var/log/juju/machine-2.log to see if it yields anything.

Changed in juju:
status: New → Incomplete
Revision history for this message
Rajan Patel (rajannpatel) wrote :

`lxc list` shows that the LXD container does not have an IPv4 address:

ubuntu@landscape-beta:~$ lxc list
+---------------+---------+-----------------------+------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-0f524d-0 | RUNNING | 10.251.146.20 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-0 | RUNNING | 10.251.146.186 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-1 | RUNNING | | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-2 | RUNNING | 10.251.146.217 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-3 | RUNNING | 10.251.146.94 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+

Running `lxc exec juju-acf4dc-1 bash` seems to hang.

Changed in juju:
status: Incomplete → New
Revision history for this message
Joseph Phillips (manadart) wrote :

Hmm.

First thing to try is bouncing the container with "lxc restart juju-acf4dc-1".

If that doesn't get it back in the game, look in /var/snap/lxd/common/lxd/logs.
There might be something within in lxd.log, or the instance's directory that illuminates.

Changed in juju:
status: New → Incomplete
Revision history for this message
Rajan Patel (rajannpatel) wrote :

ubuntu@landscape-beta:~$ lxc restart juju-acf4dc-1
Error: Failed shutting down instance, status is "Running": context deadline exceeded
Try `lxc info --show-log juju-acf4dc-1` for more info

root@landscape-beta:/var/snap/lxd/common/lxd/logs# cat lxd.log
time="2023-05-31T17:49:38Z" level=warning msg=" - Couldn't find the CGroup network priority controller, network priority will be ignored"
time="2023-05-31T17:49:38Z" level=warning msg="Instance type not operational" driver=qemu err="KVM support is missing (no /dev/kvm)" type=virtual-machine

Changed in juju:
status: Incomplete → New
Revision history for this message
Rajan Patel (rajannpatel) wrote (last edit ):

This output may be more readable: https://pastebin.canonical.com/p/hh3vZ7Tff2/

Revision history for this message
Joseph Phillips (manadart) wrote :

Nothing here is illuminating why the container is black-holed.

I've added LXD to the bug.

I'll need to reach out to them explicitly to solicit help because they don't use LP for their bug tracking.

Revision history for this message
Thomas Parrott (tomparrott) wrote :

Are the containers restartable using `lxc restart -f`?

Revision history for this message
Thomas Parrott (tomparrott) wrote :

Please also provide `sudo ps auxf` output for the lxd host.

Changed in juju:
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Rajan Patel (rajannpatel) wrote :

@Thomas the `sudo ps auxf` output can be found here: https://pastebin.canonical.com/p/nYxTY7HFZd/

Revision history for this message
Rajan Patel (rajannpatel) wrote :

@Thomas - apologies, the `sudo ps auxf` output in comment #9 above is from a different machine (also exhibiting the same issue). The output you requested on the machine we are discussing in this ticket can be found here: https://pastebin.canonical.com/p/mN8Fy7RJmb/

Revision history for this message
Rajan Patel (rajannpatel) wrote :

@Thomas - `lxc restart -f` worked. Output shown below:

ubuntu@landscape-beta:~$ lxc list
+---------------+---------+-----------------------+------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-0f524d-0 | RUNNING | 10.251.146.20 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-0 | RUNNING | 10.251.146.186 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-1 | RUNNING | | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-2 | RUNNING | 10.251.146.217 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-3 | RUNNING | 10.251.146.94 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
ubuntu@landscape-beta:~$ lxc restart juju-acf4dc-1 -f
ubuntu@landscape-beta:~$ lxc list
+---------------+---------+-----------------------+------+-----------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-0f524d-0 | RUNNING | 10.251.146.20 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-0 | RUNNING | 10.251.146.186 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-1 | RUNNING | 10.251.146.75 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-2 | RUNNING | 10.251.146.217 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+
| juju-acf4dc-3 | RUNNING | 10.251.146.94 (eth0) | | CONTAINER | 0 |
+---------------+---------+-----------------------+------+-----------+-----------+

After running: `lxc exec juju-acf4dc-1 bash`:
- For `cat /var/log/juju/machine-1.log` I see: https://pastebin.canonical.com/p/SWY7H3NMgg/
- For `cat /var/log/juju/machine-lock.log` I see: https://pastebin.canonical.com/p/9nCGBBk6T2/
- For `cat /var/log/juju/unit-landscape-server-0.log` I see: https://pastebin.canonical.com/p/244hzjndzj/

It's worth noting, this had the same result as rebooting the entire machine. Landscape no longer produces a 503 error, the dashboard is served by the LXD juju-acf4dc-1 instance successfully. However, I am fully expecting this issue to reoccur after several weeks.

Revision history for this message
Rajan Patel (rajannpatel) wrote :

@Thomas if you want a `sudo ps auxf` output after performing the `lxc restart` step, it can be found here: https://pastebin.canonical.com/p/qhwb9WyWKp/

Revision history for this message
Thomas Parrott (tomparrott) wrote :

Thanks for those.

So to summarise so far:

1. After a while container loses its IP.
2. `lxc restart` doesn't work because container's init system is not responding (or not finishing) the shutdown request that LXD sends.
3. `lxc exec` still works?
4. `lxc restart -f` works - showing that LXD itself is still functioning correctly and can control the container from outside.

Looking at your ps output, does it strike you as odd that there are many many `/usr/sbin/CRON -f -P` processes running inside that problem `juju-acf4dc-1` container?

The next time it happens, if `lxc exec` is working, please can you run `lxc exec <instance> -- ps auxf` so we can see things from the container's perspective.

Also, it should be good to get the output of `lxc exec <instance> -- systemctl` to see what is the state of the systemd units.

Finally, it would be good to see the output of the last few hundred lines of the container's journalctl entries, so something like:

```
lxc exec <instance> -- journalctl -b -r -n 200
```

Changed in juju:
status: New → Triaged
Revision history for this message
Rajan Patel (rajannpatel) wrote :

#3 - `lxc exec` doesn't work. For example: Running `lxc exec juju-acf4dc-1 bash` seems to hang.

#4 - agree, `lxc restart` times out (exact error message included above) but using `-f` succeeds.

The `sudo ps auxf` output was from the host, not from inside the container.

I have a separate machine where this same issue is reproduced (and the it has not been restarted). Would you folks like SSH access there so you can log in and tinker? If yes, which SSH key(s) should I add?

Revision history for this message
Thomas Parrott (tomparrott) wrote :

Yes that would be great thanks.

Revision history for this message
Thomas Parrott (tomparrott) wrote :
description: updated
description: updated
Changed in juju:
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.