Kubernetes-core on lxd-cluster breaks lxd database connection

Bug #1852138 reported by Camille Rodriguez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Charmed Kubernetes Bundles
Triaged
Medium
Unassigned

Bug Description

I deployed the kubernetes-core bundle on a lxd cluster. My lxd cluster is composed of 2 nodes, and is set up with a fan network.

When I deploy the kubernetes-core bundle on top of it, at some point, something breaks the lxd database connection. Juju isn't able to connect to it anymore, it does not respond to simple commands:

```
ubuntu@kubecon04:~$ sudo lxc info
Error: cannot fetch node config from database: driver: bad connection
```
The status of the service seems fine. But if I restart it it does not load anymore
```
ubuntu@kubecon04:~$ systemctl status snap.lxd.daemon.unix.socket
Failed to dump process list, ignoring: No such file or directory
● snap.lxd.daemon.unix.socket - Socket unix for snap application lxd.daemon
   Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.unix.socket; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2019-11-11 21:07:19 UTC; 40min ago
   Listen: /var/snap/lxd/common/lxd/unix.socket (Stream)
   CGroup: /system.slice/snap.lxd.daemon.unix.socket
ubuntu@kubecon04:~$ systemctl restart snap.lxd.daemon.unix.socket
Failed to restart snap.lxd.daemon.unix.socket: The name org.freedesktop.PolicyKit1 was not provided by any .service files
See system logs and 'systemctl status snap.lxd.daemon.unix.socket' for details.
ubuntu@kubecon04:~$ sudo systemctl restart snap.lxd.daemon.unix.socket
Job for snap.lxd.daemon.unix.socket failed.
See "systemctl status snap.lxd.daemon.unix.socket" and "journalctl -xe" for details.
ubuntu@kubecon04:~$
ubuntu@kubecon04:~$
ubuntu@kubecon04:~$ systemctl status snap.lxd.daemon.unix.socket
● snap.lxd.daemon.unix.socket - Socket unix for snap application lxd.daemon
   Loaded: loaded (/etc/systemd/system/snap.lxd.daemon.unix.socket; enabled; vendor preset: enabled)
   Active: inactive (dead) since Mon 2019-11-11 21:48:07 UTC; 2s ago
   Listen: /var/snap/lxd/common/lxd/unix.socket (Stream)
```

Here's a dump of the lxd database.

```
ubuntu@kubecon04:~$ sudo sqlite3 /var/snap/lxd/common/lxd/database/local.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE schema (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    version INTEGER NOT NULL,
    updated_at DATETIME NOT NULL,
    UNIQUE (version)
);
INSERT INTO schema VALUES(1,38,1573498533);
CREATE TABLE config (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    key VARCHAR(255) NOT NULL,
    value TEXT,
    UNIQUE (key)
);
INSERT INTO config VALUES(1,'cluster.https_address','139.178.67.178:8443');
INSERT INTO config VALUES(3,'core.https_address','139.178.67.178:8443');
CREATE TABLE patches (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    name VARCHAR(255) NOT NULL,
    applied_at DATETIME NOT NULL,
    UNIQUE (name)
);
INSERT INTO patches VALUES(1,'shrink_logs_db_file',1573498533);
INSERT INTO patches VALUES(2,'invalid_profile_names',1573498533);
INSERT INTO patches VALUES(3,'leftover_profile_config',1573498533);
INSERT INTO patches VALUES(4,'network_permissions',1573498533);
INSERT INTO patches VALUES(5,'storage_api',1573498533);
INSERT INTO patches VALUES(6,'storage_api_v1',1573498533);
INSERT INTO patches VALUES(7,'storage_api_dir_cleanup',1573498533);
INSERT INTO patches VALUES(8,'storage_api_lvm_keys',1573498533);
INSERT INTO patches VALUES(9,'storage_api_keys',1573498533);
INSERT INTO patches VALUES(10,'storage_api_update_storage_configs',1573498533);
INSERT INTO patches VALUES(11,'storage_api_lxd_on_btrfs',1573498533);
INSERT INTO patches VALUES(12,'storage_api_lvm_detect_lv_size',1573498533);
INSERT INTO patches VALUES(13,'storage_api_insert_zfs_driver',1573498533);
INSERT INTO patches VALUES(14,'storage_zfs_noauto',1573498533);
INSERT INTO patches VALUES(15,'storage_zfs_volume_size',1573498533);
INSERT INTO patches VALUES(16,'network_dnsmasq_hosts',1573498533);
INSERT INTO patches VALUES(17,'storage_api_dir_bind_mount',1573498533);
INSERT INTO patches VALUES(18,'fix_uploaded_at',1573498533);
INSERT INTO patches VALUES(19,'storage_api_ceph_size_remove',1573498533);
INSERT INTO patches VALUES(20,'devices_new_naming_scheme',1573498533);
INSERT INTO patches VALUES(21,'storage_api_permissions',1573498533);
INSERT INTO patches VALUES(22,'container_config_regen',1573498533);
INSERT INTO patches VALUES(23,'lvm_node_specific_config_keys',1573498533);
INSERT INTO patches VALUES(24,'candid_rename_config_key',1573498533);
INSERT INTO patches VALUES(25,'move_backups',1573498533);
INSERT INTO patches VALUES(26,'storage_api_rename_container_snapshots_dir',1573498533);
INSERT INTO patches VALUES(27,'storage_api_rename_container_snapshots_links',1573498533);
INSERT INTO patches VALUES(28,'fix_lvm_pool_volume_names',1573498533);
INSERT INTO patches VALUES(29,'storage_api_rename_container_snapshots_dir_again',1573498533);
INSERT INTO patches VALUES(30,'storage_api_rename_container_snapshots_links_again',1573498533);
INSERT INTO patches VALUES(31,'storage_api_rename_container_snapshots_dir_again_again',1573498533);
INSERT INTO patches VALUES(32,'clustering_add_roles',1573498533);
CREATE TABLE raft_nodes (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    address TEXT NOT NULL,
    UNIQUE (address)
);
INSERT INTO raft_nodes VALUES(1,'139.178.67.178:8443');
INSERT INTO raft_nodes VALUES(2,'139.178.67.242:8443');
DELETE FROM sqlite_sequence;
INSERT INTO sqlite_sequence VALUES('schema',1);
INSERT INTO sqlite_sequence VALUES('patches',32);
INSERT INTO sqlite_sequence VALUES('config',3);
INSERT INTO sqlite_sequence VALUES('raft_nodes',2);
COMMIT;
```

juju status
```
ubuntu@kubecon04:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
kubernetes-core lxd-remote-default lxd-remote/default 2.6.10 unsupported 21:51:15Z

App Version Status Scale Charm Store Rev OS Notes
containerd waiting 0 containerd jujucharms 33 ubuntu
easyrsa 3.0.1 active 1 easyrsa jujucharms 278 ubuntu
etcd 3.2.10 active 1 etcd jujucharms 460 ubuntu
flannel waiting 0 flannel jujucharms 450 ubuntu
kubernetes-master 1.16.2 waiting 0/1 kubernetes-master jujucharms 754 ubuntu exposed
kubernetes-worker 1.16.2 blocked 0/1 kubernetes-worker jujucharms 590 ubuntu exposed

Unit Workload Agent Machine Public address Ports Message
easyrsa/0* active idle 0/lxd/0 10.98.184.17 Certificate Authority connected.
etcd/0* active idle 0 240.67.178.105 2379/tcp Healthy with 1 known peer
kubernetes-master/0* waiting failed 0 240.67.178.105 Waiting for kube-api-endpoint relation
kubernetes-worker/0* blocked failed 1 240.67.242.56 Connect a container runtime.

Machine State DNS Inst id Series AZ Message
0 started 240.67.178.105 juju-4aa30b-0 bionic Running
0/lxd/0 started 10.98.184.17 juju-4aa30b-0-lxd-0 bionic Container started
1 started 240.67.242.56 juju-4aa30b-1 bionic Running
```

I will join the debug-log in the first comment of this bug.

From the lxd debug command, I get this error:
```
EROR[11-11|21:52:14] Cannot listen on https socket, skipping... err="listen tcp 139.178.67.178:8443: bind
: address already in use"
```

Destroying the model doesn't work either...
```
ubuntu@kubecon04:~$ juju destroy-model kubernetes-core
WARNING! This command will destroy the "kubernetes-core" model.
This includes all machines, applications, data and other resources.

Continue [y/N]? y
Destroying modelUnable to get the model status from the API: getting storage provider registry: Get https://139.178.67.178:8443/1.0: Unable to connect to: 139.178.67.178:8443.
```

This might be more a lxd bug than a kubernetes bug, but in any case it is an incompatibility between the two of them.

no longer affects: charm-aws-iam
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :
Download full text (3.2 KiB)

Well the upload of the debug-log keeps failing in launchpad, so here's short snippet of it (the end)

```
/provides.py:42:joined:client
tracer: ++ queue handler reactive/easyrsa.py:231:message
tracer: ++ queue handler reactive/easyrsa.py:240:send_ca
tracer: ++ queue handler reactive/easyrsa.py:277:publish_global_client_cert
tracer: ++ queue handler reactive/easyrsa.py:90:set_easyrsa_version
unit-easyrsa-0: 21:31:33 INFO unit.easyrsa/0.juju-log Invoking reactive handler: reactive/easyrsa.py:90:set_easyrsa_version
unit-easyrsa-0: 21:31:33 INFO unit.easyrsa/0.juju-log Invoking reactive handler: reactive/easyrsa.py:231:message
unit-easyrsa-0: 21:31:33 INFO unit.easyrsa/0.juju-log Invoking reactive handler: reactive/easyrsa.py:240:send_ca
unit-kubernetes-master-0: 21:31:34 ERROR juju.worker.uniter resolver loop error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-master-0: 21:31:34 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-easyrsa-0: 21:31:34 INFO unit.easyrsa/0.juju-log Invoking reactive handler: reactive/easyrsa.py:277:publish_global_client_cert
unit-easyrsa-0: 21:31:35 INFO unit.easyrsa/0.juju-log Invoking reactive handler: hooks/relations/tls-certificates/provides.py:42:joined:client
unit-etcd-0: 21:31:35 INFO unit.etcd/0.juju-log Invoking reactive handler: reactive/etcd.py:429:process_snapd_timer
unit-etcd-0: 21:31:36 INFO unit.etcd/0.juju-log Get config refresh.timer for snap core
unit-etcd-0: 21:31:36 INFO unit.etcd/0.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:79:joined:certificates
unit-kubernetes-master-0: 21:33:02 ERROR juju.worker.uniter resolver loop error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-master-0: 21:33:02 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-worker-0: 21:33:05 ERROR juju.worker.uniter resolver loop error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-worker-0: 21:33:05 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-master-0: 21:34:46 ERROR juju.worker.uniter resolver loop error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-master-0: 21:34:46 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-worker-0: 21:35:00 ERROR juju.worker.uniter resolver loop error: getting storage provider registry: cannot fetch node config from database: driver: bad connection
unit-kubernetes-worker-0: 21:35:00 ERROR juju.worker.dependency "uniter" man...

Read more...

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :
Download full text (7.7 KiB)

I tried one more time. Wiped lxd, everything, and reinstalled lxd as a cluster on 3 nodes, and then deployed CDK.

If fails with the error "Failed to get current cluster nodes: failed to begin transaction: not an error" pretty quickly into the deployment of CDK.

Here's the logs in /var/snap/lxd/common/lxd/logs/lxd.log between the start of the deployment to the failure.

t=2019-11-12T22:00:29+0000 lvl=info msg="Creating container" ephemeral=false name=juju-c54a63-0 project=default
t=2019-11-12T22:00:29+0000 lvl=info msg="Created container" ephemeral=false name=juju-c54a63-0 project=default
t=2019-11-12T22:00:32+0000 lvl=info msg="Starting container" action=start created=2019-11-12T22:00:29+0000 ephemeral=false name=juju-c54a63-0 project=default stateful=false used=1970-01-01T00:00:00+0000
t=2019-11-12T22:00:33+0000 lvl=info msg="Started container" action=start created=2019-11-12T22:00:29+0000 ephemeral=false name=juju-c54a63-0 project=default stateful=false used=1970-01-01T00:00:00+0000
t=2019-11-12T22:01:44+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@00490: use of closed network connection"
t=2019-11-12T22:01:44+0000 lvl=warn msg="Dqlite server proxy Unix -> TLS: read unix @->@00490: use of closed network connection"
t=2019-11-12T22:01:44+0000 lvl=warn msg="Excluding offline node from refresh: {ID:1 Address:139.178.67.162:8443 RaftID:1 Raft:true LastHeartbeat:2019-11-12 22:00:41.65550536 +0000 UTC Online:false updated:false}"
t=2019-11-12T22:01:44+0000 lvl=eror msg="Failed to load instances for snapshot expiry" err="failed to begin transaction: not an error"
t=2019-11-12T22:01:44+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: not an error"
t=2019-11-12T22:01:44+0000 lvl=eror msg="Failed to load containers for scheduled snapshots" err="failed to begin transaction: not an error"
t=2019-11-12T22:01:44+0000 lvl=eror msg="Error refreshing forkdns: failed to begin transaction: not an error"
t=2019-11-12T22:01:45+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: not an error"
t=2019-11-12T22:01:45+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: not an error"
t=2019-11-12T22:01:46+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: not an error"
t=2019-11-12T22:01:47+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: not an error"
t=2019-11-12T22:01:48+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: not an error"
t=2019-11-12T22:01:49+0000 lvl=warn msg="Failed to get current cluster nodes: failed to begin transaction: not an error"

On the juju side, the latest logs:

machine-0: 22:07:36 ERROR juju.worker.dependency "machiner" manifold worker returned unexpected error: cannot update observed network config: cannot get provider network config: failed to construct a model from config: failed to begin transaction: not an error
machine-0: 22:07:39 ERROR juju.worker.dependency "broker-tracker" manifold worker returned unexpected error: cannot load machine machine-0 from state: failed to begin transaction: not an error
...

Read more...

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Oh, btw, for my last test in comment #2, I actually deployed CDK on a 3 cluster nodes, not kubernetes-core as the title suggests.

no longer affects: cdk8s
Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report. Unfortunately, I don't think there's much we can do with this issue without more details about your environment and how to reproduce it. We regularly test Charmed Kubernetes on LXD and have not encountered this.

It's not clear where the issue lies. Charmed Kubernetes uses a fairly permissive LXD profile, so it's possible something inside one of the containers was conflicting with LXD outside of it. It could have also been a bug in LXD or a bug in Juju.

I know it's been awhile since you reported this, but if you still have the information, please share information about your environment (host OS, LXD version, Juju version) and how you configured LXD.

Changed in charmed-kubernetes-bundles:
status: New → Incomplete
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Hi George. Effectively, I do not have access to the environment anymore.
However, I would like to point out that this bug occurred when deploying CDK on **clustered** lxd, not simply lxd. At the time of reporting the bug, I recall talking with the LXD and with the K8s team and both told me that this was not a tested architecture. I am pretty sure that a simple test of 1) setting up a lxd cluster with 3+ bare-metal nodes, 2) deploy cdk on top of it, would trigger the bug. It just depends if this is a scenario worth investigating.

Cheers,
Camille

Revision history for this message
George Kraft (cynerva) wrote :

Ah, thanks. Looks like you said clustered LXD clearly in the original report, but I misinterpreted it. My bad!

Yes, while we do test on LXD, we do not test on clustered LXD. I think that should be enough info for us to try reproducing this.

Changed in charmed-kubernetes-bundles:
importance: Undecided → Medium
status: Incomplete → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.