MAAS 2.8 production mode sometimes loses connection when finished downloading an image

Bug #1882155 reported by Bill Wear
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Expired
Undecided
Unassigned

Bug Description

Sometimes, when running MAAS 2.8 in production mode in a lxd container, MAAS 2.8/candidate blanks the lower screen and shows a "Connection lost, reconnecting...." Sometimes doing a "maas status" will run for a very long time and brings the connection back. Usually happens right after the first image finishes downloading, after installation and configuration.

In these cases, "maas status" eventually returns:

unix:///var/snap/maas/6827/supervisord/sock no such file

"snap stop maas" and "snap restart maas" eventually return:

error: cannot communicate with server: timeout exceeded while waiting for response

"ps -ef | grep maas" shows only these items:

root@maas-2-8-t2:~# ps -ef | grep maas
root 1177 1 0 22:14 ? 00:00:00 snapfuse /var/lib/snapd/snaps/maas-cli_13.snap /snap/maas-cli/13 -o ro,nodev,allow_other,suid
root 1387 1 0 22:16 ? 00:10:21 snapfuse /var/lib/snapd/snaps/maas_6827.snap /snap/maas/6827 -o ro,nodev,allow_other,suid
root 20528 380 0 22:38 ? 00:00:00 grep --color=auto maas
root@maas-2-8-t2:~#

"reboot"-ing the container takes a very long time to run, and does not reboot the container, it remains STOPPED.

attempting to restart the container with "lxc start" produces this error:

Error: Common start logic: saving config file for the container failed
Try `lxc info --show-log maas-2-8-t2` for more info

"lxc info --show-log maas-2-8-t2" produces exactly this, not including the separator line below:

Name: maas-2-8-t2
Location: none
Remote: unix://
Architecture: x86_64
Created: 2020/06/04 22:12 UTC
Status: Stopped
Type: container
Profiles: maas

Log:

----

"lxc start maas-2-8-t2 --debug" produces the following long output, not including the separator line below:

DBUG[06-04|17:46:56] Connecting to a local LXD over a Unix socket
DBUG[06-04|17:46:56] Sending request to LXD method=GET url=http://unix.socket/1.0 etag=
DBUG[06-04|17:46:56] Got response struct from LXD
DBUG[06-04|17:46:56]
 {
  "config": {},
  "api_extensions": [
   "storage_zfs_remove_snapshots",
   "container_host_shutdown_timeout",
   "container_stop_priority",
   "container_syscall_filtering",
   "auth_pki",
   "container_last_used_at",
   "etag",
   "patch",
   "usb_devices",
   "https_allowed_credentials",
   "image_compression_algorithm",
   "directory_manipulation",
   "container_cpu_time",
   "storage_zfs_use_refquota",
   "storage_lvm_mount_options",
   "network",
   "profile_usedby",
   "container_push",
   "container_exec_recording",
   "certificate_update",
   "container_exec_signal_handling",
   "gpu_devices",
   "container_image_properties",
   "migration_progress",
   "id_map",
   "network_firewall_filtering",
   "network_routes",
   "storage",
   "file_delete",
   "file_append",
   "network_dhcp_expiry",
   "storage_lvm_vg_rename",
   "storage_lvm_thinpool_rename",
   "network_vlan",
   "image_create_aliases",
   "container_stateless_copy",
   "container_only_migration",
   "storage_zfs_clone_copy",
   "unix_device_rename",
   "storage_lvm_use_thinpool",
   "storage_rsync_bwlimit",
   "network_vxlan_interface",
   "storage_btrfs_mount_options",
   "entity_description",
   "image_force_refresh",
   "storage_lvm_lv_resizing",
   "id_map_base",
   "file_symlinks",
   "container_push_target",
   "network_vlan_physical",
   "storage_images_delete",
   "container_edit_metadata",
   "container_snapshot_stateful_migration",
   "storage_driver_ceph",
   "storage_ceph_user_name",
   "resource_limits",
   "storage_volatile_initial_source",
   "storage_ceph_force_osd_reuse",
   "storage_block_filesystem_btrfs",
   "resources",
   "kernel_limits",
   "storage_api_volume_rename",
   "macaroon_authentication",
   "network_sriov",
   "console",
   "restrict_devlxd",
   "migration_pre_copy",
   "infiniband",
   "maas_network",
   "devlxd_events",
   "proxy",
   "network_dhcp_gateway",
   "file_get_symlink",
   "network_leases",
   "unix_device_hotplug",
   "storage_api_local_volume_handling",
   "operation_description",
   "clustering",
   "event_lifecycle",
   "storage_api_remote_volume_handling",
   "nvidia_runtime",
   "container_mount_propagation",
   "container_backup",
   "devlxd_images",
   "container_local_cross_pool_handling",
   "proxy_unix",
   "proxy_udp",
   "clustering_join",
   "proxy_tcp_udp_multi_port_handling",
   "network_state",
   "proxy_unix_dac_properties",
   "container_protection_delete",
   "unix_priv_drop",
   "pprof_http",
   "proxy_haproxy_protocol",
   "network_hwaddr",
   "proxy_nat",
   "network_nat_order",
   "container_full",
   "candid_authentication",
   "backup_compression",
   "candid_config",
   "nvidia_runtime_config",
   "storage_api_volume_snapshots",
   "storage_unmapped",
   "projects",
   "candid_config_key",
   "network_vxlan_ttl",
   "container_incremental_copy",
   "usb_optional_vendorid",
   "snapshot_scheduling",
   "container_copy_project",
   "clustering_server_address",
   "clustering_image_replication",
   "container_protection_shift",
   "snapshot_expiry",
   "container_backup_override_pool",
   "snapshot_expiry_creation",
   "network_leases_location",
   "resources_cpu_socket",
   "resources_gpu",
   "resources_numa",
   "kernel_features",
   "id_map_current",
   "event_location",
   "storage_api_remote_volume_snapshots",
   "network_nat_address",
   "container_nic_routes",
   "rbac",
   "cluster_internal_copy",
   "seccomp_notify",
   "lxc_features",
   "container_nic_ipvlan",
   "network_vlan_sriov",
   "storage_cephfs",
   "container_nic_ipfilter",
   "resources_v2",
   "container_exec_user_group_cwd",
   "container_syscall_intercept",
   "container_disk_shift",
   "storage_shifted",
   "resources_infiniband",
   "daemon_storage",
   "instances",
   "image_types",
   "resources_disk_sata",
   "clustering_roles",
   "images_expiry",
   "resources_network_firmware",
   "backup_compression_algorithm",
   "ceph_data_pool_name",
   "container_syscall_intercept_mount",
   "compression_squashfs",
   "container_raw_mount",
   "container_nic_routed",
   "container_syscall_intercept_mount_fuse",
   "container_disk_ceph",
   "virtual-machines",
   "image_profiles",
   "clustering_architecture",
   "resources_disk_id",
   "storage_lvm_stripes",
   "vm_boot_priority",
   "unix_hotplug_devices",
   "api_filtering",
   "instance_nic_network",
   "clustering_sizing",
   "firewall_driver",
   "projects_limits",
   "container_syscall_intercept_hugetlbfs",
   "limits_hugepages",
   "container_nic_routed_gateway",
   "projects_restrictions",
   "custom_volume_snapshot_expiry",
   "volume_snapshot_scheduling",
   "trust_ca_certificates",
   "snapshot_disk_usage",
   "clustering_edit_roles",
   "container_nic_routed_host_address",
   "container_nic_ipvlan_gateway",
   "resources_usb_pci",
   "resources_cpu_threads_numa",
   "resources_cpu_core_die",
   "api_os",
   "container_nic_routed_host_table",
   "container_nic_ipvlan_host_table",
   "container_nic_ipvlan_mode",
   "resources_system",
   "images_push_relay"
  ],
  "api_status": "stable",
  "api_version": "1.0",
  "auth": "trusted",
  "public": false,
  "auth_methods": [
   "tls"
  ],
  "environment": {
   "addresses": [],
   "architectures": [
    "x86_64",
    "i686"
   ],
   "certificate": "-----BEGIN CERTIFICATE-----\nMIICNzCCAb2gAwIBAgIQK33PA5wM5k6IIUnMeEG/QTAKBggqhkjOPQQDAzA9MRww\nGgYDVQQKExNsaW51eGNvbnRhaW5lcnMub3JnMR0wGwYDVQQDDBRyb290QHN0b3Jt\ncmlkZXIteW9nYTAeFw0xOTExMjkxOTUzMDRaFw0yOTExMjYxOTUzMDRaMD0xHDAa\nBgNVBAoTE2xpbnV4Y29udGFpbmVycy5vcmcxHTAbBgNVBAMMFHJvb3RAc3Rvcm1y\naWRlci15b2dhMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEpwoyjfbphJ/yfCD2I8de\nXRoDkyuFi9TASt+55ciDMr5qO3lVFVV2CftmxHk1r34QZ5AJxHZHzYwhs9O14/ui\nEzQvKdahKPFGsjt+f607Kjpg0kAj1652/jzRjoihlvjYo4GBMH8wDgYDVR0PAQH/\nBAQDAgWgMBMGA1UdJQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwSgYDVR0R\nBEMwQYIPc3Rvcm1yaWRlci15b2dhhwTAqCtEhxAmABANsUmdcyWtxwOV5dUThxAm\nABANsUmdc5FXF4GQCLVghwTAqHoBMAoGCCqGSM49BAMDA2gAMGUCMQCcQ3ar/fzW\n5SORmQLz7S39OzBW/9fLMs+ipyJOYF6t3559hWkHz/wIgdRBQPvLxQsCMF8OMUP4\nsldJbD1DhigkxXi8w8jvsWiYx/7MPP+K8wx+6vUNXeM92HHmuaAAa/kYjQ==\n-----END CERTIFICATE-----\n",
   "certificate_fingerprint": "71606ed3a8c20ee717c2b05d23471a15c863bbccd69fbdff7eb6d0bd19e57ad4",
   "driver": "lxc",
   "driver_version": "4.0.2",
   "firewall": "xtables",
   "kernel": "Linux",
   "kernel_architecture": "x86_64",
   "kernel_features": {
    "netnsid_getifaddrs": "true",
    "seccomp_listener": "true",
    "seccomp_listener_continue": "true",
    "shiftfs": "false",
    "uevent_injection": "true",
    "unpriv_fscaps": "true"
   },
   "kernel_version": "5.3.0-24-generic",
   "lxc_features": {
    "cgroup2": "true",
    "mount_injection_file": "true",
    "network_gateway_device_route": "true",
    "network_ipvlan": "true",
    "network_l2proxy": "true",
    "network_phys_macvlan_mtu": "true",
    "network_veth_router": "true",
    "pidfd": "true",
    "seccomp_notify": "true"
   },
   "os_name": "Ubuntu",
   "os_version": "19.10",
   "project": "default",
   "server": "lxd",
   "server_clustered": false,
   "server_name": "stormrider-yoga",
   "server_pid": 2773,
   "server_version": "4.1",
   "storage": "zfs",
   "storage_version": "0.8.1-1ubuntu14"
  }
 }
DBUG[06-04|17:46:56] Sending request to LXD method=GET url=http://unix.socket/1.0/instances/maas-2-8-t2 etag=
DBUG[06-04|17:46:56] Got response struct from LXD
DBUG[06-04|17:46:56]
 {
  "architecture": "x86_64",
  "config": {
   "image.architecture": "amd64",
   "image.description": "ubuntu 18.04 LTS amd64 (release) (20191114)",
   "image.label": "release",
   "image.os": "ubuntu",
   "image.release": "bionic",
   "image.serial": "20191114",
   "image.type": "squashfs",
   "image.version": "18.04",
   "volatile.base_image": "028d045b1cfcfc8a69cc68674557bd86e015c0ba4bb5c3d6851043f785963728",
   "volatile.eth0.host_name": "vethc4bd7bec",
   "volatile.eth0.hwaddr": "00:16:3e:6a:f7:31",
   "volatile.idmap.base": "0",
   "volatile.idmap.current": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
   "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
   "volatile.last_state.idmap": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
   "volatile.last_state.power": "RUNNING",
   "volatile.virbr0.host_name": "vethf0880655",
   "volatile.virbr0.hwaddr": "00:16:3e:00:29:4a",
   "volatile.virbr0.name": "eth1"
  },
  "devices": {},
  "ephemeral": false,
  "profiles": [
   "maas"
  ],
  "stateful": false,
  "description": "",
  "created_at": "2020-06-04T17:12:05.600604273-05:00",
  "expanded_config": {
   "image.architecture": "amd64",
   "image.description": "ubuntu 18.04 LTS amd64 (release) (20191114)",
   "image.label": "release",
   "image.os": "ubuntu",
   "image.release": "bionic",
   "image.serial": "20191114",
   "image.type": "squashfs",
   "image.version": "18.04",
   "volatile.base_image": "028d045b1cfcfc8a69cc68674557bd86e015c0ba4bb5c3d6851043f785963728",
   "volatile.eth0.host_name": "vethc4bd7bec",
   "volatile.eth0.hwaddr": "00:16:3e:6a:f7:31",
   "volatile.idmap.base": "0",
   "volatile.idmap.current": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
   "volatile.idmap.next": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
   "volatile.last_state.idmap": "[{\"Isuid\":true,\"Isgid\":false,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000},{\"Isuid\":false,\"Isgid\":true,\"Hostid\":1000000,\"Nsid\":0,\"Maprange\":1000000000}]",
   "volatile.last_state.power": "RUNNING",
   "volatile.virbr0.host_name": "vethf0880655",
   "volatile.virbr0.hwaddr": "00:16:3e:00:29:4a",
   "volatile.virbr0.name": "eth1"
  },
  "expanded_devices": {
   "eth0": {
    "name": "eth0",
    "nictype": "bridged",
    "parent": "lxdbr0",
    "type": "nic"
   },
   "root": {
    "path": "/",
    "pool": "default",
    "type": "disk"
   },
   "virbr0": {
    "nictype": "bridged",
    "parent": "virbr0",
    "type": "nic"
   }
  },
  "name": "maas-2-8-t2",
  "status": "Stopped",
  "status_code": 102,
  "last_used_at": "2020-06-04T17:12:29.904500739-05:00",
  "location": "none",
  "type": "container"
 }
DBUG[06-04|17:46:56] Connected to the websocket: ws://unix.socket/1.0/events
DBUG[06-04|17:46:56] Sending request to LXD method=PUT url=http://unix.socket/1.0/instances/maas-2-8-t2/state etag=
DBUG[06-04|17:46:56]
 {
  "action": "start",
  "timeout": 0,
  "force": false,
  "stateful": false
 }
DBUG[06-04|17:46:56] Got operation from LXD
DBUG[06-04|17:46:56]
 {
  "id": "8391e635-6cef-4c49-92a3-0f48099e7f66",
  "class": "task",
  "description": "Starting container",
  "created_at": "2020-06-04T17:46:56.600010764-05:00",
  "updated_at": "2020-06-04T17:46:56.600010764-05:00",
  "status": "Running",
  "status_code": 103,
  "resources": {
   "containers": [
    "/1.0/containers/maas-2-8-t2"
   ]
  },
  "metadata": null,
  "may_cancel": false,
  "err": "",
  "location": "none"
 }
DBUG[06-04|17:46:56] Sending request to LXD method=GET url=http://unix.socket/1.0/operations/8391e635-6cef-4c49-92a3-0f48099e7f66 etag=
DBUG[06-04|17:46:56] Got response struct from LXD
DBUG[06-04|17:46:56]
 {
  "id": "8391e635-6cef-4c49-92a3-0f48099e7f66",
  "class": "task",
  "description": "Starting container",
  "created_at": "2020-06-04T17:46:56.600010764-05:00",
  "updated_at": "2020-06-04T17:46:56.600010764-05:00",
  "status": "Running",
  "status_code": 103,
  "resources": {
   "containers": [
    "/1.0/containers/maas-2-8-t2"
   ]
  },
  "metadata": null,
  "may_cancel": false,
  "err": "",
  "location": "none"
 }
Error: Common start logic: saving config file for the container failed
Try `lxc info --show-log maas-2-8-t2` for more info

----

only solution is to delete the container and try a new one.

i have one currently working production container, and one production container in the failed state listed just above.

Bill Wear (billwear)
description: updated
description: updated
Revision history for this message
Bill Wear (billwear) wrote :

Note: I have determined that the issue with container restarts is related to drive space associated with lxd. By deleting enough other containers, I can return to a running state, though I should note that I don't have a lot of containers (about 6 or 7) running at any given time. The other issues seem consistent, especially needing to run "maas status" to get the MAAS connecting again when it drops.

Bill Wear (billwear)
Changed in maas:
status: New → Triaged
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Need more diagnostic information - standard MAAS diagnostic information, plus host LXD logs.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.