Bug #1911909 “Getting message “vip not yet configured” on all Op...” : Bugs : Canonical Juju

Revision history for this message

kashif nawaz (knawaz) wrote on 2021-01-15:

#1

Download full text (17.9 KiB)

series: bionic
variables:
  # https://wiki.ubuntu.com/OpenStack/CloudArchive
  # packages for an LTS release come in a form of SRUs
  # do not use cloud:<pocket> for an LTS version as
  # installation hooks will fail. Example:
  openstack-origin: &openstack-origin distro
  #openstack-origin: &openstack-origin cloud:bionic-rocky

openstack-region: &openstack-region RegionOne

  # !> Important <!
  # configure that value for the API services as if they
  # spawn too many workers you will get inconsistent failures
  # due to CPU overcommit
  worker-multiplier: &worker-multiplier 0.25

  # Number of MySQL connections in the env. Default is not enough
  # for environment of this size. So, bundle declares default of
  # 2000. There's hardly a case for higher than this
  mysql-connections: &mysql-connections 2000

  # MySQL tuning level. Charm default is "safest", this however
  # impacts performance. For spinning platters consider setting this
  # to "fast"
  mysql-tuning-level: &mysql-tuning-level safest

  # Configure RAM allocation params for nova. For hyperconverged
  # nodes, we need to have plenty reserves for service containers,
  # Ceph OSDs, and swift-storage daemons. Those processes will not
  # only directly allocate RAM but also indirectly via pagecache, file
  # system caches, system buffers usage. Adjust for higher density
  # clouds, e.g. high OSD/host ratio or when running >2 service
  # containers/host adapt appropriately.
  reserved-host-memory: &reserved-host-memory 16384
  ram-allocation-ratio: &ram-allocation-ratio 0.999999 # XXX bug 1613839
  cpu-allocation-ratio: &cpu-allocation-ratio 4.0

  # This is Management network, unrelated to OpenStack and other applications
  # OAM - Operations, Administration and Maintenance
  oam-space: &oam-space oam-space

# This is OpenStack Admin network; for adminURL endpoints
admin-space: &admin-space oam-space

# This is OpenStack Public network; for publicURL endpoints
public-space: &public-space external-space

# This is OpenStack Internal network; for internalURL endpoints
internal-space: &internal-space oam-space

  # CEPH configuration
  # CEPH access network
  ceph-public-space: &ceph-public-space ceph-access-space

  # CEPH replication network
  ceph-cluster-space: &ceph-cluster-space ceph-replica-space
  sdn-transport: &sdn-transport sdn-transport

  # Workaround for 'only one default binding supported'
  oam-space-constr: &oam-space-constr spaces=oam-space
  ceph-access-constr: &ceph-access-constr spaces=ceph-access-space
  combi-access-constr: &combi-access-constr spaces=ceph-access-space,oam-space

  # Various VIPs
  aodh-vip: &aodh-vip "172.30.204.132 172.30.205.132"
  cinder-vip: &cinder-vip "172.30.204.133 172.30.205.133"
  dashboard-vip: &dashboard-vip "172.30.205.144"
  glance-vip: &glance-vip "172.30.204.134 172.30.205.134"
  gnocchi-vip: &gnocchi-vip "172.30.204.135 172.30.205.135"
  heat-vip: &heat-vip "172.30.204.136 172.30.205.136"
...

series: bionic
variables:
  # https://wiki.ubuntu.com/OpenStack/CloudArchive
  # packages for an LTS release come in a form of SRUs
  # do not use cloud:<pocket> for an LTS version as
  # installation hooks will fail. Example:
  openstack-origin:    &openstack-origin    distro
  #openstack-origin:    &openstack-origin    cloud:bionic-rocky

openstack-region:    &openstack-region    RegionOne

# !> Important <!
  # configure that value for the API services as if they
  # spawn too many workers you will get inconsistent failures
  # due to CPU overcommit
  worker-multiplier:   &worker-multiplier   0.25

# Number of MySQL connections in the env. Default is not enough
  # for environment of this size. So, bundle declares default of
  # 2000. There's hardly a case for higher than this
  mysql-connections:   &mysql-connections   2000

# MySQL tuning level. Charm default is "safest", this however
  # impacts performance. For spinning platters consider setting this
  # to "fast"
  mysql-tuning-level:  &mysql-tuning-level safest

# Configure RAM allocation params for nova. For hyperconverged
  # nodes, we need to have plenty reserves for service containers,
  # Ceph OSDs, and swift-storage daemons. Those processes will not
  # only directly allocate RAM but also indirectly via pagecache, file
  # system caches, system buffers usage.  Adjust for higher density
  # clouds, e.g. high OSD/host ratio or when running >2 service
  # containers/host adapt appropriately.
  reserved-host-memory: &reserved-host-memory 16384
  ram-allocation-ratio: &ram-allocation-ratio 0.999999  # XXX bug 1613839
  cpu-allocation-ratio: &cpu-allocation-ratio 4.0

# This is Management network, unrelated to OpenStack and other applications
  # OAM - Operations, Administration and Maintenance
  oam-space:           &oam-space           oam-space

# This is OpenStack Admin network; for adminURL endpoints
  admin-space:         &admin-space         oam-space

# This is OpenStack Public network; for publicURL endpoints
  public-space:        &public-space        external-space

# This is OpenStack Internal network; for internalURL endpoints
  internal-space:      &internal-space      oam-space

# CEPH configuration
  # CEPH access network
  ceph-public-space:   &ceph-public-space   ceph-access-space

# CEPH replication network
  ceph-cluster-space:  &ceph-cluster-space  ceph-replica-space
  sdn-transport:       &sdn-transport       sdn-transport

# Workaround for 'only one default binding supported'
  oam-space-constr:    &oam-space-constr    spaces=oam-space
  ceph-access-constr:  &ceph-access-constr  spaces=ceph-access-space
  combi-access-constr:  &combi-access-constr  spaces=ceph-access-space,oam-space

# Various VIPs
  aodh-vip:            &aodh-vip            "172.30.204.132 172.30.205.132"
  cinder-vip:          &cinder-vip          "172.30.204.133 172.30.205.133"
  dashboard-vip:       &dashboard-vip       "172.30.205.144"
  glance-vip:          &glance-vip          "172.30.204.134 172.30.205.134"
  gnocchi-vip:         &gnocchi-vip         "172.30.204.135 172.30.205.135"
  heat-vip:            &heat-vip            "172.30.204.136 172.30.205.136"
  keystone-vip:        &keystone-vip        "172.30.204.137 172.30.205.137"
  mysql-vip:           &mysql-vip           "172.30.205.140"
  neutron-api-vip:     &neutron-api-vip     "172.30.204.138 172.30.205.138"
  nova-cc-vip:         &nova-cc-vip         "172.30.204.139 172.30.205.139"
  rados-gateway-vip:   &rados-gateway-vip   "172.30.204.141 172.30.205.141"
  vault-vip:           &vault-vip           "172.30.205.142"

# NTP configuration
  ntp-source:          &ntp-source          "ntp.ubuntu.com"

# Contrail variables
  contrail-docker-registry: &contrail-docker-registry hub.juniper.net/contrail
  contrail-docker-user: &contrail-docker-user "xxxxx"
  contrail-docker-password: &contrail-docker-password "xxxxxx"
  contrail-image-tag: &contrail-image-tag "2011.138"
  contrail-control-net: &contrail-control-net 172.30.205.128/25
  contrail-data-net: &contrail-data-net 192.168.254.0/24
  contrail-api-vip: &contrail-api-vip 172.30.204.145

# Add policy-routing to the external network
  external-network-cidr: &external-network-cidr 172.30.204.128/26
  external-network-gateway: &external-network-gateway 172.30.204.129

services:
  ubuntu:
    charm: cs:bionic/ubuntu
    num_units: 3
    to:
    - 1
    - 2
    - 3
  ntp:
    charm: cs:ntp
    options:
      source: *ntp-source

easyrsa:
    charm: cs:~containers/easyrsa
    num_units: 1
    bindings:
      "": *oam-space
    to:
    - lxd:1
  mysql:
    charm: cs:percona-cluster
    num_units: 3
    bindings:
      "": *oam-space
      cluster: *internal-space
      shared-db: *internal-space
      ha: *internal-space
      db: *internal-space
      db-admin: *internal-space
    options:
      source: *openstack-origin
      innodb-buffer-pool-size: 512M
      vip: *mysql-vip
      wait-timeout: 3600
      min-cluster-size: 3
      enable-binlogs: True
      performance-schema: True
      max-connections: *mysql-connections
      tuning-level: *mysql-tuning-level
      wsrep-slave-threads: 48
    to:
    - lxd:1
    - lxd:2
    - lxd:3
  memcached:
    charm: cs:memcached
    num_units: 3
    constraints: *oam-space-constr
    bindings:
      "": *internal-space
      cache: *internal-space
    options:
      allow-ufw-ip6-softfail: True
    to:
    - lxd:1
    - lxd:2
    - lxd:3
  rabbitmq-server:
    charm: cs:rabbitmq-server
    bindings:
      "": *oam-space
      amqp: *internal-space
      cluster: *internal-space
    options:
      source: *openstack-origin
      min-cluster-size: 3
    num_units: 3
    to:
    - lxd:1
    - lxd:2
    - lxd:3
   
  heat:
    charm: cs:heat
    num_units: 3
    constraints: mem=8196 cores=4 root-disk=64G spaces=oam-space,sdn-transport,external-space
    bindings:
      "": *oam-space
      public: *public-space
      admin: *admin-space
      internal: *internal-space
      shared-db: *internal-space
      heat-plugin-subordinate: *sdn-transport
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
      region: *openstack-region
      vip: *heat-vip
      use-internal-endpoints: True
    to:
    - kvm:1
    - kvm:2
    - kvm:3
  keystone:
    charm: cs:keystone
    num_units: 3
    bindings:
      "": *oam-space
      public: *public-space
      admin: *admin-space
      internal: *internal-space
      shared-db: *internal-space
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
      vip: *keystone-vip
      region: *openstack-region
      preferred-api-version: 3
      token-provider: 'fernet'
      # For contrail rbac
      admin-role: "admin"
      admin-password: "c0ntrail123"
    to:
    - lxd:1
    - lxd:2
    - lxd:3
  nova-cloud-controller:
    charm: cs:nova-cloud-controller
    num_units: 3
    bindings:
      "": *oam-space
      public: *public-space
      admin: *admin-space
      internal: *internal-space
      shared-db: *internal-space
      memcache: *internal-space
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
      network-manager: Neutron
      region: *openstack-region
      vip: *nova-cc-vip
      console-access-protocol: novnc
      console-proxy-ip: local
      use-internal-endpoints: True
      ram-allocation-ratio: *ram-allocation-ratio
      cpu-allocation-ratio: *cpu-allocation-ratio
    to:
    - lxd:1
    - lxd:2
    - lxd:3
  neutron-api:
    charm: cs:neutron-api
    num_units: 3
    constraints: mem=8196 cores=8 root-disk=100G spaces=oam-space,sdn-transport,external-space
    bindings:
      "": *oam-space
      public: *public-space
      admin: *admin-space
      internal: *internal-space
      shared-db: *internal-space
      neutron-plugin-api-subordinate: *sdn-transport
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
      region: *openstack-region
      neutron-security-groups: True
      overlay-network-type: ''
      use-internal-endpoints: True
      vip: *neutron-api-vip
      enable-l3ha: True
      dhcp-agents-per-network: 2
      enable-ml2-port-security: True
      default-tenant-network-type: vlan
      l2-population: True
      global-physnet-mtu: 9000
      # Contrail
      manage-neutron-plugin-legacy-mode: false
    to:
    - kvm:1
    - kvm:2
    - kvm:3
  glance:
    charm: cs:glance
    #constraints: *combi-access-constr
    bindings:
      "": *oam-space
      public: *public-space
      admin: *admin-space
      internal: *internal-space
      shared-db: *internal-space
    options:
      worker-multiplier: *worker-multiplier
      openstack-origin: *openstack-origin
      vip: *glance-vip
      use-internal-endpoints: True
      restrict-ceph-pools: False
      region: *openstack-region
    num_units: 3
    to:
    - lxd:1
    - lxd:2
    - lxd:3
  openstack-dashboard:
    charm: cs:openstack-dashboard
    num_units: 3
    constraints: *oam-space-constr
    bindings:
      "": *public-space
      shared-db: *internal-space
    options:
      openstack-origin: *openstack-origin
      webroot: "/"
      secret: "encryptcookieswithme"
      vip: *dashboard-vip
      neutron-network-l3ha: True
      neutron-network-lb: True
      neutron-network-firewall: False
      cinder-backup: False
      password-retrieve: True
      endpoint-type: 'publicURL'
    to:
    - lxd:1
    - lxd:2
    - lxd:3

nova-compute:
    charm: cs:nova-compute
    options:
      openstack-origin: *openstack-origin
      #os-internal-network: 172.30.205.128/25
    num_units: 2
    to:
      - 4
      - 5
  mysql-hacluster:
    charm: cs:hacluster
    options:
      cluster_count: 3
  keystone-hacluster:
    charm: cs:hacluster
    options:
      cluster_count: 3
  ncc-hacluster:
    charm: cs:hacluster
    options:
      cluster_count: 3
  neutron-hacluster:
    charm: cs:hacluster
    options:
      cluster_count: 3
  glance-hacluster:
    charm: cs:hacluster
    options:
      cluster_count: 3
  dashboard-hacluster:
    charm: cs:hacluster
    options:
      cluster_count: 3
  heat-hacluster:
    charm: cs:hacluster
    options:
      cluster_count: 3
  contrail-openstack:
    #charm: ./contrail-charms/contrail-openstack
    charm: cs:~juniper-os-software/contrail-openstack-26
    options:
      docker-registry: *contrail-docker-registry
      docker-user: *contrail-docker-user
      docker-password: *contrail-docker-password
      image-tag: *contrail-image-tag
      use-internal-endpoints: True
  contrail-agent:
    #charm: ./contrail-charms/contrail-agent
    charm: cs:~juniper-os-software/contrail-agent-25
    options:
      log-level: "SYS_INFO"
      docker-registry: *contrail-docker-registry
      docker-user: *contrail-docker-user
      docker-password: *contrail-docker-password
      image-tag: *contrail-image-tag
      #data-network: *contrail-data-net
      #control-network: *contrail-control-net
      physical-interface: bond0.300
      vhost-gateway: 192.168.254.1
      #sriov-physical-interface: eno2
      #sriov-numvfs: "4"
  contrail-analytics:
    #charm: ./contrail-charms/contrail-analytics
    charm: cs:~juniper-os-software/contrail-analytics-23
    min-cluster-size: 3
    num_units: 3
    constraints: mem=16384 cores=8 root-disk=100G spaces=oam-space,sdn-transport
    bindings:
      "": *oam-space
    options:
      log-level: "SYS_DEBUG"
      min-cluster-size: 3
      docker-registry: *contrail-docker-registry
      docker-user: *contrail-docker-user
      docker-password: *contrail-docker-password
      image-tag: *contrail-image-tag
      control-network: *contrail-control-net
      haproxy-http-mode: "http"
    to:
    - kvm:1
    - kvm:2
    - kvm:3
  contrail-analyticsdb:
    #charm: ./contrail-charms/contrail-analyticsdb
    charm: cs:~juniper-os-software/contrail-analyticsdb-23
    num_units: 3
    constraints: mem=32768 cores=8 root-disk=256G spaces=oam-space,sdn-transport
    bindings:
      "": *oam-space
    options:
      log-level: "SYS_DEBUG"
      min-cluster-size: 3
      docker-registry: *contrail-docker-registry
      docker-user: *contrail-docker-user
      docker-password: *contrail-docker-password
      image-tag: *contrail-image-tag
      control-network: *contrail-control-net
      cassandra-minimum-diskgb: "8"
      cassandra-jvm-extra-opts: "-Xms8g -Xmx8g"
    to:
    - kvm:1
    - kvm:2
    - kvm:3
  contrail-controller:
    #charm: ./contrail-charms/contrail-controller
    charm: cs:~juniper-os-software/contrail-controller-25
    num_units: 3
    constraints: mem=32768 cores=8 root-disk=100G spaces=oam-space,sdn-transport,external-space
    bindings:
      "": *oam-space
    options:
      log-level: "SYS_DEBUG"
      min-cluster-size: 3
      docker-registry: *contrail-docker-registry
      docker-user: *contrail-docker-user
      docker-password: *contrail-docker-password
      image-tag: *contrail-image-tag
      data-network: *contrail-data-net
      control-network: *contrail-control-net
      bgp-asn: '65000' 
      auth-mode: rbac
      cassandra-minimum-diskgb: "8"
      cassandra-jvm-extra-opts: "-Xms8g -Xmx8g"
      vip: *contrail-api-vip
      local-rabbitmq-hostname-resolution: True
      haproxy-https-mode: "http"
      haproxy-http-mode: "http"
    to:
    - kvm:1
    - kvm:2
    - kvm:3
  contrail-keystone-auth:
    #charm: ./contrail-charms/contrail-keystone-auth
    charm: cs:~juniper-os-software/contrail-keystone-auth-23
    num_units: 3
    constraints: spaces=oam-space,sdn-transport
    bindings:
      "": *oam-space
    to:
    - lxd:1
    - lxd:2
    - lxd:3
  contrail-keepalived:
    charm: cs:~containers/keepalived
    series: bionic
    options:
      virtual_ip: *contrail-api-vip
      #network_interface: eth0
      port: 8143
  contrail-haproxy:
    charm: cs:haproxy
    num_units: 3
    bindings:
      "": *oam-space
      reverseproxy: *internal-space
      website: *public-space
      public: *public-space
    options:
      default_timeouts: >-
        queue 60000, connect 5000, client 120000, server 120000
      services: ""
      source: backports
      peering_mode: "active-active"
      enable_monitoring: True
      ssl_cert: SELFSIGNED
    to:
    - lxd:1
    - lxd:2
    - lxd:3
  external-policy-routing:
    charm: cs:~canonical-bootstack/policy-routing-3
    options:
      cidr: *external-network-cidr
      gateway: *external-network-gateway

relations:
  # openstack
  - [ "ubuntu", "ntp" ]
  - [ "mysql:ha", "mysql-hacluster:ha" ]
  - [ "keystone", "mysql" ]
  - [ "keystone:ha", "keystone-hacluster:ha" ]
  - [ "glance", "mysql" ]
  - [ "glance", "keystone" ]
  - [ "glance:ha", "glance-hacluster:ha" ]
  - [ "nova-cloud-controller:shared-db", "mysql:shared-db" ]
  - [ "nova-cloud-controller:amqp", "rabbitmq-server:amqp" ]
  - [ "nova-cloud-controller", "keystone" ]
  - [ "nova-cloud-controller", "glance" ]
  - [ "nova-cloud-controller:ha", "ncc-hacluster:ha" ]
  - [ "neutron-api", "mysql" ]
  - [ "neutron-api", "rabbitmq-server" ]
  - [ "neutron-api", "nova-cloud-controller" ]
  - [ "neutron-api", "keystone" ]
  - [ "neutron-api:ha", "neutron-hacluster:ha" ]
  - [ "nova-compute:amqp", "rabbitmq-server:amqp" ]
  - [ "nova-compute", "glance" ]
  - [ "nova-compute", "nova-cloud-controller" ]
  - [ "nova-compute", "ntp" ]
  - [ "openstack-dashboard:identity-service", "keystone" ]
  - [ "openstack-dashboard", "dashboard-hacluster" ]
  - [ "heat", "mysql" ]
  - [ "heat", "rabbitmq-server" ]
  - [ "heat", "keystone" ]
  - [ "heat:ha", "heat-hacluster:ha" ]

#contrail
  - [ "contrail-agent:tls-certificates", "easyrsa:client" ]
  - [ "contrail-controller:tls-certificates", "easyrsa:client" ]
  - [ "contrail-analytics:tls-certificates", "easyrsa:client" ]
  - [ "contrail-analyticsdb:tls-certificates", "easyrsa:client" ]
  - [ "contrail-agent", "contrail-controller" ]
  - [ "contrail-agent:juju-info", "nova-compute:juju-info" ]

- [ "contrail-analytics", "contrail-analyticsdb" ]
  - [ "contrail-analytics", "contrail-controller" ]
  - [ "contrail-analytics", "contrail-haproxy" ]

- [ "contrail-analyticsdb", "contrail-controller" ]

- [ "contrail-controller", "contrail-keystone-auth" ]
  - [ "contrail-controller:http-services", "contrail-haproxy" ]
  - [ "contrail-controller:https-services", "contrail-haproxy" ]

- [ "contrail-keystone-auth", "keystone" ]

- [ "contrail-openstack", "nova-compute" ]
  - [ "contrail-openstack", "neutron-api" ]
  - [ "contrail-openstack", "heat" ]
  - [ "contrail-openstack", "contrail-controller" ]

#haproxy
  - [ "contrail-haproxy:juju-info", "contrail-keepalived:juju-info" ]

#memcached for nova-cc in HA
  - [ "nova-cloud-controller:memcache", "memcached:cache" ]

# Policy routing
  #- ["external-policy-routing:juju-info", "aodh:juju-info"]
  #- ["external-policy-routing:juju-info", "ceilometer:juju-info"]
  #- ["external-policy-routing:juju-info", "cinder:juju-info"]
  # ["external-policy-routing:juju-info", "openstack-dashboard:juju-info"]
  # ["external-policy-routing:juju-info", "glance:juju-info"]
  #- ["external-policy-routing:juju-info", "gnocchi:juju-info"]
  # ["external-policy-routing:juju-info", "heat:juju-info"]
  # ["external-policy-routing:juju-info", "keystone:juju-info"]
  # ["external-policy-routing:juju-info", "neutron-api:juju-info"]
  # ["external-policy-routing:juju-info", "nova-cloud-controller:juju-info"]
  #- ["external-policy-routing:juju-info", "ceph-radosgw:juju-info"]
  # ["external-policy-routing:juju-info", "contrail-haproxy:juju-info"]

- [ "ntp:juju-info", "contrail-controller:juju-info" ]
  - [ "ntp:juju-info", "contrail-analytics:juju-info" ]
  - [ "ntp:juju-info", "contrail-analyticsdb:juju-info" ]
  # Heat and neutron-api are KVM machines and require NTP
  - [ "ntp:juju-info", "neutron-api:juju-info" ]
  - [ "ntp:juju-info", "heat:juju-info" ]
machines:
  "1":
    series: bionic
    constraints: tags=controller1
  "2":
    series: bionic
    constraints: tags=controller2
  "3":
    series: bionic
    constraints: tags=controller3
  "4":
    series: bionic
    constraints: tags=compute1
  "5":
    series: bionic
    constraints: tags=compute2

Revision history for this message

kashif nawaz (knawaz) wrote on 2021-01-15:

#2

Download full text (3.5 KiB)

ubuntu@jumphost:~$ maas admin subnets read
Success.
Machine-readable output follows:
[
    {
        "name": "oam",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": true,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5001,
            "space": "oam-space",
            "primary_rack": "4ds8d6",
            "fabric": "fabric-0",
            "fabric_id": 0,
            "resource_uri": "/MAAS/api/2.0/vlans/5001/"
        },
        "cidr": "172.30.205.128/25",
        "rdns_mode": 2,
        "gateway_ip": "172.30.205.129",
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 1,
        "space": "oam-space",
        "resource_uri": "/MAAS/api/2.0/subnets/1/"
    },
    {
        "name": "external",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5003,
            "space": "external-space",
            "primary_rack": null,
            "fabric": "fabric2",
            "fabric_id": 2,
            "resource_uri": "/MAAS/api/2.0/vlans/5003/"
        },
        "cidr": "172.30.204.128/26",
        "rdns_mode": 2,
        "gateway_ip": null,
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 3,
        "space": "external-space",
        "resource_uri": "/MAAS/api/2.0/subnets/3/"
    },
    {
        "name": "overlay",
        "description": "",
        "vlan": {
            "vid": 300,
            "mtu": 9000,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "overlay",
            "secondary_rack": null,
            "id": 5005,
            "space": "sdn-transport",
            "primary_rack": null,
            "fabric": "fabric3",
            "fabric_id": 3,
            "resource_uri": "/MAAS/api/2.0/vlans/5005/"
        },
        "cidr": "192.168.254.0/24",
        "rdns_mode": 2,
        "gateway_ip": null,
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 4,
        "space": "sdn-transport",
        "resource_uri": "/MAAS/api/2.0/subnets/4/"
    },
    {
        "name": "appformix",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5002,
            "space": "appformix-space",
            "primary_rack": null,
            "fabric": "fabric-1",
            "fabric_id": 1,
            "resource_uri": "/MAAS/api/2.0/vlans/5002/"
        },
        "cidr": "172...

ubuntu@jumphost:~$ maas admin subnets read  
Success.
Machine-readable output follows:
[
    {
        "name": "oam",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": true,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5001,
            "space": "oam-space",
            "primary_rack": "4ds8d6",
            "fabric": "fabric-0",
            "fabric_id": 0,
            "resource_uri": "/MAAS/api/2.0/vlans/5001/"
        },
        "cidr": "172.30.205.128/25",
        "rdns_mode": 2,
        "gateway_ip": "172.30.205.129",
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 1,
        "space": "oam-space",
        "resource_uri": "/MAAS/api/2.0/subnets/1/"
    },
    {
        "name": "external",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5003,
            "space": "external-space",
            "primary_rack": null,
            "fabric": "fabric2",
            "fabric_id": 2,
            "resource_uri": "/MAAS/api/2.0/vlans/5003/"
        },
        "cidr": "172.30.204.128/26",
        "rdns_mode": 2,
        "gateway_ip": null,
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 3,
        "space": "external-space",
        "resource_uri": "/MAAS/api/2.0/subnets/3/"
    },
    {
        "name": "overlay",
        "description": "",
        "vlan": {
            "vid": 300,
            "mtu": 9000,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "overlay",
            "secondary_rack": null,
            "id": 5005,
            "space": "sdn-transport",
            "primary_rack": null,
            "fabric": "fabric3",
            "fabric_id": 3,
            "resource_uri": "/MAAS/api/2.0/vlans/5005/"
        },
        "cidr": "192.168.254.0/24",
        "rdns_mode": 2,
        "gateway_ip": null,
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 4,
        "space": "sdn-transport",
        "resource_uri": "/MAAS/api/2.0/subnets/4/"
    },
    {
        "name": "appformix",
        "description": "",
        "vlan": {
            "vid": 0,
            "mtu": 1500,
            "dhcp_on": false,
            "external_dhcp": null,
            "relay_vlan": null,
            "name": "untagged",
            "secondary_rack": null,
            "id": 5002,
            "space": "appformix-space",
            "primary_rack": null,
            "fabric": "fabric-1",
            "fabric_id": 1,
            "resource_uri": "/MAAS/api/2.0/vlans/5002/"
        },
        "cidr": "172.30.204.192/27",
        "rdns_mode": 2,
        "gateway_ip": "172.30.204.193",
        "dns_servers": [],
        "allow_dns": true,
        "allow_proxy": true,
        "active_discovery": false,
        "managed": true,
        "id": 2,
        "space": "appformix-space",
        "resource_uri": "/MAAS/api/2.0/subnets/2/"
    }
]

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-03-22:

#3

We have similar issues with SQA, except that just one HA-cluster is showing this error:
https://oil-jenkins.canonical.com/job/fce_build/140//console

All the occurrences can be found here:
https://solutions.qa.canonical.com/bugs/bugs/bug/1911909

Under the testrun id, at the bottom of the page there is a link to the `full artifacts repository` for the testruns, where crashdumps can be downloaded.

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2022-04-04:

#4

Download full text (3.3 KiB)

Looking through the logs for the glance units from one of the runs identified in comment #3, I have a strong suspicion that this is related to bug https://bugs.launchpad.net/charm-hacluster/+bug/1874719. A patch has been proposed and merged at https://review.opendev.org/c/openstack/charm-hacluster/+/834034 and is available in charm revision 93 in the latest/edge channel.

Can you please try using the latest/edge channel for focal+ deployments?

Supporting evidence below.

I see the following:

2022-04-03 18:04:03 ERROR unit.hacluster-glance/1.juju-log server.go:327 Pacemaker is down. Please manually start it. Pacemaker or Corosync are still not fully up after waiting for 12 retries. This looks like lp:1874719. Last output: node1(1): member

With pacemaker showing the following errors in the syslog:

Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]: notice: Caught 'Terminated' signal
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]: notice: Shutting down Pacemaker
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]: notice: Stopping pacemaker-controld
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Caught 'Terminated' signal
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Shutting down cluster resource manager
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: State transition S_IDLE -> S_POLICY_ENGINE
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-attrd[33245]: notice: Setting shutdown[juju-f975de-0-lxd-4]: (unset) -> 1649009637
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: error: Resource start-up disabled since no STONITH resources have been defined
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: error: Either configure some or disable STONITH with the stonith-enabled option
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: Delaying fencing operations until there are resources to manage
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: Scheduling shutdown of node juju-f975de-0-lxd-4
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: warning: Node node1 is unclean!
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: * Shutdown juju-f975de-0-lxd-4
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: warning: Calculated transition 2 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-2.bz2
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]: notice: Configuration errors found during scheduler processing, please run "crm_verify -L" to identify issues
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Transition 2 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Complete
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Disconnected from the executor
Apr 3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]: notice: Disconnected from Corosync
Apr ...

Looking through the logs for the glance units from one of the runs identified in comment #3, I have a strong suspicion that this is related to bug https://bugs.launchpad.net/charm-hacluster/+bug/1874719. A patch has been proposed and merged at https://review.opendev.org/c/openstack/charm-hacluster/+/834034 and is available in charm revision 93 in the latest/edge channel.

Can you please try using the latest/edge channel for focal+ deployments?

Supporting evidence below.

I see the following:

2022-04-03 18:04:03 ERROR unit.hacluster-glance/1.juju-log server.go:327 Pacemaker is down. Please manually start it. Pacemaker or Corosync are still not fully up after waiting for 12 retries. This looks like lp:1874719. Last output: node1(1): member

With pacemaker showing the following errors in the syslog:

Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]:  notice: Caught 'Terminated' signal 
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]:  notice: Shutting down Pacemaker
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]:  notice: Stopping pacemaker-controld 
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]:  notice: Caught 'Terminated' signal 
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]:  notice: Shutting down cluster resource manager 
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]:  notice: State transition S_IDLE -> S_POLICY_ENGINE 
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-attrd[33245]:  notice: Setting shutdown[juju-f975de-0-lxd-4]: (unset) -> 1649009637 
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  error: Resource start-up disabled since no STONITH resources have been defined
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  error: Either configure some or disable STONITH with the stonith-enabled option
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  notice: Delaying fencing operations until there are resources to manage
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  notice: Scheduling shutdown of node juju-f975de-0-lxd-4
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  warning: Node node1 is unclean!
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  notice:  * Shutdown juju-f975de-0-lxd-4
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  warning: Calculated transition 2 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-2.bz2
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-schedulerd[33246]:  notice: Configuration errors found during scheduler processing,  please run "crm_verify -L" to identify issues
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]:  notice: Transition 2 (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Complete
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]:  notice: Disconnected from the executor
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]:  notice: Disconnected from Corosync
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemaker-controld[33247]:  notice: Disconnected from the CIB manager
Apr  3 18:13:57 juju-f975de-0-lxd-4 pacemakerd[33227]:  notice: Stopping pacemaker-schedulerd

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-04-18:

#5

We are now also seeing this on latest/edge unfortunately, e.g. this testrun:
https://solutions.qa.canonical.com/testruns/testRun/24a134bd-d912-4bfa-9bc8-bf61dbe4fc45

Looking at the logs https://oil-jenkins.canonical.com/artifacts/24a134bd-d912-4bfa-9bc8-bf61dbe4fc45/generated/generated/openstack/juju-crashdump-openstack-2022-04-16-16.29.24.tar.gz on unit 1/lxd/9, nothing stands out regarding the hacluster. I also can't see the messages you refer to in comment #4.

Revision history for this message

Felipe Reyes (freyes) wrote on 2022-04-18:

#6

This piece of pacemaker log seems to be relevant:

Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Node juju-d527c3-1-lxd-9 state is now member
Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Defaulting to uname -n for the local corosync node name
Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Pacemaker controller successfully started and accepting connections
Apr 16 12:51:36 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: State transition S_STARTING -> S_PENDING
Apr 16 12:51:37 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: Fencer successfully connected
Apr 16 12:51:57 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: warning: Input I_DC_TIMEOUT received in state S_PENDING from crm_timer_popped
Apr 16 12:51:57 juju-d527c3-1-lxd-9 pacemaker-controld[35722]: notice: State transition S_ELECTION -> S_INTEGRATION
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: error: Resource start-up disabled since no STONITH resources have been defined
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: error: Either configure some or disable STONITH with the stonith-enabled option
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: notice: Delaying fencing operations until there are resources to manage
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: notice: Calculated transition 0, saving inputs in /var/lib/pacemaker/pengine/pe-input-2.bz2
Apr 16 12:51:58 juju-d527c3-1-lxd-9 pacemaker-schedulerd[35721]: notice: Configuration errors found during scheduler processing, please run "crm_verify -L" to identify issues

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2022-04-18:

#7

Looking through the crashdump, this does not appear to be related to LP#1874719.

The charm is simply reporting that the resource does not exist in the local set of resources reported by pacemaker. It would be ideal to have the contents of /var/log/pacemaker as well as /etc/corosync/corosync.conf so we can see the rendered configuration file.

Some oddities that are present, which *may* explain a faulty rendered configuration file:

hacluster-placement/1 never sees relation-joined hooks from hacluster-placement/2. Whereas, hacluster-placement/{0,2} both see all the relation-joined hooks from all the units.

Unfortunately, we'll need more information to see what's going on here.

Revision history for this message

Felipe Reyes (freyes) wrote on 2022-04-18:

#8

Download full text (3.2 KiB)

yeah, I believe this is the issue:

hacluster-placement/1 didn't get to see all the peers joining the hanode relation:

$ grep hacluster-placement machine-lock.log | grep -v update-status | grep 'unit: hacluster'
2022-04-16 12:54:44 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-joined (25; unit: hacluster-placement/0) hook), waited 23s, held 5s
2022-04-16 12:55:14 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-changed (25; unit: hacluster-placement/0) hook), waited 24s, held 6s
2022-04-16 12:56:38 unit-placement-1: placement/1 uniter (run relation-joined (249; unit: hacluster-placement/1) hook), waited 19s, held 5s
2022-04-16 12:57:13 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-changed (25; unit: hacluster-placement/0) hook), waited 12s, held 5s
2022-04-16 12:59:48 unit-placement-1: placement/1 uniter (run relation-changed (249; unit: hacluster-placement/1) hook), waited 9s, held 8s
2022-04-16 13:01:12 unit-nrpe-42: nrpe/42 uniter (run relation-joined (250; unit: hacluster-placement/1) hook), waited 13s, held 11s
2022-04-16 13:01:35 unit-nrpe-42: nrpe/42 uniter (run relation-changed (250; unit: hacluster-placement/1) hook), waited 12s, held 11s

Also this unit wasn't the leader:

2022-04-16 12:46:59 DEBUG juju.worker.uniter.relation statetracker.go:221 unit "hacluster-placement/1" (leader=false) entered scope for relation "hacluster-placement:hanode"

This prevents from running this piece of code[0]:

...
    if configure_corosync():
        try_pcmk_wait()
        if is_leader():
            run_initial_setup() #<---!!
...

the function run_initial_setup() is the one in charge of disabling stonith[1] and due to the peers relation described above this unit was being configured as a single node cluster:

$ journalctl --file ../journal/660a5f12c3b64778a026dc895e3d6c09/system.journal | grep 'adding new UDPU member'
abr 16 08:46:15 juju-d527c3-1-lxd-9 corosync[27385]: [TOTEM ] adding new UDPU member {10.246.165.57}
abr 16 08:51:34 juju-d527c3-1-lxd-9 corosync[35698]: [TOTEM ] adding new UDPU member {10.246.165.57}

^ that's the journal file for the machine 1/lxd/9, and in comparison this is how that grep line looks like for nova-cloud-controller/0:

$ journalctl --file journal/c5b4fcd7022748cda9f41e1f6f6df7b9/system.journal | grep 'adding new UDPU member'
abr 16 08:46:23 juju-d527c3-0-lxd-7 corosync[30678]: [TOTEM ] adding new UDPU member {10.246.164.247}
abr 16 08:51:50 juju-d527c3-0-lxd-7 corosync[39528]: [TOTEM ] adding new UDPU member {10.246.164.247}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]: [TOTEM ] adding new UDPU member {10.246.167.82}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]: [TOTEM ] adding new UDPU member {10.246.165.194}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]: [TOTEM ] adding new UDPU member {10.246.164.247}

So I believe this is not a charm's issue, but a juju issue, not necessarily a bug, but this could be more related to the hooks still being processed in the queue.

[0] https://github.com/openstack/charm-hacluster/blob/master/hooks/hooks.py#L218-L221
[1] https://github.com/openstack/charm-hacluster/blob...

yeah, I believe this is the issue:

hacluster-placement/1 didn't get to see all the peers joining the hanode relation:

$ grep hacluster-placement machine-lock.log | grep -v update-status | grep 'unit: hacluster'
2022-04-16 12:54:44 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-joined (25; unit: hacluster-placement/0) hook), waited 23s, held 5s
2022-04-16 12:55:14 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-changed (25; unit: hacluster-placement/0) hook), waited 24s, held 6s
2022-04-16 12:56:38 unit-placement-1: placement/1 uniter (run relation-joined (249; unit: hacluster-placement/1) hook), waited 19s, held 5s
2022-04-16 12:57:13 unit-hacluster-placement-1: hacluster-placement/1 uniter (run relation-changed (25; unit: hacluster-placement/0) hook), waited 12s, held 5s
2022-04-16 12:59:48 unit-placement-1: placement/1 uniter (run relation-changed (249; unit: hacluster-placement/1) hook), waited 9s, held 8s
2022-04-16 13:01:12 unit-nrpe-42: nrpe/42 uniter (run relation-joined (250; unit: hacluster-placement/1) hook), waited 13s, held 11s
2022-04-16 13:01:35 unit-nrpe-42: nrpe/42 uniter (run relation-changed (250; unit: hacluster-placement/1) hook), waited 12s, held 11s

Also this unit wasn't the leader:

2022-04-16 12:46:59 DEBUG juju.worker.uniter.relation statetracker.go:221 unit "hacluster-placement/1" (leader=false) entered scope for relation "hacluster-placement:hanode"

This prevents from running this piece of code[0]:

...
    if configure_corosync():
        try_pcmk_wait()
        if is_leader():
            run_initial_setup()  #<---!!
...

the function run_initial_setup() is the one in charge of disabling stonith[1] and due to the peers relation described above this unit was being configured as a single node cluster:

$ journalctl --file ../journal/660a5f12c3b64778a026dc895e3d6c09/system.journal  | grep 'adding new UDPU member'
abr 16 08:46:15 juju-d527c3-1-lxd-9 corosync[27385]:   [TOTEM ] adding new UDPU member {10.246.165.57}
abr 16 08:51:34 juju-d527c3-1-lxd-9 corosync[35698]:   [TOTEM ] adding new UDPU member {10.246.165.57}

^ that's the journal file for the machine 1/lxd/9, and in comparison this is how that grep line looks like for nova-cloud-controller/0:

$ journalctl --file journal/c5b4fcd7022748cda9f41e1f6f6df7b9/system.journal | grep 'adding new UDPU member'
abr 16 08:46:23 juju-d527c3-0-lxd-7 corosync[30678]:   [TOTEM ] adding new UDPU member {10.246.164.247}
abr 16 08:51:50 juju-d527c3-0-lxd-7 corosync[39528]:   [TOTEM ] adding new UDPU member {10.246.164.247}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]:   [TOTEM ] adding new UDPU member {10.246.167.82}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]:   [TOTEM ] adding new UDPU member {10.246.165.194}
abr 16 08:56:48 juju-d527c3-0-lxd-7 corosync[48186]:   [TOTEM ] adding new UDPU member {10.246.164.247}

So I believe this is not a charm's issue, but a juju issue, not necessarily a bug, but this could be more related to the hooks still being processed in the queue.

[0] https://github.com/openstack/charm-hacluster/blob/master/hooks/hooks.py#L218-L221
[1] https://github.com/openstack/charm-hacluster/blob/master/hooks/hooks.py#L182

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2022-04-18:

#9

Per reviewing Felipe's latest analysis, combined with the previous look - I agree that this appears to be a juju issue. There could be additional hooks left in the queue, but I think the juju devs should take a look to validate/invalidate that fact.

Changed in charm-hacluster:
status:	New → Invalid

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2022-08-15:

#10

I have done significant work refactoring hacluster to overcome that issue. There are several specific conditions that result in "vip not yet configured". Maybe the issue can be addressed by the following patches:

https://review.opendev.org/c/openstack/charm-hacluster/+/818996
https://review.opendev.org/c/openstack/charm-hacluster/+/815755

Both patches are ideal to be used together, not a hard dependency, but they complement each other

Revision history for this message

Liam Young (gnuoy) wrote on 2022-09-02:

#11

In the deployment I looked at two vips were requested:

juju config aodh vip
10.246.172.210 10.246.168.210

aodh translated this into a request for two different resources (from ha relation data0:

"res_aodh_8486566_vip": " params ip=\"10.246.168.210\" meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op monitor timeout=\"20s\"
interval=\"10s\" depth=\"0\"",
"res_aodh_eth0_vip": " params ip=\"10.246.172.210\" nic=\"eth0\" cidr_netmask=\"24\" meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\" op
monitor timeout=\"20s\" interval=\"10s\" depth=\"0\"",
...

The hacluster charm then creates the new resources *but* with a new name for res_aodh_eth0_vip:

$ sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-e310e9-1-lxd-0 (version 2.1.2-ada5c3b36e2) - partition with quorum
  * Last updated: Fri Sep 2 14:57:12 2022
  * Last change: Thu Sep 1 02:22:01 2022 by hacluster via crmd on juju-e310e9-1-lxd-0
  * 3 nodes configured
  * 5 resource instances configured

Node List:
* Online: [ juju-e310e9-0-lxd-0 juju-e310e9-1-lxd-0 juju-e310e9-2-lxd-0 ]

Full List of Resources:
  * Resource Group: grp_aodh_vips:
    * res_aodh_272179f_vip (ocf:heartbeat:IPaddr2): Started juju-e310e9-1-lxd-0
    * res_aodh_8486566_vip (ocf:heartbeat:IPaddr2): Started juju-e310e9-1-lxd-0
  * Clone Set: cl_res_aodh_haproxy [res_aodh_haproxy]:
    * Started: [ juju-e310e9-0-lxd-0 juju-e310e9-1-lxd-0 juju-e310e9-2-lxd-0 ]

So both vips are configured and working but the resource name has changed.

$ ping -c1 10.246.172.210
PING 10.246.172.210 (10.246.172.210) 56(84) bytes of data.
64 bytes from 10.246.172.210: icmp_seq=1 ttl=62 time=1.04 ms

--- 10.246.172.210 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.041/1.041/1.041/0.000 ms

$ ping -c1 10.246.168.210
PING 10.246.168.210 (10.246.168.210) 56(84) bytes of data.
64 bytes from 10.246.168.210: icmp_seq=1 ttl=63 time=0.613 ms

--- 10.246.168.210 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.613/0.613/0.613/0.000 ms

I think the issue is that the charm sets vip_iface by default to eth0 and that causes this issue. TBH I thought the vip_iface option had been removed long ago.

In the deployment I looked at two vips were requested:

juju config aodh vip
10.246.172.210 10.246.168.210

aodh translated this into a request for two different resources (from ha relation data0:

"res_aodh_8486566_vip": "  params ip=\"10.246.168.210\"  meta  migration-threshold=\"INFINITY\" failure-timeout=\"5s\"  op monitor timeout=\"20s\"
  interval=\"10s\" depth=\"0\"",
"res_aodh_eth0_vip": "  params ip=\"10.246.172.210\" nic=\"eth0\" cidr_netmask=\"24\"  meta migration-threshold=\"INFINITY\" failure-timeout=\"5s\"  op
  monitor timeout=\"20s\" interval=\"10s\" depth=\"0\"",
...

The hacluster charm then creates the new resources *but* with a new name for res_aodh_eth0_vip:

$ sudo crm status
Cluster Summary:
  * Stack: corosync
  * Current DC: juju-e310e9-1-lxd-0 (version 2.1.2-ada5c3b36e2) - partition with quorum
  * Last updated: Fri Sep  2 14:57:12 2022
  * Last change:  Thu Sep  1 02:22:01 2022 by hacluster via crmd on juju-e310e9-1-lxd-0
  * 3 nodes configured
  * 5 resource instances configured

Node List:
  * Online: [ juju-e310e9-0-lxd-0 juju-e310e9-1-lxd-0 juju-e310e9-2-lxd-0 ]

Full List of Resources:
  * Resource Group: grp_aodh_vips:
    * res_aodh_272179f_vip	(ocf:heartbeat:IPaddr2):	 Started juju-e310e9-1-lxd-0
    * res_aodh_8486566_vip	(ocf:heartbeat:IPaddr2):	 Started juju-e310e9-1-lxd-0
  * Clone Set: cl_res_aodh_haproxy [res_aodh_haproxy]:
    * Started: [ juju-e310e9-0-lxd-0 juju-e310e9-1-lxd-0 juju-e310e9-2-lxd-0 ]

So both vips are configured and working but the resource name has changed.

$ ping -c1 10.246.172.210
PING 10.246.172.210 (10.246.172.210) 56(84) bytes of data.
64 bytes from 10.246.172.210: icmp_seq=1 ttl=62 time=1.04 ms

--- 10.246.172.210 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.041/1.041/1.041/0.000 ms

$ ping -c1 10.246.168.210
PING 10.246.168.210 (10.246.168.210) 56(84) bytes of data.
64 bytes from 10.246.168.210: icmp_seq=1 ttl=63 time=0.613 ms

--- 10.246.168.210 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.613/0.613/0.613/0.000 ms

I think the issue is that the charm sets vip_iface by default to eth0 and that causes this issue. TBH I thought the vip_iface option had been removed long ago.

Revision history for this message

Liam Young (gnuoy) wrote on 2022-09-06:

#12

Download full text (4.4 KiB)

On reflection I think comments 7 and 8 are correct and it's a juju bug. Looking into it showed that unit hacluster-aodh/1 was missing from the hanode relationship with hacluster-aodh/0 but only from hacluster-aodh/0's point of view. Inspecting the relationship from hacluster-aodh/1's point of view correctly shows both peer nodes (hacluster-aodh/0 and hacluster-aodh/2).

juju version: 2.9.33-ubuntu-amd64

$ juju run --application hacluster-aodh "relation-ids hanode" [11/51]
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/1
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/2
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/0

$ juju run --application hacluster-aodh "relation-list -r hanode:8"
- Stdout: |
    hacluster-aodh/2
  UnitId: hacluster-aodh/0
- Stdout: |
    hacluster-aodh/0
    hacluster-aodh/2
  UnitId: hacluster-aodh/1
- Stdout: |
    hacluster-aodh/0
    hacluster-aodh/1
  UnitId: hacluster-aodh/2

$ juju status aodh
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud/default 2.9.33 unsupported 08:46:42Z

SAAS Status Store URL
grafana active foundations-maas admin/lma-maas.grafana
graylog active foundations-maas admin/lma-maas.graylog
nagios active foundations-maas admin/lma-maas.nagios
prometheus active foundations-maas admin/lma-maas.prometheus

App Version Status Scale Charm Channel Rev Exposed Message
aodh 14.0.0 active 3 aodh yoga/stable 77 no Unit is ready
aodh-mysql-router 8.0.30 active 3 mysql-router 8.0/stable 35 no Unit is ready
filebeat 6.8.23 active 3 filebeat candidate 38 no Filebeat ready.
hacluster-aodh waiting 3 hacluster edge 109 no Resource: res_aodh_272179f_vip not yet configured
logrotated active 3 logrotated candidate 7 no Unit is ready.
nrpe active 3 nrpe candidate 94 no Ready
prometheus-grok-exporter maintenance 3 prometheus-grok-exporter candidate 8 no Installing software
public-policy-routing active 3 advanced-routing candidate 11 no Unit is ready
telegraf active 3 telegraf candidate 54 no Monitoring ceph-osd/2 (source version/commit 76901fd)

Unit Workload Agent Machine Public address Ports Message
aodh/0* active idle 0/lxd/0 10.246.165.92 8042/tcp Unit is ready
  aodh-mysql-router/0* active idle 10.246.165.92 Unit is ready
  filebeat/30 active idle 10.246.165.92 Filebeat ready.
  hacluster-aodh/0* waiting idle 10.246.165.92 Resource: res_aodh_272179f_vip not yet configured
  logrotated/24 active idle 10.246.165.92 Unit is ready.
  nrpe/36 active idle 10.246.165.92 icmp,5666/tcp Ready
  prometheus-grok-exporter/31 active idle 10.246.165.92 9144/tcp Unit is ready
  public-policy-routing/13 active idle 10.246.165.92 Unit is ready
  telegraf/29 active idle 10.246.165.92 9103/tcp Monitoring aodh/0 (source version/commit 76901fd)
aodh/1 active idle 1/lxd/0 10.246.166.208 8042/tcp Unit is ready
  aodh-mysql-router/1 active idle 10.246.166.208 Unit is ready
  filebeat/31 active idle 10.246.166.208 Filebeat ready.
  hacluster-aodh/1 waiting idle 10.246.166.208 Resource: res_aodh_272179f_vip not yet configured
  logrotated/25 active idle 10.246.166.208 Unit is ready.
  nrpe/37 active idle 10.246.166.208 icmp,5666/tcp Ready
  prometheus-grok-exporter/30 active idle 10.246.166...

On reflection I think comments 7 and 8 are correct and it's a juju bug. Looking into it showed that unit hacluster-aodh/1 was missing from the hanode relationship with hacluster-aodh/0 but only from hacluster-aodh/0's point of view. Inspecting the relationship from hacluster-aodh/1's point of view correctly shows both peer nodes (hacluster-aodh/0 and hacluster-aodh/2).

juju version: 2.9.33-ubuntu-amd64

$ juju run --application hacluster-aodh "relation-ids hanode" [11/51]
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/1
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/2
- Stdout: |
    hanode:8
  UnitId: hacluster-aodh/0

$ juju run --application hacluster-aodh "relation-list -r hanode:8"
- Stdout: |
    hacluster-aodh/2
  UnitId: hacluster-aodh/0
- Stdout: |
    hacluster-aodh/0
    hacluster-aodh/2
  UnitId: hacluster-aodh/1
- Stdout: |
    hacluster-aodh/0
    hacluster-aodh/1
  UnitId: hacluster-aodh/2

$ juju status aodh
Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud/default 2.9.33 unsupported 08:46:42Z

SAAS Status Store URL
grafana active foundations-maas admin/lma-maas.grafana
graylog active foundations-maas admin/lma-maas.graylog
nagios active foundations-maas admin/lma-maas.nagios
prometheus active foundations-maas admin/lma-maas.prometheus

App Version Status Scale Charm Channel Rev Exposed Message
aodh 14.0.0 active 3 aodh yoga/stable 77 no Unit is ready
aodh-mysql-router 8.0.30 active 3 mysql-router 8.0/stable 35 no Unit is ready
filebeat 6.8.23 active 3 filebeat candidate 38 no Filebeat ready.
hacluster-aodh waiting 3 hacluster edge 109 no Resource: res_aodh_272179f_vip not yet configured
logrotated active 3 logrotated candidate 7 no Unit is ready.
nrpe active 3 nrpe candidate 94 no Ready
prometheus-grok-exporter maintenance 3 prometheus-grok-exporter candidate 8 no Installing software
public-policy-routing active 3 advanced-routing candidate 11 no Unit is ready
telegraf active 3 telegraf candidate 54 no Monitoring ceph-osd/2 (source version/commit 76901fd)

Unit Workload Agent Machine Public address Ports Message
aodh/0* active idle 0/lxd/0 10.246.165.92 8042/tcp Unit is ready
  aodh-mysql-router/0* active idle 10.246.165.92 Unit is ready
  filebeat/30 active idle 10.246.165.92 Filebeat ready.
  hacluster-aodh/0* waiting idle 10.246.165.92 Resource: res_aodh_272179f_vip not yet configured
  logrotated/24 active idle 10.246.165.92 Unit is ready.
  nrpe/36 active idle 10.246.165.92 icmp,5666/tcp Ready
  prometheus-grok-exporter/31 active idle 10.246.165.92 9144/tcp Unit is ready
  public-policy-routing/13 active idle 10.246.165.92 Unit is ready
  telegraf/29 active idle 10.246.165.92 9103/tcp Monitoring aodh/0 (source version/commit 76901fd)
aodh/1 active idle 1/lxd/0 10.246.166.208 8042/tcp Unit is ready
  aodh-mysql-router/1 active idle 10.246.166.208 Unit is ready
  filebeat/31 active idle 10.246.166.208 Filebeat ready.
  hacluster-aodh/1 waiting idle 10.246.166.208 Resource: res_aodh_272179f_vip not yet configured
  logrotated/25 active idle 10.246.166.208 Unit is ready.
  nrpe/37 active idle 10.246.166.208 icmp,5666/tcp Ready
  prometheus-grok-exporter/30 active idle 10.246.166.208 9144/tcp Unit is ready
  public-policy-routing/14 active idle 10.246.166.208 Unit is ready
  telegraf/31 active idle 10.246.166.208 9103/tcp Monitoring aodh/1 (source version/commit 76901fd)
aodh/2 active idle 2/lxd/0 10.246.165.66 8042/tcp Unit is ready
  aodh-mysql-router/2 active idle 10.246.165.66 Unit is ready
  filebeat/64 active idle 10.246.165.66 Filebeat ready.
  hacluster-aodh/2 waiting idle 10.246.165.66 Resource: res_aodh_272179f_vip not yet configured
  logrotated/56 active idle 10.246.165.66 Unit is ready.
  nrpe/68 active idle 10.246.165.66 icmp,5666/tcp Ready
  prometheus-grok-exporter/62 active idle 10.246.165.66 9144/tcp Unit is ready
  public-policy-routing/35 active idle 10.246.165.66 Unit is ready
  telegraf/63 active idle 10.246.165.66 9103/tcp Monitoring aodh/2 (source version/commit 76901fd)

Machine State Address Inst id Series AZ Message
0 started 10.246.164.163 solqa-lab1-server-37 jammy zone1 Deployed
0/lxd/0 started 10.246.165.92 juju-30865a-0-lxd-0 jammy zone1 Container started
1 started 10.246.166.192 solqa-lab1-server-32 jammy zone2 Deployed
1/lxd/0 started 10.246.166.208 juju-30865a-1-lxd-0 jammy zone2 Container started
2 started 10.246.165.238 solqa-lab1-server-33 jammy zone3 Deployed
2/lxd/0 started 10.246.165.66 juju-30865a-2-lxd-0 jammy zone3 Container started

Revision history for this message

Ian Booth (wallyworld) wrote on 2022-09-07:

#13

Ideally we'd get the following info:

juju dump-db
juju show-status-log -n 100 hacluster-aodh/N
juju show-status-log -n 100 aodh/N
juju status --relations --format yaml

We need the db dump to look at the applications, relations, units, relationscopes, unitstates collections. Maybe also settings.

Without the above it will be difficult to start trying to understand what's going on.

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2022-09-07 (last edit on 2022-09-07):

#14

@Ian, the last few occurrences have almost all of this info.

Testrun https://solutions.qa.canonical.com/testruns/testRun/e896684f-db39-4f59-b6ff-2453fd45cf08, has a juju dump-db here:
https://oil-jenkins.canonical.com/artifacts/e896684f-db39-4f59-b6ff-2453fd45cf08/generated/generated/openstack/juju-dump-db-openstack-2022-09-04-18.40.11.tar.gz

And juju crashdump here:
https://oil-jenkins.canonical.com/artifacts/e896684f-db39-4f59-b6ff-2453fd45cf08/generated/generated/openstack/juju-crashdump-openstack-2022-09-04-18.40.11.tar.gz

The crashdump includes the show-status-log under the machine directory (hacluster-octavia/2 on 1/lxd/8 in this case). We do not collect `juju status --relations`, but we do have `juju_introspection/juju_engine_report-unit...`, which seems to include info about the relations.

An overview of all the collected logs and configs for this testrun can be found here:
https://oil-jenkins.canonical.com/artifacts/e896684f-db39-4f59-b6ff-2453fd45cf08/index.html

Vitaly Antonenko (anvial) on 2022-09-08

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Jeffrey Chang (modern911) on 2022-11-16

tags:

added: cdo-qa

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-07-19:

#15

SQA has ~10 occurrences recently, and hacluster-manila-ganesha contributes most of those. We removed manila-ganesha for now.

The crashdump link for one of them, https://oil-jenkins.canonical.com/artifacts/134a886d-1270-443f-9962-b9326879f887/generated/generated/openstack/juju-crashdump-openstack-2023-07-17-08.13.53.tar.gz

hacluster-manila-ganesha blocked 3 hacluster 2.0.3/stable 113 no Resource: res_ganesha_e817d5c_vip not running

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	High	Unassigned
	OpenStack HA Cluster Charm	Invalid	Undecided	Unassigned

Canonical Juju

Getting message "vip not yet configured" on all Openstack Cluster based services

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches