Controller-0 did not become active after installation unlock

Bug #1871638 reported by Difu Hu
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Paul-Ionut Vaduva

Bug Description

Brief Description
-----------------
During fresh installation to unlock controller-0, controller-0 did not become active

Severity
--------
Major

Steps to Reproduce
------------------
DC installation
ssh to floating ip
source /etc/platform/openrc

Expected Behavior
------------------
controller-0 is in active

Actual Behavior
----------------
controller-0 did not become active controller

Reproducibility
---------------
Not sure. Suspect it is intermittent, because we see installation and config passed on a different dc system with similar load.

System Configuration
--------------------
DC system
Lab-name: wp_18_21 (DC-3)

Branch/Pull Time/Commit
-----------------------
2020-04-07_04-10-00

Last Pass
---------
Same system: 2020-04-05_04-10-00
DC-1: 2020-04-07_00-10-00

Timestamp/Logs
--------------
controller-0:~$ source /etc/platform/openrc
Openstack Admin credentials can only be loaded from the active controller.

controller-0:~$ sudo sm-dump
Password:
/var/run/sm/sm.db not available.

Test Activity
-------------
installation

Yang Liu (yliu12)
description: updated
Yang Liu (yliu12)
description: updated
Revision history for this message
Difu Hu (difuhu) wrote :
Revision history for this message
Bart Wensley (bartwensley) wrote :
Download full text (3.2 KiB)

After the controller is first unlocked, the controller manifest is failing to apply. The failure happens because the docker daemon cannot be started:

2020-04-08T06:55:57.588 Debug: 2020-04-08 06:55:57 +0000 Exec[perform systemctl daemon reload for docker proxy](provider=posix): Executing 'systemctl daemon-reload'
2020-04-08T06:55:57.590 Debug: 2020-04-08 06:55:57 +0000 Executing: 'systemctl daemon-reload'
2020-04-08T06:55:57.592 Notice: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: Triggered 'refresh' from 1 events
2020-04-08T06:55:57.594 Info: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: Scheduling refresh of Service[docker]
2020-04-08T06:55:57.596 Debug: 2020-04-08 06:55:57 +0000 /Stage[main]/Platform::Docker::Config/Exec[perform systemctl daemon reload for docker proxy]: The container Class[Platform::Docker::Config] will propagate my refresh event
2020-04-08T06:55:57.598 Debug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl is-active docker'
2020-04-08T06:55:57.600 Debug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl is-enabled docker'
2020-04-08T06:55:57.602 Debug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl unmask docker'
2020-04-08T06:55:57.643 Debug: 2020-04-08 06:55:57 +0000 Executing: '/usr/bin/systemctl start docker'
2020-04-08T06:55:57.816 Debug: 2020-04-08 06:55:57 +0000 Runing journalctl command to get logs for systemd start failure: journalctl -n 50 --since '5 minutes ago' -u docker --no-pager
2020-04-08T06:55:57.819 Debug: 2020-04-08 06:55:57 +0000 Executing: 'journalctl -n 50 --since '5 minutes ago' -u docker --no-pager'
2020-04-08T06:55:57.828 Error: 2020-04-08 06:55:57 +0000 Systemd start for docker failed!

Looking at daemon.log, it shows dockerd and containerd being restarted, but then dockerd gets stuck with the following logs coming out 1000s of times:

2020-04-08T06:55:57.472 controller-0 dockerd[100339]: info time="2020-04-08T06:55:57.472875804Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" module=libcontainerd namespace=moby
2020-04-08T06:55:57.472 controller-0 dockerd[100339]: info time="2020-04-08T06:55:57.472870255Z" level=error msg="failed to get event" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\"" module=libcontainerd namespace=plugins.moby

The containerd/dockerd startup was modified recently:
https://review.opendev.org/#/c/716911
https://review.opendev.org/#/c/715593
https://review.opendev.org/#/c/717044

Paul or Bob should take a look at this LP. Note that the puppet exec that is failing is ...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / high priority - as per above analysis, this looks like it may have been introduced by recent code changes

Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.4.0 stx.containers
Changed in starlingx:
assignee: nobody → Frank Miller (sensfan22)
Revision history for this message
Yang Liu (yliu12) wrote :

Seen similar issue on a duplex subcloud on DC-1 with "2020-04-07_00-10-00" load. New logs added to:
https://files.starlingx.kube.cengn.ca/launchpad/1871638

Revision history for this message
Frank Miller (sensfan22) wrote :

The dockerd failure occurs when containerd dies during dockerd startup. It looks like containerd cannot bind to an address: [2620:10a:a001:a103::1065]:0

2020-04-08T06:55:57.789 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.789513836Z" level=info msg="Loading containers: start."
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801433786Z" level=info msg="Start event monitor"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801482117Z" level=info msg="Start snapshots syncer"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801493162Z" level=info msg="Start streaming server"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801571334Z" level=error msg="Failed to start streaming server" error="listen tcp [2620:10a:a001:a103::1065]:0: bind: cannot assign requested address"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801635792Z" level=info msg="Stop CRI service"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801687544Z" level=info msg="Event monitor stopped"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801723120Z" level=info msg="Stream server stopped"
2020-04-08T06:55:57.801 controller-0 containerd[101102]: info time="2020-04-08T06:55:57.801742976Z" level=fatal msg="Failed to run CRI service" error="stream server error: listen tcp [2620:10a:a001:a103::1065]:0: bind: cannot assign requested address"
2020-04-08T06:55:57.806 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.806166493Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42096e1e0, TRANSIENT_FAILURE" module=grpc
2020-04-08T06:55:57.806 controller-0 dockerd[101117]: info time="2020-04-08T06:55:57.806192558Z" level=info msg="pickfirstBalancer: HandleSubConnStateChange: 0xc42096e1e0, CONNECTING" module=grpc
2020-04-08T06:55:57.806 controller-0 systemd[1]: notice containerd.service: main process exited, code=exited, status=1/FAILURE

Next step is to determine why containerd cannot connect to that address.

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Frank Miller (sensfan22) → Paul-Ionut Vaduva (pvaduva)
Revision history for this message
Bin Qian (bqian20) wrote :

Issue is reproduced in a recent test. With identical error messages on puppet.log and daemon.log.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/720205

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/720205
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=9a18b7086035062bd326a279aea47c23c3c3f96e
Submitter: Zuul
Branch: master

commit 9a18b7086035062bd326a279aea47c23c3c3f96e
Author: Paul Vaduva <email address hidden>
Date: Wed Apr 15 09:56:42 2020 -0400

    Introduce a wait until network interfaces are ready

    The DAD (Duplicate Address Detection) mechanism keeps
    ipv6 network interface in tentative state until it finishes.
    During this time no binding to this interface address is
    possible and networking dependent services fail to start

    Change-Id: I9cfa604a0d75400f6d3c7172b3b973b0d50c3578
    Closes-bug: 1871638
    Signed-off-by: Paul Vaduva <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Yang Liu (yliu12) wrote :

This issue is not seen in recent 2 DC installations on with 04-19 load.

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/729825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (f/centos8)
Download full text (16.7 KiB)

Reviewed: https://review.opendev.org/729825
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=d4617fbad74a05f2af81ee85a47565083991e6f8
Submitter: Zuul
Branch: f/centos8

commit 4134023ab84d8a635b118d5e3ff26ade3bbe535b
Author: Sharath Kumar K <email address hidden>
Date: Thu May 7 10:08:11 2020 +0200

    Tox and Zuul job for the bandit code scan in stx/stx-puppet

    Setting up the bandit tool for the scanning of HIGH severity issues
    in the python codes under Starlingx/stx-puppet folder.
    Expecting this merge will enable zuul job for CI/CD of bandit scan.

    Configuration files:
    1. tox.ini for adding bandit environment and command.
    2. test-requirements.txt for adding bandit version.
    3. .zuul.yaml file for adding bandit job and configuring under
       check job to run code scan every time before code commit.

    Test:
    Run tox -e bandit command inside the fault folder to validate the
    bandit scan and result.

    Story: 2007541
    Task: 39687
    Depends-On: https://review.opendev.org/#/c/721294/

    Change-Id: I2982268db2b5e75feeb287bc95420fedc9b0d816
    Signed-off-by: Sharath Kumar K <email address hidden>

commit 65daac29e4635f32a57e80cd18f96fd59dc8ebe0
Author: Bin Qian <email address hidden>
Date: Tue May 12 22:39:21 2020 -0400

    DC cert manifest should only apply to controller nodes

    DC cert manifest should only apply to controller nodes on system
    controller.
    This fix is for DC with worker nodes in central cloud.

    Change-Id: I4233509a6f0afb3013c01e81dea6f655d9e15371
    Closes-Bug: 1878260
    Signed-off-by: Bin Qian <email address hidden>

commit 04a3cb8cbad9b1700286c5de67aa5d974cf54400
Author: Elena Taivan <email address hidden>
Date: Wed Apr 29 08:44:13 2020 +0000

    Changing permissions for conversion folder

    Adding writing permissions to '/opt/conversion' mountpoint
    so openstack image conversion can happen there.

    Change-Id: Id1a91db6570dcbed3b8068e79e72f5bb800f24ad
    Partial-bug: 1819688
    Signed-off-by: Elena Taivan <email address hidden>

commit 4e9153cf234e714e4bbc9a9eb3d9b55b2828145a
Author: Tao Liu <email address hidden>
Date: Mon May 4 14:30:30 2020 -0500

    Move subcloud audit to separate process

    Subcloud audit is being removed from the dcmanager-manager
    process and it is running in dcmanager-audit process.

    This update adds associated puppet config.

    Story: 2007267
    Task: 39640
    Depends-On: https://review.opendev.org/#/c/725627/

    Change-Id: Idd2e675126a01d6113597646ddd9eb4a0bc5be44
    Signed-off-by: Tao Liu <email address hidden>

commit b793518f65ae932f3974ff85b797f505b5ef1c2a
Author: Robert Church <email address hidden>
Date: Wed Apr 29 12:49:04 2020 -0400

    Ensure containerd binds to the loopback interface

    Set the stream_server_address to bind to the loopback interface with a
    value of "127.0.0.1" for IPv4 and "::1" for IPv6.

    Without setting the stream_server_address in config.toml, containerd was
    binding to the OAM interface. Under most situations this resulted in
    containe...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.