StarlingX

IPv6: lock host stuck in ceph get_monitors_status check

Bug #1843082 reported by Anujeyan Manokeran on 2019-09-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Daniel Badea

Bug Description

Brief Description
-----------------
Unable to lock any host. System host-lock command was in stucked state . This was observed during the patch install test on IPv6 lab(wolfpass-03-07). Prior to this test lab was healthy able to lock and unlock all the hosts. Test scenario as below
Launched 50 pods using resource-consumer –image . Pods are active
upload test patch successful
apply test patch successful
Using horizon patch orchestration sw-patch strategy created successfully
orchestration sw-patch strategy applied successfully.
Orchestration locked standby controller-1 and swacted successfully then failed on locking new standby lock.
Manually tried to lock new standby failed and the prompt was not returned. This was continue to be there and lock was tried on other hosts it was same issue. After deleting resource consumer pods also same behavior.

sudo sw-patch show 2019-09-04_00-10-00_RR_ALLNODES
Password:
2019-09-04_00-10-00_RR_ALLNODES:
    Release: 19.09
    Patch State: Partial-Apply
    RR: Y
    Summary: Patch to /etc/init.d/logmgmt
    Contents:
                    RR_ALLNODES-1.0-2.tis.x86_64.rpm
                    logmgmt-1.0-6.tis.x86_64.rpm
                    logmgmt-wheels-1.0-6.tis.x86_64.rpm

sw-patch query-hosts
Hostname IP Address Patch Current Reboot Required Release State
============ ========================= ============= =============== ======= =====
compute-0 face::fb1:f2dc:4b40:2ed4 No Yes 19.09 idle
compute-1 face::5455:e33e:5332:120c No Yes 19.09 idle
compute-2 face::75a5:57f:709b:1c3e No Yes 19.09 idle
controller-0 face::3 No Yes 19.09 idle
controller-1 face::4 Yes No 19.09 idle

$ sw-patch query
Patch ID RR Release Patch State
=============================== == ======= =============
2019-09-04_00-10-00_RR_ALLNODES Y 19.09 Partial-Apply

Sysinv logs showing which is not sure
19-09-06 15:51:56.996 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.036 110951 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.141 110951 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.181 110957 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.289 110957 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.292 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.389 110957 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.397 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:54:31.492 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:55:13.765 110957 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:55:28.128 282169 INFO sysinv.cmd.dnsmasq_lease_update [-] Called 'old' for mac '3c:fd:fe:af:fc:ec' with ip 'face::fb1:f2dc:4b40:2ed4'
2019-09-06 15:56:03.015 286216 INFO sysinv.cmd.dnsmasq_lease_update [-] Called 'old' for mac '3c:fd:fe:af:fc:e8' with ip 'face::75a5:57f:709b:1c3e'
2019-09-06 15:56:14.259 287377 INFO sysinv.cmd.dnsmasq_lease_update [-] Called 'old' for mac '3c:fd:fe:af:fb:84' with ip 'face::5455:e33e:5332:120c'
2019-09-06 15:56:15.769 110953 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:56:15.886 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:57:24.770 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:59:31.052 110951 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 15:59:32.760 110951 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 16:00:33.771 110951 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 16:00:33.776 110957 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 16:00:33.812 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 16:00:33.820 110960 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 16:01:40.769 110951 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 16:04:31.220 110957 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []
2019-09-06 16:04:31.325 110957 INFO sysinv.api.controllers.v1.host [-] Provisioned storage node(s) []

Severity
--------
Major
Steps to Reproduce
------------------
1.Create 50 pods using resource consumer
kubectl run resource-consumer --image=gcr.io/kubernetes-e2e-test-images/resource-consumer:1.4 --expose --service-overrides='{ "spec": { "type": "LoadBalancer" } }' --port 8080 --requests='cpu=500m,memory=256Mi'
kubectl get services resource-consumer
kubectl scale deploy/resource-consumer --replicas 50

2. Upload patches
3. Apply patch
3. Create patch strategy using horizon orchestion
4. Apply patch strategy and monitor for failure. Host lock standby and swact was successful new standby lock failed.
5. Tried host-lock manually failed as per description

System Configuration
--------------------
Regular system

Expected Behavior
------------------
Host-lock should be locking host and returning prompt. If there is a error should show and error and return prompt
Actual Behavior
----------------
As per description hosts fail to lock host.

Reproducibility
---------------
Unable to lock all the time . Lab was in that state.
Above test scenario was tried 3 times seen once .
System Configuration
--------------------
Regular system IPV6

Load
----
Build date 2019-09-05 00:13:38

Last Pass
---------
Not available
Timestamp/Logs
--------------
Manual lock
2019-09-06T16:01:52.000 controller-1 -sh: info HISTORY: PID=323823 UID=42425 date;system host-lock controller-0
Test Activity
-------------
Regression test

See original description

Tags:

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-06:

collect logs Edit (76.1 MiB, application/x-tar)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-06: Re: IPv6: lock host stuck during patch orchestration

Marking as stx.3.0 - maybe an issue specific to IPv6 (scenario not tested previously).

summary:	- IPV6 lab unable lock host and prompt was not returning + IPv6: lock host stuck during patch orchestration
tags:	added: stx.3.0 stx.config
Changed in starlingx:
importance:	Undecided → High
status:	New → Triaged
importance:	High → Medium

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-06:

John Kung is investigating...

Changed in starlingx:
assignee:	nobody → John Kung (john-kung)

Anujeyan Manokeran (anujeyan) on 2019-09-06

description:

updated

Anujeyan Manokeran (anujeyan) on 2019-09-06

description:

updated

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-06:

Above issue was reproduced without patch orchestration.

Revision history for this message

John Kung (john-kung) wrote on 2019-09-06:

The following host-lock attempt (pid 106416) did not progress pass the Ceph get_monitors_status() check:

2019-09-06 20:38:50.965 106416 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-09-06-20-38-50 patch
2019-09-06 20:38:50.965 106416 INFO sysinv.api.controllers.v1.host [-] controller-0 1. delta_handle ['action']
2019-09-06 20:38:50.965 106416 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost check_lock
2019-09-06 20:38:50.966 106416 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost check_lock_controller
2019-09-06 20:38:51.183 106416 INFO sysinv.common.ceph [-] Active ceph monitors in inventory = [u'controller-0', u'controller-1', u'compute-0'] << This is the last log for this pid. get_monitors_status()

Pid appears stuck within method: get_monitors_status(). The missing log for pid 106416, after it attempts self._osd_quorum_names(), should indicate a log: " Active ceph monitors in ceph cluster" which is missing.

While ceph status indicates the following at time of host-lock attempt on controller-1:
[sysadmin@controller-1 sysinv(keystone_admin)]$ ceph -s
  cluster:
    id: 10100cb2-2e80-4dd5-a759-68de2ac873fc
    health: HEALTH_WARN
            Reduced data availability: 32 pgs stale

  services:
    mon: 3 daemons, quorum controller-0,controller-1,compute-0
    mgr: controller-0(active), standbys: controller-1
    osd: 2 osds: 2 up, 2 in

  data:
    pools: 1 pools, 64 pgs
    objects: 0 objects, 0 B
    usage: 217 MiB used, 892 GiB / 892 GiB avail
    pgs: 32 active+clean
             32 stale+active+clean

Investigation is required into ceph api (and reachability to it) to determine if it could be blocking or failing.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-06:

As discussed with Frank Miller, assigning to Daniel Badea to investigate

Changed in starlingx:
assignee:	John Kung (john-kung) → Daniel Badea (daniel.badea)
summary:	- IPv6: lock host stuck during patch orchestration + IPv6: lock host stuck in ceph get_monitors_status check

Yang Liu (yliu12) on 2019-09-09

tags:

added: stx.retestneeded

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-09-10:

Logs archive is corrupted. Can't extract controller-1 logs.

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-10:

reproduced on and collected logs load 2019-09-09 00:14:31 -0400 Edit (90.2 MiB, application/x-tar)

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-09-11:

Issue is caused by ceph-mgr binding to floating address face::5 . After swact floating address moves to the new active controller but because ceph-mgr running on the new active controller doesn't switch role to active (it's still in standby state) there is no service listening on face::5 and "ceph restful list-keys" hangs forever.

Fix: explicitly configure ceph-mgr to bind to controller-specific management address.
For example:

# /etc/ceph/ceph.conf on controller-0
[mgr.controller-0]
public_addr = face::3

# /etc/ceph/ceph.conf on controller-1
[mgr.controller-1]
public_addr = face::4

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-11: Fix proposed to stx-puppet (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/681507

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-11: Fix merged to stx-puppet (master)

#11

Reviewed: https://review.opendev.org/681507
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=ceb1b1566992b0161e59b86c3d46d0ac9a51c783
Submitter: Zuul
Branch: master

commit ceb1b1566992b0161e59b86c3d46d0ac9a51c783
Author: Daniel Badea <email address hidden>
Date: Wed Sep 11 14:30:27 2019 +0000

ceph: explicitly bind ceph-mgr to controller address

    ceph-mgr has no corresponding sections in ceph.conf
    When the service starts it binds and advertise a
    local address.

    In the case of #1843082 ceph-mgr on controller-0 binds
    to the floating management IP address [face::5]. When
    controller swacts ceph-client API requests are sent
    to [face::5] but there is no service listening on that
    endpoint and host-lock command hangs.

Fix: use ceph.conf to explicitly bind ceph-mgr to a
host-specific controller address.

    Change-Id: I811bfa7410770eb91eb4519cfab3e6febf381e6d
    Closes-bug: 1843082
    Signed-off-by: Daniel Badea <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-11: Fix proposed to integ (master)

#12

Fix proposed to branch: master
Review: https://review.opendev.org/681531

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-11: Fix merged to integ (master)

#13

Reviewed: https://review.opendev.org/681531
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=edc7f8495db315d654a5844c525024adffc0439d
Submitter: Zuul
Branch: master

commit edc7f8495db315d654a5844c525024adffc0439d
Author: Daniel Badea <email address hidden>
Date: Wed Sep 11 16:24:41 2019 +0000

ceph: mgr-restful-plugin set ceph-mgr config file path

    Explicitly set ceph-mgr configuration file path to
    /etc/ceph/ceph.conf to avoid surprises. ceph-mon
    and ceph-osd are also started with '-c' (--conf)
    pointing to /etc/ceph/ceph.conf.

    Change-Id: I4915952f17b4d96a8fce3b4b96335693f9b6c76b
    Closes-bug: 1843082
    Signed-off-by: Daniel Badea<email address hidden>

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-30:

#14

Verified in build 2019-10-29 20:02:14 .

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.