AIO-DX Application apply aborted Unexpected process termination while application-apply was in progress

Bug #1838101 reported by Anujeyan Manokeran
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bob Church

Bug Description

Brief Description
-----------------
       Application apply was failed when standby controller was locked and unlocked on AIO-DX. Failure as below in sysinv.log. Prior to this system had issues CPU https://bugs.launchpad.net/starlingx/+bug/1837426

2019-07-26T18:32:33.000 controller-1 -sh: info HISTORY: PID=2055436 UID=42425 system host-unlock 1
system application-show stx-openstack
+---------------+-------------------------------------------------------------------------+
| Property | Value |
+---------------+-------------------------------------------------------------------------+
| active | True |
| app_version | 1.0-17-centos-stable-versioned |
| created_at | 2019-07-26T15:28:37.144643+00:00 |
| manifest_file | stx-openstack.yaml |
| manifest_name | armada-manifest |
| name | stx-openstack |
| progress | Unexpected process termination while application-apply was in progress. |
| | The application status has changed from 'applying' to 'apply-failed'. |
| status | apply-failed |
| updated_at | 2019-07-26T18:47:55.634276+00:00 |
+---------------+-------------------------------------------------------------------------+

7-26 18:46:53.194 20389 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 21 seconds.
2019-07-26 18:47:14.210 20389 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2019-07-26 18:47:14.216 20389 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2019-07-26 18:47:14.222 20389 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 23 seconds.
2019-07-26 18:47:14.222 20389 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 23 seconds.
2019-07-26 18:47:37.240 20389 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2019-07-26 18:47:37.246 20389 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.204.2:5672
2019-07-26 18:47:37.252 20389 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 25 seconds.
2019-07-26 18:47:37.252 20389 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on 192.168.204.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 25 seconds.
2019-07-26 18:47:48.391 186024 INFO sysinv.cmd.dnsmasq_lease_update [-] Called 'old' for mac '24:8a:07:58:d0:d0' with ip '192.168.204.4'
2019-07-26 18:47:52.189 186219 WARNING ceph_client [-] skip checking server certificate
2019-07-26 18:47:52.509 187128 INFO sysinv.cmd.dnsmasq_lease_update [-] Called 'old' for mac '24:8a:07:58:d0:d0' with ip '192.168.204.4'
2019-07-26 18:47:54.602 185734 WARNING ceph_client [-] skip checking server certificate
2019-07-26 18:47:55.053 186219 WARNING ceph_client [-] skip checking server certificate
2019-07-26 18:47:55.123 186219 WARNING ceph_client [-] skip checking server certificate
2019-07-26 18:47:55.128 186219 WARNING ceph_client [-] skip checking server certificate
2019-07-26 18:47:55.141 186219 WARNING oslo_config.cfg [-] Deprecated: Option "auth_uri" from group "keystone_authtoken" is deprecated for removal (The auth_uri option is deprecated in favor of www_authenticate_uri and will be removed in the S release.). Its value may be silently ignored in the future.
2019-07-26 18:47:55.146 186219 WARNING oslo_config.cfg [-] Deprecated: Option "auth_uri" from group "keystone_authtoken" is deprecated. Use option "www_authenticate_uri" from group "keystone_authtoken".
2019-07-26 18:47:55.148 186219 INFO keystonemiddleware.auth_token [-] Starting Keystone auth_token middleware
2019-07-26 18:47:55.157 186219 WARNING keystonemiddleware.auth_token [-] AuthToken middleware is set with keystone_authtoken.service_token_roles_required set to False. This is backwards compatible but deprecated behaviour. Please set this to True.
2019-07-26 18:47:55.168 186219 INFO oslo.service.wsgi [-] sysinv_api listening on 192.168.204.2:6385
2019-07-26 18:47:55.168 186219 INFO oslo_service.service [-] Starting 1 workers
2019-07-26 18:47:55.179 186219 INFO keystonemiddleware.auth_token [-] Starting Keystone auth_token middleware
2019-07-26 18:47:55.183 186219 WARNING keystonemiddleware._common.config [-] The option "user_domain_name" in conf is not known to auth_token
2019-07-26 18:47:55.184 186219 WARNING keystonemiddleware._common.config [-] The option "auth_url" in conf is not known to auth_token
2019-07-26 18:47:55.184 186219 WARNING keystonemiddleware._common.config [-] The option "project_domain_name" in conf is not known to auth_token
2019-07-26 18:47:55.184 186219 WARNING keystonemiddleware._common.config [-] The option "username" in conf is not known to auth_token
2019-07-26 18:47:55.184 186219 WARNING keystonemiddleware._common.config [-] The option "project_name" in conf is not known to auth_token
2019-07-26 18:47:55.185 186219 WARNING keystonemiddleware._common.config [-] The option "password" in conf is not known to auth_token
2019-07-26 18:47:55.187 186219 WARNING keystonemiddleware.auth_token [-] AuthToken middleware is set with keystone_authtoken.service_token_roles_required set to False. This is backwards compatible but deprecated behaviour. Please set this to True.

47:55.184 186219 WARNING keystonemiddleware._common.config [-] The option "username" in conf is not known to auth_token
2019-07-26 18:47:55.184 186219 WARNING keystonemiddleware._common.config [-] The option "project_name" in conf is not known to auth_token
2019-07-26 18:47:55.185 186219 WARNING keystonemiddleware._common.config [-] The option "password" in conf is not known to auth_token
2019-07-26 18:47:55.187 186219 WARNING keystonemiddleware.auth_token [-] AuthToken middleware is set with keystone_authtoken.service_token_roles_required set to False. This is backwards compatible but deprecated behaviour. Please set this to True.
2019-07-26 18:47:55.190 186219 INFO oslo.service.wsgi [-] sysinv_api_pxe listening on 169.254.202.2:6385
2019-07-26 18:47:55.192 186219 INFO oslo_service.service [-] Starting 1 workers
2019-07-26 18:47:55.631 185734 INFO sysinv.conductor.kube_app [-] Resetting status of app stx-openstack from 'applying' to 'apply-failed'
2019-07-26 18:47:55.654 185734 WARNING urllib3.connectionpool [-] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb81bc5f190>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock
2019-07-26 18:47:55.655 185734 WARNING urllib3.connectionpool [-] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb81bc5f2d0>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock
2019-07-26 18:47:55.655 185734 WARNING urllib3.connectionpool [-] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb81bc5f410>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',)': /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock
2019-07-26 18:47:55.657 185734 ERROR sysinv.common.kubernetes [-] Failed to delete custom object, Namespace kube-system: HTTPSConnectionPool(host='192.168.206.2', port=6443): Max retries exceeded with url: /apis/armada.process/v1/namespaces/kube-system/locks/locks.armada.process.lock (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fb81bc5f390>: Failed to establish a new connection: [Errno 111] ECONNREFUSED',))
2019-07-26 18:47:55.657 185734 WARNING ceph_client [-] skip checking server certificate
2019-07-26 18:47:57.478 18573
Severity
--------
Major
Steps to Reproduce
------------------
1. Lock and unlock standby controller
2. Once the controller is up application was failed

System Configuration
--------------------
AIO DX system

Expected Behavior
------------------
No failure or abort on application apply on stx-openstack
Actual Behavior
----------------
Failure or abort on application apply on stx-openstack
Reproducibility
---------------
System Configuration
--------------------
Regular system
Load
----

Last Pass
---------
Timestamp/Logs
--------------
2019-07-26T18:32:33.000
Test Activity
-------------
Regression test

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Please update the bug report to indicate which load.
Also the system configuration in the report indicates "Regular system" but the title indicates "AIO-DX". Please correct the bug report template.

Ghada Khalil (gkhalil)
tags: added: stx.containers
tags: added: stx.regression
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 / medium priority for now -- expected that this is an intermittent issue

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Bob Church (rchurch)
tags: added: stx.2.0
Revision history for this message
Bob Church (rchurch) wrote :
Download full text (29.0 KiB)

* Event Summary:
   - Unlock of standby controller
   - Application apply due to the unlock
   - ceph-mon, ceph-radosgw, dbmon start failing their audits.
     - Observing brief communication alarms on all interfaces
     - Ceph health check fails to respond with 10s timeout in this window
   - Uncontrolled swact occurs
     - Noticed some sm not scheduling on time messages, usually indicative of hitting platform resource limits
   - Monitor outage during the swact causes the mgr-restful-plugin to fail and restart
   - Total monitor outage from kill on active controller + swact + start on active controller is ~5m 37s
      2019-07-26 18:41:58.119 7fb018724700 -1 mon.controller@0(leader) e1 *** Got Signal Terminated ***
      2019-07-26 18:47:35.743 7fd370d83140 0 ceph version 13.2.2 (00071a2d9e839b95f9439daaccd4677c5d15eaa6) mimic (stable), process ceph-mon, pid 182505
   - Application is marked failed after abort

* Further actions:
   - Evaluate ceph health audit logic
     - This recent update might help: https://opendev.org/starlingx/integ/commit/12f604b4dd86663910fa152ea50265a2ae56d932
   - Evalute and optimize the monitor outage during swacts ~5m 30s minutes is excessive
     - SM dependencies on a swact: can we stop ceph-mon last on shutdown and first after startup?

2019-07-26 18:32:38.751 1799045 INFO sysinv.api.controllers.v1.host [-] controller-0 apply ihost_val {'ihost_action': 'unlock', 'task': u'Unlocking'}
2019-07-26T18:39:41.876 | 1 | node-scn | controller-0 | unknown-unknown | unlocked-unknown | customer action

2019-07-26T18:40:27.000 controller-1 sm: debug time[12467.330] log<15> INFO: sm_alarm[97894]: sm_alarm_thread.c(638): Raising alarm (communication-failure) for node (controller-1) domain () entity (oam).
2019-07-26T18:40:27.000 controller-1 sm: debug time[12467.335] log<16> INFO: sm_alarm[97894]: sm_alarm_thread.c(1083): Raised alarm (communication-failure) for node (controller-1) domain () entity (oam), fm_uuid=07de968b-ab45-4133-a100-03e34d12cb07.
2019-07-26T18:40:27.000 controller-1 sm: debug time[12467.335] log<17> INFO: sm_alarm[97894]: sm_alarm_thread.c(638): Raising alarm (communication-failure) for node (controller-1) domain () entity (mgmt).
2019-07-26T18:40:27.000 controller-1 sm: debug time[12467.430] log<18> INFO: sm_alarm[97894]: sm_alarm_thread.c(1083): Raised alarm (communication-failure) for node (controller-1) domain () entity (mgmt), fm_uuid=59862cfb-0054-4007-a2ae-618814a6fa27.

2019-07-26 18:40:30.108 1798378 INFO sysinv.conductor.kube_app [-] All docker images for application stx-openstack were successfully downloaded in 441 seconds
2019-07-26 18:40:30.150 1798378 INFO sysinv.conductor.kube_app [-] Starting Armada service...
2019-07-26 18:40:31.140 1798378 INFO sysinv.conductor.kube_app [-] Starting progress monitoring thread for app stx-openstack

2019-07-26T18:40:40.000 controller-1 sm: debug time[12479.845] log<21080> INFO: sm[97872]: sm_service_audit.c(175): Action (audit-enabled) timeout with result (failed), state (unknown), status (unknown), and condition (unknown) for service (ceph-mon), reason_text=, exit_co...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
Bob Church (rchurch) wrote :

Marking this as a duplicate of bug #1837750. This apply restructuring will avoid the failure scenario by conditionally applying the application after the unlock of the controller is complete.

Additionally, there have been performance improvements for the AIO-DX which will improve sm responsiveness.

No additional change are required to address this.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug was fixed by:
https://review.opendev.org/677845
https://review.opendev.org/677847
Merged on 2019-08-22

Marked as Fix Released

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

verified in load 2019-10-27_20-00-00.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.