Activity log for bug #1890330

Date Who What changed Old value New value Message
2020-08-04 18:49:29 Dmitrii Shcherbakov bug added bug
2020-08-04 18:52:29 Dmitrii Shcherbakov description tests/bundles/bionic-ussuri.yaml observed in https://review.opendev.org/#/c/743212/ There is an issue with the manila-share service post-startup where it suddenly gets SIGTERM after a few minutes of operation. It properly reads its config, connects to Ceph mons CephFS, runs periodic share status tasks while being alive. The behavior matches what systemd does by default when stopping services: https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode= "Processes will first be terminated via SIGTERM (unless the signal to send is changed via KillSignal= or RestartKillSignal=)." Based on the log entries, something stops manila-share via systemd 18:17:31.274360 <--- journal log entry about stopping the manila-share service 18:17:31.278 <--- caught SIGTERM Meanwhile, Pacemaker doesn't seem to do anything at that time: grep -iP 'stop' /var/log/pacemaker/pacemaker.log | grep 18:17 ; echo $? 1 crm resource status res_manila_share_manila_share (systemd:manila-share): Started res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): Started Resource Group: grp_ganesha_vips res_ganesha_a3b980a_vip (ocf::heartbeat:IPaddr2): Started The resources are listed as started More information below: 1) the NFS share resource is in the error state: ubuntu@dmitriis-bastion:~$ openstack share list +--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+ | ID | Name | Size | Share Proto | Status | Is Public | Share Type Name | Host | Availability Zone | +--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+ | ec095c71-f8f2-4aa0-bdd1-ed1f47bd3cce | cephnfsshare1 | 1 | NFS | error | False | cephfsnfstype | | None | +--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+ 2) manila-share services are DOWN: ubuntu@dmitriis-bastion:~$ manila service-list +----+------------------+--------------------------------------------+------+---------+-------+----------------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | +----+------------------+--------------------------------------------+------+---------+-------+----------------------------+ | 1 | manila-scheduler | juju-111c97-zaza-1335ec68e4d0-13 | nova | enabled | up | 2020-08-04T14:57:38.000000 | | 2 | manila-data | juju-111c97-zaza-1335ec68e4d0-13 | nova | enabled | up | 2020-08-04T14:57:43.000000 | | 3 | manila-share | juju-111c97-zaza-1335ec68e4d0-5@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:36:25.000000 | | 4 | manila-share | juju-111c97-zaza-1335ec68e4d0-4@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:40:57.000000 | | 5 | manila-share | juju-111c97-zaza-1335ec68e4d0-3@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:56:55.000000 | +----+------------------+--------------------------------------------+------+---------+-------+----------------------------+ 3) /var/log/manila/manila-share.log at manila-ganesha units contains the following: 2020-08-04 18:13:43.190 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Starting 1 workers ... 2020-08-04 18:13:43.875 214156 DEBUG manila.service [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] Creating RPC server for service manila-share. start /usr/lib/python3/dist-packages/manila/service.py:121 2020-08-04 18:13:43.918 214156 DEBUG manila.share.manager [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] Start initialization of driver: 'CephFSDriver@juju-111c97-zaza-1335ec68e4d0-4@cephfsnfs1' _driver_setup /usr/lib/python3/dist-packages/manila/share/manager.py:309 2020-08-04 18:13:43.919 214156 INFO manila.share.drivers.cephfs.driver [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] [CEPHFSNFS1}] Ceph client found, connecting... ... 2020-08-04 18:17:30.935 214156 INFO manila.share.manager [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating share status 2020-08-04 18:17:30.941 214156 DEBUG manila.share.driver [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating share stats. _update_share_stats /usr/lib/python3/dist-packages/manila/share/driver.py:1271 2020-08-04 18:17:30.941 214156 DEBUG oslo_service.periodic_task [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Running periodic task ShareManager.periodic_share_status_update run_periodic_tasks /usr/lib/python3/dist-packages/oslo_service/periodic_task.py:216 2020-08-04 18:17:30.942 214156 DEBUG manila.share.manager [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating status of share instances. periodic_share_status_update /usr/lib/python3/dist-packages/manila/share/manager.py:4515 2020-08-04 18:17:31.278 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Caught SIGTERM, stopping children 2020-08-04 18:17:31.279 214098 DEBUG oslo_concurrency.lockutils [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Acquired lock "singleton_lock" lock /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:265 2020-08-04 18:17:31.279 214098 DEBUG oslo_concurrency.lockutils [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Releasing lock "singleton_lock" lock /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:281 2020-08-04 18:17:31.280 214098 DEBUG oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Stop services. stop /usr/lib/python3/dist-packages/oslo_service/service.py:699 2020-08-04 18:17:31.280 214098 INFO manila.coordination [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Stopped Coordinator (Agent ID: 73b71760-242a-40e4-a54f-108702fb28ef, prefix: manila-) 2020-08-04 18:17:31.280 214098 DEBUG oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Killing children. stop /usr/lib/python3/dist-packages/oslo_service/service.py:704 2020-08-04 18:17:31.286 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Waiting on 1 children to exit 2020-08-04 18:17:31.294 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Child 214156 killed by signal 15 journalctl -o short-precise -u manila-share.service Aug 04 18:13:41.971356 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Started OpenStack Manila Share. Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: /usr/lib/python3/dist-packages/pymysql/cursors.py:170: Warning: (3719, "'utf8' is curren> Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: result = self._query(query) Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: Exception ignored in: <function _after_fork at 0x7f35f2a6c670> Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: Traceback (most recent call last): Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: File "/usr/lib/python3.8/threading.py", line 1454, in _after_fork Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: assert len(_active) == 1 Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: AssertionError: Aug 04 18:13:44.604497 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: manila : TTY=unknown ; PWD=/var/lib/manila ; USER=root ; COMMAND=/usr/bin/manila-rootwrap /etc> Aug 04 18:13:44.605673 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 04 18:13:44.816512 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: pam_unix(sudo:session): session closed for user root Aug 04 18:17:31.274360 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Stopping OpenStack Manila Share... Aug 04 18:17:31.331146 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: manila-share.service: Succeeded. Aug 04 18:17:31.331676 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Stopped OpenStack Manila Share. ---- Earlier errors in the Pacemaker log indicate that stop and start operations time out: Aug 04 17:42:09 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of start operation for res_nfs_ganesha_nfs_ganesha on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=211 key=res_nfs_ganesha_nfs_ganesha_start_0 timeout=20000ms Aug 04 17:42:16 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of start operation for res_manila_share_manila_share on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=214 key=res_manila_share_manila_share_start_0 timeout=20000ms Aug 04 17:42:19 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of stop operation for res_manila_share_manila_share on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=216 key=res_manila_share_manila_share_stop_0 timeout=20000ms Aug 04 17:42:19 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of stop operation for res_nfs_ganesha_nfs_ganesha on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=217 key=res_nfs_ganesha_nfs_ganesha_stop_0 timeout=20000ms tests/bundles/bionic-ussuri.yaml observed in https://review.opendev.org/#/c/743212/ There is an issue with the manila-share service post-startup where it suddenly gets SIGTERM after a few minutes of operation. It properly reads its config, connects to Ceph mons CephFS, runs periodic share status tasks while being alive. The behavior matches what systemd does by default when stopping services: https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode= "Processes will first be terminated via SIGTERM (unless the signal to send is changed via KillSignal= or RestartKillSignal=)." Based on the log entries, something stops manila-share via systemd 18:17:31.274360 <--- journal log entry about stopping the manila-share service 18:17:31.278 <--- caught SIGTERM Meanwhile, Pacemaker was not doing anything at that time by the looks of it: grep -iP 'stop' /var/log/pacemaker/pacemaker.log | grep 18:17 ; echo $? 1 crm resource status  res_manila_share_manila_share (systemd:manila-share): Started  res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): Started  Resource Group: grp_ganesha_vips      res_ganesha_a3b980a_vip (ocf::heartbeat:IPaddr2): Started The resources are listed as started More information below: 1) the NFS share resource is in the error state: ubuntu@dmitriis-bastion:~$ openstack share list +--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+ | ID | Name | Size | Share Proto | Status | Is Public | Share Type Name | Host | Availability Zone | +--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+ | ec095c71-f8f2-4aa0-bdd1-ed1f47bd3cce | cephnfsshare1 | 1 | NFS | error | False | cephfsnfstype | | None | +--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+ 2) manila-share services are DOWN: ubuntu@dmitriis-bastion:~$ manila service-list +----+------------------+--------------------------------------------+------+---------+-------+----------------------------+ | Id | Binary | Host | Zone | Status | State | Updated_at | +----+------------------+--------------------------------------------+------+---------+-------+----------------------------+ | 1 | manila-scheduler | juju-111c97-zaza-1335ec68e4d0-13 | nova | enabled | up | 2020-08-04T14:57:38.000000 | | 2 | manila-data | juju-111c97-zaza-1335ec68e4d0-13 | nova | enabled | up | 2020-08-04T14:57:43.000000 | | 3 | manila-share | juju-111c97-zaza-1335ec68e4d0-5@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:36:25.000000 | | 4 | manila-share | juju-111c97-zaza-1335ec68e4d0-4@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:40:57.000000 | | 5 | manila-share | juju-111c97-zaza-1335ec68e4d0-3@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:56:55.000000 | +----+------------------+--------------------------------------------+------+---------+-------+----------------------------+ 3) /var/log/manila/manila-share.log at manila-ganesha units contains the following: 2020-08-04 18:13:43.190 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Starting 1 workers ... 2020-08-04 18:13:43.875 214156 DEBUG manila.service [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] Creating RPC server for service manila-share. start /usr/lib/python3/dist-packages/manila/service.py:121 2020-08-04 18:13:43.918 214156 DEBUG manila.share.manager [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] Start initialization of driver: 'CephFSDriver@juju-111c97-zaza-1335ec68e4d0-4@cephfsnfs1' _driver_setup /usr/lib/python3/dist-packages/manila/share/manager.py:309 2020-08-04 18:13:43.919 214156 INFO manila.share.drivers.cephfs.driver [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] [CEPHFSNFS1}] Ceph client found, connecting... ... 2020-08-04 18:17:30.935 214156 INFO manila.share.manager [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating share status 2020-08-04 18:17:30.941 214156 DEBUG manila.share.driver [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating share stats. _update_share_stats /usr/lib/python3/dist-packages/manila/share/driver.py:1271 2020-08-04 18:17:30.941 214156 DEBUG oslo_service.periodic_task [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Running periodic task ShareManager.periodic_share_status_update run_periodic_tasks /usr/lib/python3/dist-packages/oslo_service/periodic_task.py:216 2020-08-04 18:17:30.942 214156 DEBUG manila.share.manager [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating status of share instances. periodic_share_status_update /usr/lib/python3/dist-packages/manila/share/manager.py:4515 2020-08-04 18:17:31.278 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Caught SIGTERM, stopping children 2020-08-04 18:17:31.279 214098 DEBUG oslo_concurrency.lockutils [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Acquired lock "singleton_lock" lock /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:265 2020-08-04 18:17:31.279 214098 DEBUG oslo_concurrency.lockutils [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Releasing lock "singleton_lock" lock /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:281 2020-08-04 18:17:31.280 214098 DEBUG oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Stop services. stop /usr/lib/python3/dist-packages/oslo_service/service.py:699 2020-08-04 18:17:31.280 214098 INFO manila.coordination [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Stopped Coordinator (Agent ID: 73b71760-242a-40e4-a54f-108702fb28ef, prefix: manila-) 2020-08-04 18:17:31.280 214098 DEBUG oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Killing children. stop /usr/lib/python3/dist-packages/oslo_service/service.py:704 2020-08-04 18:17:31.286 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Waiting on 1 children to exit 2020-08-04 18:17:31.294 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Child 214156 killed by signal 15 journalctl -o short-precise -u manila-share.service Aug 04 18:13:41.971356 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Started OpenStack Manila Share. Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: /usr/lib/python3/dist-packages/pymysql/cursors.py:170: Warning: (3719, "'utf8' is curren> Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: result = self._query(query) Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: Exception ignored in: <function _after_fork at 0x7f35f2a6c670> Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: Traceback (most recent call last): Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: File "/usr/lib/python3.8/threading.py", line 1454, in _after_fork Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: assert len(_active) == 1 Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: AssertionError: Aug 04 18:13:44.604497 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: manila : TTY=unknown ; PWD=/var/lib/manila ; USER=root ; COMMAND=/usr/bin/manila-rootwrap /etc> Aug 04 18:13:44.605673 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: pam_unix(sudo:session): session opened for user root by (uid=0) Aug 04 18:13:44.816512 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: pam_unix(sudo:session): session closed for user root Aug 04 18:17:31.274360 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Stopping OpenStack Manila Share... Aug 04 18:17:31.331146 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: manila-share.service: Succeeded. Aug 04 18:17:31.331676 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Stopped OpenStack Manila Share. ---- Earlier errors in the Pacemaker log indicate that stop and start operations time out: Aug 04 17:42:09 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of start operation for res_nfs_ganesha_nfs_ganesha on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=211 key=res_nfs_ganesha_nfs_ganesha_start_0 timeout=20000ms Aug 04 17:42:16 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of start operation for res_manila_share_manila_share on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=214 key=res_manila_share_manila_share_start_0 timeout=20000ms Aug 04 17:42:19 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of stop operation for res_manila_share_manila_share on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=216 key=res_manila_share_manila_share_stop_0 timeout=20000ms Aug 04 17:42:19 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of stop operation for res_nfs_ganesha_nfs_ganesha on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=217 key=res_nfs_ganesha_nfs_ganesha_stop_0 timeout=20000ms
2020-08-04 19:22:11 Dmitrii Shcherbakov attachment added juju-crashdump-c12a93b3-7d61-41d9-b52b-abae6ca44ab7.tar.xz https://bugs.launchpad.net/charm-manila-ganesha/+bug/1890330/+attachment/5398754/+files/juju-crashdump-c12a93b3-7d61-41d9-b52b-abae6ca44ab7.tar.xz
2020-08-04 22:37:28 Dmitrii Shcherbakov charm-manila-ganesha: status In Progress Triaged
2020-08-04 22:37:30 Dmitrii Shcherbakov charm-manila-ganesha: importance Undecided Critical
2020-08-04 22:37:32 Dmitrii Shcherbakov charm-manila-ganesha: assignee Dmitrii Shcherbakov (dmitriis)
2020-08-05 00:10:13 OpenStack Infra charm-manila-ganesha: status Triaged In Progress
2021-01-15 14:12:00 Dmitrii Shcherbakov charm-manila-ganesha: status In Progress Fix Committed
2021-05-03 17:44:49 Alex Kavanagh charm-manila-ganesha: milestone 21.04
2021-05-03 17:48:05 Alex Kavanagh charm-manila-ganesha: status Fix Committed Fix Released