[bionic/ussuri] manila-share service exits a few minutes after startup with "Caught SIGTERM, stopping children... Child <pid> killed by signal 15"

Bug #1890330 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Manila-Ganesha Charm
Fix Released
Critical
Dmitrii Shcherbakov

Bug Description

tests/bundles/bionic-ussuri.yaml observed in https://review.opendev.org/#/c/743212/

There is an issue with the manila-share service post-startup where it suddenly gets SIGTERM after a few minutes of operation. It properly reads its config, connects to Ceph mons CephFS, runs periodic share status tasks while being alive.

The behavior matches what systemd does by default when stopping services:
https://www.freedesktop.org/software/systemd/man/systemd.kill.html#KillMode=
"Processes will first be terminated via SIGTERM (unless the signal to send is changed via KillSignal= or RestartKillSignal=)."

Based on the log entries, something stops manila-share via systemd

18:17:31.274360 <--- journal log entry about stopping the manila-share service
18:17:31.278 <--- caught SIGTERM

Meanwhile, Pacemaker was not doing anything at that time by the looks of it:

grep -iP 'stop' /var/log/pacemaker/pacemaker.log | grep 18:17 ; echo $?
1

crm resource status
 res_manila_share_manila_share (systemd:manila-share): Started
 res_nfs_ganesha_nfs_ganesha (systemd:nfs-ganesha): Started
 Resource Group: grp_ganesha_vips
     res_ganesha_a3b980a_vip (ocf::heartbeat:IPaddr2): Started

The resources are listed as started

More information below:

1) the NFS share resource is in the error state:

ubuntu@dmitriis-bastion:~$ openstack share list
+--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+
| ID | Name | Size | Share Proto | Status | Is Public | Share Type Name | Host | Availability Zone |
+--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+
| ec095c71-f8f2-4aa0-bdd1-ed1f47bd3cce | cephnfsshare1 | 1 | NFS | error | False | cephfsnfstype | | None |
+--------------------------------------+---------------+------+-------------+--------+-----------+-----------------+------+-------------------+

2) manila-share services are DOWN:

ubuntu@dmitriis-bastion:~$ manila service-list
+----+------------------+--------------------------------------------+------+---------+-------+----------------------------+
| Id | Binary | Host | Zone | Status | State | Updated_at |
+----+------------------+--------------------------------------------+------+---------+-------+----------------------------+
| 1 | manila-scheduler | juju-111c97-zaza-1335ec68e4d0-13 | nova | enabled | up | 2020-08-04T14:57:38.000000 |
| 2 | manila-data | juju-111c97-zaza-1335ec68e4d0-13 | nova | enabled | up | 2020-08-04T14:57:43.000000 |
| 3 | manila-share | juju-111c97-zaza-1335ec68e4d0-5@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:36:25.000000 |
| 4 | manila-share | juju-111c97-zaza-1335ec68e4d0-4@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:40:57.000000 |
| 5 | manila-share | juju-111c97-zaza-1335ec68e4d0-3@cephfsnfs1 | nova | enabled | down | 2020-08-04T13:56:55.000000 |
+----+------------------+--------------------------------------------+------+---------+-------+----------------------------+

3) /var/log/manila/manila-share.log at manila-ganesha units contains the following:

2020-08-04 18:13:43.190 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Starting 1 workers

...

2020-08-04 18:13:43.875 214156 DEBUG manila.service [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] Creating RPC server for service manila-share. start /usr/lib/python3/dist-packages/manila/service.py:121
2020-08-04 18:13:43.918 214156 DEBUG manila.share.manager [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] Start initialization of driver: 'CephFSDriver@juju-111c97-zaza-1335ec68e4d0-4@cephfsnfs1' _driver_setup /usr/lib/python3/dist-packages/manila/share/manager.py:309
2020-08-04 18:13:43.919 214156 INFO manila.share.drivers.cephfs.driver [req-e89377c0-b153-462f-91db-1b1912a38223 - - - - -] [CEPHFSNFS1}] Ceph client found, connecting...

...

2020-08-04 18:17:30.935 214156 INFO manila.share.manager [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating share status
2020-08-04 18:17:30.941 214156 DEBUG manila.share.driver [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating share stats. _update_share_stats /usr/lib/python3/dist-packages/manila/share/driver.py:1271
2020-08-04 18:17:30.941 214156 DEBUG oslo_service.periodic_task [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Running periodic task ShareManager.periodic_share_status_update run_periodic_tasks /usr/lib/python3/dist-packages/oslo_service/periodic_task.py:216
2020-08-04 18:17:30.942 214156 DEBUG manila.share.manager [req-b8b9a3fe-59f0-4666-94ad-3ffd6f43bb32 - - - - -] Updating status of share instances. periodic_share_status_update /usr/lib/python3/dist-packages/manila/share/manager.py:4515
2020-08-04 18:17:31.278 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Caught SIGTERM, stopping children
2020-08-04 18:17:31.279 214098 DEBUG oslo_concurrency.lockutils [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Acquired lock "singleton_lock" lock /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:265
2020-08-04 18:17:31.279 214098 DEBUG oslo_concurrency.lockutils [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Releasing lock "singleton_lock" lock /usr/lib/python3/dist-packages/oslo_concurrency/lockutils.py:281
2020-08-04 18:17:31.280 214098 DEBUG oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Stop services. stop /usr/lib/python3/dist-packages/oslo_service/service.py:699
2020-08-04 18:17:31.280 214098 INFO manila.coordination [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Stopped Coordinator (Agent ID: 73b71760-242a-40e4-a54f-108702fb28ef, prefix: manila-)
2020-08-04 18:17:31.280 214098 DEBUG oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Killing children. stop /usr/lib/python3/dist-packages/oslo_service/service.py:704
2020-08-04 18:17:31.286 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Waiting on 1 children to exit
2020-08-04 18:17:31.294 214098 INFO oslo_service.service [req-19fc1843-0bdf-4c8a-8b66-d190a192e4ee - - - - -] Child 214156 killed by signal 15

journalctl -o short-precise -u manila-share.service

Aug 04 18:13:41.971356 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Started OpenStack Manila Share.
Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: /usr/lib/python3/dist-packages/pymysql/cursors.py:170: Warning: (3719, "'utf8' is curren>
Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: result = self._query(query)
Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: Exception ignored in: <function _after_fork at 0x7f35f2a6c670>
Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: Traceback (most recent call last):
Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: File "/usr/lib/python3.8/threading.py", line 1454, in _after_fork
Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: assert len(_active) == 1
Aug 04 18:13:44.588087 juju-111c97-zaza-1335ec68e4d0-4 manila-share[214188]: AssertionError:
Aug 04 18:13:44.604497 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: manila : TTY=unknown ; PWD=/var/lib/manila ; USER=root ; COMMAND=/usr/bin/manila-rootwrap /etc>
Aug 04 18:13:44.605673 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: pam_unix(sudo:session): session opened for user root by (uid=0)
Aug 04 18:13:44.816512 juju-111c97-zaza-1335ec68e4d0-4 sudo[214188]: pam_unix(sudo:session): session closed for user root
Aug 04 18:17:31.274360 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Stopping OpenStack Manila Share...
Aug 04 18:17:31.331146 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: manila-share.service: Succeeded.
Aug 04 18:17:31.331676 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Stopped OpenStack Manila Share.

----

Earlier errors in the Pacemaker log indicate that stop and start operations time out:

Aug 04 17:42:09 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of start operation for res_nfs_ganesha_nfs_ganesha on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=211 key=res_nfs_ganesha_nfs_ganesha_start_0 timeout=20000ms
Aug 04 17:42:16 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of start operation for res_manila_share_manila_share on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=214 key=res_manila_share_manila_share_start_0 timeout=20000ms
Aug 04 17:42:19 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of stop operation for res_manila_share_manila_share on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=216 key=res_manila_share_manila_share_stop_0 timeout=20000ms
Aug 04 17:42:19 juju-111c97-zaza-1335ec68e4d0-4 pacemaker-controld [30769] (process_lrm_event) error: Result of stop operation for res_nfs_ganesha_nfs_ganesha on juju-111c97-zaza-1335ec68e4d0-4: Timed Out | call=217 key=res_nfs_ganesha_nfs_ganesha_stop_0 timeout=20000ms

description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

As a sanity test:

* systemctl stop pacemaker;
* systemctl start manila-share.

Results in the same behavior - the manila-share service gets stopped in ~ 3 minutes.

So it's definitely not Pacemaker that stops manila-share.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (4.6 KiB)

I noticed that the stop time varies between somewhere around 30 seconds and 6 minutes.

While trying to catch the reason for I used this:

while true ; do pgrep -af manila-share >> out.txt ; echo -e "`date --rfc-3339=ns`\n"'------------' >> out.txt ; sleep 0.1 ; done

https://paste.ubuntu.com/p/jcsJF4WVmH/
------------
302430 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
302463 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
304498 watch --color -n1 systemctl status manila-share
2020-08-04 20:39:28.652242839+00:00
------------
302430 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
302463 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
304498 watch --color -n1 systemctl status manila-share
305575 systemctl disable manila-share
305576 /bin/sh /lib/systemd/systemd-sysv-install disable manila-share
305578 /usr/bin/perl /usr/sbin/update-rc.d manila-share defaults
2020-08-04 20:39:28.847391469+00:00
------------
302430 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
302463 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
304498 watch --color -n1 systemctl status manila-share
305575 systemctl disable manila-share
305576 /bin/sh /lib/systemd/systemd-sysv-install disable manila-share
305578 /usr/bin/perl /usr/sbin/update-rc.d manila-share defaults
305609 watch --color -n1 systemctl status manila-share
305610 sh -c systemctl status manila-share
305611 systemctl status manila-share
2020-08-04 20:39:29.040207067+00:00
------------
302430 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
302463 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
304498 watch --color -n1 systemctl status manila-share
305575 systemctl disable manila-share
305576 /bin/sh /lib/systemd/systemd-sysv-install disable manila-share
2020-08-04 20:39:29.265223464+00:00
------------
302430 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
302463 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
304498 watch --color -n1 systemctl status manila-share
305575 systemctl disable manila-share
305576 /bin/sh /lib/systemd/systemd-sysv-install disable manila-share
305619 /usr/bin/perl /usr/sbin/update-rc.d manila-share disable
2020-08-04 20:39:29.472259353+00:00
------------
302430 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
302463 /usr/bin/python3 /usr/bin/manila-share --config-file=/etc/manila/manila.conf --log-file=/var/log/manila/manila-share.log
304498 watch --color -n1 systemctl status manila-share
305575...

Read more...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Based on #3, I found that the execution of update-status apparently calls `/lib/systemd/systemd-sysv-install disable manila-share`:

ubuntu@juju-111c97-zaza-1335ec68e4d0-4:~$ sudo grep -Ri '20:39:28' /var/log/
/var/log/juju/unit-manila-ganesha-1.log:2020-08-04 20:39:28 DEBUG update-status Synchronizing state of manila-share.service with SysV service script with /lib/systemd/systemd-sysv-install.
/var/log/juju/unit-manila-ganesha-1.log:2020-08-04 20:39:28 DEBUG update-status Executing: /lib/systemd/systemd-sysv-install disable manila-share
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: /lib/systemd/system/nfs-ganesha-lock.service:18: .include directives are deprecated, and support for them will be removed in a future version of systemd. Please use drop-in files instead.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Reloading.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update the unit file accordingly.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Configuration file /etc/systemd/system/manila-ganesha-mysql-router.service is marked executable. Please remove executable permission bits. Proceeding anyway.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: /lib/systemd/system/rpc-statd.service:16: PIDFile= references a path below legacy directory /var/run/, updating /var/run/rpc.statd.pid → /run/rpc.statd.pid; please update the unit file accordingly.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: /lib/systemd/system/nfs-ganesha-lock.service:18: .include directives are deprecated, and support for them will be removed in a future version of systemd. Please use drop-in files instead.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: /lib/systemd/system/rpc-statd.service:16: PIDFile= references a path below legacy directory /var/run/, updating /var/run/rpc.statd.pid → /run/rpc.statd.pid; please update the unit file accordingly.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: Reloading.
/var/log/syslog:Aug 4 20:39:28 juju-111c97-zaza-1335ec68e4d0-4 systemd[1]: /lib/systemd/system/dbus.socket:5: ListenStream= references a path below legacy directory /var/run/, updating /var/run/dbus/system_bus_socket → /run/dbus/system_bus_socket; please update the unit file accordingly.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The code that disables the manila-share and nfs-ganesha services is not guarded by `when_not('ha.available')` which is why update-status stops those services on every update-status hook:

https://opendev.org/openstack/charm-manila-ganesha/src/commit/7d804802302e696c1c9fbc623c951d3ac578c4e1/src/reactive/manila_ganesha.py#L108-L122
@reactive.when('ha.connected', 'ganesha-pool-configured',
               'config.rendered')
def cluster_connected(hacluster):
    """Configure HA resources in corosync"""
    with charm.provide_charm_instance() as this_charm:
        this_charm.configure_ha_resources(hacluster)
        for service in ['nfs-ganesha', 'manila-share']:
            ch_core.host.service('disable', service)
            ch_core.host.service('stop', service)

sudo juju-run manila-ganesha/1 'charms.reactive get_flags'
amqp.available
amqp.connected
ceph.available
ceph.connected
ceph.pools.available
charm.installed
charms.openstack.do-default-amqp.connected
charms.openstack.do-default-certificates.available
charms.openstack.do-default-charm.installed
charms.openstack.do-default-cluster.available
charms.openstack.do-default-config.changed
charms.openstack.do-default-shared-db.connected
charms.openstack.do-default-update-status
charms.openstack.do-default-upgrade-charm
cluster.available
cluster.connected
config.rendered
ganesha-installed
ganesha-pool-configured
ha.available
ha.connected
haproxy.stat.password
identity-service.available
identity-service.available.auth
identity-service.connected
manila-plugin.available
manila-plugin.changed
manila-plugin.connected
shared-db.available
shared-db.connected

sudo juju-run manila-ganesha/1 'charms.reactive get_flags' | grep -P 'ha.connected|ganesha-pool-configured|config.rendered'

config.rendered
ganesha-pool-configured
ha.connected

Changed in charm-manila-ganesha:
status: In Progress → Triaged
importance: Undecided → Critical
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
Changed in charm-manila-ganesha:
status: Triaged → In Progress
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :
Changed in charm-manila-ganesha:
status: In Progress → Fix Committed
Revision history for this message
Vern Hart (vern) wrote :

Dmitrii mentioned that cluster_connected disables manila-share and nfs-ganesha. When I look at that method in the current charm, I see that it no longer disables those services. I think that means this bug is Fix-Released. Right?

Changed in charm-manila-ganesha:
milestone: none → 21.04
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.