amulet _get_proc_start_time has a race which causes service restart checks to fail

Bug #1474030 reported by Ryan Beisner
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Charm Helpers
Fix Released
Medium
Ryan Beisner
cinder (Juju Charms Collection)
Fix Released
Medium
Ryan Beisner
glance (Juju Charms Collection)
Invalid
Medium
Ryan Beisner
heat (Juju Charms Collection)
Invalid
Medium
Ryan Beisner
neutron-api (Juju Charms Collection)
Fix Released
Medium
Ryan Beisner
neutron-gateway (Juju Charms Collection)
Fix Released
Medium
Ryan Beisner
neutron-openvswitch (Juju Charms Collection)
Fix Released
Medium
Ryan Beisner
openstack-dashboard (Juju Charms Collection)
Fix Released
Medium
Ryan Beisner

Bug Description

Although most frequently observed in the openstack-dashboard, neutron-api and neutron-gateway charms, this affects all OpenStack charm amulet tests which utilize a service restart check.

The test is to make a charm config change, then inspect the unit for process start time, and compare that start time with file modification time of the relevant config file. In some cases, that arbitrary sleep results in the _get_proc_start_time checking for a service start time before the service is started, yielding OSError: [Errno 2] No such file or directory: '/proc/1242'.

2015-07-13 14:57:39,934 test_900_restart_on_config_change DEBUG: Making config change on neutron-api...
2015-07-13 14:57:40,527 test_900_restart_on_config_change DEBUG: Checking that service restarted: neutron-server
2015-07-13 14:59:29,632 _get_proc_start_time DEBUG: CMDout: (u'1242', 0)
2015-07-13 14:59:29,632 _get_proc_start_time DEBUG: Pid for neutron-server 1242
Traceback (most recent call last):
  File "tests/019-basic-vivid-kilo", line 9, in <module>
    deployment.run_tests()
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/deployment.py", line 93, in run_tests
    getattr(self, test)()
  File "/home/ubuntu/bzr/next/neutron-api/tests/basic_deployment.py", line 516, in test_900_restart_on_config_change
    pgrep_full=True):
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/utils.py", line 290, in service_restarted
    if (self._get_proc_start_time(sentry_unit, service, pgrep_full) >=
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/utils.py", line 279, in _get_proc_start_time
    return self._get_dir_mtime(sentry_unit, proc_dir)
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/utils.py", line 260, in _get_dir_mtime
    return sentry_unit.directory_stat(directory)['mtime']
  File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 88, in directory_stat
    return self._fs_data(path)
  File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 75, in _fs_data
    return self._run_unit_script("filesystem_data.py {}".format(path))
  File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 126, in _run_unit_script
    raise IOError(output)
IOError: Traceback (most recent call last):
  File "/tmp/amulet/filesystem_data.py", line 7, in <module>
    s = os.stat(sys.argv[1])
OSError: [Errno 2] No such file or directory: '/proc/1242'
ERROR subprocess encountered error code 1

real 17m21.555s

Related branches

Ryan Beisner (1chb1n)
no longer affects: neutron-gateway (Ubuntu)
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Switching from service_restarted to validate_service_config_changed, the race in this bug still exists, as the process may not be started at the moment that it is checked. I think it needs a try/retry loop with a timeout threshold.

2015-08-24 16:27:24,641 test_900_restart_on_config_change DEBUG: Making config change on neutron-api...
DEBUG:amulet-logger:Making config change on neutron-api...
2015-08-24 16:27:26,794 test_900_restart_on_config_change DEBUG: Checking that service restarted: neutron-server
DEBUG:amulet-logger:Checking that service restarted: neutron-server
2015-08-24 16:27:26,796 validate_service_config_changed DEBUG: Checking neutron-server restarted since 1440433643.0
DEBUG:amulet-logger:Checking neutron-server restarted since 1440433643.0
2015-08-24 16:28:26,856 service_restarted_since DEBUG: Checking neutron-server restarted since 1440433643.0
DEBUG:amulet-logger:Checking neutron-server restarted since 1440433643.0
2015-08-24 16:29:15,571 _get_proc_start_time DEBUG: CMDout: (u'15786', 0)
DEBUG:amulet-logger:CMDout: (u'15786', 0)
2015-08-24 16:29:15,571 _get_proc_start_time DEBUG: Pid for neutron-server 15786
DEBUG:amulet-logger:Pid for neutron-server 15786
Traceback (most recent call last):
  File "tests/019-basic-vivid-kilo", line 9, in <module>
    deployment.run_tests()
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/deployment.py", line 93, in run_tests
    getattr(self, test)()
  File "/home/ubuntu/bzr/next/neutron-api/tests/basic_deployment.py", line 548, in test_900_restart_on_config_change
    sleep_time=sleep_time):
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/utils.py", line 411, in validate_service_config_changed
    retry_count=retry_count)
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/utils.py", line 327, in service_restarted_since
    pgrep_full)
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/utils.py", line 289, in _get_proc_start_time
    return self._get_dir_mtime(sentry_unit, proc_dir)
  File "/home/ubuntu/bzr/next/neutron-api/tests/charmhelpers/contrib/amulet/utils.py", line 270, in _get_dir_mtime
    return sentry_unit.directory_stat(directory)['mtime']
  File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 88, in directory_stat
    return self._fs_data(path)
  File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 75, in _fs_data
    return self._run_unit_script("filesystem_data.py {}".format(path))
  File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 126, in _run_unit_script
    raise IOError(output)
IOError: Traceback (most recent call last):
  File "/tmp/amulet/filesystem_data.py", line 7, in <module>
    s = os.stat(sys.argv[1])
OSError: [Errno 2] No such file or directory: '/proc/15786'
ERROR subprocess encountered error code 1

Revision history for this message
Adam Collard (adam-collard) wrote :

I hit this same traceback but diagnosed it differently.

When pgrep_full is used, it looks for the script name in the full command line of all processes. Unfortunately "juju run pgrep foo" is also running, so when we pgrep -a for foo we'll find the juju process. In charm-helpers, there's a "| grep -v pgrep" which I assume is trying to work around this case. Given that the main call to pgrep is only configured to output PIDs though, it's something which will never get excluded. The fix I was going to propose added -a to the pgrep command and then processed the results accordingly.

We should still have handling for when pgrep finds a PID but then fails to look at the mtime of /proc/$pid though since the process might well go away between the two commands.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I'm not sure I completely follow. What I do know, is I have not experienced the race when using the proposed changes, where it was actually fairly prevalent with a few of these charms. I'd sure entertain further clarification or an example though. :-)

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Note that pgrep on precise doesn't take have the -a option. :-/

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Shifted gears to use pidof instead of pgrep, via an existing helper. Manual tests against keystone, neutron-api and openstack-dashboard are OK. That encompasses charm tests which had previously used pgrep_full True and pgrep_full False. So, functionally backwards-compatible in that cross-section of checks.

James Page (james-page)
Changed in neutron-api (Juju Charms Collection):
importance: Undecided → Medium
Changed in neutron-gateway (Juju Charms Collection):
importance: Undecided → Medium
Changed in openstack-dashboard (Juju Charms Collection):
importance: Undecided → Medium
status: New → In Progress
Changed in neutron-gateway (Juju Charms Collection):
status: New → In Progress
Changed in neutron-api (Juju Charms Collection):
status: New → In Progress
Changed in openstack-dashboard (Juju Charms Collection):
milestone: none → 15.10
Changed in neutron-gateway (Juju Charms Collection):
milestone: none → 15.10
Changed in neutron-api (Juju Charms Collection):
milestone: none → 15.10
assignee: nobody → Ryan Beisner (1chb1n)
Changed in neutron-gateway (Juju Charms Collection):
assignee: nobody → Ryan Beisner (1chb1n)
Changed in openstack-dashboard (Juju Charms Collection):
assignee: nobody → Ryan Beisner (1chb1n)
James Page (james-page)
Changed in neutron-api (Juju Charms Collection):
status: In Progress → Fix Committed
Changed in charm-helpers:
status: New → Fix Released
Changed in openstack-dashboard (Juju Charms Collection):
status: In Progress → Fix Committed
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Similar race, nearby place, will propose something more resilient in the form of an update to contrib/amulet/utils.py. This is happening when the conf file mtime is pinged, and the file is apparently in the middle of a write.

13:29:22 Traceback (most recent call last):
13:29:22 File "tests/014-basic-precise-icehouse", line 12, in <module>
13:29:22 deployment.run_tests()
13:29:22 File "/var/lib/jenkins/checkout/nova-cloud-controller/tests/charmhelpers/contrib/amulet/deployment.py", line 93, in run_tests
13:29:22 getattr(self, test)()
13:29:22 File "/var/lib/jenkins/checkout/nova-cloud-controller/tests/basic_deployment.py", line 459, in test_z_restart_on_config_change
13:29:22 pgrep_full=True, sleep_time=time):
13:29:22 File "/var/lib/jenkins/checkout/nova-cloud-controller/tests/charmhelpers/contrib/amulet/utils.py", line 301, in service_restarted
13:29:22 self._get_file_mtime(sentry_unit, filename)):
13:29:22 File "/var/lib/jenkins/checkout/nova-cloud-controller/tests/charmhelpers/contrib/amulet/utils.py", line 266, in _get_file_mtime
13:29:22 return sentry_unit.file_stat(filename)['mtime']
13:29:22 File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 77, in file_stat
13:29:22 return self._fs_data(filename)
13:29:22 File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 74, in _fs_data
13:29:22 return self._run_unit_script("filesystem_data.py {}".format(path))
13:29:22 File "/usr/lib/python2.7/dist-packages/amulet/sentry.py", line 146, in _run_unit_script
13:29:22 raise IOError(output)
13:29:22 IOError: Warning: Permanently added '172.17.105.93' (ECDSA) to the list of known hosts.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Neutron-gateway & nova-cloud-controller file race fixes underway, linked to this bug.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Status update: we fixed the process-didnt-exist-at-that-moment race; but not the conf-file-didnt-exist-at-that-moment race, which is what https://code.launchpad.net/~1chb1n/charm-helpers/amulet-file-race/+merge/271417 aims to do.

Changed in neutron-gateway (Juju Charms Collection):
status: In Progress → Fix Committed
James Page (james-page)
Changed in neutron-api (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in neutron-gateway (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in openstack-dashboard (Juju Charms Collection):
status: Fix Committed → Fix Released
Ryan Beisner (1chb1n)
Changed in neutron-openvswitch (Ubuntu):
assignee: nobody → Ryan Beisner (1chb1n)
Changed in heat (Ubuntu):
assignee: nobody → Ryan Beisner (1chb1n)
Changed in glance (Ubuntu):
assignee: nobody → Ryan Beisner (1chb1n)
Changed in cinder (Juju Charms Collection):
assignee: nobody → Ryan Beisner (1chb1n)
status: New → In Progress
Changed in glance (Ubuntu):
status: New → In Progress
Changed in heat (Ubuntu):
status: New → In Progress
Changed in neutron-openvswitch (Ubuntu):
status: New → In Progress
Ryan Beisner (1chb1n)
Changed in glance (Ubuntu):
status: In Progress → Confirmed
Changed in heat (Ubuntu):
status: In Progress → Confirmed
Changed in cinder (Juju Charms Collection):
milestone: none → 16.01
no longer affects: neutron-openvswitch (Ubuntu)
no longer affects: heat (Ubuntu)
no longer affects: glance (Ubuntu)
Changed in heat (Juju Charms Collection):
status: New → Confirmed
milestone: none → 16.01
Changed in glance (Juju Charms Collection):
status: New → Confirmed
milestone: none → 16.01
Changed in neutron-openvswitch (Juju Charms Collection):
status: New → In Progress
milestone: none → 16.01
assignee: nobody → Ryan Beisner (1chb1n)
Ryan Beisner (1chb1n)
Changed in cinder (Juju Charms Collection):
status: In Progress → Fix Committed
Changed in neutron-openvswitch (Juju Charms Collection):
status: In Progress → Fix Committed
Ryan Beisner (1chb1n)
Changed in heat (Juju Charms Collection):
assignee: nobody → Ryan Beisner (1chb1n)
importance: Undecided → Medium
James Page (james-page)
Changed in heat (Juju Charms Collection):
milestone: 16.01 → 16.04
Changed in cinder (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in neutron-openvswitch (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in glance (Juju Charms Collection):
milestone: 16.01 → 16.04
James Page (james-page)
Changed in heat (Juju Charms Collection):
milestone: 16.04 → 16.07
Changed in glance (Juju Charms Collection):
milestone: 16.04 → 16.07
Liam Young (gnuoy)
Changed in heat (Juju Charms Collection):
milestone: 16.07 → 16.10
James Page (james-page)
Changed in glance (Juju Charms Collection):
milestone: 16.07 → 16.10
James Page (james-page)
Changed in heat (Juju Charms Collection):
milestone: 16.10 → 17.01
Liam Young (gnuoy)
Changed in neutron-openvswitch (Juju Charms Collection):
importance: Undecided → Medium
Changed in glance (Juju Charms Collection):
importance: Undecided → Medium
Changed in charm-helpers:
importance: Undecided → Medium
Changed in cinder (Juju Charms Collection):
importance: Undecided → Medium
Ryan Beisner (1chb1n)
Changed in charm-helpers:
assignee: nobody → Ryan Beisner (1chb1n)
Changed in glance (Juju Charms Collection):
assignee: nobody → Ryan Beisner (1chb1n)
James Page (james-page)
Changed in charm-glance:
assignee: nobody → Ryan Beisner (1chb1n)
importance: Undecided → Medium
status: New → Confirmed
Changed in glance (Juju Charms Collection):
status: Confirmed → Invalid
Changed in charm-heat:
assignee: nobody → Ryan Beisner (1chb1n)
importance: Undecided → Medium
status: New → Confirmed
Changed in heat (Juju Charms Collection):
status: Confirmed → Invalid
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Ryan, is this still an issue? I know there've been a lot of amulet fixes in the past year+ and this is an old bug. Can we close it out?

Ryan Beisner (1chb1n)
no longer affects: charm-heat
no longer affects: charm-glance
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.