cinder volume service keeps retrying even with code exception

Bug #1167841 reported by YangLei
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Yang Yu
Grizzly
Fix Released
Medium
Yang Yu

Bug Description

When the wrong credential exists in cinder.conf for the cinder driver storwize_svc, The cinder volume keeps starting child process again and again, and just records the exception in log, But no meaningful error reports to the user.

------ps result:
# ps -ef | grep cinder-
cinder 6873 1 0 02:30 ? 00:00:00 /usr/bin/python /usr/bin/cinder-api --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/api.log
cinder 6888 1 0 02:30 ? 00:00:08 /usr/bin/python /usr/bin/cinder-scheduler --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/scheduler.log
cinder 9216 1 4 03:43 ? 00:03:43 /usr/bin/python /usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/volume.log
cinder 11320 9216 7 05:02 ? 00:00:00 /usr/bin/python /usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/volume.log
root 11322 3296 0 05:02 pts/2 00:00:00 grep cinder-
# ps -ef | grep cinder-
cinder 6873 1 0 02:30 ? 00:00:00 /usr/bin/python /usr/bin/cinder-api --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/api.log
cinder 6888 1 0 02:30 ? 00:00:08 /usr/bin/python /usr/bin/cinder-scheduler --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/scheduler.log
cinder 9216 1 4 03:43 ? 00:03:43 /usr/bin/python /usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/volume.log
cinder 11324 9216 4 05:03 ? 00:00:00 /usr/bin/python /usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/volume.log
root 11326 3296 0 05:03 pts/2 00:00:00 grep cinder-
#
-------------------------------------log for starting child process again and again
2013-04-11 03:18:56 INFO [cinder.service] Starting 1 workers
2013-04-11 03:18:56 INFO [cinder.service] Started child 8510
2013-04-11 03:18:56 AUDIT [cinder.service] Starting cinder-volume node (version 2013.1)
2013-04-11 03:18:56 DEBUG [cinder.volume.drivers.storwize_svc] enter: do_setup
2013-04-11 03:18:56 DEBUG [paramiko.transport] starting thread (client mode): 0x34fdfc10L
2013-04-11 03:18:56 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_5.1)
2013-04-11 03:18:56 DEBUG [paramiko.transport] kex algos:['diffie-hellman-group-exchange-sha256', 'diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'arcfour128', 'arcfour256', 'arcfour', 'aes192-cbc', 'aes256-cbc', '<email address hidden>', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr'] server encrypt:['aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'arcfour128', 'arcfour256', 'arcfour', 'aes192-cbc', 'aes256-cbc', '<email address hidden>', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr'] client mac:['hmac-md5', 'hmac-sha1', '<email address hidden>', 'hmac-ripemd160', '<email address hidden>', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', '<email address hidden>', 'hmac-ripemd160', '<email address hidden>', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', '<email address hidden>'] server compress:['none', '<email address hidden>'] client lang:[''] server lang:[''] kex follows?False
2013-04-11 03:18:56 DEBUG [paramiko.transport] Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
2013-04-11 03:18:56 DEBUG [paramiko.transport] using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
2013-04-11 03:18:57 DEBUG [paramiko.transport] Switch to new keys ...
2013-04-11 03:18:57 DEBUG [paramiko.transport] Adding ssh-rsa host key for olyv7000.rch.stglabs.ibm.com: b3dcf0402edb424ba041164ac8874352
2013-04-11 03:18:57 DEBUG [paramiko.transport] Trying SSH key c8788e03d27d9ab7523d8f8608e43881
2013-04-11 03:18:57 DEBUG [paramiko.transport] userauth is OK
2013-04-11 03:18:57 DEBUG [paramiko.transport] Debug msg: Adding to environment: SSH_LABEL_ID=5
2013-04-11 03:18:58 DEBUG [paramiko.transport] Debug msg: Adding to environment: SSH_EPOCH=3
2013-04-11 03:18:58 INFO [paramiko.transport] Authentication (publickey) successful!
2013-04-11 03:18:58 DEBUG [cinder.utils] Running cmd (SSH): lsmdiskgrp -delim ! -nohdr
2013-04-11 03:18:58 DEBUG [paramiko.transport] [chan 1] Max packet in: 34816 bytes
2013-04-11 03:18:58 DEBUG [paramiko.transport] [chan 1] Max packet out: 32768 bytes
2013-04-11 03:18:58 INFO [paramiko.transport] Secsh channel 1 opened.
2013-04-11 03:18:58 DEBUG [paramiko.transport] [chan 1] Sesch channel 1 request ok
2013-04-11 03:18:58 DEBUG [paramiko.transport] [chan 1] EOF received (1)
2013-04-11 03:18:58 DEBUG [paramiko.transport] [chan 1] Unhandled channel request "<email address hidden>"
2013-04-11 03:18:58 DEBUG [paramiko.transport] [chan 1] EOF sent (1)
2013-04-11 03:18:58 DEBUG [cinder.utils] Result was 1
2013-04-11 03:18:58 ERROR [cinder.volume.drivers.san.san] Unexpected error while running command.
Command: lsmdiskgrp -delim ! -nohdr
Exit code: 1
Stdout: 'CMMVC7016E Authorization has failed because the private key is not valid for the user name that you have specified.\n\n'
Stderr: ''
2013-04-11 03:18:59 ERROR [cinder.volume.drivers.san.san] Error running SSH command: lsmdiskgrp -delim ! -nohdr
2013-04-11 03:18:59 ERROR [cinder.service] Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 227, in _start_child
    self._child_process(wrap.server)
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 204, in _child_process
    launcher.run_server(server)
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 98, in run_server
    server.start()
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 345, in start
    self.manager.init_host()
  File "/usr/lib/python2.6/site-packages/cinder/volume/manager.py", line 145, in init_host
    self.driver.do_setup(ctxt)
  File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/storwize_svc.py", line 190, in do_setup
    out, err = self._run_ssh(ssh_cmd)
  File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/san/san.py", line 152, in _run_ssh
    raise e
ProcessExecutionError: Unexpected error while running command.
Command: lsmdiskgrp -delim ! -nohdr
Exit code: 1
Stdout: 'CMMVC7016E Authorization has failed because the private key is not valid for the user name that you have specified.\n\n'
Stderr: ''
2013-04-11 03:18:59 INFO [cinder.service] Child 8510 exited with status 2
2013-04-11 03:18:59 INFO [cinder.service] Started child 8512
2013-04-11 03:18:59 AUDIT [cinder.service] Starting cinder-volume node (version 2013.1)
2013-04-11 03:18:59 DEBUG [cinder.volume.drivers.storwize_svc] enter: do_setup
2013-04-11 03:18:59 DEBUG [paramiko.transport] starting thread (client mode): 0x34fdfd10L
2013-04-11 03:19:00 INFO [paramiko.transport] Connected (version 2.0, client OpenSSH_5.1)
2013-04-11 03:19:00 DEBUG [paramiko.transport] kex algos:['diffie-hellman-group-exchange-sha256', 'diffie-hellman-group-exchange-sha1', 'diffie-hellman-group14-sha1', 'diffie-hellman-group1-sha1'] server key:['ssh-rsa', 'ssh-dss'] client encrypt:['aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'arcfour128', 'arcfour256', 'arcfour', 'aes192-cbc', 'aes256-cbc', '<email address hidden>', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr'] server encrypt:['aes128-cbc', '3des-cbc', 'blowfish-cbc', 'cast128-cbc', 'arcfour128', 'arcfour256', 'arcfour', 'aes192-cbc', 'aes256-cbc', '<email address hidden>', 'aes128-ctr', 'aes192-ctr', 'aes256-ctr'] client mac:['hmac-md5', 'hmac-sha1', '<email address hidden>', 'hmac-ripemd160', '<email address hidden>', 'hmac-sha1-96', 'hmac-md5-96'] server mac:['hmac-md5', 'hmac-sha1', '<email address hidden>', 'hmac-ripemd160', '<email address hidden>', 'hmac-sha1-96', 'hmac-md5-96'] client compress:['none', '<email address hidden>'] server compress:['none', '<email address hidden>'] client lang:[''] server lang:[''] kex follows?False
2013-04-11 03:19:00 DEBUG [paramiko.transport] Ciphers agreed: local=aes128-ctr, remote=aes128-ctr
2013-04-11 03:19:00 DEBUG [paramiko.transport] using kex diffie-hellman-group1-sha1; server key type ssh-rsa; cipher: local aes128-ctr, remote aes128-ctr; mac: local hmac-sha1, remote hmac-sha1; compression: local none, remote none
2013-04-11 03:19:00 DEBUG [paramiko.transport] Switch to new keys ...
2013-04-11 03:19:00 DEBUG [paramiko.transport] Adding ssh-rsa host key for olyv7000.rch.stglabs.ibm.com: b3dcf0402edb424ba041164ac8874352
2013-04-11 03:19:00 DEBUG [paramiko.transport] Trying SSH key c8788e03d27d9ab7523d8f8608e43881
2013-04-11 03:19:01 DEBUG [paramiko.transport] userauth is OK
2013-04-11 03:19:01 DEBUG [paramiko.transport] Debug msg: Adding to environment: SSH_LABEL_ID=5
2013-04-11 03:19:01 DEBUG [paramiko.transport] Debug msg: Adding to environment: SSH_EPOCH=3
2013-04-11 03:19:01 INFO [paramiko.transport] Authentication (publickey) successful!
2013-04-11 03:19:01 DEBUG [cinder.utils] Running cmd (SSH): lsmdiskgrp -delim ! -nohdr
2013-04-11 03:19:01 DEBUG [paramiko.transport] [chan 1] Max packet in: 34816 bytes
2013-04-11 03:19:02 DEBUG [paramiko.transport] [chan 1] Max packet out: 32768 bytes
2013-04-11 03:19:02 INFO [paramiko.transport] Secsh channel 1 opened.
2013-04-11 03:19:02 DEBUG [paramiko.transport] [chan 1] Sesch channel 1 request ok
2013-04-11 03:19:02 DEBUG [paramiko.transport] [chan 1] EOF received (1)
2013-04-11 03:19:02 DEBUG [paramiko.transport] [chan 1] Unhandled channel request "<email address hidden>"
2013-04-11 03:19:02 DEBUG [paramiko.transport] [chan 1] EOF sent (1)
2013-04-11 03:19:02 DEBUG [cinder.utils] Result was 1
2013-04-11 03:19:02 ERROR [cinder.volume.drivers.san.san] Unexpected error while running command.
Command: lsmdiskgrp -delim ! -nohdr
Exit code: 1
Stdout: 'CMMVC7016E Authorization has failed because the private key is not valid for the user name that you have specified.\n\n'
Stderr: ''
2013-04-11 03:19:03 ERROR [cinder.volume.drivers.san.san] Error running SSH command: lsmdiskgrp -delim ! -nohdr
2013-04-11 03:19:03 ERROR [cinder.service] Unhandled exception
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 227, in _start_child
    self._child_process(wrap.server)
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 204, in _child_process
    launcher.run_server(server)
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 98, in run_server
    server.start()
  File "/usr/lib/python2.6/site-packages/cinder/service.py", line 345, in start
    self.manager.init_host()
  File "/usr/lib/python2.6/site-packages/cinder/volume/manager.py", line 145, in init_host
    self.driver.do_setup(ctxt)
  File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/storwize_svc.py", line 190, in do_setup
    out, err = self._run_ssh(ssh_cmd)
  File "/usr/lib/python2.6/site-packages/cinder/volume/drivers/san/san.py", line 152, in _run_ssh
    raise e
ProcessExecutionError: Unexpected error while running command.
Command: lsmdiskgrp -delim ! -nohdr
Exit code: 1
Stdout: 'CMMVC7016E Authorization has failed because the private key is not valid for the user name that you have specified.\n\n'
Stderr: ''
2013-04-11 03:19:03 INFO [cinder.service] Child 8512 exited with status 2
2013-04-11 03:19:03 INFO [cinder.service] Started child 8513

Revision history for this message
YangLei (yanglyy) wrote :

the function wait() will try to _start_child again and again when "self.running and len(wrap.children) < wrap.workers" is true
and It seems as if it always is true.

cinder\cinder\service.py
    def wait(self):
        """Loop waiting on children to die and respawning as necessary."""
        while self.running:
            wrap = self._wait_child()
            if not wrap:
                # Yield to other threads if no children have exited
                # Sleep for a short time to avoid excessive CPU usage
                # (see bug #1095346)
                eventlet.greenthread.sleep(.01)
                continue

            while self.running and len(wrap.children) < wrap.workers:
                self._start_child(wrap)

        if self.sigcaught:
            signame = {signal.SIGTERM: 'SIGTERM',
                       signal.SIGINT: 'SIGINT'}[self.sigcaught]
            LOG.info(_('Caught %s, stopping children'), signame)

        for pid in self.children:
            try:
                os.kill(pid, signal.SIGTERM)
            except OSError as exc:
                if exc.errno != errno.ESRCH:
                    raise

        # Wait for children to die
        if self.children:
            LOG.info(_('Waiting on %d children to exit'), len(self.children))
            while self.children:
                self._wait_child()

Revision history for this message
YangLei (yanglyy) wrote :

add the follow line in function _wait_child()
self.running = False
then the whole loop in function wait(self) will finish.

To better understand the current logic flow,
I am searching the history of this file and trying to find who
designed/added this logic flow.

-------------
cinder\cinder\service.py
    def _wait_child(self):
        try:
            # Don't block if no child processes have exited
            pid, status = os.waitpid(0, os.WNOHANG)
            if not pid:
                return None
        except OSError as exc:
            if exc.errno not in (errno.EINTR, errno.ECHILD):
                raise
            return None

        if os.WIFSIGNALED(status):
            sig = os.WTERMSIG(status)
            LOG.info(_('Child %(pid)d killed by signal %(sig)d'), locals())
        else:
            code = os.WEXITSTATUS(status)
            LOG.info(_('Child %(pid)d exited with status %(code)d'), locals())
            self.running = False

Revision history for this message
YangLei (yanglyy) wrote :
Revision history for this message
YangLei (yanglyy) wrote :

Michael Basnight,

Could you please take a look at this bug?
which is related to this commit:
https://github.com/openstack/cinder/commit/6c708d12f58eb20fce6733f1f6fd08d978570775

 Update cinder-volume to enable multi volume support

* Added the service wrapper for multi binscripts
* Added a service_name to manager
* Added the configuration wrapper so the impls do not need
  to know if they are accessing config values from a option
  group or from the DEFAULT group
* Updated drivers to use the new configuration wrapper
* Added an example config set for multi backend
* Modified service to comply w/ common
* Added get_volume_stats to iscsi
* Added config for volume_backend_name

implements blueprint multi-volume-backends

Change-Id: Idb86ac5f1c06b9cd5061bef6a24f17727e9a58c1

Revision history for this message
YangLei (yanglyy) wrote :

Michael Basnight,

Is the proposal in commet #2 a candidate solution for this bug?
Any suggestions on this are highly appreciated.

Of course, Any solutions from you are great.

Yang Yu (yuyangbj)
Changed in cinder:
assignee: nobody → Yang Yu (yuyangbj)
Changed in cinder:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/27040
Committed: http://github.com/openstack/cinder/commit/2cee8fc3a4899097801ad93b5d6e6a4f373ad8fe
Submitter: Jenkins
Branch: master

commit 2cee8fc3a4899097801ad93b5d6e6a4f373ad8fe
Author: Yang Yu <email address hidden>
Date: Wed Apr 17 16:00:07 2013 +0800

    cinder volume service keeps retrying even code exception

    Make sure that ProcessLauncher does not try to restart the failed
    service. And make sure the ProcessLauncher exits, when the number
    of the failed services equal the number of total services.

    Fixes: bug 1167841
    Change-Id: I818c754534c18a62d292e21f0c98086571fe0565

Changed in cinder:
status: In Progress → Fix Committed
Revision history for this message
Avishay Traeger (avishay-il) wrote : AUTO: Avishay Traeger is out of the office (returning 05/12/2013)

I am out of the office until 05/12/2013.

For technical issues regarding the Storwize/SVC Cinder driver, please
contact: Jie Ping Wu <email address hidden>, Li Min Liu <email address hidden>,
Ronen Kat <email address hidden>
For all other issue, please contact my manager, Dalit Naor
<email address hidden>

Note: This is an automated response to your message "[Bug 1167841] Re:
cinder volume service keeps retrying even with code exception" sent on
26/04/2013 20:46:40.

This is the only notification you will receive while this person is away.

YangLei (yanglyy)
tags: added: grizzly-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/28658

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/grizzly)

Reviewed: https://review.openstack.org/28658
Committed: http://github.com/openstack/cinder/commit/839a6a70211e7b175a5e8f03f2f47b5893120377
Submitter: Jenkins
Branch: stable/grizzly

commit 839a6a70211e7b175a5e8f03f2f47b5893120377
Author: Yang Yu <email address hidden>
Date: Wed Apr 17 16:00:07 2013 +0800

    cinder volume service keeps retrying even code exception

    Make sure that ProcessLauncher does not try to restart the failed
    service. And make sure the ProcessLauncher exits, when the number
    of the failed services equal the number of total services.

    Fixes: bug 1167841
    Change-Id: I818c754534c18a62d292e21f0c98086571fe0565
    (cherry picked from commit 2cee8fc3a4899097801ad93b5d6e6a4f373ad8fe)

tags: added: in-stable-grizzly
Thierry Carrez (ttx)
Changed in cinder:
milestone: none → havana-1
status: Fix Committed → Fix Released
tags: removed: grizzly-backport-potential
Alan Pevec (apevec)
Changed in cinder:
importance: Undecided → Medium
tags: removed: in-stable-grizzly
Thierry Carrez (ttx)
Changed in cinder:
milestone: havana-1 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.