OpenStack Compute (nova)

Bug #1844929
Comment #3

Comment 3 for bug 1844929

Revision history for this message

Matt Riedemann (mriedem) wrote on 2019-09-23:

The confusing thing is it looks like grenade restarts the scheduler:

Sep 22 00:37:32.126839 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Stopped Devstack <email address hidden>.

Sep 22 00:45:55.359862 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Started Devstack <email address hidden>.
...
Sep 22 00:45:59.927792 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO oslo_service.service [None req-d45bc5b0-cd18-49d3-afe4-a9a40f37aefd None None] Starting 2 workers
...
Sep 22 00:46:00.068988 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.230765 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.568873 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Updating service version for nova-scheduler on ubuntu-bionic-ovh-gra1-0011664420 from 37 to 41
...
Sep 22 00:46:00.599841 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Creating RPC server for service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:183}}
...
Sep 22 00:46:00.656407 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Join ServiceGroup membership for this service scheduler {{(pid=19220)
Sep 22 00:46:00.657550 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19220) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}
...
Sep 22 00:46:00.694504 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Join ServiceGroup membership for this service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:201}}
Sep 22 00:46:00.695624 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19219) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}

And then I see this run which is connecting to the cell1 database to pull compute nodes and instances:

Sep 22 00:46:00.734968 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.host_manager [None req-ba4a446d-5a73-4994-8077-366f115fd4ba None None] Total number of compute nodes: 1 {{(pid=18043) _async_init_instance_info /opt/stack/new/nova/nova/scheduler/host_manager.py:428}}

Sep 22 00:46:00.783135 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.host_manager [None req-ba4a446d-5a73-4994-8077-366f115fd4ba None None] Adding 2 instances for hosts 10-20 {{(pid=18043) _async_init_instance_info /opt/stack/new/nova/nova/scheduler/host_manager.py:448}}

That's coming from this method in the scheduler HostManager on restart:

https://github.com/openstack/nova/blob/597b34cd87ac349c0f3702a872630f3c830b1483/nova/scheduler/host_manager.py#L413

Since this is a single node grenade job, we have 1 compute node record in the cell1 database and there are 2 instances on that node from the old side of grenade where we create some resources to verify after the upgrade on the new side.

Then the first scheduling request starts here:

Sep 22 00:49:53.890357 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.scheduler.manager [None req-1929039e-1517-4326-9700-738d4b570ba6 tempest-AttachInterfacesUnderV243Test-2009753731 tempest-AttachInterfacesUnderV243Test-2009753731] Starting to schedule for instances: ['aac35009-5eda-41f7-bb2e-893cc4f868e2'] {{(pid=19220) select_destinations /opt/stack/new/nova/nova/scheduler/manager.py:133}}

That's nearly 4 minutes later and then we've lost the connection to cell1:

Sep 22 00:50:54.174385 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: WARNING nova.context [None req-1929039e-1517-4326-9700-738d4b570ba6 tempest-AttachInterfacesUnderV243Test-2009753731 tempest-AttachInterfacesUnderV243Test-2009753731] Timed out waiting for response from cell 8acfb79b-2e40-4e1c-bc3d-d404dac6db90

So what happened between 00:46:00 and 00:49:53 when we lose the connection to the cell1 DB.

The confusing thing is it looks like grenade restarts the scheduler:

Sep 22 00:37:32.126839 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Stopped Devstack devstack@n-sch.service.

Sep 22 00:45:55.359862 ubuntu-bionic-ovh-gra1-0011664420 systemd[1]: Started Devstack devstack@n-sch.service.
...
Sep 22 00:45:59.927792 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO oslo_service.service [None req-d45bc5b0-cd18-49d3-afe4-a9a40f37aefd None None] Starting 2 workers
...
Sep 22 00:46:00.068988 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.230765 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [-] Starting scheduler node (version 19.1.0)
...
Sep 22 00:46:00.568873 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: INFO nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Updating service version for nova-scheduler on ubuntu-bionic-ovh-gra1-0011664420 from 37 to 41
...
Sep 22 00:46:00.599841 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Creating RPC server for service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:183}}
...
Sep 22 00:46:00.656407 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] Join ServiceGroup membership for this service scheduler {{(pid=19220)
Sep 22 00:46:00.657550 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-331424f4-2ae8-421d-8a43-8d569d999469 None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19220) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}
...
Sep 22 00:46:00.694504 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.service [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] Join ServiceGroup membership for this service scheduler {{(pid=19219) start /opt/stack/new/nova/nova/service.py:201}}
Sep 22 00:46:00.695624 ubuntu-bionic-ovh-gra1-0011664420 nova-scheduler[18043]: DEBUG nova.servicegroup.drivers.db [None req-5738c33c-8cc8-472c-b93b-de6664557bbb None None] DB_Driver: join new ServiceGroup member ubuntu-bionic-ovh-gra1-0011664420 to the scheduler group, service = <Service: host=ubuntu-bionic-ovh-gra1-0011664420, binary=nova-scheduler, manager_class_name=nova.scheduler.manager.SchedulerManager> {{(pid=19219) join /opt/stack/new/nova/nova/servicegroup/drivers/db.py:47}}

And then I see this run which is connecting to the cell1 database to pull compute nodes and instances:

That's coming from this method in the scheduler HostManager on restart:

https://github.com/openstack/nova/blob/597b34cd87ac349c0f3702a872630f3c830b1483/nova/scheduler/host_manager.py#L413

Then the first scheduling request starts here:

That's nearly 4 minutes later and then we've lost the connection to cell1:

So what happened between 00:46:00 and 00:49:53 when we lose the connection to the cell1 DB.