Instance fails sometimes(inconsistently) in the second boot when multiple ironic conductors are used

Bug #1397438 reported by jsb
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ironic
Expired
Undecided
Unassigned

Bug Description

Juno release - Openstack
Ironic Juno - 2014.2 tar ball

Env:
 one controller node - all openstack controller services and ironic-api service
one network node
one compute node - nova-compute service and ironic-conductor service with tftp service
2 ironic conductor nodes - run only ironc-conductor services with tftp services
[ totally one nova-compute service and 3 ironic-conductor services]

1) Launched an instance on a node
2) got scheduled to one conductor
3) the conductor deployed deploy ramdisk/kernel and copied the user image perfectly
4) the conductor finished deploying saying "Deploying to node ... done.
5) the node got rebooted.
6) Now, another conductor reserved it and started sending it deploy ramdisk/kernel and fails to do tgt because it was not able to get the token (token not found server error for tgt). But, as the node got second rebooted, nova instance was set to Active state. But, ssh will not work. [it gets stuck at the where it tries to get authorized from keystone but does not send enough info and results in authentication required error ]

The same behavior can be observed sometimes(inconsistently) if a successfully launched node is either rebooted from its command line or soft rebooted from nova command. The node becomes unreachable.

Another interesting observation is the server-ip-address and tftp-server keeps changing in the neutron dnsmasq's opts file. Opts file details are provided at the end of this report. The same port f6f5db91-a621-443b-b307-915eb00df8fa gets tftp-serve and server-ip-address as 10.10.44.17 and 10.10.44.18 (these are where ironic-conductors are running with their tftp service)

Looks like there is a race among multiple conductor services.

ironic-conductor log - that successfully launched till second reboot:

2014-11-28 18:05:09.145 4178 DEBUG ironic.common.utils [-] Execution completed, command line is "parted -a optimal -s /dev/disk/by-path/ip-10.10.44.52:3260-iscsi-iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800-lun-1 -- unit MiB mklabel msdos mkpart primary 1 61441" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:84
2014-11-28 18:05:09.153 4178 DEBUG ironic.common.utils [-] Command stdout is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:85
2014-11-28 18:05:09.157 4178 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:86
2014-11-28 18:05:09.622 4178 DEBUG ironic.common.utils [-] Execution completed, command line is "fuser /dev/disk/by-path/ip-10.10.44.52:3260-iscsi-iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800-lun-1" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:84
2014-11-28 18:05:09.630 4178 DEBUG ironic.common.utils [-] Command stdout is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:85
2014-11-28 18:05:09.633 4178 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:86
2014-11-28 18:09:33.690 4178 DEBUG ironic.common.utils [-] Execution completed, command line is "dd if=/var/lib/ironic/images/14131dce-a5dd-489e-84f6-f9b1b8b56800/disk of=/dev/disk/by-path/ip-10.10.44.52:3260-iscsi-iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800-lun-1-part1 bs=1M oflag=direct" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:84
2014-11-28 18:09:33.700 4178 DEBUG ironic.common.utils [-] Command stdout is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:85
2014-11-28 18:09:33.701 4178 DEBUG ironic.common.utils [-] Command stderr is: "1375+1 records in
1375+1 records out
1441923072 bytes (1.4 GB) copied, 263.634 s, 5.5 MB/s
" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:86
2014-11-28 18:09:33.846 4178 DEBUG ironic.common.utils [-] Execution completed, command line is "blkid -s UUID -o value /dev/disk/by-path/ip-10.10.44.52:3260-iscsi-iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800-lun-1-part1" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:84
2014-11-28 18:09:33.847 4178 DEBUG ironic.common.utils [-] Command stdout is: "1a2c7ada-adc0-4765-b205-ae525d501fad
" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:85
2014-11-28 18:09:33.847 4178 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:86
2014-11-28 18:09:34.437 4178 DEBUG ironic.common.utils [-] Execution completed, command line is "iscsiadm -m node -p 10.10.44.52:3260 -T iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800 --logout" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:84
2014-11-28 18:09:34.438 4178 DEBUG ironic.common.utils [-] Command stdout is: "Logging out of session [sid: 8, target: iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800, portal: 10.10.44.52,3260]
Logout of [sid: 8, target: iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800, portal: 10.10.44.52,3260] successful.
" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:85
2014-11-28 18:09:34.439 4178 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:86
2014-11-28 18:09:34.502 4178 DEBUG ironic.common.utils [-] Execution completed, command line is "iscsiadm -m node -p 10.10.44.52:3260 -T iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800 -o delete" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:84
2014-11-28 18:09:34.503 4178 DEBUG ironic.common.utils [-] Command stdout is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:85
2014-11-28 18:09:34.503 4178 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/local/lib/python2.7/dist-packages/ironic/common/utils.py:86
2014-11-28 18:09:34.505 4178 DEBUG ironic.drivers.modules.image_cache [-] Starting clean up for master image cache /var/lib/ironic/master_images clean_up /usr/local/lib/python2.7/dist-packages/ironic/drivers/modules/image_cache.py:169
2014-11-28 18:09:37.509 4178 INFO ironic.drivers.modules.pxe [-] Deployment to node 14131dce-a5dd-489e-84f6-f9b1b8b56800 done

ironic-conductor 2 log that took over the same node right after the above conductor rebooted it:

2014-11-28 18:10:41.342 4409 DEBUG ironic.conductor.manager [-] Conductor trustycompute2.example.com taking over node 14131dce-a5dd-489e-84f6-f9b1b8b56800 _do_takeover /usr/local/lib/python2.7/dist-packages/ironic/conductor/manager.py:878
2014-11-28 18:10:41.356 4409 DEBUG ironic.common.pxe_utils [-] Building PXE config for node 14131dce-a5dd-489e-84f6-f9b1b8b56800 create_pxe_config /usr/local/lib/python2.7/dist-packages/ironic/common/pxe_utils.py:182
2014-11-28 18:10:41.366 4409 DEBUG ironic.drivers.modules.pxe [-] Fetching kernel and ramdisk for node 14131dce-a5dd-489e-84f6-f9b1b8b56800 _cache_ramdisk_kernel /usr/local/lib/python2.7/dist-packages/ironic/drivers/modules/pxe.py:212
2014-11-28 18:10:41.367 4409 DEBUG ironic.common.glance_service.base_image_service [-] Getting image metadata from glance. Image: e4e64019-0827-4ee1-a66f-b04e7b84f357 _show /usr/local/lib/python2.7/dist-packages/ironic/common/glance_service/base_image_service.py:186
2014-11-28 18:10:41.370 4409 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.198.10
2014-11-28 18:10:41.372 4409 DEBUG urllib3.connectionpool [-] Setting read timeout to <object object at 0x7fbbda5db0c0> _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:375
2014-11-28 18:10:41.493 4409 DEBUG urllib3.connectionpool [-] "HEAD /v1/images/e4e64019-0827-4ee1-a66f-b04e7b84f357 HTTP/1.1" 200 0 _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:415
2014-11-28 18:10:41.497 4409 DEBUG urllib3.connectionpool [-] Setting read timeout to <object object at 0x7fbbda5db0c0> _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:375
2014-11-28 18:10:41.549 4409 DEBUG urllib3.connectionpool [-] "HEAD /v1/images/e4e64019-0827-4ee1-a66f-b04e7b84f357 HTTP/1.1" 200 0 _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:415
2014-11-28 18:10:41.552 4409 DEBUG ironic.common.glance_service.base_image_service [-] Getting image metadata from glance. Image: 81fcbc70-457d-46df-9f4f-5bf32483a36d _show /usr/local/lib/python2.7/dist-packages/ironic/common/glance_service/base_image_service.py:186
2014-11-28 18:10:41.554 4409 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 192.168.198.10
2014-11-28 18:10:41.557 4409 DEBUG urllib3.connectionpool [-] Setting read timeout to <object object at 0x7fbbda5db0c0> _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:375
2014-11-28 18:10:41.620 4409 DEBUG urllib3.connectionpool [-] "HEAD /v1/images/81fcbc70-457d-46df-9f4f-5bf32483a36d HTTP/1.1" 200 0 _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:415
2014-11-28 18:10:41.624 4409 DEBUG urllib3.connectionpool [-] Setting read timeout to <object object at 0x7fbbda5db0c0> _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:375
2014-11-28 18:10:41.650 4409 DEBUG urllib3.connectionpool [-] "HEAD /v1/images/81fcbc70-457d-46df-9f4f-5bf32483a36d HTTP/1.1" 200 0 _make_request /usr/lib/python2.7/dist-packages/urllib3/connectionpool.py:415
2014-11-28 18:10:41.654 4409 DEBUG ironic.common.glance_service.base_image_service [-] Getting image metadata from glance. Image: 6b180bf0-aac6-4fa5-bdd7-86ac96e077e8 _show /usr/local/lib/python2.7/dist-packages/ironic/common/glance_service/base_image_service.py:186

Network node:
neutron dnsmasq's options file:

cat /var/lib/neutron/dhcp/3e1b6943-8639-48f3-ae16-0828a7daa9cb/opts
tag:tag0,option:classless-static-route,169.254.169.254/32,10.10.44.31,0.0.0.0/0,10.10.44.15
tag:tag0,249,169.254.169.254/32,10.10.44.31,0.0.0.0/0,10.10.44.15
tag:tag0,option:router,10.10.44.15
tag:b19eea1a-6182-4c8f-ab7d-598b76f61106,option:bootfile-name,pxelinux.0
tag:b19eea1a-6182-4c8f-ab7d-598b76f61106,option:server-ip-address,10.10.44.17
tag:b19eea1a-6182-4c8f-ab7d-598b76f61106,option:tftp-server,10.10.44.17
tag:6b896b2f-2c44-46b0-a4d5-b16de2375f77,option:tftp-server,10.10.44.17
tag:6b896b2f-2c44-46b0-a4d5-b16de2375f77,option:bootfile-name,pxelinux.0
tag:6b896b2f-2c44-46b0-a4d5-b16de2375f77,option:server-ip-address,10.10.44.17
tag:f6f5db91-a621-443b-b307-915eb00df8fa,option:bootfile-name,pxelinux.0
tag:f6f5db91-a621-443b-b307-915eb00df8fa,option:server-ip-address,10.10.44.18
tag:f6f5db91-a621-443b-b307-915eb00df8fa,option:tftp-server,10.10.44.18
tag:tag0,option:dns-server,10.10.44.31

cat /var/lib/neutron/dhcp/3e1b6943-8639-48f3-ae16-0828a7daa9cb/opts
tag:tag0,option:classless-static-route,169.254.169.254/32,10.10.44.31,0.0.0.0/0,10.10.44.15
tag:tag0,249,169.254.169.254/32,10.10.44.31,0.0.0.0/0,10.10.44.15
tag:tag0,option:router,10.10.44.15
tag:b19eea1a-6182-4c8f-ab7d-598b76f61106,option:bootfile-name,pxelinux.0
tag:b19eea1a-6182-4c8f-ab7d-598b76f61106,option:server-ip-address,10.10.44.17
tag:b19eea1a-6182-4c8f-ab7d-598b76f61106,option:tftp-server,10.10.44.17
tag:6b896b2f-2c44-46b0-a4d5-b16de2375f77,option:tftp-server,10.10.44.17
tag:6b896b2f-2c44-46b0-a4d5-b16de2375f77,option:bootfile-name,pxelinux.0
tag:6b896b2f-2c44-46b0-a4d5-b16de2375f77,option:server-ip-address,10.10.44.17
tag:f6f5db91-a621-443b-b307-915eb00df8fa,option:bootfile-name,pxelinux.0
tag:f6f5db91-a621-443b-b307-915eb00df8fa,option:server-ip-address,10.10.44.17
tag:f6f5db91-a621-443b-b307-915eb00df8fa,option:tftp-server,10.10.44.17
tag:tag0,option:dns-server,10.10.44.31

Revision history for this message
John Stafford (john-stafford) wrote :

Is this still an active bug?

Revision history for this message
aeva black (tenbrae) wrote :

Conductors should only initiate take-over when a conductor joins or leaves the cluster. In other words, when any conductor service starts or stops, it may trigger a re-balancing, and while other conductors should "heal" there could be a race leading to this behaviour as a result of the cluster changing.

Was this behaviour being observed even when no ironic-conductor services had been started or stopped for some time?

Changed in ironic:
status: New → Incomplete
Revision history for this message
jsb (jay-s-b) wrote : Re: [Bug 1397438] Re: Instance fails sometimes(inconsistently) in the second boot when multiple ironic conductors are used
Download full text (12.7 KiB)

When this behavior was observed, multiple ironic conductors had been started already for sometime.

Thanks,
Jay

> On Apr 21, 2015, at 11:21 PM, Devananda van der Veen <email address hidden> wrote:
>
> Conductors should only initiate take-over when a conductor joins or
> leaves the cluster. In other words, when any conductor service starts or
> stops, it may trigger a re-balancing, and while other conductors should
> "heal" there could be a race leading to this behaviour as a result of
> the cluster changing.
>
> Was this behaviour being observed even when no ironic-conductor services
> had been started or stopped for some time?
>
> ** Changed in: ironic
> Status: New => Incomplete
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1397438
>
> Title:
> Instance fails sometimes(inconsistently) in the second boot when
> multiple ironic conductors are used
>
> Status in OpenStack Bare Metal Provisioning Service (Ironic):
> Incomplete
>
> Bug description:
> Juno release - Openstack
> Ironic Juno - 2014.2 tar ball
>
> Env:
> one controller node - all openstack controller services and ironic-api service
> one network node
> one compute node - nova-compute service and ironic-conductor service with tftp service
> 2 ironic conductor nodes - run only ironc-conductor services with tftp services
> [ totally one nova-compute service and 3 ironic-conductor services]
>
> 1) Launched an instance on a node
> 2) got scheduled to one conductor
> 3) the conductor deployed deploy ramdisk/kernel and copied the user image perfectly
> 4) the conductor finished deploying saying "Deploying to node ... done.
> 5) the node got rebooted.
> 6) Now, another conductor reserved it and started sending it deploy ramdisk/kernel and fails to do tgt because it was not able to get the token (token not found server error for tgt). But, as the node got second rebooted, nova instance was set to Active state. But, ssh will not work. [it gets stuck at the where it tries to get authorized from keystone but does not send enough info and results in authentication required error ]
>
> The same behavior can be observed sometimes(inconsistently) if a
> successfully launched node is either rebooted from its command line or
> soft rebooted from nova command. The node becomes unreachable.
>
> Another interesting observation is the server-ip-address and tftp-
> server keeps changing in the neutron dnsmasq's opts file. Opts file
> details are provided at the end of this report. The same port
> f6f5db91-a621-443b-b307-915eb00df8fa gets tftp-serve and server-ip-
> address as 10.10.44.17 and 10.10.44.18 (these are where ironic-
> conductors are running with their tftp service)
>
> Looks like there is a race among multiple conductor services.
>
>
> ironic-conductor log - that successfully launched till second reboot:
>
> 2014-11-28 18:05:09.145 4178 DEBUG ironic.common.utils [-] Execution completed, command line is "parted -a optimal -s /dev/disk/by-path/ip-10.10.44.52:3260-iscsi-iqn-14131dce-a5dd-489e-84f6-f9b1b8b56800-lun-1 -- unit MiB mklabel msdos mkpart primar...

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Ironic because there has been no activity for 60 days.]

Changed in ironic:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.