confused baremetal instance thinks its off, is clearly operational

Bug #1178378 reported by Robert Collins
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
aeva black
tripleo
Fix Released
Critical
aeva black

Bug Description

| 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45 | foo | SHUTOFF | None | Shutdown |
                  ctlplane=192.0.2.34 |
but
64 bytes from 192.0.2.34: icmp_req=1 ttl=64 time=0.491 ms
64 bytes from 192.0.2.34: icmp_req=2 ttl=64 time=0.390 ms

and I'm also connected to the console and it's running as it was ~ 8 hours ago when I went to bed.
 uptime
 14:12:21 up 9:50, 1 user, load average: 0.00, 0.01, 0.05

I've marked this critical because once the confusion happens, the next time IPMI polling works, this is logged:
[instance: bfb6957b-d23d-4716-bc9d-96d2878b738d] Instance is not stopped. Calling the stop
                  API.

and nova promptly powers the machine off via IPMI, destroying any state on the machine.

It can be powered back on via 'nova start $UUID'.

mysql> select * from instances where uuid='7b12b6aa-ea62-4165-82b5-68c1ebdfbb45';
+---------------------+---------------------+------------+----+-------------+----------------------------------+----------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------+----------+----------+-------------+----------+-----------+-------+----------+--------+-----------+----------------+---------------------+---------------------+---------------+--------------+---------------------+-------------------+--------+---------+-------------+------------------+---------+--------------------------------------+--------------+------------------+--------------+--------------+--------------+------------+--------------------------+---------------------+----------+------------------+--------------------+-------------------+---------+--------------+-----------+--------------------------------------+---------+
| created_at | updated_at | deleted_at | id | internal_id | user_id | project_id | image_ref | kernel_id | ramdisk_id | launch_index | key_name | key_data | power_state | vm_state | memory_mb | vcpus | hostname | host | user_data | reservation_id | scheduled_at | launched_at | terminated_at | display_name | display_description | availability_zone | locked | os_type | launched_on | instance_type_id | vm_mode | uuid | architecture | root_device_name | access_ip_v4 | access_ip_v6 | config_drive | task_state | default_ephemeral_device | default_swap_device | progress | auto_disk_config | shutdown_terminate | disable_terminate | root_gb | ephemeral_gb | cell_name | node | deleted |
+---------------------+---------------------+------------+----+-------------+----------------------------------+----------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------+----------+----------+-------------+----------+-----------+-------+----------+--------+-----------+----------------+---------------------+---------------------+---------------+--------------+---------------------+-------------------+--------+---------+-------------+------------------+---------+--------------------------------------+--------------+------------------+--------------+--------------+--------------+------------+--------------------------+---------------------+----------+------------------+--------------------+-------------------+---------+--------------+-----------+--------------------------------------+---------+
| 2013-05-09 08:48:15 | 2013-05-09 09:05:26 | NULL | 57 | NULL | baa113f6f7994ddd9c7d86945768616e | 0d4df5d4fee24f18b8b5f425eb81e0c4 | 0b6e64a9-4867-4742-a964-f52951c93123 | 615727b1-07d2-4071-b5f4-6ee432972df9 | 9e5d7470-f3ed-4604-8fe7-894822d86016 | 0 | NULL | NULL | 4 | stopped | 512 | 1 | foo | ubuntu | NULL | r-kunwlpa2 | 2013-05-09 08:48:15 | 2013-05-09 08:54:59 | NULL | foo | foo | NULL | 0 | NULL | ubuntu | 6 | NULL | 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45 | NULL | NULL | NULL | NULL | | NULL | NULL | NULL | 0 | NULL | 0 | 0 | 10 | 0 | NULL | 96deccd5-0ad9-4bb5-979b-009bebac52fc | 0 |
+---------------------+---------------------+------------+----+-------------+----------------------------------+----------------------------------+--------------------------------------+--------------------------------------+--------------------------------------+--------------+----------+----------+-------------+----------+-----------+-------+----------+--------+-----------+----------------+---------------------+---------------------+---------------+--------------+---------------------+-------------------+--------+---------+-------------+------------------+---------+--------------------------------------+--------------+------------------+--------------+--------------+--------------+------------+--------------------------+---------------------+----------+------------------+--------------------+-------------------+---------+--------------+-----------+--------------------------------------+---------+

select * from nova_bm.bm_nodes where instance_uuid='7b12b6aa-ea62-4165-82b5-68c1ebdfbb45';
+---------------------+---------------------+------------+---------+----+------+-----------+----------+-------------+---------------+-------------+--------------+------------------+--------------------------------------+------------+---------------+------------------------------------------------+-------------------------------------------------------+----------------------------------+---------+---------+--------------------------------------+---------------+
| created_at | updated_at | deleted_at | deleted | id | cpus | memory_mb | local_gb | pm_address | pm_user | pm_password | service_host | prov_mac_address | instance_uuid | task_state | terminal_port | image_path | pxe_config_path | deploy_key | root_mb | swap_mb | uuid | instance_name |
+---------------------+---------------------+------------+---------+----+------+-----------+----------+-------------+---------------+-------------+--------------+------------------+--------------------------------------+------------+---------------+------------------------------------------------+-------------------------------------------------------+----------------------------------+---------+---------+--------------------------------------+---------------+
| 2013-05-09 00:41:48 | 2013-05-09 08:54:59 | NULL | 0 | 15 | 24 | 98304 | 2048 | 10.10.16.45 | Administrator | 42560532 | ubuntu | NULL | 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45 | active | NULL | /var/lib/nova/instances/instance-00000039/disk | /tftpboot/7b12b6aa-ea62-4165-82b5-68c1ebdfbb45/config | RDC5IT9ZQL0DPTYM0KZF5VCI22Q2K4GE | 10240 | 1 | 96deccd5-0ad9-4bb5-979b-009bebac52fc | foo |
+---------------------+---------------------+------------+---------+----+------+-----------+----------+-------------+---------------+-------------+--------------+------------------+--------------------------------------+------------+---------------+------------------------------------------------+-------------------------------------------------------+----------------------------------+---------+---------+--------------------------------------+---------------+

select * from compute_nodes where compute_nodes.hypervisor_hostname='96deccd5-0ad9-4bb5-979b-009bebac52fc';
+---------------------+---------------------+------------+----+------------+-------+-----------+----------+------------+----------------+---------------+-----------------+--------------------+---------------+----------------------+-------------+--------------+------------------+-------------+--------------------------------------+---------+
| created_at | updated_at | deleted_at | id | service_id | vcpus | memory_mb | local_gb | vcpus_used | memory_mb_used | local_gb_used | hypervisor_type | hypervisor_version | cpu_info | disk_available_least | free_ram_mb | free_disk_gb | current_workload | running_vms | hypervisor_hostname | deleted |
+---------------------+---------------------+------------+----+------------+-------+-----------+----------+------------+----------------+---------------+-----------------+--------------------+---------------+----------------------+-------------+--------------+------------------+-------------+--------------------------------------+---------+
| 2013-05-09 00:42:45 | 2013-05-09 18:46:06 | NULL | 16 | 5 | 24 | 98304 | 2048 | 1 | 512 | 10 | baremetal | 1 | baremetal cpu | NULL | 97792 | 2038 | 0 | 1 | 96deccd5-0ad9-4bb5-979b-009bebac52fc | 0 |
+---------------------+---------------------+------------+----+------------+-------+-----------+----------+------------+----------------+---------------+-----------------+--------------------+---------------+----------------------+-------------+--------------+------------------+-------------+--------------------------------------+---------+
1 row in set (0.00 sec)

 grep 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45 /var/log/upstart/nova-compute.log
2013-05-09 08:48:15,392.392 19736 AUDIT nova.compute.manager [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Starting instance...
2013-05-09 08:48:15,924.924 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Attempting claim: memory 512 MB, disk 10 GB, VCPUs 1
2013-05-09 08:48:15,924.924 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Total Memory: 98304 MB, used: 0 MB
2013-05-09 08:48:15,925.925 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Memory limit: 98304 MB, free: 98304 MB
2013-05-09 08:48:15,925.925 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Total Disk: 2048 GB, used: 0 GB
2013-05-09 08:48:15,925.925 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Disk limit not specified, defaulting to unlimited
2013-05-09 08:48:15,926.926 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Total CPU: 24 VCPUs, used: 0 VCPUs
2013-05-09 08:48:15,926.926 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] CPU limit not specified, defaulting to unlimited
2013-05-09 08:48:15,926.926 19736 AUDIT nova.compute.claims [req-9671f69f-b656-4bf7-8acf-5186d721b75e baa113f6f7994ddd9c7d86945768616e 0d4df5d4fee24f18b8b5f425eb81e0c4] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Claim successful
2013-05-09 08:53:51,373.373 19736 INFO nova.virt.baremetal.pxe [-] PXE deploy started for instance 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45
2013-05-09 08:54:59,414.414 19736 INFO nova.virt.baremetal.pxe [-] PXE deploy completed for instance 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45
2013-05-09 09:05:25,370.370 19736 WARNING nova.compute.manager [-] [instance: 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45] Instance shutdown by itself. Calling the stop API.

# zgrep 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45 /var/log/upstart/nova-scheduler.log.2.gz
# zgrep 7b12b6aa-ea62-4165-82b5-68c1ebdfbb45 /var/log/upstart/nova-scheduler.log.1.gz
#

Revision history for this message
Robert Collins (lifeless) wrote :

The machines ILO believes power to be on, FWIW.

Revision history for this message
aeva black (tenbrae) wrote :

The warning was triggered during _sync_power_states periodic task getting vm_power_state of SHUTDOWN from the baremetal IPMI driver. This will happen if the IPMI driver is not able to check the power state of the node, because get_info() defaults to believing the power_state is SHUTDOWN.

https://github.com/openstack/nova/blob/stable/grizzly/nova/virt/baremetal/driver.py#L362

    def get_info(self, instance):
        node = _get_baremetal_node_by_instance_uuid(inst_uuid)
        pm = get_power_manager(node=node, instance=instance)
        ps = power_state.SHUTDOWN
        if pm.is_power_on():
            ps = power_state.RUNNING
        return {'state': ps,
                      ...

Based on the behaviour of _sync_instance_power_state,
  https://github.com/openstack/nova/blob/stable/grizzly/nova/compute/manager.py#L3773

I believe the correct default returned when the IPMI driver isn't able to determine the power state is NOSTATE, which would lead to a warning being logged, and nothing else. SHUTDOWN is clearly the wrong default.

Revision history for this message
Robert Collins (lifeless) wrote :

+1

aeva black (tenbrae)
tags: added: low-hanging-fruit
Changed in nova:
importance: Medium → Low
importance: Low → Critical
description: updated
Anita Kuno (anteaya)
Changed in nova:
assignee: nobody → Anita Kuno (akuno)
Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/29031

Changed in nova:
assignee: Anita Kuno (akuno) → Devananda van der Veen (devananda)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/29031
Committed: http://github.com/openstack/nova/commit/6337d704096b48970f7080c1eafc52e053de0e56
Submitter: Jenkins
Branch: master

commit 6337d704096b48970f7080c1eafc52e053de0e56
Author: Devananda van der Veen <email address hidden>
Date: Mon May 13 19:13:36 2013 -0700

    Handle IPMI transient failures better.

    Sometimes, IPMI "power status" doesn't return either "on" or "off".
    When this happens, the IPMI driver shouldn't assume what the state is,
    and the baremetal driver shouldn't assume that NOSTATE means SHUTDOWN.

    This patch also improves the doc strings for the IPMI driver.

    Fix bug 1178378.

    Change-Id: I54a5e6309f68ab8e5601e65039366d7d87c03478

Changed in nova:
status: In Progress → Fix Committed
Changed in tripleo:
status: Triaged → Fix Committed
assignee: nobody → Devananda van der Veen (devananda)
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-1 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.