nova-compute status is "XXX"in nova-manage service list

Bug #1439145 reported by dshimo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
New
Undecided
Sanju Abraham

Bug Description

Customer contrail version: 2.0-22 (Icehouse)
Customer Host OS: Ubuntu 14.04.1 LTS

The customer tried creation of new service-instance from contrail GUI in this state. And then, this was fail. State: spawning status...
And then, customer tried deletion of this service-instance from contrail GUI. This service-instance delete is completed.

So, they did check the status of contrail/openstack (with nova) in control and compute.
# openstack-status
# contrail-status
# nova-manage service list

They found the some error and core file by each status output.

<--- Control node:

1)
ERROR and NOSTATE for default-domain__demo__L2-L3-service-chain__1
~~~~~
root@B-9-BL460C-06:~# nova list
+--------------------------------------+---------------------------------------------------------+-----------+------------+-------------+----------------------------------------------------------------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------------------------------------------------------+-----------+------------+-------------+----------------------------------------------------------------------------------------------------------+
| 82b1de52-ee1d-42c6-b7ef-0de26caec77d | default-domain__demo__L2-L3-service-chain__1 | ERROR | - | NOSTATE | |
| 7f3d6150-24c8-45f7-b610-13620b808cd4 | default-domain__demo__NAT-L3-FVP-JUN-1__1 | ACTIVE | - | Running | mgmt=10.0.9.30; L3-FVP-JUN-1-inside-vn=172.22.183.4; Internet=10.0.4.24 |
...<SNIP>...
+--------------------------------------+---------------------------------------------------------+-----------+------------+-------------+----------------------------------------------------------------------------------------------------------+
~~~~~

2)
nova-compute State is XXX in nova-manage service list (below).
~~~~~
root@B-9-BL460C-06:~# vnoavva-manage service list
Binary Host Zone Status State Updated_At
nova-scheduler B-9-BL460C-06 internal enabled :-) 2015-03-24 10:10:51
nova-console B-9-BL460C-06 internal enabled :-) 2015-03-24 10:10:50
nova-consoleauth B-9-BL460C-06 internal enabled :-) 2015-03-24 10:10:55
nova-conductor B-9-BL460C-06 internal enabled :-) 2015-03-24 10:10:56
nova-compute B-9-BL460C-05 nova enabled XXX 2015-03-22 12:44:47 <--- HERE
~~~~~

<--- Compute node:

3)
Core file create in compute node
~~~~~
root@B-9-BL460C-05:~# contrail-status
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

========Run time service failures=============
/var/crashes/core.contrail-vroute.17893.B-9-BL460C-05.1427082091 <--- this is newer
/var/crashes/core.contrail-vroute.1643.B-9-BL460C-05.1426214768
/var/crashes/core.contrail-vroute.2981.B-9-BL460C-05.1426154350
~~~~~

Their setup was recovered by following step.
service nova-compute restart
~~~~~ AFTER
root@B-9-BL460C-06:~# nova-manage service list
Binary Host Zone Status State Updated_At
nova-scheduler B-9-BL460C-06 internal enabled :-) 2015-03-24 10:18:21
nova-console B-9-BL460C-06 internal enabled :-) 2015-03-24 10:18:21
nova-consoleauth B-9-BL460C-06 internal enabled :-) 2015-03-24 10:18:15
nova-conductor B-9-BL460C-06 internal enabled :-) 2015-03-24 10:18:17
nova-compute B-9-BL460C-05 nova enabled :-) 2015-03-24 10:18:19
~~~~~

nova delete 82b1de52-ee1d-42c6-b7ef-0de26caec77d
~~~~~AFTER
ERROR STATE service-instance was deleted
root@B-9-BL460C-06:~# nova delete 82b1de52-ee1d-42c6-b7ef-0de26caec77dlist
+--------------------------------------+---------------------------------------------------------+-----------+------------+-------------+----------------------------------------------------------------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+---------------------------------------------------------+-----------+------------+-------------+----------------------------------------------------------------------------------------------------------+
| 7f3d6150-24c8-45f7-b610-13620b808cd4 | default-domain__demo__NAT-L3-FVP-JUN-1__1 | ACTIVE | - | Running | mgmt=10.0.9.30; L3-FVP-JUN-1-inside-vn=172.22.183.4; Internet=10.0.4.24 |
...<SNIP>...

Attached file is contrail/nova log from customer node.

Tags: nova openstack
Revision history for this message
dshimo (dshimo) wrote :
dshimo (dshimo)
information type: Proprietary → Public
tags: added: nova openstack
Revision history for this message
dshimo (dshimo) wrote :

Hi team,
customer want to know root cause asap (if possible, it's next week).

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

Could you attach the logs from /var/log/nova from both B-9-BL460C-06 and B-9-BL460C-05 ?

Revision history for this message
dshimo (dshimo) wrote :

I attached the log of /var/nova/ and crash.core file.
20150402_log.zip
<---
B-9-BL460C-06_control_nova.zip
B-9-BL460C-05_compute_nova.zip
B-9-BL460C-06_control_terminal.log
B-9-BL460C-05_compute_terminal.log
core.contrail-vroute.17893.B-9-BL460C-05.1427082091

Revision history for this message
Nagabhushana R (bhushana) wrote :
Download full text (4.6 KiB)

We need to know few things

1-> Did rabbitmq restart anytime.
2-> Network connectivity to RMQ from nova-compute. Was it stable?
3-> Are they running HA.

I see the following in the log message

2015-03-24 18:35:14.320 2972 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: Timed out waiting for a reply to message ID 446d5968d5ff469ea71c84a85d9f2b6d
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/openstack/common/periodic_task.py", line 182, in run_periodic_tasks
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task task(self, context)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5460, in update_available_resource
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task rt.update_available_resource(context)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/openstack/common/lockutils.py", line 249, in inner
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task return f(*args, **kwargs)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 315, in update_available_resource
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task context, self.host, self.nodename)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/objects/base.py", line 110, in wrapper
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task args, kwargs)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 425, in object_class_action
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task objver=objver, args=args, kwargs=kwargs)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/oslo/messaging/rpc/client.py", line 150, in call
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task wait_for_reply=True, timeout=timeout)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/oslo/messaging/transport.py", line 90, in _send
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task timeout=timeout)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 412, in send
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task return self._send(target, ctxt, message, wait_for_reply, timeout)
2015-03-24 18:35:14.320 2972 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/oslo/messaging/_drivers/amqpdriver.py", line 403, in _send
2015-03-24 18...

Read more...

Revision history for this message
dshimo (dshimo) wrote :

1-> Did rabbitmq restart anytime.
Customer does not restart rabbitmq anytime

2-> Network connectivity to RMQ from nova-compute. Was it stable?
restart the nova-compute on 24th/Mar , then the system is stable until today

3-> Are they running HA.
No, they are not running HA

Revision history for this message
dshimo (dshimo) wrote :

> Could you please ask them to attach /var/log/contrail/ha/rmq-monitor.log. This will help us in checking if RMQ was stable.
the rmq-monitor.log (and /var/log/contrail/ha/ <- 'ha' folder) did not exist in a customer system

Revision history for this message
dshimo (dshimo) wrote :

Hi Team,
Do we need any other logs for root cause analysis?
I apologize for rush.

Revision history for this message
dshimo (dshimo) wrote :

Hi Team,

Is there any update for this and progress?
Regards,

Revision history for this message
Sanju Abraham (asanju) wrote :

Hi Daisuke-san,

The issue is related to the server 40.0.0.1 being down OR the AMQP on this node being down. Could you please provide he node uptime from the server IP - 40.0.0.1 and also attach /var/log/rabbitmq/*.log from this node.

Thanks,
Sanju

Changed in juniperopenstack:
assignee: nobody → Sanju Abraham (asanju)
Revision history for this message
dshimo (dshimo) wrote :

Hi Sanju-san,

Thanks for taking this report.
I ask to customer regarding /var/log/rabbitmz*.log and uptime of compute node.
Is acquire it by uptime command?
=====
e.g. root@dell-pe-630:~# uptime
 19:41:42 up 44 min, 2 users, load average: 0.00, 0.01, 0.05
=====

Regards,
- Daisuke

Revision history for this message
dshimo (dshimo) wrote :

Hi Sanju-san,

Sorry for my late update.
I got the rabbitmq log from customer server and attached this file.
Regards,

Revision history for this message
dshimo (dshimo) wrote :

Hi,
Is there any update for this and progress?
Regards,

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.