Nodes triggers to OFFLINE during OSTF tests

Bug #1592570 reported by Sergii Turivnyi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
Critical
Sergii Turivnyi

Bug Description

Detailed bug description:
The controller nodes go to Offline while OSTF tests is running. Environment with Neutron DVR, Sahara, Ceilometer and Ironic.
Several OSTF tests is Failed

HA tests. Duration 30 sec - 8 min
Check data replication over mysql
-- Time limit exceeded while waiting for detect mysql node to finish. Please refer to OpenStack logs for more details.

Check if amount of tables in databases is the same on each node
-- Time limit exceeded while waiting for get amount of tables for each database to finish. Please refer to OpenStack logs for more details.

Check galera environment state
-- Time limit exceeded while waiting for get status from galera node to finish. Please refer to OpenStack logs for more details.

RabbitMQ availability
-- Time limit exceeded while waiting for to finish. Please refer to OpenStack logs for more details.

Platform services functional tests. Duration 3 min - 60 min
-- Ceilometer test to check notifications from Sahara
Correctly registered image to create Sahara cluster not found.

Sahara test for launching a simple Vanilla2 cluster
-- Authorization failure. Please provide the valid credentials for your OpenStack environment, and reattempt.

Steps to reproduce:
1. Get ISO: http://srv52-bud.infra.mirantis.net/fuelweb-iso/fuel-9.0-mos-465-2016-06-09_22-51-38.iso.torrent
2. Nodes = 6
    Conroller + Mongo = 3
    Compute + Cinder = 2
    Ironic = 1
    Neutron DVR
    Nova quotas
    Cinder LVM over iSCSI for volumes
    OpenStack debug logging
    Install Sahara
    Install Ceilometer and Aodh
    Install Ironic
4. Deploy
    -- Deployment is successful
5. Run all OSTF tests

Expected results:
Environment in operational state
All OSTF tests is Passed

Actual result:
Several OSTF tests are Failed
2 Controller node is Offline

Reproducibility:
--

Workaround:
--

Impact:
--

Description of the environment:
see attachments

Additional information:
Snapshot: https://drive.google.com/a/mirantis.com/file/d/0B8hkiEm94sEtVE0ycHVETG5rdjA/view?usp=sharing

See attachments

Tags: area-sahara
Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :
tags: added: area-sahara
Changed in mos:
status: New → Confirmed
importance: High → Critical
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

Could you please attach the diagnostic snapshot as well? Png screenshot is not very informative.

Access to the environment where the bug was found would be of great help.

Revision history for this message
Sergii Turivnyi (sturivnyi) wrote : Re: [Bug 1592570] Re: Nodes triggers to OFFLINE during OSTF tests

Environment: http://qa-servers.vm.mirantis.net/runs/244

diagnostic snapshot I'll attach ASAP

On Wed, Jun 15, 2016 at 1:28 AM, Inessa Vasilevskaya <
<email address hidden>> wrote:

> Could you please attach the diagnostic snapshot as well? Png screenshot
> is not very informative.
>
> Access to the environment where the bug was found would be of great
> help.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1592570
>
> Title:
> Nodes triggers to OFFLINE during OSTF tests
>
> Status in Mirantis OpenStack:
> Confirmed
>
> Bug description:
> Detailed bug description:
> The controller nodes go to Offline while OSTF tests is running.
> Environment with Neutron DVR, Sahara, Ceilometer and Ironic.
> Several OSTF tests is Failed
>
> HA tests. Duration 30 sec - 8 min
> Check data replication over mysql
> -- Time limit exceeded while waiting for detect mysql node to finish.
> Please refer to OpenStack logs for more details.
>
> Check if amount of tables in databases is the same on each node
> -- Time limit exceeded while waiting for get amount of tables for each
> database to finish. Please refer to OpenStack logs for more details.
>
> Check galera environment state
> -- Time limit exceeded while waiting for get status from galera node to
> finish. Please refer to OpenStack logs for more details.
>
> RabbitMQ availability
> -- Time limit exceeded while waiting for to finish. Please refer to
> OpenStack logs for more details.
>
> Platform services functional tests. Duration 3 min - 60 min
> -- Ceilometer test to check notifications from Sahara
> Correctly registered image to create Sahara cluster not found.
>
> Sahara test for launching a simple Vanilla2 cluster
> -- Authorization failure. Please provide the valid credentials for your
> OpenStack environment, and reattempt.
>
>
> Steps to reproduce:
> 1. Get ISO:
> http://srv52-bud.infra.mirantis.net/fuelweb-iso/fuel-9.0-mos-465-2016-06-09_22-51-38.iso.torrent
> 2. Nodes = 6
> Conroller + Mongo = 3
> Compute + Cinder = 2
> Ironic = 1
> Neutron DVR
> Nova quotas
> Cinder LVM over iSCSI for volumes
> OpenStack debug logging
> Install Sahara
> Install Ceilometer and Aodh
> Install Ironic
> 4. Deploy
> -- Deployment is successful
> 5. Run all OSTF tests
>
> Expected results:
> Environment in operational state
> All OSTF tests is Passed
>
> Actual result:
> Several OSTF tests are Failed
> 2 Controller node is Offline
>
> Reproducibility:
> --
>
> Workaround:
> --
>
> Impact:
> --
>
> Description of the environment:
> see attachments
>
> Additional information:
> Snapshot:
> https://drive.google.com/a/mirantis.com/file/d/0B8hkiEm94sEtVE0ycHVETG5rdjA/view?usp=sharing
>
> See attachments
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/mos/+bug/1592570/+subscriptions
>

--
Thanks a lot in advance.
Kind Regards, Sergey Turivnyi

Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :
Download full text (3.4 KiB)

Diagnostic Snapshot:
https://drive.google.com/a/mirantis.com/file/d/0B8hkiEm94sEtVE0ycHVETG5rdjA/view?usp=sharing

On Wed, Jun 15, 2016 at 10:44 AM, Sergii Turivnyi <email address hidden>
wrote:

> Environment: http://qa-servers.vm.mirantis.net/runs/244
>
> diagnostic snapshot I'll attach ASAP
>
> On Wed, Jun 15, 2016 at 1:28 AM, Inessa Vasilevskaya <
> <email address hidden>> wrote:
>
>> Could you please attach the diagnostic snapshot as well? Png screenshot
>> is not very informative.
>>
>> Access to the environment where the bug was found would be of great
>> help.
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1592570
>>
>> Title:
>> Nodes triggers to OFFLINE during OSTF tests
>>
>> Status in Mirantis OpenStack:
>> Confirmed
>>
>> Bug description:
>> Detailed bug description:
>> The controller nodes go to Offline while OSTF tests is running.
>> Environment with Neutron DVR, Sahara, Ceilometer and Ironic.
>> Several OSTF tests is Failed
>>
>> HA tests. Duration 30 sec - 8 min
>> Check data replication over mysql
>> -- Time limit exceeded while waiting for detect mysql node to finish.
>> Please refer to OpenStack logs for more details.
>>
>> Check if amount of tables in databases is the same on each node
>> -- Time limit exceeded while waiting for get amount of tables for each
>> database to finish. Please refer to OpenStack logs for more details.
>>
>> Check galera environment state
>> -- Time limit exceeded while waiting for get status from galera node to
>> finish. Please refer to OpenStack logs for more details.
>>
>> RabbitMQ availability
>> -- Time limit exceeded while waiting for to finish. Please refer to
>> OpenStack logs for more details.
>>
>> Platform services functional tests. Duration 3 min - 60 min
>> -- Ceilometer test to check notifications from Sahara
>> Correctly registered image to create Sahara cluster not found.
>>
>> Sahara test for launching a simple Vanilla2 cluster
>> -- Authorization failure. Please provide the valid credentials for your
>> OpenStack environment, and reattempt.
>>
>>
>> Steps to reproduce:
>> 1. Get ISO:
>> http://srv52-bud.infra.mirantis.net/fuelweb-iso/fuel-9.0-mos-465-2016-06-09_22-51-38.iso.torrent
>> 2. Nodes = 6
>> Conroller + Mongo = 3
>> Compute + Cinder = 2
>> Ironic = 1
>> Neutron DVR
>> Nova quotas
>> Cinder LVM over iSCSI for volumes
>> OpenStack debug logging
>> Install Sahara
>> Install Ceilometer and Aodh
>> Install Ironic
>> 4. Deploy
>> -- Deployment is successful
>> 5. Run all OSTF tests
>>
>> Expected results:
>> Environment in operational state
>> All OSTF tests is Passed
>>
>> Actual result:
>> Several OSTF tests are Failed
>> 2 Controller node is Offline
>>
>> Reproducibility:
>> --
>>
>> Workaround:
>> --
>>
>> Impact:
>> --
>>
>> Description of the environment:
>> see attachments
>>
>> Additional information:
>> Snapshot:
>> https://drive.google.com/a/mirantis.com/file/d/0B8hkiEm94sEtVE0ycHVETG5rdjA/view?usp=sharing
>>
>> See attachments
>>...

Read more...

Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :
Changed in mos:
assignee: nobody → MOS Sahara (mos-sahara)
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

(venv-mos)mmalchuk@srv146-bud:~$ virsh list --all | grep sturivnyi_465_bug_1592019
 263 sturivnyi_465_bug_1592019_admin running
 264 sturivnyi_465_bug_1592019_slave-01 running
 266 sturivnyi_465_bug_1592019_slave-03 running
 267 sturivnyi_465_bug_1592019_slave-04 running
 268 sturivnyi_465_bug_1592019_slave-05 running
 270 sturivnyi_465_bug_1592019_ironic-slave-01 running
 - sturivnyi_465_bug_1592019_slave-02 shut off
 - sturivnyi_465_bug_1592019_slave-06 shut off

these two nodes not started after reverting the snapshot.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

after manual start slave it appears in the list as online:

(venv-mos)mmalchuk@srv146-bud:~$ virsh start sturivnyi_465_bug_1592019_slave-06
Domain sturivnyi_465_bug_1592019_slave-06 started

(venv-mos)mmalchuk@srv146-bud:~$ ssh root@10.109.11.2
root@10.109.11.2's password:
Last login: Wed Jun 15 08:51:42 2016 from 10.109.11.1
[root@nailgun ~]# fuel node | grep 10.109.11.8
 2 | ready | Untitled (f8:14) | 1 | 10.109.11.8 | 46:9a:2f:27:fb:30 | controller, mongo | | 1 | 1

Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :

Looks like root of this issue in lack of RAM on the server.
I'll try to deploy one more time on another server.

Revision history for this message
Vitalii Gridnev (vgridnev) wrote :

It's not clear for me why it's Sahara problem at all. Most probably it's some configuration issues or unstable work of env. From ostf.log and sahara logs I can see that there Sahara was successful to perform CRUD operations with cluster templates and node group templates, by the way it's VanillaTwoTemplatesTest seems to be passed:

2016-06-14 15:21:15 INFO (test_mixins) STEP:8, verify action: 'deleting cluster template'
2016-06-14 15:21:15 DEBUG (session) REQ: curl -g -i --insecure -X DELETE http://10.109.15.4:8386/v1.1/592f206c5b4546b6b61caac81f184de9/cluster-templates/e874307f-ce6e-4ea6-b6f8- 3def1acf27f5 -H "User-Agent: python-saharaclient" -H "X-Auth-Token: {SHA1}16ea324306090130cbb472f2854d565413ed176b"
2016-06-14 15:21:16 DEBUG (__init__) Authenticating user token
2016-06-14 15:21:16 DEBUG (__init__) Received request from user: user_id e0fa1accd35142f6866c727e180aabec, project_id 25016f68cba346449cd9cc879ad75e75, roles admin
2016-06-14 15:21:16 DEBUG (session) RESP: [204] Content-Length: 0 Via: 1.1 apache_api_proxy:8386 Server: Apache Connection: close Date: Tue, 14 Jun 2016 15:21:16 GMT Content-Type: application/json X-Openstack-Request-Id: req-2f0855f3-8368-402c-990d-62eb3e284612

After that I can see that there are some unrelated to Sahara failures with DB operations in sahara-engine.

Logs about that failures in ostf.log:

http://paste.openstack.org/show/516185/

and sahara-engine confirms that access to DB can be done successfully:

http://paste.openstack.org/show/516186/

So, by my opinion we should investigate is everything Ok with DB was for all period of running OSTF tests?

Changed in mos:
assignee: MOS Sahara (mos-sahara) → nobody
Revision history for this message
Vitalii Gridnev (vgridnev) wrote :

misprint in my comment:

'and sahara-engine confirms that access to DB CANNOT be done successfully:'

Revision history for this message
Dina Belova (dbelova) wrote :

Assigning back to Serg until news will come about new run.

Changed in mos:
assignee: nobody → Sergii Turivnyi (sturivnyi)
Revision history for this message
Sergii Turivnyi (sturivnyi) wrote :

Was tested on another lab. It works

Changed in mos:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.