[systests] Nodes went offline during the system tests

Bug #1259609 reported by Anastasiia Naboikina
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Medium
Nastya Urlapova

Bug Description

Test shows deployment error:
AssertionError: Task 'deploy' has incorrect status. error != ready

Master-node shows the following error:
[10.108.27.1] Nodes "slave-04_compute (id=1, mac=64:23:9f:8a:b7:cd),slave-02_compute (id=2, mac=64:0b:2e:1e:a2:ac),slave-01_controller (id=3, mac=64:49:14:ec:54:53),slave-03_compute (id=4, mac=64:f2:70:d2:40:e9),slave-05_cinder (id=5, mac=64:96:87:f7:b9:2e)" are offline. Remove them from environment and try again.

When reverting snapshot and checking nodes state they are in active state. It`s even possible to ssh to them.

It might also be an issue with devops.

Reproduced on http://jenkins-product.srt.mirantis.net:8080/view/Fuelweb%20system%20tests%204.0/job/fuelmain.system_test.centos.thread_1/42/testReport/(root)/deploy_murano_simple/deploy_murano_simple/

Tags: system-tests
Revision history for this message
Anastasiia Naboikina (anaboikina) wrote :
Changed in fuel:
milestone: none → 4.0
Revision history for this message
Anastasiia Naboikina (anaboikina) wrote :
Mike Scherbakov (mihgen)
Changed in fuel:
assignee: Nastya Urlapova (aurlapova) → nobody
Revision history for this message
Nikolay Fedotov (nfedotov) wrote :

It is caused by performance issues of a system tests. It takes a lot time to suspend slave nodes. Thereby the master node may detect offline / suspended nodes before it suspended.

Revision history for this message
Nikolay Markov (nmarkov) wrote :

Tatyana, could you please set 'Confirmed' status if bug is successfully reproduced?

Changed in fuel:
assignee: nobody → Tatyana (tatyana-leontovich)
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

No I can not, because I have not chance to reproduce it

Changed in fuel:
assignee: Tatyana (tatyana-leontovich) → Nikolay Markov (nmarkov)
Nikolay Markov (nmarkov)
Changed in fuel:
status: New → Incomplete
assignee: Nikolay Markov (nmarkov) → nobody
Changed in fuel:
assignee: nobody → Timur Nurlygayanov (tnurlygayanov)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Looks like this is problem with performance. It is reproduced when user tries to deploy multinode HA environment (with 4+ nodes) simultaneously. The root of this problem in hardcoded timeouts for different operations in Fuel manifests. Need to check on the other environment.

summary: - [systests] Nodes went offline on test deploy_murano_simple
+ [systests] Nodes went offline during the system tests
Changed in fuel:
assignee: Timur Nurlygayanov (tnurlygayanov) → nobody
assignee: nobody → Nastya Urlapova (aurlapova)
Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Anastasia, can you please look on this problem with the system tests? Anastasia Naboikina can provide more detaield information abbout the root of this problem.

Revision history for this message
Nastya Urlapova (aurlapova) wrote :

This problem was fixed in systests, actually test wait online status for nodes.

Changed in fuel:
status: Incomplete → Fix Released
Revision history for this message
Egor Kotko (ykotko) wrote :

{"build_id": "2014-05-21_01-10-31", "mirantis": "yes", "build_number": "214", "ostf_sha": "353f918197ec53a00127fd28b9151f248a2a2d30", "nailgun_sha": "0b6e8eabaccad2aa29519561ce7cde9df9292964", "production": "docker", "api": "1.0", "fuelmain_sha": "910f262f85e94bef08e0e9b9d6230ad890bf139e", "astute_sha": "9a0d86918724c1153b5f70bdae008dea8572fd3e", "release": "5.0", "fuellib_sha": "3d92142a5643af82596f0450e39282550a45e5db"}

Have the same issue on 214 iso.

Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Egor Kotko (ykotko) wrote :
Revision history for this message
Egor Kotko (ykotko) wrote :
Changed in fuel:
status: Fix Released → New
Revision history for this message
Vladimir Sharshov (vsharshov) wrote :

My suggestion: after snapshot reverting and ntp synchronization we get big time different between last node agent report and actual time. Aslo we have background process in Nailgun which periodically (every 5 minuts) check database and set state of nodes to offline if no reports from nailgun agent performed between runs.

Because this process do it in background, we can get moment, when we start deploy and in same time at least one in cluster set as offline.

Solutions:
 - after ntp synchronization run nailgun-agent. It will actualize nodes status and prevent unexpected offline status (preferable);
 - update datetime field in nodes table to actual in master node which also prevent such behavior.

Changed in fuel:
milestone: 4.0 → 5.0.1
importance: High → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-main (master)

Fix proposed to branch: master
Review: https://review.openstack.org/94629

Changed in fuel:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/94630

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/94630
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=a1f4dc1bd0d1333e52d2db09131fc4d01c23753c
Submitter: Jenkins
Branch: master

commit a1f4dc1bd0d1333e52d2db09131fc4d01c23753c
Author: NastyaUrlapova <email address hidden>
Date: Wed May 21 18:28:29 2014 +0400

    Added nailgun agent run after snapshot revert

    Change-Id: Ib8229b07c3ceb948ca96014044a333b2f82ce30b
    Closes-Bug: #1259609

Changed in fuel:
status: In Progress → Fix Committed
Mike Scherbakov (mihgen)
Changed in fuel:
milestone: 5.0.1 → 5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/94825

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (master)

Reviewed: https://review.openstack.org/94825
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=f0406e7673aad153dadf0c74df845134be04c572
Submitter: Jenkins
Branch: master

commit f0406e7673aad153dadf0c74df845134be04c572
Author: NastyaUrlapova <email address hidden>
Date: Thu May 22 15:14:51 2014 +0400

    Filter nailgun notifications

    Change-Id: I4546b8ab942f7eb7a9a63f00cc3382d71a44a59a
    Related-Bug: #1259609

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-main (stable/4.1)

Related fix proposed to branch: stable/4.1
Review: https://review.openstack.org/97883

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-main (stable/4.1)

Reviewed: https://review.openstack.org/97883
Committed: https://git.openstack.org/cgit/stackforge/fuel-main/commit/?id=c17d113a571af1aed52077843ce44b00f120664c
Submitter: Jenkins
Branch: stable/4.1

commit c17d113a571af1aed52077843ce44b00f120664c
Author: NastyaUrlapova <email address hidden>
Date: Thu May 22 15:14:51 2014 +0400

    Filter nailgun notifications

    Change-Id: I4546b8ab942f7eb7a9a63f00cc3382d71a44a59a
    Related-Bug: #1259609

Revision history for this message
Aleksei Stepanov (penguinolog) wrote :

Another way of reproduce:
https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.filling_root/76/testReport/%28root%29/Case_FillRootPrimaryController__Config_ceph_all_on_neutron_vlan/

Traceback (most recent call last):
File "/usr/lib/python2.7/unittest/case.py", line 331, in run
testMethod()
File "/usr/lib/python2.7/unittest/case.py", line 1043, in runTest
self._testFunc()
File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
compatability.capture_type_error(s_func)
File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
func()
File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
func(test_case.state.get_state())
File "/home/jenkins/workspace/9.0.system_test.ubuntu.filling_root/system_test/core/factory.py", line 37, in wrapper
result = func(*args, **kwargs)
File "/home/jenkins/workspace/9.0.systemtest.ubuntu.fillingroot/systemtest/helpers/decorators.py", line 40, in wrapper result = func(*args, **kwargs) File "/home/jenkins/workspace/9.0.systemtest.ubuntu.fillingroot/systemtest/actions/strengthactions.py", line 141, in getpcsinitialstate
pcsstatus = parsepcsstatusxml(remote)
File "/home/jenkins/workspace/9.0.systemtest.ubuntu.fillingroot/fuelwebtest/helpers/pacemaker.py", line 107, in parsepcsstatusxml
remoteip, 'pcs status xml')['stdoutstr']
File "/home/jenkins/workspace/9.0.systemtest.ubuntu.fillingroot/fuelwebtest/helpers/sshmanager.py", line 165, in executeonremote
result = self.execute(ip=ip, port=port, cmd=cmd)
File "/home/jenkins/workspace/9.0.systemtest.ubuntu.fillingroot/fuelwebtest/helpers/sshmanager.py", line 141, in execute
remote = self.getremote(ip=ip, port=port)
File "/home/jenkins/workspace/9.0.systemtest.ubuntu.fillingroot/fuelwebtest/helpers/sshmanager.py", line 101, in getremote
privatekeys=keys File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 259, in init self.reconnect() File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 302, in reconnect self.connect() File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/retry.py", line 27, in wrapper return func(*args, **kwargs) File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 289, in connect password=self.password, pkey=privatekey)
File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/paramiko/client.py", line 283, in connect
totry = list(self.familiesandaddresses(hostname, port))
File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/paramiko/client.py", line 187, in familiesandaddresses addrinfos = socket.getaddrinfo(hostname, port, socket.AFUNSPEC, socket.SOCK_STREAM)
gaierror: [Errno -2] Name or service not known

Changed in fuel:
status: Fix Released → Confirmed
Revision history for this message
Aleksei Stepanov (penguinolog) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.