ntpdate run before p_ntp pacemaker resource is up

Bug #1572521 reported by Alexey Stepanov on 2016-04-20
38
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Fuel QA Team
7.0.x
High
Alexander Kurenyshev
Mitaka
High
Fuel QA Team

Bug Description

Steps to reproduce:
1. Create cluster
2. Add 3 nodes with controller and ceph OSD roles
3. Add 1 node with ceph OSD roles
4. Add 2 nodes with compute and ceph OSD roles
5. Deploy the cluster
6. Cold reboot all nodes in the cluster

Expected result:
The cluster is working, all pcs resources are up after nodes startup

Actual result:
The luster is broken. All resources are down.
Presumably the main problem is outage of resources on VM nodes. When we have add memory to the controller nodes mysql enabled performance mode and got all free memory again.
Such behavior doesn't appear on the baremetal lab with 64GB memory on controller nodes.

This leads to the test [1] error with ntpdate, because all resources are down including ntp one:

2016-04-20 03:50:42,371 - ERROR decorators.py:126 -- Traceback (most recent call last):
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/tests/tests_strength/test_restart.py", line 165, in ceph_ha_restart
    'slave-04']))
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/models/fuel_web_client.py", line 1860, in cold_restart_nodes
    self.environment.sync_time()
  File "/home/jenkins/workspace/9.0.system_test.ubuntu.thread_3/fuelweb_test/models/environment.py", line 137, in sync_time
    new_time = sync_time(self.d_env, nodes_names, skip_sync)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/retry.py", line 27, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 45, in sync_time
    g_ntp.do_sync_time(g_ntp.other_ntps)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 134, in do_sync_time
    .format(self.report_not_synchronized(ntps)))
TimeoutError: Time on nodes was not set with 'ntpdate':
[(u'slave-04', ['Wed Apr 20 03:46:22 UTC 2016\n'])]

This error couldn't be fixed in the test with smart waiting for ntp resource. When tried to reproduce the cluster was broken even after one day waiting.

[1] https://github.com/openstack/fuel-qa/blob/391b1219ebfa5f3d136054931076cbd03644c6eb/fuelweb_test/tests/tests_strength/test_restart.py#L90

Dmitry Klenov (dklenov) wrote :

Alexey, please attach snapshot if possible.

tags: added: area-library
Alexey Stepanov (astepanov-m) wrote :
Valeriy Saharov (vsakharov) wrote :

Bug cannot be reproduced on the latest iso. Also, this error don't appear on the next and other builds (from 84 to current) of the 9.0.system_test.ubuntu.thread_3 job (https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.thread_3/).
This error appeared, if ceph-osd node don't reply more than 10 min.
In the ceph_health on the node-4 was warning about it:
"[10.109.20.4] out: HEALTH_WARN clock skew detected on mon.node-5; too many PGs per OSD (499 > max 300); Monitor clock skew detected".

tags: added: tricky
Dmitry Pyzhov (dpyzhov) wrote :

Please reopen the bug if you reproduce it

no longer affects: fuel/newton
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Valeriy Saharov (vsakharov)

Bug reproduced again, compute node (slave-04 wasn't synchronized)

Changed in fuel:
status: Invalid → New
Changed in fuel:
status: New → Confirmed
tags: added: swarm-blocker
Dmitry Klenov (dklenov) on 2016-05-19
Changed in fuel:
assignee: Valeriy Saharov (vsakharov) → Michael Polenchuk (mpolenchuk)
Michael Polenchuk (mpolenchuk) wrote :

@sshevorakov, I wonder why this bug is swarm blocker?
I believe this ain't common usecase to shutdown all the controllers and then boot them back.
ETA: unknown

Michael Polenchuk (mpolenchuk) wrote :

Almost all of services in down state after "cold" shutdown -> boot.

Michael Polenchuk (mpolenchuk) wrote :

Please make sure the tests are properly waiting for services start.
Not reproduced by manual actions therefore this might be considered as "invalid" or re-assigned to qa-team in order to fix the tests.

Changed in fuel:
assignee: Michael Polenchuk (mpolenchuk) → Fuel QA Team (fuel-qa)
Dmitry Pyzhov (dpyzhov) on 2016-05-20
tags: added: area-qa
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Alexander Kurenyshev (akurenyshev)

This bug is really hard to reproduce, will try to do that and wait for it on CI

Download full text (5.9 KiB)

So, as we see at the log, fuel-devops couldn't sync time on some node, and this node is not controller.
The difference between syncing time on controllers and other nodes is an NTP host. For controller it is some remote host and for other nodes it is the NTP server which runs on virtual namespace managed by Pacemaker at vip__vrouter resource.

I tried to sync time on compute:

root@node-5:~# ntpdate -d 10.109.1.8
 1 Jun 10:31:29 ntpdate[32556]: ntpdate 4.2.6p5@1.2349-o Thu Feb 11 18:30:41 UTC 2016 (1)
Looking for host 10.109.1.8 and service ntp
host found : 10.109.1.8
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
transmit(10.109.1.8)
10.109.1.8: Server dropped: no data
server 10.109.1.8, port 123
stratum 0, precision 0, leap 00, trust 000
refid [10.109.1.8], delay 0.00000, dispersion 64.00000
transmitted 4, in filter 4
reference time: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
originate timestamp: 00000000.00000000 Mon, Jan 1 1900 0:00:00.000
transmit timestamp: daf93807.6a5ce4c3 Wed, Jun 1 2016 10:31:35.415
filter delay: 0.00000 0.00000 0.00000 0.00000
         0.00000 0.00000 0.00000 0.00000
filter offset: 0.000000 0.000000 0.000000 0.000000
         0.000000 0.000000 0.000000 0.000000
delay 0.00000, dispersion 64.00000
offset 0.000000

 1 Jun 10:31:37 ntpdate[32556]: no server suitable for synchronization found

Then I tried to get know what's wrong with NTP in namespace, on controller:
root@node-6:~# ip netns exec vrouter netstat -nap|grep :123

So, there is no alive NTP instance

Pacemaker status:
root@node-6:~# pcs status
Cluster name:
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Wed Jun 1 11:46:43 2016 Last change: Tue May 31 12:27:14 2016 by root via cibadmin on node-2.test.domain.local
Stack: corosync
Current DC: node-6.test.domain.local (version 1.1.14-70404b0) - partition with quorum
3 nodes and 46 resources configured

Online: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]

Full list of resources:

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-1.test.domain.local
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-6.test.domain.local
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-6.test.domain.local
 vip__public (ocf::fuel:ns_IPaddr2): Stopped
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
 Clone Set: clone_p_mysqld [p_mysqld]
     Stopped: [ node-1.test.domain.local node-2.test.domain.local node-6.test.domain.local ]
 sysinfo_node-2.test.domain.local (ocf::pacemaker:SysInfo): Started node-2.test.domain.local
 sysinfo_node-6.test.domain.local (ocf::pacemaker:SysInfo): Started node-6.test.domain.local
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-2.test.domain.local ]
     Slaves: [ node-6.test.domain.local ]
     Stopped: [ node-1.test.domain.local ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Slaves: [ ...

Read more...

Changed in fuel:
assignee: Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)
Changed in fuel:
milestone: 10.0 → 9.0
milestone: 9.0 → 10.0
assignee: Fuel Sustaining (fuel-sustaining-team) → Oleksiy Molchanov (omolchanov)
tags: added: swarm-fail
removed: swarm-blocker
Dmitry Pyzhov (dpyzhov) wrote :

The bug appears only on several automated test cases with reverts of environment snapshots. We need live environment in order to investigate it deeper. We cannot commit to fix the bug by HCF and I propose to move it out from the release. Marking as incomplete until we have an environment.

Changed in fuel:
status: Confirmed → Incomplete

This problem reproduces not only after snapshot revert but also after cluster restart.
Env is available during next 3-4 hours

Changed in fuel:
status: Incomplete → Confirmed
Dmitry Pyzhov (dpyzhov) on 2016-06-06
tags: added: move-to-mu
Oleksiy Molchanov (omolchanov) wrote :

The problem is that you are doing time_sync after small amount of time node is online. It works for controller nodes, but non-controller nodes are trying to reach ntp server that is managed by pacemaker. It fails because of ntp pacemaker resource is not started yet, it need more time.

I think you should sync time with master node for all nodes. Or wait for ntp resource is up, but I prefere first approach.

Changed in fuel:
assignee: Oleksiy Molchanov (omolchanov) → Fuel QA Team (fuel-qa)

Moving back to 9.0 release since it is a swarm-blocker.

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Alexander Kurenyshev (akurenyshev)
Dmitry (dtsapikov) wrote :

The problem was reproduced on 7.0
fail_error_ha_one_controller_backup_restore-fuel-snapshot-2016-06-08_17-34-06.tar.xz

Vadim Rovachev (vrovachev) wrote :

Addition to comment #18, 7.0 and 8.0 swarm tests uses devops version 2.9.19, so this bug was reproduced on this version also

Related fix proposed to branch: master
Change author: Alexander Kurenyshev <email address hidden>
Review: https://review.fuel-infra.org/21894

Guys, the problem is not in Devops and its revert mechanism.
The main problem is insufficient resources on nodes, cpu LA is 6, processes start to use swap.
As we decided with V. Kuklin, we have to increase memory for slave nodes. And we'll see if this be helpful.
Patch above do that staff.

Reviewed: https://review.fuel-infra.org/21894
Submitter: Ivan Remizov <email address hidden>
Branch: master

Commit: 0492e2c99c879b3e1c76b587e794801980ccd668
Author: Alexander Kurenyshev <email address hidden>
Date: Thu Jun 9 12:41:28 2016

Increase slave node memory for destructive ceph

This patch increases memory for slave nodes for
all related jobes from 7, 8, 9 and 10 SWARM

Change-Id: Ibed77dd9e20264c24907e9e699dfa7a24a5e642c
Related-Bug: #1572521

tags: added: swarm-blocker
removed: swarm-fail

Fix above has solved the problem. For now test is passed.

Changed in fuel:
status: Confirmed → Fix Committed

Issue still reproduced, logs are attached

Changed in fuel:
status: Fix Committed → Confirmed

According to my comment #22 we've increased memory on all slaves. Test was passed on custom CI, but this wasn't helpful at swarm.
I think we should continue an investigating on developers side. And fix performance issues with slow pcs resources start after nodes cold restart.

Changed in fuel:
assignee: Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)

Seems that real problem which we should fix is:
```The problem is that you are doing time_sync after small amount of time node is online. It works for controller nodes, but non-controller nodes are trying to reach ntp server that is managed by pacemaker. It fails because of ntp pacemaker resource is not started yet, it need more time.

I think you should sync time with master node for all nodes. Or wait for ntp resource is up, but I prefere first approach.```

Maksim Malchuk (mmalchuk) wrote :

Alexander Kurenyshev, please address the recommendations, this bug should be fixed in the tests.

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Alexander Kurenyshev (akurenyshev)
tags: added: area-python
removed: area-library

Maksim, please, say to me what exactly we should fix here?
1. We have a function which is used everywhere to sync time and it works perfect
2. I tried to do a smart wait for ntp resources are up and it took 60 minutes without success - resources didn't start, cluster was dead. Also I leave my env for a day and after that resources still were down.

This is a performance issue, and we couldn't avoid it in our tests.
V. Kuklin suggested to add another 1.5 GB of memory. Okay, let's see what will happen.

Maksim Malchuk (mmalchuk) wrote :

@Alexander Kurenyshev:
1. it didn't work perfect because it broke the swarm, and also, I sometimes see the same issue on debug-jobs when trying to revert snapshots.
2. maybe smart wait will help, but as a failover, we need to sync the cluster with the master node.

I agree that this is the performance issue, but it reproduced only on tests with fuel-devops, so should be fixed in the tests.

Nastya Urlapova (aurlapova) wrote :

SWARM on ISO #485 gave only one reproduce of this failure, tag swarm-blocker was removed.

tags: removed: area-qa swarm-blocker

Related fix proposed to branch: master
Change author: Alexander Kurenyshev <email address hidden>
Review: https://review.fuel-infra.org/22098

I had a talk with Dmitry Ilyin and made testings. So we have 2 approaches for fixing this bug:

1) Non-controller nodes should sync time with the same ntp-server as controllers
2) You should wait before pacemaker resource p_ntp is up and after that sync time

Personally I prefere 1-st.

Why can't it be fixed in library? In your case pacemaker needs a lot of time to bring up all resources in cluster and ntp resource is the last one in this list. We cannot move it to the top of the list because it has lots of dependencies, like p_vrouter, vip_mgmt and deployment by itself (this resource is created in the post-deployment, that is why it is in the bottom). Even if we could make it start first, you must have some kind of test to check whether it is up, synced and can provide time.

Reviewed: https://review.fuel-infra.org/22098
Submitter: Lesya Novaselskaya <email address hidden>
Branch: master

Commit: 00e76af6307135d7b507c80a94264c036c717687
Author: Alexander Kurenyshev <email address hidden>
Date: Wed Jun 15 17:06:31 2016

Increase memory for tests with cold restart

Fix adds 1.5Gb memory to the tests which use
cold restart function

Change-Id: Ic05004c3d2a96d4e4f4178d3530112350c981430
Related-Bug:#1572521

Dmitry Pyzhov (dpyzhov) on 2016-06-30
tags: removed: area-python
tags: added: swarm-fail

@Maksim Malchuk, @Oleksiy Molchanov
Guys are you joking? Or you really think that this product fail could be fixed in tests?

1) Smart fail couldn't help us, because overall cluster resources are broken, and there is no way to wait for ntp daemon becomes ready. I've implemented it in tests, I tried to wait for that and I was waiting for 2 hours without success.
2) `Non-controller nodes should sync time with the same ntp-server as controllers` approach couldn't working by the same reason: the cluster IS BROKEN. There is no difference between servers for sync, because ntp is not running.

The main problem is mysql performance scheme. We add some amount of memory and it is quickly consumed by mysql. Other resources are in SWAP.
We have never faced this error on the baremetal lab with 64 GB of memory per controller.

@Ksenia Svechnikova
I couldn't get know what happened with that job [1] I need a live environment. Other, the latest fails were not related to this bug. There are problems with setup master node.

[1] https://product-ci.infra.mirantis.net/view/10.0_swarm/job/10.0.system_test.ubuntu.services_reconfiguration_thread_1/

Changed in fuel:
assignee: Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)
Maksim Malchuk (mmalchuk) wrote :

@Alexander Kurenyshev, you are kidding? Please reread all the comments, along with yours:
1. The bug appears only on several automated test cases with reverts of environment snapshots.
2. The bug never appears on standard deployment on any hardware/virtualization.
3. The bug never appears on the BVT or Smoke_Neutron tests.
According to the provided points, the issue should be fixed ONLY in the "several automated tests cases".
As for the place and method to fix, indeed, adding the timeouts in the tests might not help, so need to review the deployment options for each case, and "several automated tests cases" shouldn't share the same settings to deploy (for example comment #34 should be addressed).

Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Alexander Kurenyshev (akurenyshev)
summary: - Time on nodes was not set with 'ntpdate'
+ Pacemaker resources are failed to start after the nodes cold reboot
description: updated
Changed in fuel:
assignee: Alexander Kurenyshev (akurenyshev) → Fuel Sustaining (fuel-sustaining-team)

@Aleksandr, I want to comment under this:

---
2) `Non-controller nodes should sync time with the same ntp-server as controllers` approach couldn't working by the same reason: the cluster IS BROKEN. There is no difference between servers for sync, because ntp is not running.
---
For initial time sync we do not need any cluster to be working. In your test-case you are syncing controller nodes with master and non-controller nodes with ntp under pacemaker on controllers. Why don't you sync it with master too?

What I observed under my env where nodes have 3G and 1 CPU:

It took ~10 minutes to bring up ntp resource in pacemaker, after that I can easily run ntpdate pointing it to vrouter_vip.

http://paste.openstack.org/show/538844/

So, my suggestion is still the same:
>>> I think you should sync time with master node for all nodes. Or wait for ntp resource is up, but I prefere first approach.

summary: - Pacemaker resources are failed to start after the nodes cold reboot
+ ntpdate run before p_ntp pacemaker resource is up
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Fuel QA Team (fuel-qa)

Similar error:

NTPD on nodes was not synchronized:
[(u'slave-01', 'Node: slave-01, ntpd peers: []', []), (u'slave-02', 'Node: slave-02, ntpd peers: []', []), (u'slave-03', 'Node: slave-03, ntpd peers: []', [])]

SWARMs:

https://mirantis.testrail.com/index.php?/runs/view/16451
https://mirantis.testrail.com/index.php?/tests/view/9752358

ElenaRossokhina (esolomina) wrote :

I confirm the issue is reproduced again in swarm
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/ntp.py", line 134, in do_sync_time
    .format(self.report_not_synchronized(ntps)))
TimeoutError: Time on nodes was not set with 'ntpdate':
[(u'slave-04', "Execution of the command 'ntpdate -p 4 -t 0.2 -bu 10.109.51.8\n' failed on the node slave-04 with exit_code = 1 : 22 Jul 04:22:18 ntpdate[8534]: no server suitable for synchronization found\n", ['Fri Jul 22 04:22:39 UTC 2016\n'])]
https://product-ci.infra.mirantis.net/job/9.x.system_test.ubuntu.thread_3/3/testReport/(root)/ceph_ha_restart/ceph_ha_restart/

Alex Schultz (alex-schultz) wrote :

We might want to leverage ntp-wait as part of this process as it will wait for the ntpd to stabilze before continuing. The issue is that upon first startup, the ntp system has to stabilize before the servers become available. This is just how ntp works. We need to wait until the system has stabilized before we can expect the time to be set.

Nastya Urlapova (aurlapova) wrote :

Moved to Incomplete, because there was no this issue in a week.

Moved to invalid due to absent of reproduces.
Feel free to reopen with logs and snapshots if this bug appears.

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.