Bug #1680195 “Random ovb-ha ping test failures” : Bugs : tripleo

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-04-07:

#1

Bumping to critical. This has happened 74 times this week.

Changed in tripleo:
importance:	High → Critical

Bogdan Dobrelya (bogdando) on 2017-05-15

tags:

added: alert promotion-blocker

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-05-15:

#2

Download full text (3.3 KiB)

Moving here my comment since it's marked with "alert" and was escalated too:
===============================================================================

I have this issue reproduced on CI dev system right now, ping me for getting access to the system to debug (@sshnaidm on #tripleo)

The failed vm networks are located in non-master HA controller node, maybe it was switch during the creation, or before it.

So far as I see, the router has its gateway down:

Errors in neutron server log about the gateway port creation (e6688ece-5936-42bf-affa-028b182f9bf4):

/var/log/neutron/server.log:2017-05-14 12:32:49.159 127313 INFO neutron.plugins.ml2.plugin [req-15df71a6-ee71-4fe3-90ea-be5b98790b1c - - - - -] Attempt 2 to bind port e6688ece-5936-42bf-affa-028b182f9bf4
/var/log/neutron/server.log:2017-05-14 12:32:53.521 127311 WARNING neutron.plugins.ml2.rpc [req-2956d1a9-251b-4180-8b83-293380d01ce9 - - - - -] Device e6688ece-5936-42bf-affa-028b182f9bf4 requested by agent ovs-agent-overcloud-controller-1.localdomain on network 35f15b00-62ec-4d2d-a033-6cb607c87f59 not bound, vif_type: ovs
/var/log/neutron/server.log:2017-05-14 12:32:58.104 127311 WARNING neutron.plugins.ml2.rpc [req-480a99ee-60f5-498e-b550-735a452a5f05 - - - - -] Device e6688ece-5936-42bf-affa-028b182f9bf4 requested by agent ovs-agent-overcloud-controller-2.localdomain on network 35f15b00-62ec-4d2d-a033-6cb607c87f59 not bound, vif_type: ovs
/var/log/neutron/server.log:2017-05-14 14:00:30.112 127311 WARNING neutron.plugins.ml2.rpc [req-7bae8287-d772-41f5-b657-e71318045e6e - - - - -] Device e6...

Moving here my comment since it's marked with "alert" and was escalated too:
===============================================================================

I have this issue reproduced on CI dev system right now, ping me for getting access to the system to debug (@sshnaidm on #tripleo)

The failed vm networks are located in non-master HA controller node, maybe it was switch during the creation, or before it.

So far as I see, the router has its gateway down:

Errors in neutron server log about the gateway port creation (e6688ece-5936-42bf-affa-028b182f9bf4):

/var/log/neutron/server.log:2017-05-14 12:32:49.159 127313 INFO neutron.plugins.ml2.plugin [req-15df71a6-ee71-4fe3-90ea-be5b98790b1c - - - - -] Attempt 2 to bind port e6688ece-5936-42bf-affa-028b182f9bf4
/var/log/neutron/server.log:2017-05-14 12:32:53.521 127311 WARNING neutron.plugins.ml2.rpc [req-2956d1a9-251b-4180-8b83-293380d01ce9 - - - - -] Device e6688ece-5936-42bf-affa-028b182f9bf4 requested by agent ovs-agent-overcloud-controller-1.localdomain on network 35f15b00-62ec-4d2d-a033-6cb607c87f59 not bound, vif_type: ovs
/var/log/neutron/server.log:2017-05-14 12:32:58.104 127311 WARNING neutron.plugins.ml2.rpc [req-480a99ee-60f5-498e-b550-735a452a5f05 - - - - -] Device e6688ece-5936-42bf-affa-028b182f9bf4 requested by agent ovs-agent-overcloud-controller-2.localdomain on network 35f15b00-62ec-4d2d-a033-6cb607c87f59 not bound, vif_type: ovs
/var/log/neutron/server.log:2017-05-14 14:00:30.112 127311 WARNING neutron.plugins.ml2.rpc [req-7bae8287-d772-41f5-b657-e71318045e6e - - - - -] Device e6688ece-5936-42bf-affa-028b182f9bf4 requested by agent ovs-agent-overcloud-controller-0.localdomain on network 35f15b00-62ec-4d2d-a033-6cb607c87f59 not bound, vif_type: ovs

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-05-15:

#3

It hasn't been seen on ovb-updates job, which is "fake" HA, it has one controller only (neither on ovb-nonha without HA). So it's more likely HA with a few controllers issue.

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-05-17:

#4

it sounds like timeout issue:
http://logs.openstack.org/11/457011/9/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/df64ac2/console.html#_2017-05-17_18_12_26_558252

Revision history for this message

Attila Darazs (adarazs) wrote on 2017-05-19:

#5

This is not a timeout issue. The problem is in the pingtest, where you can see

    + ping -c 1 10.0.0.103
    PING 10.0.0.103 (10.0.0.103) 56(84) bytes of data.
    From 10.0.0.1 icmp_seq=1 Destination Host Unreachable

This suggests some intermittent HA related networking error.

Here's a recent example: http://logs.openstack.org/83/466183/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/4873ca2/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-05-19_07_31_32

Revision history for this message

Michele Baldessari (michele) wrote on 2017-05-24:

#6

So I poked at the environment Sagi gave me and this seems to be indeed a neutron-l3-ha thing. Brent Eagles and John Eckersberg did most of the work looking at the env.

You can ping the floating ip from inside the proper namespace on the node with the active l3 router:
[root@overcloud-controller-2 neutron]# ip netns exec qdhcp-760bbb48-b24b-4d6e-8ac8-db477de6019a ping 10.0.0.102
PING 10.0.0.102 (10.0.0.102) 56(84) bytes of data.
64 bytes from 10.0.0.102: icmp_seq=1 ttl=63 time=4.11 ms
64 bytes from 10.0.0.102: icmp_seq=2 ttl=63 time=0.986 ms

But you can't from outside.

Another thing worth noting is how much ceilometer and swift seem to hammer the controllers in general. They seem to monopolize the CPU quite a bit. Namely,ceilometer generates more notifications than it consumes:
the rate of incoming notifications is 4 and the outgoing notification rate is 2 (meaning we get 2 times more incoming messages than we are able to process)

The unconsumed notifications in rabbit are about ~190k:
sqlite> select name, messages from queues order by messages desc limit 10;
notifications.info|187502

And swift seems to generate a huge number of connections to swift-container:
A huge number of connections created towards port 6001 (swift-container):
[root@overcloud-controller-0 ceilometer]# ss -antp dport = 6001 | wc -l
5748

And also in terms of syscalls swift-container generates two orders of magnitude more syscalls than any other process on the system:

[root@overcloud-controller-0 ~]# stap -v syscalls_by_pid.stp
Collecting data... Type Ctrl-C to exit and display results
#SysCalls PID
1038113 105149
12662 105160

[root@overcloud-controller-0 ~]# ps -f 105149
UID PID PPID C STIME TTY STAT TIME CMD
swift 105149 1 27 May14 ? Rs 3720:13 /usr/bin/python2 /usr/bin/swift-container-replicator /etc/swift/container-server.conf

While I don't think ceilometer/swift hogging the cpu are the main culprit we will need to investigate this sometime.

So I poked at the environment Sagi gave me and this seems to be indeed a neutron-l3-ha thing. Brent Eagles and John Eckersberg did most of the work looking at the env.

You can ping the floating ip from inside the proper namespace on the node with the active l3 router:
[root@overcloud-controller-2 neutron]# ip netns exec qdhcp-760bbb48-b24b-4d6e-8ac8-db477de6019a ping 10.0.0.102
PING 10.0.0.102 (10.0.0.102) 56(84) bytes of data.
64 bytes from 10.0.0.102: icmp_seq=1 ttl=63 time=4.11 ms
64 bytes from 10.0.0.102: icmp_seq=2 ttl=63 time=0.986 ms

But you can't from outside.

Another thing worth noting is how much ceilometer and swift seem to hammer the controllers in general. They seem to monopolize the CPU quite a bit. Namely,ceilometer generates more notifications than it consumes:
the rate of incoming notifications is 4 and the outgoing notification rate is 2 (meaning we get 2 times more incoming messages than we are able to process)

The unconsumed notifications in rabbit are about ~190k:
sqlite> select name, messages from queues order by messages desc limit 10;
notifications.info|187502

And swift seems to generate a huge number of connections to swift-container:
A huge number of connections created towards port 6001 (swift-container):
[root@overcloud-controller-0 ceilometer]# ss -antp dport = 6001 | wc -l
5748

And also in terms of syscalls swift-container generates two orders of magnitude more syscalls than any other process on the system:

[root@overcloud-controller-0 ~]# stap -v syscalls_by_pid.stp                                                                                                                                  
Collecting data... Type Ctrl-C to exit and display results
#SysCalls  PID
1038113    105149
12662      105160

[root@overcloud-controller-0 ~]# ps -f 105149
UID          PID    PPID  C STIME TTY      STAT   TIME CMD
swift     105149       1 27 May14 ?        Rs   3720:13 /usr/bin/python2 /usr/bin/swift-container-replicator /etc/swift/container-server.conf

While I don't think ceilometer/swift hogging the cpu are the main culprit we will need to investigate this sometime.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-05-25: Fix proposed to tripleo-heat-templates (master)

#7

Fix proposed to branch: master
Review: https://review.openstack.org/468031

Changed in tripleo:
assignee:	nobody → John Trowbridge (trown)
status:	Triaged → In Progress

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2017-05-25:

#8

https://review.openstack.org/#/c/467430/ may be relevant.

Revision history for this message

John Trowbridge (trown) wrote on 2017-06-05:

#9

I am not entirely sure if https://review.openstack.org/#/c/467430/ is what helped with this issue, or if disabling telemetry has freed up resources for us... but looking on http://status-tripleoci.rhcloud.com/ it seems like the issue is resolved.

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-06-05:

#10

I think it's still a problem: http://logs.openstack.org/30/469330/7/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/4d6a564/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-06-05_08_31_05

Maybe not as much as before, but still happens.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-06-06:

#11

Reopening it because I still see the same issues for ovb-ha-oooq job. It's not in so high rate as before, but still a problem and should be resolved in HA part I think.

http://logs.openstack.org/72/469372/2/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/33e0547/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-06-06_10_34_03

http://logs.openstack.org/62/471162/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/0b8ba79/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-06-06_09_15_46

http://logs.openstack.org/77/430277/116/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/ed0e145/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-06-05_18_00_42

Changed in tripleo:
status:	Fix Released → Triaged

Alan Pevec (apevec) on 2017-06-08

summary:

- Random ovb-ha ping test failures
+ Random OOOQ ovb-ha ping test failures

Emilien Macchi (emilienm) on 2017-06-08

Changed in tripleo:
milestone:	pike-2 → pike-3

Revision history for this message

Alan Pevec (apevec) wrote on 2017-06-10: Re: Random OOOQ ovb-ha ping test failures

#12

Is there logstash query to see the frequency of ping failures?

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-06-11:

#13

@Alan, why did you change to "OOOQ"? It's not related, ping test failures happens in both oooq and non-oooq HA jobs.

Revision history for this message

Alan Pevec (apevec) wrote on 2017-06-12:

#14

@Sagi in comment #11 all listed jobs were OOOQ

summary:

- Random OOOQ ovb-ha ping test failures
+ Random ovb-ha ping test failures

Revision history for this message

John Trowbridge (trown) wrote on 2017-06-12:

#15

according to http://status-tripleoci.rhcloud.com/ this is resolved.

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

John Trowbridge (trown) wrote on 2017-06-12:

#16

If this gets reopened as an HA bug in tripleo please remove me as assignee so I stop getting pinged about it every 4 hours ;P.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-12:

#17

This has not been fixed either of the times you've closed it. I'm not sure what results you're seeing that suggest the ping test is not spuriously failing in ha jobs. Granted, they aren't all necessarily this bug, but I see instances of it still today.

Unfortunately it doesn't show up in logstash for the quickstart ha jobs so it's hard to get good numbers on how often it's happening. :-(

I am going to drop the alert since it doesn't appear to be happening frequently enough to block anything. It's still something we need to fix though.

Changed in tripleo:
status:	Fix Released → Triaged
assignee:	John Trowbridge (trown) → nobody
tags:	removed: alert

Revision history for this message

wes hayutin (weshayutin) wrote on 2017-06-15:

#18

Ben, Emilien others

https://review.openstack.org/#/c/473095/

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-07-27:

#19

I didn't see it happens for a long time, especially after we disabled telemetry. So I close this and let's reopen it if the problem come back.

Changed in tripleo:
status:	Triaged → Fix Released

tripleo

Random ovb-ha ping test failures

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches