Random ovb-ha ping test failures

Bug #1680195 reported by Ben Nemec on 2017-04-05
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned

Bug Description

Ben Nemec (bnemec) wrote :

Bumping to critical. This has happened 74 times this week.

Changed in tripleo:
importance: High → Critical
tags: added: alert promotion-blocker
Download full text (3.3 KiB)

Moving here my comment since it's marked with "alert" and was escalated too:
===============================================================================

I have this issue reproduced on CI dev system right now, ping me for getting access to the system to debug (@sshnaidm on #tripleo)

The failed vm networks are located in non-master HA controller node, maybe it was switch during the creation, or before it.

So far as I see, the router has its gateway down:

(overcloud) [jenkins@undercloud tripleo-ci]$ openstack port list --router 71ce23d9-2c47-4286-ad56-f2e5265ee30f
+--------------------------------------+----------------------------------------------+-------------------+------------------------------------------------+--------+
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+--------------------------------------+----------------------------------------------+-------------------+------------------------------------------------+--------+
| 5597935b-ce81-4002-ba83-1fdd404fab45 | HA port tenant | fa:16:3e:8e:24:c3 | ip_address='169.254.192.7', subnet_id | ACTIVE |
| | 124a61db46a24ebbadda7caa47de29a1 | | ='3af6b6be-972a-4fe5-bdc8-717a24a623a4' | |
| 9e9131b5-31b4-44f3-92e0-7b13de9e7f8d | | fa:16:3e:07:3e:15 | ip_address='192.168.2.1', subnet_id='086fe45e- | ACTIVE |
| | | | 4c20-45ae-bc16-ad7bb370951f' | |
| b72a3c4b-6545-4b34-856d-904b4bd66694 | HA port tenant | fa:16:3e:b3:f3:db | ip_address='169.254.192.3', subnet_id | ACTIVE |
| | 124a61db46a24ebbadda7caa47de29a1 | | ='3af6b6be-972a-4fe5-bdc8-717a24a623a4' | |
| d37fdab9-0490-47a4-a020-a3017c690c2a | HA port tenant | fa:16:3e:22:ca:96 | ip_address='169.254.192.4', subnet_id | ACTIVE |
| | 124a61db46a24ebbadda7caa47de29a1 | | ='3af6b6be-972a-4fe5-bdc8-717a24a623a4' | |
| e6688ece-5936-42bf-affa-028b182f9bf4 | | fa:16:3e:c1:a4:15 | ip_address='10.0.0.101', subnet_id='25b22b42 | DOWN |
| | | | -4b8b-44bc-a0fc-cc512a189d4d' | |
+--------------------------------------+----------------------------------------------+-------------------+------------------------------------------------+--------+

Errors in neutron server log about the gateway port creation (e6688ece-5936-42bf-affa-028b182f9bf4):

/var/log/neutron/server.log:2017-05-14 12:32:49.159 127313 INFO neutron.plugins.ml2.plugin [req-15df71a6-ee71-4fe3-90ea-be5b98790b1c - - - - -] Attempt 2 to bind port e6688ece-5936-42bf-affa-028b182f9bf4
/var/log/neutron/server.log:2017-05-14 12:32:53.521 127311 WARNING neutron.plugins.ml2.rpc [req-2956d1a9-251b-4180-8b83-293380d01ce9 - - - - -] Device e6688ece-5936-42bf-affa-028b182f9bf4 requested by agent ovs-agent-overcloud-controller-1.localdomain on network 35f15b00-62ec-4d2d-a033-6cb607c87f59 not bound, vif_type: ovs
/var/log/neutron/server.log:2017-05-14 12:32:58.104 127311 WARNING neutron.plugins.ml2.rpc [req-480a99ee-60f5-498e-b550-735a452a5f05 - - - - -] Device e6688ece-5936-42bf-affa-028b182f9bf4 requested by agent ovs-agent-overcloud-controller-2.localdomain on network 35f15b00-62ec-4d2d-a033-6cb607c87f59 not bound, vif_type: ovs
/var/log/neutron/server.log:2017-05-14 14:00:30.112 127311 WARNING neutron.plugins.ml2.rpc [req-7bae8287-d772-41f5-b657-e71318045e6e - - - - -] Device e6...

Read more...

It hasn't been seen on ovb-updates job, which is "fake" HA, it has one controller only (neither on ovb-nonha without HA). So it's more likely HA with a few controllers issue.

Attila Darazs (adarazs) wrote :

This is not a timeout issue. The problem is in the pingtest, where you can see

    + ping -c 1 10.0.0.103
    PING 10.0.0.103 (10.0.0.103) 56(84) bytes of data.
    From 10.0.0.1 icmp_seq=1 Destination Host Unreachable

This suggests some intermittent HA related networking error.

Here's a recent example: http://logs.openstack.org/83/466183/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-oooq/4873ca2/logs/undercloud/home/jenkins/overcloud_validate.log.txt.gz#_2017-05-19_07_31_32

Michele Baldessari (michele) wrote :

So I poked at the environment Sagi gave me and this seems to be indeed a neutron-l3-ha thing. Brent Eagles and John Eckersberg did most of the work looking at the env.

You can ping the floating ip from inside the proper namespace on the node with the active l3 router:
[root@overcloud-controller-2 neutron]# ip netns exec qdhcp-760bbb48-b24b-4d6e-8ac8-db477de6019a ping 10.0.0.102
PING 10.0.0.102 (10.0.0.102) 56(84) bytes of data.
64 bytes from 10.0.0.102: icmp_seq=1 ttl=63 time=4.11 ms
64 bytes from 10.0.0.102: icmp_seq=2 ttl=63 time=0.986 ms

But you can't from outside.

Another thing worth noting is how much ceilometer and swift seem to hammer the controllers in general. They seem to monopolize the CPU quite a bit. Namely,ceilometer generates more notifications than it consumes:
the rate of incoming notifications is 4 and the outgoing notification rate is 2 (meaning we get 2 times more incoming messages than we are able to process)

The unconsumed notifications in rabbit are about ~190k:
sqlite> select name, messages from queues order by messages desc limit 10;
notifications.info|187502

And swift seems to generate a huge number of connections to swift-container:
A huge number of connections created towards port 6001 (swift-container):
[root@overcloud-controller-0 ceilometer]# ss -antp dport = 6001 | wc -l
5748

And also in terms of syscalls swift-container generates two orders of magnitude more syscalls than any other process on the system:

[root@overcloud-controller-0 ~]# stap -v syscalls_by_pid.stp
Collecting data... Type Ctrl-C to exit and display results
#SysCalls PID
1038113 105149
12662 105160

[root@overcloud-controller-0 ~]# ps -f 105149
UID PID PPID C STIME TTY STAT TIME CMD
swift 105149 1 27 May14 ? Rs 3720:13 /usr/bin/python2 /usr/bin/swift-container-replicator /etc/swift/container-server.conf

While I don't think ceilometer/swift hogging the cpu are the main culprit we will need to investigate this sometime.

Fix proposed to branch: master
Review: https://review.openstack.org/468031

Changed in tripleo:
assignee: nobody → John Trowbridge (trown)
status: Triaged → In Progress
John Trowbridge (trown) wrote :

I am not entirely sure if https://review.openstack.org/#/c/467430/ is what helped with this issue, or if disabling telemetry has freed up resources for us... but looking on http://status-tripleoci.rhcloud.com/ it seems like the issue is resolved.

Changed in tripleo:
status: In Progress → Fix Released
Alan Pevec (apevec) on 2017-06-08
summary: - Random ovb-ha ping test failures
+ Random OOOQ ovb-ha ping test failures
Changed in tripleo:
milestone: pike-2 → pike-3

Is there logstash query to see the frequency of ping failures?

@Alan, why did you change to "OOOQ"? It's not related, ping test failures happens in both oooq and non-oooq HA jobs.

Alan Pevec (apevec) wrote :

@Sagi in comment #11 all listed jobs were OOOQ

summary: - Random OOOQ ovb-ha ping test failures
+ Random ovb-ha ping test failures
John Trowbridge (trown) wrote :

according to http://status-tripleoci.rhcloud.com/ this is resolved.

Changed in tripleo:
status: Triaged → Fix Released
John Trowbridge (trown) wrote :

If this gets reopened as an HA bug in tripleo please remove me as assignee so I stop getting pinged about it every 4 hours ;P.

Ben Nemec (bnemec) wrote :

This has not been fixed either of the times you've closed it. I'm not sure what results you're seeing that suggest the ping test is not spuriously failing in ha jobs. Granted, they aren't all necessarily this bug, but I see instances of it still today.

Unfortunately it doesn't show up in logstash for the quickstart ha jobs so it's hard to get good numbers on how often it's happening. :-(

I am going to drop the alert since it doesn't appear to be happening frequently enough to block anything. It's still something we need to fix though.

Changed in tripleo:
status: Fix Released → Triaged
assignee: John Trowbridge (trown) → nobody
tags: removed: alert

I didn't see it happens for a long time, especially after we disabled telemetry. So I close this and let's reopen it if the problem come back.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers