"15.184.64.1 is not pingable" in multinode jobs

Bug #1680167 reported by Ben Nemec on 2017-04-05
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Emilien Macchi

Bug Description

This is happening quite a bit in ci, and causing spurious failures when it does. It makes it very difficult to merge patches that need to pass a lot of multinode jobs twice to get through the gate. For something like https://review.openstack.org/#/c/453127 that needs to pass 18 multinode jobs between the check and gate queue, it's next to impossible to merge it even though there's nothing wrong with the patch.

Example failure: http://logs.openstack.org/27/453127/1/check/gate-tripleo-ci-centos-7-scenario002-multinode-oooq/d803e64/

I'm guessing this is happening any time a job gets scheduled to an infracloud node. I believe I've seen discussion of the problem in IRC, but I couldn't find an existing bug.

There are 87 hits in logstash for this error: http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%20*tripleo-ci*%20AND%20build_status%3A%20FAILURE%20AND%20message%3A%20%5C%2215.184.64.1%20is%20not%20pingable.%5C%22 That's overreporting a little because some of the hits are duplicated, but I would estimate there are still at least 60 separate occurrences in that time frame.

wes hayutin (weshayutin) on 2017-04-06
tags: added: alert
Emilien Macchi (emilienm) wrote :

2 hits in 24 hours, removing the alert.

tags: removed: alert
Changed in tripleo:
milestone: pike-2 → pike-3

http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3Agate-tripleo-ci-centos-7-*%20AND%20message%3A%20%5C%2215.184.64.1%20is%20not%20pingable%5C%22

It happens pretty much and cause problems both in gates and in rechecks. From logstash I counted 29 hits in last day - it's after all our files started to be indexed by logstash (https://review.openstack.org/#/c/476524/). So I think it's worth to bring back the alert.

I was asking about this IP in #openstack-infra, but didn't get an answer.

tags: added: alert
Oliver Walsh (owalsh) wrote :

It's a HP address block: https://apps.db.ripe.net/search/query.html?searchtext=15.184.64.1#resultsAnchor

Network config looks ok to me, netmask looks correct for the gateway...

2017-06-28 13:06:28.946228 | "networks": [
2017-06-28 13:06:28.946249 | {
2017-06-28 13:06:28.946278 | "id": "network0",
2017-06-28 13:06:28.946312 | "ip_address": "15.184.67.60",
2017-06-28 13:06:28.946346 | "link": "tap6088de7e-d9",
2017-06-28 13:06:28.946379 | "netmask": "255.255.224.0",
2017-06-28 13:06:28.946422 | "network_id": "85ba3bb6-1fd9-443e-beb8-e9733147218d",
2017-06-28 13:06:28.946449 | "routes": [
2017-06-28 13:06:28.946473 | {
2017-06-28 13:06:28.946508 | "gateway": "15.184.64.1",
2017-06-28 13:06:28.946542 | "netmask": "0.0.0.0",
2017-06-28 13:06:28.946575 | "network": "0.0.0.0"
2017-06-28 13:06:28.946599 | }
2017-06-28 13:06:28.946622 | ],
2017-06-28 13:06:28.946650 | "type": "ipv4"
2017-06-28 13:06:28.946671 | }
2017-06-28 13:06:28.946690 | ],

I'd guess the gateway just isn't responding to ICMP packets sometimes. High load maybe?

Oliver Walsh (owalsh) wrote :

Doubt it's high load actually, we run ping -w 300 -c1, which will continue pinging for 300 seconds or until it gets 1 response. Then we retry 10 times (by default). That looks wrong, surely we don't want to wait 50 minutes?

From discussion with @fungi on #openstack-infra, it's router of HPE cloud (infracloud). We don't have contacts there and seems like no one will give them us. I can ping this IP now, but seems like it's not reliable and may not to reply in case of load.
In bottom line that's the situation and we need to add another way to check connectivity if this one fails.

@fungi suggested to ping git.openstack.org - it doesn't limit pings and devstack already uses it for icmp. So if default gw is not pingable, we can try to ping git.openstack.org.

Ben Nemec (bnemec) wrote :

The problem is this check is not ci-specific. It's a connectivity validation during deployment, and we can't change it to ping an arbitrary address like git.openstack.org because some people will be deploying in isolated environments where they don't have direct access to the outside world. Not to mention that adding a hard dependency on an external resource like that would be less than ideal.

I wonder if it would work better to reduce the wait time of the ping but add more attempts. I'll try a patch and we can recheck it a bunch of times to see if it hits this bug.

Fix proposed to branch: master
Review: https://review.openstack.org/478981

Changed in tripleo:
assignee: nobody → Ben Nemec (bnemec)
status: Triaged → In Progress
Oliver Walsh (owalsh) wrote :

Ping -w isn't wait time, it's how long to loop for, pinging every second (by default).

Ben Nemec (bnemec) wrote :

Huh, tcpdump says you're right. There's no indication of that in the man page though: "Time to wait for a response, in seconds."

Change abandoned by Ben Nemec (<email address hidden>) on branch: master
Review: https://review.openstack.org/478981
Reason: Per the discussion the bug this won't actually help. -W already retries once per second.

Ben Nemec (bnemec) on 2017-06-29
Changed in tripleo:
assignee: Ben Nemec (bnemec) → nobody
Oliver Walsh (owalsh) wrote :

Timeout is -W, deadline is -w.

Oliver Walsh (owalsh) wrote :
Download full text (5.4 KiB)

Hmmm, not 100% sure the infrastructure is the culprit...

Looking at the logs here:
http://logs.openstack.org/56/471956/11/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2258e1b/

traceroute succeeds when the job begins:
http://logs.openstack.org/56/471956/11/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2258e1b/console.html#_2017-06-28_13_06_29_028535
2017-06-28 13:06:29.028535 | traceroute to git.openstack.org (104.130.246.128), 30 hops max, 60 byte packets
2017-06-28 13:06:34.035102 | 1 15.184.64.1 9.740 ms 0.747 ms 0.718 ms
...

but ping fails from the controller around 1 hour later: http://logs.openstack.org/56/471956/11/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/2258e1b/logs/subnode-2/var/log/messages.txt.gz#_Jun_28_14_54_25

Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:54:25,155] (heat-config) [INFO] {"deploy_stdout": "Trying to ping 192.168.24.10 for local network 192.168.24.0/24.\nPing to 192.168.24.10 succeeded.\nSUCCESS\nTrying to ping default gateway 15.184.64.1...Ping to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nPing to 15.184.64.1 failed. Retrying...\nFAILURE\n15.184.64.1 is not pingable.\n", "deploy_stderr": "", "deploy_status_code": 1}
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:54:25,158] (heat-config) [DEBUG] [2017-06-28 14:04:24,048] (heat-config) [INFO] ping_test_ips=192.168.24.10 192.168.24.10 192.168.24.10 192.168.24.10 192.168.24.10 192.168.24.10
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:04:24,048] (heat-config) [INFO] validate_fqdn=False
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:04:24,048] (heat-config) [INFO] deploy_server_id=2012629a-13ca-43e2-9a4f-2818ecc705d9
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:04:24,048] (heat-config) [INFO] deploy_action=CREATE
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:04:24,049] (heat-config) [INFO] deploy_stack_id=overcloud-ControllerAllNodesValidationDeployment-sjlbjpxkkq3z/fb669aab-6a19-4f56-b156-3c6352b4a928
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:04:24,049] (heat-config) [INFO] deploy_resource_name=0
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:04:24,049] (heat-config) [INFO] deploy_signal_transport=CFN_SIGNAL
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:04:24,049] (heat-config) [INFO] deploy_signal_id=http://192.168.24.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A1a6779e490614d8f8aaee69b08761314%3Astacks%2Fovercloud-ControllerAllNodesValidationDeployment-sjlbjpxkkq3z%2Ffb669aab-6a19-4f56-b156-3c6352b4a928%2Fresources%2F0?Timestamp=2017-06-28T14%3A04%3A18Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=df9ee8c74b564c4986e751999f78e24d&SignatureVersion=2&Signature=tGkZxInloRVRy1Gu%2BPdA9jb%2FrEUc43y8yLKO1l%2BnXLs%3D
Jun 28 14:54:25 localhost os-collect-config: [2017-0...

Read more...

Oliver Walsh (owalsh) wrote :

Ah, no, ping didn't fail quickly, it too the full 50 minutes:
[2017-06-28 14:04:24,049] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/f109f3ec-80e5-40a2-aa28-2be7541fe69f
Jun 28 14:54:25 localhost os-collect-config: [2017-06-28 14:54:25,145] (heat-config) [INFO] Trying to ping 192.168.24.10 for local network 192.168.24.0/24.

Fix proposed to branch: master
Review: https://review.openstack.org/479406

Changed in tripleo:
assignee: nobody → Ben Nemec (bnemec)
Changed in tripleo:
assignee: Ben Nemec (bnemec) → Emilien Macchi (emilienm)

Reviewed: https://review.openstack.org/479406
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=766de0cacb18171264d2a699ac48cacb8d35a152
Submitter: Jenkins
Branch: master

commit 766de0cacb18171264d2a699ac48cacb8d35a152
Author: Ben Nemec <email address hidden>
Date: Fri Jun 30 14:04:35 2017 -0500

    Disable network validation in multinode jobs

    Sometimes the infracloud gateway refuses to ping even though
    everything else is working fine. Since we have coverage of this
    functionality in the OVB jobs it should be safe to turn it off
    here so it stops spuriously failing our jobs.

    We can't just set the resource to OS::Heat::None because there
    are other resources with dependencies on it. Instead, this adds
    a noop version of the validation software config that always
    returns true.

    Change-Id: I8361bc8be442b45c3ef6bdccdc53598fcb1d9540
    Partial-Bug: 1680167

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/479256
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=6a64a4a9d20a7a43eee6181c7bab738329844eba
Submitter: Jenkins
Branch: master

commit 6a64a4a9d20a7a43eee6181c7bab738329844eba
Author: Oliver Walsh <email address hidden>
Date: Fri Jun 30 11:51:06 2017 +0100

    Tolerate network errors in pingtest retry logic

    We use ping -w <deadline> -c <count>. This will ping every second until
    <count> replies are received, or <deadline> is reached, or a network error occurs.

    With the current retry logic a network error will result in a short tight loop
    instead of waiting for the network to come up.

    This change reduces the deadline to 10s, but sleeps 60s between retries.

    Change-Id: Ib00cff6f843c04a00737b40e3ef3d1560d6e6d2d
    Related-bug: #1680167

tags: added: in-stable-newton

Reviewed: https://review.openstack.org/482645
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=36a16de28359c91299f1efb4f36f6c6e8631a622
Submitter: Jenkins
Branch: stable/newton

commit 36a16de28359c91299f1efb4f36f6c6e8631a622
Author: Ben Nemec <email address hidden>
Date: Fri Jun 30 14:04:35 2017 -0500

    Disable network validation in multinode jobs

    Sometimes the infracloud gateway refuses to ping even though
    everything else is working fine. Since we have coverage of this
    functionality in the OVB jobs it should be safe to turn it off
    here so it stops spuriously failing our jobs.

    We can't just set the resource to OS::Heat::None because there
    are other resources with dependencies on it. Instead, this adds
    a noop version of the validation software config that always
    returns true.

    Change-Id: I8361bc8be442b45c3ef6bdccdc53598fcb1d9540
    Partial-Bug: 1680167
    (cherry picked from commit 766de0cacb18171264d2a699ac48cacb8d35a152)
    (cherry picked from commit 07d9e9c1a95e06613fc368aff452ae10bac1a555)

Reviewed: https://review.openstack.org/482644
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=8458911e514ed4ee88c17e52bf1ea7b82bdcfb66
Submitter: Jenkins
Branch: stable/ocata

commit 8458911e514ed4ee88c17e52bf1ea7b82bdcfb66
Author: Ben Nemec <email address hidden>
Date: Fri Jun 30 14:04:35 2017 -0500

    Disable network validation in multinode jobs

    Sometimes the infracloud gateway refuses to ping even though
    everything else is working fine. Since we have coverage of this
    functionality in the OVB jobs it should be safe to turn it off
    here so it stops spuriously failing our jobs.

    We can't just set the resource to OS::Heat::None because there
    are other resources with dependencies on it. Instead, this adds
    a noop version of the validation software config that always
    returns true.

    Change-Id: I8361bc8be442b45c3ef6bdccdc53598fcb1d9540
    Partial-Bug: 1680167
    (cherry picked from commit 766de0cacb18171264d2a699ac48cacb8d35a152)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/548663
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=f30721482f40b7a010a11ae9242579ef4edc0470
Submitter: Zuul
Branch: stable/ocata

commit f30721482f40b7a010a11ae9242579ef4edc0470
Author: Oliver Walsh <email address hidden>
Date: Fri Jun 30 11:51:06 2017 +0100

    Tolerate network errors in pingtest retry logic

    We use ping -w <deadline> -c <count>. This will ping every second until
    <count> replies are received, or <deadline> is reached, or a network error occurs.

    With the current retry logic a network error will result in a short tight loop
    instead of waiting for the network to come up.

    This change reduces the deadline to 10s, but sleeps 60s between retries.

    Change-Id: Ib00cff6f843c04a00737b40e3ef3d1560d6e6d2d
    Related-bug: #1680167
    (cherry picked from commit 6a64a4a9d20a7a43eee6181c7bab738329844eba)

Reviewed: https://review.openstack.org/548665
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=4de1d93f2fff3a225b1002663e62a071b9cb24e3
Submitter: Zuul
Branch: stable/newton

commit 4de1d93f2fff3a225b1002663e62a071b9cb24e3
Author: Oliver Walsh <email address hidden>
Date: Fri Jun 30 11:51:06 2017 +0100

    Tolerate network errors in pingtest retry logic

    We use ping -w <deadline> -c <count>. This will ping every second until
    <count> replies are received, or <deadline> is reached, or a network error occurs.

    With the current retry logic a network error will result in a short tight loop
    instead of waiting for the network to come up.

    This change reduces the deadline to 10s, but sleeps 60s between retries.

    Change-Id: Ib00cff6f843c04a00737b40e3ef3d1560d6e6d2d
    Related-bug: #1680167
    (cherry picked from commit 6a64a4a9d20a7a43eee6181c7bab738329844eba)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers