Nodes periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master fail to get usable IPs though os-net-config with Error, some other host (BE:E5:4F:B9:21:B0) already uses address

Bug #1818060 reported by Gabriele Cerami
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
In Progress
Critical
Gabriele Cerami

Bug Description

logs at

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/0398d7f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-02-28_01_33_47

show node are failing to get a usable IP address.
First this error shows

2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [INFO] running ifup on interface: eth1
2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [ERROR] Failure(s) occurred when applying configuration
2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (BE:E5:4F:B9:21:B0) already uses address 172.18.0.79.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | Traceback (most recent call last):
2019-02-28 01:33:47 | File "/bin/os-net-config", line 10, in <module>
2019-02-28 01:33:47 | sys.exit(main())
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 295, in main
2019-02-28 01:33:47 | activate=not opts.no_activate)
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 1696, in apply
2019-02-28 01:33:47 | raise os_net_config.ConfigurationError(message)
2019-02-28 01:33:47 | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration

Than, maybe as consequence:

2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] Failure(s) occurred when applying configuration
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth5... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth4... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth3... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth2... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth1... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | Traceback (most recent call last):
2019-02-28 01:33:47 | File "/bin/os-net-config", line 10, in <module>
2019-02-28 01:33:47 | sys.exit(main())
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 295, in main
2019-02-28 01:33:47 | activate=not opts.no_activate)
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 1696, in apply
2019-02-28 01:33:47 | raise os_net_config.ConfigurationError(message)
2019-02-28 01:33:47 | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration

Tags: ci
Revision history for this message
wes hayutin (weshayutin) wrote :

 openstack server list | grep -i error | wc -l 
184

Revision history for this message
Gabriele Cerami (gcerami) wrote :

Proposing https://review.rdoproject.org/r/19048 the include error servers cleanup in the periodic cleanup

wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
Ronelle Landy (rlandy) wrote :

Reopening this bug. We hit the port failures when the stack list is clean.
Possibly while one stack is created while another is being deleted.
Reopening to monitor this.

example: https://logs.rdoproject.org/08/653408/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/5c9b021/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-04-17_13_07_08

Changed in tripleo:
status: Fix Released → In Progress
Revision history for this message
wes hayutin (weshayutin) wrote :

dhcp should NOT distribute an ip that is already allocated. AFAICT the heat stacks are deleted as much as possible during a CI run [1] , additionally clean up scripts are running in the background.

[1] https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/roles/ovb-manage/tasks/ovb-delete-stack.yml#L33-L71

IMHO we have a bug in neutron dhcp

Revision history for this message
Brian Haley (brian-haley) wrote :

Wes - do you have a pointer to a recent failure? I've been looking through logs.rdoproject.org and haven't found one yet. I don't know of any issues where neutron DHCP would have allocated the same IP twice, I think the DB would have not allowed it.

Revision history for this message
wes hayutin (weshayutin) wrote :

Brian,

http://logs.rdoproject.org/13/635913/2/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/32ee9e4/logs/bmc-console.log
[ 90.854150] openstackbmc[2648]: socket.error: [Errno 99] Cannot assign requested address
[ 91.617117] openstackbmc[2684]: socket.error: [Errno 99] Cannot assign requested address

Active examples can be found by navigating to http://cistatus.tripleo.org/ ( give it 1.5 min to pull data )

1. once http://cistatus.tripleo.org/ renders you should see graphs
2. click on the link tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001
3. In the search field, search for "cannot"

Attaching screenshot.

Revision history for this message
wes hayutin (weshayutin) wrote :

FYI.. example filter for finding ip assignment issues

Revision history for this message
wes hayutin (weshayutin) wrote :

I would also like to have my team investigate the possible root cause as being something in https://openstack-virtual-baremetal.readthedocs.io/en/latest/ I'm not convinced it's rdo-cloud 100%

Revision history for this message
Brian Haley (brian-haley) wrote :

So the error is coming from the ifup-eth script on the instance, which probably looks something like this:

# cat /etc/sysconfig/network-scripts/ifup-eth
if ! arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${IPADDR} ; then
  echo $"Error, some other host already uses address ${IPADDR}."
  exit 1
fi

That's going to do duplicate address detection on the IP.

My first thought was that somehow the distro was using the wrong 'arping' package, as this is iputils-arping package syntax (not arping package), but I think in that case it would have thrown an error with a usage message that would be obvious to spot.

So that leaves two possibilities:

1) Another system does have that IP configured; We can look at the ports on the subnet in question to figure out if neutron somehow allocated it twice with something like:

$ openstack port list --fixed-ip subnet=$subnet

2) Something in the network is responding for the IP because it's mis-configured. This is a little harder to track down, but something like:

$ ping -c1 $IP
$ arp -an

Then we'd have a MAC to chase down.
Running tcpdump could also be used to see where the response is coming from.

Since I don't know much about BMC is there an easy way to insert that first command to get more info? Maybe even just in a test patch?

Revision history for this message
Gabriele Cerami (gcerami) wrote :

The arping output is suppressed, and we are using a catchall exception to assume somebody has our address.

I'll start by enhancing the output and error handling of that part, for example arping unsuppressed output includes the mac address of the replier directly, so there is no need to explicitly check the arp table for case 2

I'll also increase information gathering, starting with the suggested command for case 1,

If the meantime, I'll try to understand if we can query the forwarding database in the switches we have access to so we can start tracking down the port with the rogue MAC.

Also, for case 2, is something using a proxy arp somewhere ?

Revision history for this message
Brian Haley (brian-haley) wrote :

Gabriele - nothing in neutron should be using proxy ARP, but there could be something mis-configured, or a piece or network equipment doing it.

Something I've seen before is that when the subnet was created in neutron a block of IPs maybe wasn't removed for non-neutron entities. For example, 'openstack subnet create' needs a '--allocation-pool start=10.0.0.15,end=10.0.0.254' since some pieces of gear are going to be using some of the IPs. I don't think this is the case but I've seen it happen when a switch was hard-coded with an IP and neutron assigned the IP to a system.

wes hayutin (weshayutin)
tags: removed: promotion-blocker
Revision history for this message
Rafael Folco (rafaelfolco) wrote :

Just adding more info...

This failure happened 5 times in the last 357 jobs. This represents only 1.4%.

| 2019-07-29 20:20 | | 67 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-29 17:50 | | 68 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-26 01:13 | | 82 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-25 19:26 | | 66 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-25 18:03 | | 63 min | | IPMI to nodes failed. Introspection failed, cannot get IP

Revision history for this message
yatin (yatinkarel) wrote :
Download full text (15.5 KiB)

So today it got reproduced locally in an OVB deployment, some of the info i gathered:-

Issue happened on:- controller-2 on interface:- eth2(ip: 172.17.0.25)
stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (A6:20:45:A7:BD:56) already uses address 172.17.0.25."

(undercloud) [zuul@undercloud ~]$ openstack server list -c Name -c Networks
+-------------------------+------------------------+
| Name | Networks |
+-------------------------+------------------------+
| overcloud-controller-0 | ctlplane=192.168.24.10 |
| overcloud-controller-2 | ctlplane=192.168.24.14 |
| overcloud-novacompute-0 | ctlplane=192.168.24.9 |
| overcloud-controller-1 | ctlplane=192.168.24.12 |
+-------------------------+------------------------+

Running arping command on controller-2 manually returns the same mac address:-
[heat-admin@overcloud-controller-2 ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [A6:20:45:A7:BD:56] 0.708ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

Now the interesting part, when arping command is ran from other overcloud nodes or undercloud for same ip, returned mac address is different:-
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.14 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [A6:20:45:A7:BD:56] 0.681ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.12 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [B6:29:F7:A1:64:97] 0.623ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.10 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [0E:D0:C7:58:65:07] 0.721ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.9 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [EE:79:BB:F1:55:37] 0.662ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [3E:87:D5:0A:EE:22] 3.453ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

tcpdump output on undercloud:-
(undercloud) [zuul@undercloud ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [3E:87:D5:0A:EE:22] 4.002ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(undercloud) [zuul@undercloud ~]$ sudo tcpdump -lni any arp|grep 172.17.0.25
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-t...

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

With Rodolfo and Kieran we did some debugging today on environment provided by Yatin. Finally we found out that problem was "arp_ignore" sysctl option which was set on compute nodes to 0 (default value).
According to documentation https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt this means that kernel will reply for ARP request if requested IP address is on ANY of interfaces in the system. And indeed we had something like that. On one of compute nodes which hosted one of vms connected with Yatin's controller-2 on some interface (vlan111X) there was exactly same IP address configured.
When we switched "arp_ignore" to 1 problem was gone.

Revision history for this message
Marios Andreou (marios-b) wrote :

o/ seeing a similar issue today for periodic centos8 fs39 ... log pointers below. Per comment 14 above https://bugs.launchpad.net/tripleo/+bug/1818060/comments/14 I also see that the node has * net.ipv4.conf.all.arp_ignore = 0 in https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/763fc68/logs/undercloud/var/log/extra/sysctl.txt.gz

--- log pointers for related fs39 periodic fails :

        * https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/763fc68/job-output.txt
                * 2020-04-20 02:57:22.439102 | primary | [2020/04/20 02:57:22 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (FA:16:3E:FC:D1:41) already uses address 10.0.0.1.

        * https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/9a46184/job-output.txt
                * 2020-04-19 01:38:21.303050 | primary | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration
                * 2020-04-19 01:38:19.599994 | primary | TASK [Add eth2 interface from eth2.conf] ***************************************

Revision history for this message
Marios Andreou (marios-b) wrote :

18:23 < ykarel|away> marios, node meant compute node in rdo cloud not undercloud/overcloud vms
18:23 < marios> ykarel|away: i see so perhaps unrelated then
18:23 < marios> ykarel|away: thanks

18:23 < ykarel|away> marios, so u would need to involve ops to figure out if compute node still has arp ignore set to 0 somehow
18:24 < marios> ykarel|away: must be something new... weren't seeing that last week for fs39
18:24 < marios> ykarel|away: thanks for hints will check

18:24 < marios> ykarel|away: must be something new... weren't seeing that last week for fs39
18:24 < marios> ykarel|away: thanks for hints will check
18:24 < ykarel|away> marios, that issue is random
18:24 < ykarel|away> when ip match compute nodes vlan
18:24 < marios> ykarel|away: :/ is saw it on 2 runs though like 19/20th
18:25 < ykarel|away> marios, and there was no pass in that time duration?
18:25 < marios> ykarel|away: nah had some skips in meantime...
18:25 < marios> https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master# ykarel|away
18:26 < ykarel|away> marios, ack have more run, if it's random that bug is likely related

Revision history for this message
yatin (yatinkarel) wrote :

<< * https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/763fc68/job-output.txt
                * 2020-04-20 02:57:22.439102 | primary | [2020/04/20 02:57:22 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (FA:16:3E:FC:D1:41) already uses address 10.0.0.1.

This is different issue not related to rdo cloud infra, it's happening because 10.0.0.1 ip is associated with 1 of overcloud node:-
2020-04-20 02:50:41.693700 | primary | "network_details": {
2020-04-20 02:50:41.693705 | primary | "baremetal-61789-extra_0": {
2020-04-20 02:50:41.693710 | primary | "id": "178ac4d6-30aa-4ca6-8b54-b17692703396",
2020-04-20 02:50:41.693716 | primary | "ips": {
2020-04-20 02:50:41.693721 | primary | "private": [
2020-04-20 02:50:41.693727 | primary | {
2020-04-20 02:50:41.693732 | primary | "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:f0:54:e8",
2020-04-20 02:50:41.693737 | primary | "version": 4,
2020-04-20 02:50:41.693742 | primary | "addr": "192.168.100.70",
2020-04-20 02:50:41.693748 | primary | "OS-EXT-IPS:type": "fixed"
2020-04-20 02:50:41.693753 | primary | }
2020-04-20 02:50:41.693758 | primary | ],
2020-04-20 02:50:41.693764 | primary | "provision-61789": [
2020-04-20 02:50:41.693769 | primary | {
2020-04-20 02:50:41.693774 | primary | "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:d3:24:df",
2020-04-20 02:50:41.693780 | primary | "version": 4,
2020-04-20 02:50:41.693785 | primary | "addr": "192.0.2.18",
2020-04-20 02:50:41.693793 | primary | "OS-EXT-IPS:type": "fixed"
2020-04-20 02:50:41.693798 | primary | }
2020-04-20 02:50:41.693804 | primary | ],
2020-04-20 02:50:41.693809 | primary | "public-61789": [
2020-04-20 02:50:41.693814 | primary | {
2020-04-20 02:50:41.693819 | primary | "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:fc:d1:41",
2020-04-20 02:50:41.693824 | primary | "version": 4,
2020-04-20 02:50:41.693830 | primary | "addr": "10.0.0.1",
2020-04-20 02:50:41.693835 | primary | "OS-EXT-IPS:type": "fixed"
2020-04-20 02:50:41.693841 | primary | }
2020-04-20 02:50:41.693846 | primary | ]
2020-04-20 02:50:41.693851 | primary | }
2020-04-20 02:50:41.693857 | primary | },

and 10.0.0.1 is hardcoded https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/playbooks/prepare-slave.yml#L44

Revision history for this message
Marios Andreou (marios-b) wrote :

thanks ykarel ... looks like you were right about this being some kind of race/not consistent issue. The latest run from today fails on the overcloud deploy, so it got passed that failing task looks like

I'll dig some more if we continue to see that in fs39 but for now i'll continue chasing the other bug i am seeing there ;)

Revision history for this message
Marios Andreou (marios-b) wrote :

filed a new bug for the issue discussed in comments 15-18 there https://bugs.launchpad.net/tripleo/+bug/1874418

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.