Nodes periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master fail to get usable IPs though os-net-config with Error, some other host (BE:E5:4F:B9:21:B0) already uses address

Bug #1818060 reported by Gabriele Cerami on 2019-02-28
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Gabriele Cerami

Bug Description

logs at

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/0398d7f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-02-28_01_33_47

show node are failing to get a usable IP address.
First this error shows

2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [INFO] running ifup on interface: eth1
2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [ERROR] Failure(s) occurred when applying configuration
2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (BE:E5:4F:B9:21:B0) already uses address 172.18.0.79.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | Traceback (most recent call last):
2019-02-28 01:33:47 | File "/bin/os-net-config", line 10, in <module>
2019-02-28 01:33:47 | sys.exit(main())
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 295, in main
2019-02-28 01:33:47 | activate=not opts.no_activate)
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 1696, in apply
2019-02-28 01:33:47 | raise os_net_config.ConfigurationError(message)
2019-02-28 01:33:47 | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration

Than, maybe as consequence:

2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] Failure(s) occurred when applying configuration
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth5... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth4... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth3... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth2... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth1... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | Traceback (most recent call last):
2019-02-28 01:33:47 | File "/bin/os-net-config", line 10, in <module>
2019-02-28 01:33:47 | sys.exit(main())
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 295, in main
2019-02-28 01:33:47 | activate=not opts.no_activate)
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 1696, in apply
2019-02-28 01:33:47 | raise os_net_config.ConfigurationError(message)
2019-02-28 01:33:47 | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration

Tags: ci Edit Tag help
wes hayutin (weshayutin) wrote :

 openstack server list | grep -i error | wc -l 
184

Gabriele Cerami (gcerami) wrote :

Proposing https://review.rdoproject.org/r/19048 the include error servers cleanup in the periodic cleanup

wes hayutin (weshayutin) on 2019-03-06
Changed in tripleo:
status: Triaged → Fix Released
Ronelle Landy (rlandy) wrote :

Reopening this bug. We hit the port failures when the stack list is clean.
Possibly while one stack is created while another is being deleted.
Reopening to monitor this.

example: https://logs.rdoproject.org/08/653408/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/5c9b021/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-04-17_13_07_08

Changed in tripleo:
status: Fix Released → In Progress
wes hayutin (weshayutin) wrote :

dhcp should NOT distribute an ip that is already allocated. AFAICT the heat stacks are deleted as much as possible during a CI run [1] , additionally clean up scripts are running in the background.

[1] https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/roles/ovb-manage/tasks/ovb-delete-stack.yml#L33-L71

IMHO we have a bug in neutron dhcp

Brian Haley (brian-haley) wrote :

Wes - do you have a pointer to a recent failure? I've been looking through logs.rdoproject.org and haven't found one yet. I don't know of any issues where neutron DHCP would have allocated the same IP twice, I think the DB would have not allowed it.

wes hayutin (weshayutin) wrote :

Brian,

http://logs.rdoproject.org/13/635913/2/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/32ee9e4/logs/bmc-console.log
[ 90.854150] openstackbmc[2648]: socket.error: [Errno 99] Cannot assign requested address
[ 91.617117] openstackbmc[2684]: socket.error: [Errno 99] Cannot assign requested address

Active examples can be found by navigating to http://cistatus.tripleo.org/ ( give it 1.5 min to pull data )

1. once http://cistatus.tripleo.org/ renders you should see graphs
2. click on the link tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001
3. In the search field, search for "cannot"

Attaching screenshot.

wes hayutin (weshayutin) wrote :

FYI.. example filter for finding ip assignment issues

wes hayutin (weshayutin) wrote :

I would also like to have my team investigate the possible root cause as being something in https://openstack-virtual-baremetal.readthedocs.io/en/latest/ I'm not convinced it's rdo-cloud 100%

Brian Haley (brian-haley) wrote :

So the error is coming from the ifup-eth script on the instance, which probably looks something like this:

# cat /etc/sysconfig/network-scripts/ifup-eth
if ! arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${IPADDR} ; then
  echo $"Error, some other host already uses address ${IPADDR}."
  exit 1
fi

That's going to do duplicate address detection on the IP.

My first thought was that somehow the distro was using the wrong 'arping' package, as this is iputils-arping package syntax (not arping package), but I think in that case it would have thrown an error with a usage message that would be obvious to spot.

So that leaves two possibilities:

1) Another system does have that IP configured; We can look at the ports on the subnet in question to figure out if neutron somehow allocated it twice with something like:

$ openstack port list --fixed-ip subnet=$subnet

2) Something in the network is responding for the IP because it's mis-configured. This is a little harder to track down, but something like:

$ ping -c1 $IP
$ arp -an

Then we'd have a MAC to chase down.
Running tcpdump could also be used to see where the response is coming from.

Since I don't know much about BMC is there an easy way to insert that first command to get more info? Maybe even just in a test patch?

Gabriele Cerami (gcerami) wrote :

The arping output is suppressed, and we are using a catchall exception to assume somebody has our address.

I'll start by enhancing the output and error handling of that part, for example arping unsuppressed output includes the mac address of the replier directly, so there is no need to explicitly check the arp table for case 2

I'll also increase information gathering, starting with the suggested command for case 1,

If the meantime, I'll try to understand if we can query the forwarding database in the switches we have access to so we can start tracking down the port with the rogue MAC.

Also, for case 2, is something using a proxy arp somewhere ?

Brian Haley (brian-haley) wrote :

Gabriele - nothing in neutron should be using proxy ARP, but there could be something mis-configured, or a piece or network equipment doing it.

Something I've seen before is that when the subnet was created in neutron a block of IPs maybe wasn't removed for non-neutron entities. For example, 'openstack subnet create' needs a '--allocation-pool start=10.0.0.15,end=10.0.0.254' since some pieces of gear are going to be using some of the IPs. I don't think this is the case but I've seen it happen when a switch was hard-coded with an IP and neutron assigned the IP to a system.

wes hayutin (weshayutin) on 2019-05-23
tags: removed: promotion-blocker
Rafael Folco (rafaelfolco) wrote :

Just adding more info...

This failure happened 5 times in the last 357 jobs. This represents only 1.4%.

| 2019-07-29 20:20 | | 67 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-29 17:50 | | 68 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-26 01:13 | | 82 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-25 19:26 | | 66 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-25 18:03 | | 63 min | | IPMI to nodes failed. Introspection failed, cannot get IP

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers