tripleo

Nodes periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master fail to get usable IPs though os-net-config with Error, some other host (BE:E5:4F:B9:21:B0) already uses address

Bug #1818060 reported by Gabriele Cerami on 2019-02-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	In Progress	Critical	Gabriele Cerami	tripleo stein-3

Bug Description

logs at

https://logs.rdoproject.org/openstack-periodic/git.openstack.org/openstack-infra/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master/0398d7f/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-02-28_01_33_47

show node are failing to get a usable IP address.
First this error shows

2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [INFO] running ifup on interface: eth1
2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [ERROR] Failure(s) occurred when applying configuration
2019-02-28 01:33:47 | [2019/02/28 01:31:01 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (BE:E5:4F:B9:21:B0) already uses address 172.18.0.79.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | Traceback (most recent call last):
2019-02-28 01:33:47 | File "/bin/os-net-config", line 10, in <module>
2019-02-28 01:33:47 | sys.exit(main())
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 295, in main
2019-02-28 01:33:47 | activate=not opts.no_activate)
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 1696, in apply
2019-02-28 01:33:47 | raise os_net_config.ConfigurationError(message)
2019-02-28 01:33:47 | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration

Than, maybe as consequence:

2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] Failure(s) occurred when applying configuration
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth5... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth4... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth3... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth2... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | [2019/02/28 01:33:41 AM] [ERROR] stdout:
2019-02-28 01:33:47 | Determining IP information for eth1... failed.
2019-02-28 01:33:47 | , stderr:
2019-02-28 01:33:47 | Traceback (most recent call last):
2019-02-28 01:33:47 | File "/bin/os-net-config", line 10, in <module>
2019-02-28 01:33:47 | sys.exit(main())
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/cli.py", line 295, in main
2019-02-28 01:33:47 | activate=not opts.no_activate)
2019-02-28 01:33:47 | File "/usr/lib/python2.7/site-packages/os_net_config/impl_ifcfg.py", line 1696, in apply
2019-02-28 01:33:47 | raise os_net_config.ConfigurationError(message)
2019-02-28 01:33:47 | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration

Tags:

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-02-28:

openstack server list | grep -i error | wc -l 
184

Revision history for this message

Gabriele Cerami (gcerami) wrote on 2019-03-04:

Proposing https://review.rdoproject.org/r/19048 the include error servers cleanup in the periodic cleanup

wes hayutin (weshayutin) on 2019-03-06

Changed in tripleo:
status:	Triaged → Fix Released

Revision history for this message

Ronelle Landy (rlandy) wrote on 2019-04-17:

Reopening this bug. We hit the port failures when the stack list is clean.
Possibly while one stack is created while another is being deleted.
Reopening to monitor this.

example: https://logs.rdoproject.org/08/653408/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/5c9b021/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-04-17_13_07_08

Changed in tripleo:
status:	Fix Released → In Progress

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-04-18:

dhcp should NOT distribute an ip that is already allocated. AFAICT the heat stacks are deleted as much as possible during a CI run [1] , additionally clean up scripts are running in the background.

[1] https://github.com/rdo-infra/review.rdoproject.org-config/blob/master/roles/ovb-manage/tasks/ovb-delete-stack.yml#L33-L71

IMHO we have a bug in neutron dhcp

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-05-15:

Wes - do you have a pointer to a recent failure? I've been looking through logs.rdoproject.org and haven't found one yet. I don't know of any issues where neutron DHCP would have allocated the same IP twice, I think the DB would have not allowed it.

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-05-15:

Brian,

http://logs.rdoproject.org/13/635913/2/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/32ee9e4/logs/bmc-console.log
[ 90.854150] openstackbmc[2648]: socket.error: [Errno 99] Cannot assign requested address
[ 91.617117] openstackbmc[2684]: socket.error: [Errno 99] Cannot assign requested address

Active examples can be found by navigating to http://cistatus.tripleo.org/ ( give it 1.5 min to pull data )

1. once http://cistatus.tripleo.org/ renders you should see graphs
2. click on the link tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001
3. In the search field, search for "cannot"

Attaching screenshot.

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-05-15:

Screenshot from 2019-05-15 09-53-20.png Edit (239.3 KiB, image/png)

FYI.. example filter for finding ip assignment issues

Revision history for this message

wes hayutin (weshayutin) wrote on 2019-05-15:

I would also like to have my team investigate the possible root cause as being something in https://openstack-virtual-baremetal.readthedocs.io/en/latest/ I'm not convinced it's rdo-cloud 100%

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-05-15:

So the error is coming from the ifup-eth script on the instance, which probably looks something like this:

# cat /etc/sysconfig/network-scripts/ifup-eth
if ! arping -q -c 2 -w 3 -D -I ${REALDEVICE} ${IPADDR} ; then
echo $"Error, some other host already uses address ${IPADDR}."
exit 1
fi

That's going to do duplicate address detection on the IP.

My first thought was that somehow the distro was using the wrong 'arping' package, as this is iputils-arping package syntax (not arping package), but I think in that case it would have thrown an error with a usage message that would be obvious to spot.

So that leaves two possibilities:

1) Another system does have that IP configured; We can look at the ports on the subnet in question to figure out if neutron somehow allocated it twice with something like:

$ openstack port list --fixed-ip subnet=$subnet

2) Something in the network is responding for the IP because it's mis-configured. This is a little harder to track down, but something like:

$ ping -c1 $IP
$ arp -an

Then we'd have a MAC to chase down.
Running tcpdump could also be used to see where the response is coming from.

Since I don't know much about BMC is there an easy way to insert that first command to get more info? Maybe even just in a test patch?

Revision history for this message

Gabriele Cerami (gcerami) wrote on 2019-05-16:

#10

The arping output is suppressed, and we are using a catchall exception to assume somebody has our address.

I'll start by enhancing the output and error handling of that part, for example arping unsuppressed output includes the mac address of the replier directly, so there is no need to explicitly check the arp table for case 2

I'll also increase information gathering, starting with the suggested command for case 1,

If the meantime, I'll try to understand if we can query the forwarding database in the switches we have access to so we can start tracking down the port with the rogue MAC.

Also, for case 2, is something using a proxy arp somewhere ?

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-05-16:

#11

Gabriele - nothing in neutron should be using proxy ARP, but there could be something mis-configured, or a piece or network equipment doing it.

Something I've seen before is that when the subnet was created in neutron a block of IPs maybe wasn't removed for non-neutron entities. For example, 'openstack subnet create' needs a '--allocation-pool start=10.0.0.15,end=10.0.0.254' since some pieces of gear are going to be using some of the IPs. I don't think this is the case but I've seen it happen when a switch was hard-coded with an IP and neutron assigned the IP to a system.

wes hayutin (weshayutin) on 2019-05-23

tags:

removed: promotion-blocker

Revision history for this message

Rafael Folco (rafaelfolco) wrote on 2019-07-30:

#12

Just adding more info...

This failure happened 5 times in the last 357 jobs. This represents only 1.4%.

| 2019-07-29 20:20 | | 67 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-29 17:50 | | 68 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-26 01:13 | | 82 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-25 19:26 | | 66 min | | IPMI to nodes failed. Introspection failed, cannot get IP
| 2019-07-25 18:03 | | 63 min | | IPMI to nodes failed. Introspection failed, cannot get IP

Revision history for this message

yatin (yatinkarel) wrote on 2019-09-25:

#13

Download full text (15.5 KiB)

So today it got reproduced locally in an OVB deployment, some of the info i gathered:-

Issue happened on:- controller-2 on interface:- eth2(ip: 172.17.0.25)
stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (A6:20:45:A7:BD:56) already uses address 172.17.0.25."

Running arping command on controller-2 manually returns the same mac address:-
[heat-admin@overcloud-controller-2 ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [A6:20:45:A7:BD:56] 0.708ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

Now the interesting part, when arping command is ran from other overcloud nodes or undercloud for same ip, returned mac address is different:-
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.14 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [A6:20:45:A7:BD:56] 0.681ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.12 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [B6:29:F7:A1:64:97] 0.623ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.10 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [0E:D0:C7:58:65:07] 0.721ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.9 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [EE:79:BB:F1:55:37] 0.662ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [3E:87:D5:0A:EE:22] 3.453ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

So today it got reproduced locally in an OVB deployment, some of the info i gathered:-

Issue happened on:- controller-2 on interface:- eth2(ip: 172.17.0.25)
stdout: ERROR     : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (A6:20:45:A7:BD:56) already uses address 172.17.0.25."

Running arping command on controller-2 manually returns the same mac address:-
[heat-admin@overcloud-controller-2 ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [A6:20:45:A7:BD:56]  0.708ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

Now the interesting part, when arping command is ran from other overcloud nodes or undercloud for same ip, returned mac address is different:-
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.14 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [A6:20:45:A7:BD:56]  0.681ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.12 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [B6:29:F7:A1:64:97]  0.623ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.10 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [0E:D0:C7:58:65:07]  0.721ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ ssh heat-admin@192.168.24.9 sudo arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [EE:79:BB:F1:55:37]  0.662ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [3E:87:D5:0A:EE:22]  3.453ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

tcpdump output on undercloud:-
(undercloud) [zuul@undercloud ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [3E:87:D5:0A:EE:22]  4.002ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
(undercloud) [zuul@undercloud ~]$
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
(undercloud) [zuul@undercloud ~]$ sudo tcpdump -lni any arp|grep 172.17.0.25                                                                      
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
07:07:58.808008 ARP, Request who-has 172.17.0.25 (Broadcast) tell 0.0.0.0, length 28
07:07:58.811465 ARP, Reply 172.17.0.25 is-at 3e:87:d5:0a:ee:22, length 28
07:07:58.811486 ARP, Reply 172.17.0.25 is-at ea:cd:38:37:77:2b, length 28
07:07:58.811491 ARP, Reply 172.17.0.25 is-at 26:d8:e0:69:61:77, length 28
07:07:58.811495 ARP, Reply 172.17.0.25 is-at 8e:e5:25:84:04:1d, length 28

tcpdump on controller-2:-
[heat-admin@overcloud-controller-2 ~]$ arping -c 2 -w 3 -D -I eth2 172.17.0.25
ARPING 172.17.0.25 from 0.0.0.0 eth2
Unicast reply from 172.17.0.25 [A6:20:45:A7:BD:56]  0.689ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
[heat-admin@overcloud-controller-2 ~]$ sudo tcpdump -lni any arp|grep 172.17.0.25
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
07:12:03.767705 ARP, Request who-has 172.17.0.25 (Broadcast) tell 0.0.0.0, length 28
07:12:03.767832 ARP, Reply 172.17.0.25 is-at a6:20:45:a7:bd:56, length 28
07:12:03.772584 ARP, Reply 172.17.0.25 is-at ee:79:bb:f1:55:37, length 28
07:12:03.772602 ARP, Reply 172.17.0.25 is-at b6:29:f7:a1:64:97, length 28
07:12:03.772608 ARP, Reply 172.17.0.25 is-at 0e:d0:c7:58:65:07, length 28

For correct ip's which are working on other overcloud nodes, tcpdump have single Reply, like:-
07:16:26.327020 ARP, Request who-has 172.17.0.11 (Broadcast) tell 0.0.0.0, length 28
07:16:26.331936 ARP, Reply 172.17.0.11 is-at fa:16:3e:29:d3:7f, length 28

So there is "something" which returns this bogus ARP response. what we can do to find that "something"? can it be some ovs/linux bridge on Host cloud or undercloud? i have a reproducer up currently, if someone from networking can jump on and find the culprit.

From host cloud:-
(openstack_venv) [ykarel@yatinkarel tqwd]$ openstack network show overcloud_internal-1927 --max-width 60
+---------------------------+------------------------------+
| Field                     | Value                        |
+---------------------------+------------------------------+
| admin_state_up            | UP                           |
| availability_zone_hints   |                              |
| availability_zones        | nova                         |
| created_at                | 2019-09-24T19:14:51Z         |
| description               |                              |
| dns_domain                | None                         |
| id                        | b9f4e5e8-f4d0-4234-98b8-3b8d |
|                           | c660b3b1                     |
| ipv4_address_scope        | None                         |
| ipv6_address_scope        | None                         |
| is_default                | None                         |
| is_vlan_transparent       | None                         |
| location                  | Munch({'project':            |
|                           | Munch({'domain_id': None,    |
|                           | 'id': u'1510cd49292a424ba64c |
|                           | 8bec08bf8e1a', 'name':       |
|                           | 'ykarel', 'domain_name':     |
|                           | None}), 'cloud': '',         |
|                           | 'region_name': 'regionOne',  |
|                           | 'zone': None})               |
| mtu                       | 1450                         |
| name                      | overcloud_internal-1927      |
| port_security_enabled     | True                         |
| project_id                | 1510cd49292a424ba64c8bec08bf |
|                           | 8e1a                         |
| provider:network_type     | None                         |
| provider:physical_network | None                         |
| provider:segmentation_id  | None                         |
| qos_policy_id             | None                         |
| revision_number           | 5                            |
| router:external           | Internal                     |
| segments                  | None                         |
| shared                    | False                        |
| status                    | ACTIVE                       |
| subnets                   | e7c9d546-1a25-4ebc-b49b-     |
|                           | cf51298e14f8                 |
| tags                      |                              |
| updated_at                | 2019-09-24T19:14:53Z         |
+---------------------------+------------------------------+

(openstack_venv) [ykarel@yatinkarel tqwd]$ openstack subnet show overcloud_internal-1927 --max-width 60
+-------------------+--------------------------------------+
| Field             | Value                                |
+-------------------+--------------------------------------+
| allocation_pools  | 172.17.0.1-172.17.0.254              |
| cidr              | 172.17.0.0/24                        |
| created_at        | 2019-09-24T19:14:53Z                 |
| description       |                                      |
| dns_nameservers   |                                      |
| enable_dhcp       | False                                |
| gateway_ip        | None                                 |
| host_routes       |                                      |
| id                | e7c9d546-1a25-4ebc-b49b-cf51298e14f8 |
| ip_version        | 4                                    |
| ipv6_address_mode | None                                 |
| ipv6_ra_mode      | None                                 |
| location          | Munch({'project':                    |
|                   | Munch({'domain_id': None, 'id':      |
|                   | u'1510cd49292a424ba64c8bec08bf8e1a', |
|                   | 'name': 'ykarel', 'domain_name':     |
|                   | None}), 'cloud': '', 'region_name':  |
|                   | 'regionOne', 'zone': None})          |
| name              | overcloud_internal-1927              |
| network_id        | b9f4e5e8-f4d0-4234-98b8-3b8dc660b3b1 |
| prefix_length     | None                                 |
| project_id        | 1510cd49292a424ba64c8bec08bf8e1a     |
| revision_number   | 2                                    |
| segment_id        | None                                 |
| service_types     |                                      |
| subnetpool_id     | None                                 |
| tags              |                                      |
| updated_at        | 2019-09-24T19:14:53Z                 |
+-------------------+--------------------------------------+

Let me know if i can gather some other info which would help in isolating the issue.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-09-26:

#14

With Rodolfo and Kieran we did some debugging today on environment provided by Yatin. Finally we found out that problem was "arp_ignore" sysctl option which was set on compute nodes to 0 (default value).
According to documentation https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt this means that kernel will reply for ARP request if requested IP address is on ANY of interfaces in the system. And indeed we had something like that. On one of compute nodes which hosted one of vms connected with Yatin's controller-2 on some interface (vlan111X) there was exactly same IP address configured.
When we switched "arp_ignore" to 1 problem was gone.

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-04-20:

#15

o/ seeing a similar issue today for periodic centos8 fs39 ... log pointers below. Per comment 14 above https://bugs.launchpad.net/tripleo/+bug/1818060/comments/14 I also see that the node has * net.ipv4.conf.all.arp_ignore = 0 in https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/763fc68/logs/undercloud/var/log/extra/sysctl.txt.gz

--- log pointers for related fs39 periodic fails :

* https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/763fc68/job-output.txt
* 2020-04-20 02:57:22.439102 | primary | [2020/04/20 02:57:22 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (FA:16:3E:FC:D1:41) already uses address 10.0.0.1.

        * https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/9a46184/job-output.txt
                * 2020-04-19 01:38:21.303050 | primary | os_net_config.ConfigurationError: Failure(s) occurred when applying configuration
                * 2020-04-19 01:38:19.599994 | primary | TASK [Add eth2 interface from eth2.conf] ***************************************

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-04-20:

#16

18:23 < ykarel|away> marios, node meant compute node in rdo cloud not undercloud/overcloud vms
18:23 < marios> ykarel|away: i see so perhaps unrelated then
18:23 < marios> ykarel|away: thanks

18:23 < ykarel|away> marios, so u would need to involve ops to figure out if compute node still has arp ignore set to 0 somehow
18:24 < marios> ykarel|away: must be something new... weren't seeing that last week for fs39
18:24 < marios> ykarel|away: thanks for hints will check

18:24 < marios> ykarel|away: must be something new... weren't seeing that last week for fs39
18:24 < marios> ykarel|away: thanks for hints will check
18:24 < ykarel|away> marios, that issue is random
18:24 < ykarel|away> when ip match compute nodes vlan
18:24 < marios> ykarel|away: :/ is saw it on 2 runs though like 19/20th
18:25 < ykarel|away> marios, and there was no pass in that time duration?
18:25 < marios> ykarel|away: nah had some skips in meantime...
18:25 < marios> https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master# ykarel|away
18:26 < ykarel|away> marios, ack have more run, if it's random that bug is likely related

Revision history for this message

yatin (yatinkarel) wrote on 2020-04-21:

#17

<< * https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/763fc68/job-output.txt
* 2020-04-20 02:57:22.439102 | primary | [2020/04/20 02:57:22 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (FA:16:3E:FC:D1:41) already uses address 10.0.0.1.

This is different issue not related 2020-04-20 02:50:41.693700 | primary | 2020-04-20 02:50:41.693705 | primary | 2020-04-20 02:50:41.693710 | primary | 2020-04-20 02:50:41.693716 | primary | 2020-04-20 02:50:41.693721 | primary | 2020-04-20 02:50:41.693727 | primary | 2020-04-20 02:50:41.693732 | primary | 2020-04-20 02:50:41.693737 | primary | 2020-04-20 02:50:41.693742 | primary | 2020-04-20 02:50:41.693748 | primary | 2020-04-20 02:50:41.693753 | primary | 2020-04-20 02:50:41.693758 | primary | 2020-04-20 02:50:41.693764 | primary | 2020-04-20 02:50:41.693769 | primary | 2020-04-20 02:50:41.693774 | primary | 2020-04-20 02:50:41.693780 | primary | 2020-04-20 02:50:41.693785 | primary | 2020-04-20 02:50:41.693793 | primary | 2020-04-20 02:50:41.693798 | primary | 2020-04-20 02:50:41.693804 | primary | 2020-04-20 02:50:41.693809 | primary | 2020-04-20 02:50:41.693814 | primary | 2020-04-20 02:50:41.693819 | primary | 2020-04-20 02:50:41.693824 | primary | 2020-04-20 02:50:41.693830 | primary | 2020-04-20 02:50:41.693835 | primary | 2020-04-20 02:50:41.693841 | primary | 2020-04-20 02:50:41.693846 | primary | 2020-04-20 02:50:41.693851 | primary | 2020-04-20 02:50:41.693857 | primary | to rdo cloud infra, it's happening because 10.0.0.1 ip is associated with 1 of overcloud node:-
"network_details": {
"baremetal-61789-extra_0": {
"id": "178ac4d6-30aa-4ca6-8b54-b17692703396",
"ips": {
"private": [
{
"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:f0:54:e8",
"version": 4,
"addr": "192.168.100.70",
"OS-EXT-IPS:type": "fixed"
}
],
"provision-61789": [
{
"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:d3:24:df",
"version": 4,
"addr": "192.0.2.18",
"OS-EXT-IPS:type": "fixed"
}
],
"public-61789": [
{
"OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:fc:d1:41",
"version": 4,
"addr": "10.0.0.1",
"OS-EXT-IPS:type": "fixed"
}
]
}
},

and 10.0.0.1 is hardcoded https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/playbooks/prepare-slave.yml#L44

<< * https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp_1supp-featureset039-master/763fc68/job-output.txt
                * 2020-04-20 02:57:22.439102 | primary | [2020/04/20 02:57:22 AM] [ERROR] stdout: ERROR : [/etc/sysconfig/network-scripts/ifup-eth] Error, some other host (FA:16:3E:FC:D1:41) already uses address 10.0.0.1.

This is different issue not related to rdo cloud infra, it's happening because 10.0.0.1 ip is associated with 1 of overcloud node:-
2020-04-20 02:50:41.693700 | primary |   "network_details": {
2020-04-20 02:50:41.693705 | primary |     "baremetal-61789-extra_0": {
2020-04-20 02:50:41.693710 | primary |       "id": "178ac4d6-30aa-4ca6-8b54-b17692703396",
2020-04-20 02:50:41.693716 | primary |       "ips": {
2020-04-20 02:50:41.693721 | primary |         "private": [
2020-04-20 02:50:41.693727 | primary |           {
2020-04-20 02:50:41.693732 | primary |             "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:f0:54:e8",
2020-04-20 02:50:41.693737 | primary |             "version": 4,
2020-04-20 02:50:41.693742 | primary |             "addr": "192.168.100.70",
2020-04-20 02:50:41.693748 | primary |             "OS-EXT-IPS:type": "fixed"
2020-04-20 02:50:41.693753 | primary |           }
2020-04-20 02:50:41.693758 | primary |         ],
2020-04-20 02:50:41.693764 | primary |         "provision-61789": [
2020-04-20 02:50:41.693769 | primary |           {
2020-04-20 02:50:41.693774 | primary |             "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:d3:24:df",
2020-04-20 02:50:41.693780 | primary |             "version": 4,
2020-04-20 02:50:41.693785 | primary |             "addr": "192.0.2.18",
2020-04-20 02:50:41.693793 | primary |             "OS-EXT-IPS:type": "fixed"
2020-04-20 02:50:41.693798 | primary |           }
2020-04-20 02:50:41.693804 | primary |         ],
2020-04-20 02:50:41.693809 | primary |         "public-61789": [
2020-04-20 02:50:41.693814 | primary |           {
2020-04-20 02:50:41.693819 | primary |             "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:fc:d1:41",
2020-04-20 02:50:41.693824 | primary |             "version": 4,
2020-04-20 02:50:41.693830 | primary |             "addr": "10.0.0.1",
2020-04-20 02:50:41.693835 | primary |             "OS-EXT-IPS:type": "fixed"
2020-04-20 02:50:41.693841 | primary |           }
2020-04-20 02:50:41.693846 | primary |         ]
2020-04-20 02:50:41.693851 | primary |       }
2020-04-20 02:50:41.693857 | primary |     },

and 10.0.0.1 is hardcoded https://opendev.org/openstack/tripleo-quickstart-extras/src/branch/master/playbooks/prepare-slave.yml#L44

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-04-21:

#18

thanks ykarel ... looks like you were right about this being some kind of race/not consistent issue. The latest run from today fails on the overcloud deploy, so it got passed that failing task looks like

I'll dig some more if we continue to see that in fs39 but for now i'll continue chasing the other bug i am seeing there ;)

Revision history for this message

Marios Andreou (marios-b) wrote on 2020-04-23:

#19

filed a new bug for the issue discussed in comments 15-18 there https://bugs.launchpad.net/tripleo/+bug/1874418