Issue when using a chain policy in iptables on master
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Medium
|
Slawek Kaplonski |
Bug Description
Hello there,
Context:
We're wanting to push our custom tripleo firewall rules in a dedicated chain, and redirect from INPUT to that new TRIPLEO_INPUT chain.
The patch is here: https:/
The reason is: make things clearer, and easier to manage, as well as to avoid any lockout when applying the rules.
In order to make the whole thing more customizable and not really depend on the actual order before hitting a final DROP, we're switching to a chain policy (namely, -P INPUT DROP) instead of the final '-A INPUT -m conntrack --ctstate NEW -m comment --comment "999 drop all ipv4" -j DROP'.
While it seems to be pretty the same, it isn't: something isn't properly allowed, apparently linked to ARP (nftables covers both l2 and l3, apparently).
As a consequence, there are issues within the RDO jobs, for instance this one:
periodic-
It can deploy both UC and OC, but during the tempest phase, it tries to ping a virtual router, and fails:
TASK [os_tempest : Ping router ip address] *******
Friday 06 May 2022 12:42:51 -0400 (0:00:00.087) 2:03:29.990 ************
FAILED - RETRYING: Ping router ip address (5 retries left).
FAILED - RETRYING: Ping router ip address (4 retries left).
FAILED - RETRYING: Ping router ip address (3 retries left).
FAILED - RETRYING: Ping router ip address (2 retries left).
FAILED - RETRYING: Ping router ip address (1 retries left).
fatal: [undercloud]: FAILED! => {"attempts": 5, "changed": true, "cmd": "set -e\nping -c2 \"10.0.0.184\"\n", "delta": "0:00:03.118020", "end": "2022-05-06 12:44:10.409759", "msg": "non-zero return code", "rc": 1, "start": "2022-05-06 12:44:07.291739", "stderr": "", "stderr_lines": [], "stdout": "PING 10.0.0.184 (10.0.0.184) 56(84) bytes of data.\nFrom 10.0.0.1 icmp_seq=1 Destination Host Unreachable\nFrom 10.0.0.1 icmp_seq=2 Destination Host Unreachable\n\n--- 10.0.0.184 ping statistics ---\n2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1061ms\npipe 2", "stdout_lines": ["PING 10.0.0.184 (10.0.0.184) 56(84) bytes of data.", "From 10.0.0.1 icmp_seq=1 Destination Host Unreachable", "From 10.0.0.1 icmp_seq=2 Destination Host Unreachable", "", "--- 10.0.0.184 ping statistics ---", "2 packets transmitted, 0 received, +2 errors, 100% packet loss, time 1061ms", "pipe 2"]}
This is fully reproducible in any lab with the mentioned patch - you can get a small env, with one UC and one Controller, create the networks and router, get the router IP, and it will fail to ping.
Fun things:
- If you add an IP in the router network on the right interface of the UC, as well as ensure the route table contains the right line for that same network, it will ping; once you remove this IP and related route, it will STILL ping
- That router IP doesn't answer to ping from the Controller either with the patch, while it does answer with plain master
- While the reproducer is easy to get, you'll need to drop everything (both UC and OC) in order to do more tests once it pings
It really, really looks like an l2 issues rather than l3. Pretty sure the "drop" policy within nftables chains is also applied to ARP, which may be the issue - and, if this is the case, we'll need to find a way to tell nftables to allow ARP on some interfaces.
I'm able to provide 2 envs in parallel, one patched, one unpatched, if needed.
Thank you for your help!
Changed in tripleo: | |
assignee: | nobody → Slawek Kaplonski (slaweq) |
Some more links: /review. rdoproject. org/r/c/ testproject/ +/42613 (check commit message for tripleo-ansible patch)
here's a testproject job where we add the policy without playing with other chains:
https:/
here's a testproject with the actual TRIPLEO_INPUT patch: https:/ /review. rdoproject. org/r/c/ testproject/ +/42344