Autopkgtest systemd fail against dnsmasq 2.86 (22.04)

Bug #1957086 reported by Christian Ehrhardt 
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
dnsmasq (Ubuntu)
Fix Released
Undecided
Unassigned
systemd (Ubuntu)
Fix Released
Undecided
Lukas Märdian

Bug Description

Full log (same on all architectures):
https://autopkgtest.ubuntu.com/results/autopkgtest-jammy/jammy/arm64/s/systemd/20220106_214401_24d29@/log.gz

The test is from systemd:
https://github.com/systemd/systemd/blob/main/test/networkd-test.py#L619

Tail of the log:
```
ERROR: test_resolved_domain_restricted_dns (__main__.DnsmasqClientTest)
resolved: domain-restricted DNS servers
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/autopkgtest.NdIvJq/build.fBQ/src/test/networkd-test.py", line 678, in test_resolved_domain_restricted_dns
    out = subprocess.check_output(['resolvectl', 'query', 'math.lab'])
  File "/usr/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['resolvectl', 'query', 'math.lab']' returned non-zero exit status 1.
```

Tagging update-excuse to be shown in excuses

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Detailed Logs:

Before the test:
dnsmasq.log is identical
dnsmasq-vpn.log as well, except the version number

Good Case:
ubuntu@autopkgtest:~$ resolvectl query math.lab
math.lab: 10.241.3.3 -- link: testvpnrouter
-- Information acquired via protocol DNS in 20.8ms.
-- Data is authenticated: no; Data was acquired via local or encrypted transport: no
-- Data from: network

Bad Case:
ubuntu@autopkgtest:~$ resolvectl query math.lab
math.lab: resolve call failed: Connection timed out

After the test:
- dnsmasq.log is still identical
- dnsmasq-vpn.log is quite different

Entries in Good case:
Jan 11 11:54:18 dnsmasq[7952]: query[A] math.lab from 10.241.3.1
Jan 11 11:54:18 dnsmasq[7952]: config math.lab is 10.241.3.3
Jan 11 11:54:18 dnsmasq[7952]: query[AAAA] math.lab from 10.241.3.1
Jan 11 11:54:18 dnsmasq[7952]: config math.lab is NODATA-IPv6

Entries in Bad case:
Jan 11 11:54:16 dnsmasq[8871]: query[A] math.lab from 10.241.3.1
Jan 11 11:54:16 dnsmasq[8871]: config math.lab is 10.241.3.3
Jan 11 11:54:16 dnsmasq[8871]: query[AAAA] math.lab from 10.241.3.1
Jan 11 11:54:16 dnsmasq[8871]: forwarded math.lab to 127.0.0.53
Jan 11 11:54:16 dnsmasq[8871]: query[AAAA] math.lab from 10.241.3.1
Jan 11 11:54:16 dnsmasq[8871]: forwarded math.lab to 127.0.0.53
... the last two lines repeat until a timeout occurs

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Reproducing the case outside the systemd tests

(Commands on Ubuntu 22.04, with root permissions)
jammy-Proposed has dnsmasq 2.86 right now

apt update; apt upgrade -y; apt install dnsmasq-base

systemctl reset-failed systemd-networkd systemd-resolved

mkdir /run/systemd/resolved.conf.d
cat > /run/systemd/resolved.conf.d/test-enable-dnssec.conf << EOF
[Resolve]
DNSSEC=allow-downgrade
LLMNR=no
MulticastDNS=no
DNSOverTLS=no
EOF

ip link add name test_eth42 address de:ad:be:ef:47:11 type veth peer name router_eth42
ip a flush dev router_eth42
ip a add 192.168.5.1/24 dev router_eth42
ip link set router_eth42 up

dnsmasq --no-daemon --log-queries --log-facility=/tmp/dnsmasq.log --conf-file=/dev/null --dhcp-leasefile=/tmp/dnsmasq.leases --bind-interfaces --interface=router_eth42 --except-interface=lo --dhcp-range=192.168.5.10,192.168.5.200 --address=/#/192.168.42.1 &

cat > /run/systemd/network/general.network << EOF
[Match]
Name=test_eth42
[Network]
DHCP=ipv4
IPv6AcceptRA=False
DNSSECNegativeTrustAnchors=search.example.com
EOF

ip link add name testvpnclient type veth peer name testvpnrouter
ip a flush dev testvpnrouter
ip a add 10.241.3.1/24 dev testvpnrouter
ip link set testvpnrouter up

dnsmasq --no-daemon --log-queries --log-facility=/tmp/dnsmasq-vpn.log --conf-file=/dev/null --dhcp-leasefile=/dev/null --bind-interfaces --interface=testvpnrouter --except-interface=lo --address=/math.lab/10.241.3.3 --address=/cantina.company/10.241.4.4 &

cat > /run/systemd/network/vpn.network << EOF
[Match]
Name=testvpnclient
[Network]
IPv6AcceptRA=False
Address=10.241.3.2/24
DNS=10.241.3.1
Domains=~company ~lab
DNSSECNegativeTrustAnchors=company la
EOF

systemctl restart systemd-networkd
/usr/lib/systemd/systemd-networkd-wait-online --interface test_eth42 --interface=testvpnclient --timeout=20
systemctl restart systemd-resolved

# The original test runs "resolvectl query math.lab"
# That would probe everything, do this step by step

#1 ipv4 works and looks pretty much the same result on good/bad case
dig @10.241.3.1 -t A math.lab

root@j-dnsmasq-proposed:~# dig @10.241.3.1 -t A math.lab
dnsmasq: query[A] math.lab from 10.241.3.1
dnsmasq: config math.lab is 10.241.3.3

; <<>> DiG 9.16.15-Ubuntu <<>> @10.241.3.1 -t A math.lab
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11869
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;math.lab. IN A

;; ANSWER SECTION:
math.lab. 0 IN A 10.241.3.3

;; Query time: 0 msec
;; SERVER: 10.241.3.1#53(10.241.3.1)
;; WHEN: Tue Jan 11 13:09:07 UTC 2022
;; MSG SIZE rcvd: 53

#2 ipv6 fails and gets into a loop
dig @10.241.3.1 -t AAAA math.lab

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Result of the repro:

#2 ipv6 fails and gets into a loop
dig @10.241.3.1 -t AAAA math.lab

#2a - Good case (dnsmasq 2.85)
root@j-dnsmasq-release:~# dig @10.241.3.1 -t AAAA math.lab
dnsmasq: query[AAAA] math.lab from 10.241.3.1
dnsmasq: config math.lab is NODATA-IPv6

; <<>> DiG 9.16.15-Ubuntu <<>> @10.241.3.1 -t AAAA math.lab
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48650
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;math.lab. IN AAAA

;; Query time: 0 msec
;; SERVER: 10.241.3.1#53(10.241.3.1)
;; WHEN: Tue Jan 11 13:09:21 UTC 2022
;; MSG SIZE rcvd: 37

#2b - Bad case (dnsmasq 2.86)

root@j-dnsmasq-proposed:~# dig @10.241.3.1 -t AAAA math.lab
dnsmasq: query[AAAA] math.lab from 10.241.3.1
dnsmasq: forwarded math.lab to 127.0.0.53
dnsmasq: query[AAAA] math.lab from 10.241.3.1
dnsmasq: forwarded math.lab to 127.0.0.53
dnsmasq: query[AAAA] math.lab from 10.241.3.1
...
dnsmasq: Maximum number of concurrent DNS queries reached (max: 150)
dnsmasq: config error is REFUSED
... repeats infinitely

There is no ipv6 answer configured, so it is ok to return an empty/bad answer.

But dnsmasq should not fall into a forwarding loop right?
Or does this test make a hard configuration mistake (always did) that now
in 2.86 exposes an issue, but was always wrong?

The original test command of `resolvectl query math.lab` gets into a "wait
until timeout" with the new behavior.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote (last edit ):

# FYI cleanup

# Minimal Reset (e.g. when switching dnsmasq)

killall dnsmasq
rm /tmp/dnsmasq*
dnsmasq --no-daemon --log-queries --log-facility=/tmp/dnsmasq.log --conf-file=/dev/null --dhcp-leasefile=/tmp/dnsmasq.leases --bind-interfaces --interface=router_eth42 --except-interface=lo --dhcp-range=192.168.5.10,192.168.5.200 --address=/#/192.168.42.1 &
dnsmasq --no-daemon --log-queries --log-facility=/tmp/dnsmasq-vpn.log --conf-file=/dev/null --dhcp-leasefile=/dev/null --bind-interfaces --interface=testvpnrouter --except-interface=lo --address=/math.lab/10.241.3.3 --address=/cantina.company/10.241.4.4 &
dig @10.241.3.1 -t AAAA math.lab

^^ With this I could confirm that switchign the good case to dnsmasq 2.86 makes
it enter the forwarding loop. Vice versa switching the bad system to dnsmasq
2.85 makes it work (not entering the loop).

# Full cleanup
sudo killall dnsmasq
sudo rm /run/systemd/network/general.network
sudo rm /run/systemd/network/vpn.network
sudo rm /run/systemd/resolved.conf.d/test-enable-dnssec.conf
for i in testvpnrouter testvpnclient test_eth42 router_eth42; do sudo ip link dev $i down; done
for i in testvpnrouter testvpnclient test_eth42 router_eth42; do sudo ip link del dev $i down; done
sudo systemctl restart systemd-resolved
sudo systemctl restart systemd-networkd

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As usual once you know what you search for the results are much better.

It turns out that it might be related to one of those reports:
- https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2022q1/016015.html
- https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2022q1/016069.html

But even better was the finding that Debian at 249-5-2 was affected by the same.
- https://ci.debian.net/packages/s/systemd/unstable/amd64/
- https://ci.debian.net/data/autopkgtest/unstable/amd64/s/systemd/16383273/log.gz

That seems to be fixed since systemd 249.6-1 and checking that made me find

To me there is nothing obvious in regard to this in debian
=> https://salsa.debian.org/systemd-team/systemd/-/commits/debian/master
nor in upstream
=> https://github.com/systemd/systemd-stable/commits/v249.6

But I think this is enough to get the bug reported and maybe eyes more used
to systemd will see the related fix as it seems it must be there somewhere ...

@systemd maintainer (Lukas?) does this ring a bell to you or do you see the change that might have fixed this?
Also is there a PPA with a Ubuntu build of >=249.6-1 to check if it resolves the issue also in Ubuntu?

tags: added: update-excuse
Changed in systemd (Ubuntu):
assignee: nobody → Lukas Märdian (slyon)
tags: added: server-next
Revision history for this message
Lukas Märdian (slyon) wrote :

Thank you for all the investigation, Christian!
I was planning to upgrade to >= v249.6 for Jammy anyway (probably 249.9). Will check the logs/patches for anything suspicious and verify if this solves this issue.

There is a bileto PPA that contains a 249.7 prototype: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4704/+packages

It might be worth a try copying the dnsmasq package into there and see if the tests pass. Let me check that.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I tested the test of 249.5-2ubuntu2 (running autopkgtest against the dsc file of 249.5-2ubuntu2) vs the binary systemd build in your PPA (systemd_249.7-1ubuntu1~slyon0) + the dnsmasq from proposed.

And I got:
resolved: domain-restricted DNS servers ... ok

That means:
1. the fix is not in debian/tests of the new systemd version
2. that your new build works to resolve that issue

Lukas Märdian (slyon)
Changed in systemd (Ubuntu):
status: New → In Progress
Revision history for this message
Michael Biebl (mbiebl) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote (last edit ):

Hi Michael,
it seems suspicious as it is the same test that fails.
But the assertion is different.
In the referred Debian bug it is
  AssertionError: b'megasearch.net: 192.168.42.1' not found in
In the Ubuntu fail it is
  subprocess.CalledProcessError: Command '['resolvectl', 'query', 'math.lab']' returned non-zero exit status 1.

Also we did show that 2.86-1 worked fine against systemd 249.7-1ubuntu1 without a change to dnsmasq.
Finally the new dnsmasq 2.86-1.1 tested against systemd 249.5-2ubuntu4 here
https://autopkgtest.ubuntu.com/results/autopkgtest-jammy/jammy/amd64/s/systemd/20220122_111714_33b38@/log.gz
It still shows the same issue as reported here.

The referenced error OTOH fails in Debian with 247.9-1 which works for this issue here.

So while indeed very close, I doubt it is the same as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995655
Overall I guess eventually we need both fixes anyway, the one in dnsmasq (sync already happened 2.86-1.1 is in j-proposed) and a newer systemd upload.

tags: removed: server-next
Revision history for this message
Lukas Märdian (slyon) wrote :

This should now be solved, as of 249.9-0ubuntu2

Changed in systemd (Ubuntu):
status: In Progress → Fix Committed
Lukas Märdian (slyon)
Changed in systemd (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Bryce Harrington (bryce) wrote :

As well, the sync has successfully migrated now:

 dnsmasq | 2.86-1.1 | jammy
 dnsmasq | 2.86-1.1 | unstable

Changed in dnsmasq (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.