dnsmasq on Ubuntu Jammy crashes on neutron-dhcp-agent updates

Bug #2026757 reported by Julia Kreger
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ironic
Triaged
Critical
Unassigned
neutron
New
Low
Unassigned
dnsmasq (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
Incomplete
Undecided
Unassigned
Kinetic
Won't Fix
Undecided
Unassigned
Lunar
Invalid
Undecided
Unassigned
Mantic
Invalid
Undecided
Unassigned

Bug Description

The Ironic project's CI has been having major blocking issues moving to utilizing Ubuntu Jammy and with some investigation we were able to isolate the issues down to the dhcp updates causing dnsmasq to crash on Ubuntu Jammy, which ships with dnsmasq 2.86. This issue sounds similar to an issue known about to the dnsmasq maintainers, where dnsmasq would crash with updates occurring due to configuration refresh[0].

This resulted in us upgrading dnsmasq to the version which ships with Ubuntu Lunar.

Which was no better. Dnsmasq still crashed upon record updates for addresses and ports getting configuration added/changed/removed.

We later downgraded to the version of dnsmasq shipped in Ubuntu Focal, and dnsmasq stopped crashing and appeared stable enough to utilize for CI purposes.

** Kernel log from Ubuntu Jammy Package **

[229798.876726] dnsmasq[81586]: segfault at 7c28 ip 00007f6e8313147e sp 00007fffb3d6f830 error 4 in libc.so.6[7f6e830b4000+195000]
[229798.876745] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[229805.444912] dnsmasq[401428]: segfault at dce8 ip 00007fe63bf6a47e sp 00007ffdb105b440 error 4 in libc.so.6[7fe63beed000+195000]
[229805.444933] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[230414.213448] dnsmasq[401538]: segfault at 78b8 ip 00007f12160e447e sp 00007ffed6ef2190 error 4 in libc.so.6[7f1216067000+195000]
[230414.213467] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[230465.098989] dnsmasq[402665]: segfault at c378 ip 00007f81458f047e sp 00007fff0db334a0 error 4 in libc.so.6[7f8145873000+195000]
[230465.099005] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[231787.247374] dnsmasq[402863]: segfault at 7318 ip 00007f3940b9147e sp 00007ffc8df4f010 error 4 in libc.so.6[7f3940b14000+195000]
[231787.247392] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[231844.886399] dnsmasq[405182]: segfault at dc58 ip 00007f32a29e147e sp 00007ffddedd7480 error 4 in libc.so.6[7f32a2964000+195000]
[231844.886420] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[234692.482154] dnsmasq[405289]: segfault at 67d8 ip 00007fab0c5c447e sp 00007fffd6fd8fa0 error 4 in libc.so.6[7fab0c547000+195000]
[234692.482173] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a

** Kernel log entries from Ubuntu Lunar package **

[234724.842339] dnsmasq[409843]: segfault at fffffffffffffffd ip 00007f35a147647e sp 00007ffd536038c0 error 5 in libc.so.6[7f35a13f9000+195000]
[234724.842368] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[234784.918116] dnsmasq[410019]: segfault at fffffffffffffffd ip 00007f634233947e sp 00007fff33877f20 error 5 in libc.so.6[7f63422bc000+195000]
[234784.918133] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[235022.163339] dnsmasq[410151]: segfault at fffffffffffffffd ip 00007f21dd37f47e sp 00007fff9bf416d0 error 5 in libc.so.6[7f21dd302000+195000]
[235022.163362] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[235024.831325] dnsmasq[410445]: segfault at fffffffffffffffd ip 00007f7edf02147e sp 00007ffc4fb19cd0 error 5 in libc.so.6[7f7edefa4000+195000]
[235024.831354] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[236052.793683] dnsmasq[410630]: segfault at fffffffffffffffd ip 00007f3046ca147e sp 00007ffe5583df50 error 5 in libc.so.6[7f3046c24000+195000]
[236052.793704] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a
[236105.451351] dnsmasq[412107]: segfault at fffffffffffffffd ip 00007f4425bcd47e sp 00007fffd5337560 error 5 in libc.so.6[7f4425b50000+195000]
[236105.451368] Code: 98 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 92 39 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 18 39 17 00 64 48 83 3a

** The command line the process is launched with **

dnsmasq --no-hosts --pid-file=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/pid --dhcp-hostsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host --addn-hosts=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/addn_hosts --dhcp-optsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/opts --dhcp-leasefile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/leases --dhcp-match=set:ipxe,175 --dhcp-userclass=set:ipxe6,iPXE --local-service--bind-dynamic --dhcp-range=set:subnet-3c1445e7-6f7d-4e62-997f-627bc53da72c,10.1.0.0,static,255.255.255.192,86400s --dhcp-option-force=option:mtu,1380 --dhcp-lease-max=64 --conf-file=/dev/null --domain=openstacklocal

** Neutron Logging **

Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.dhcp.agent [-] neutron.agent.dhcp.agent.DhcpAgentWithStateReport method _port_delete called with arguments ({'port_id': 'bdeaa43c-687c-4e60-a24e-3725d6353828', 'network_id': 'c1ca059e-350d-4d78-9330-600f7315c380', 'fixed_ips': [{'subnet_id': '3c1445e7-6f7d-4e62-997f-627bc53da72c', 'ip_address': '10.1.0.14'}, {'subnet_id': '54bc71f6-bff5-417d-9e4b-1f5f58ed6318', 'ip_address': 'fdd9:92b1:9e2c:0:5054:ff:fe44:5c9f'}], 'priority': 6},) {} {{(pid=60941) wrapper /usr/local/lib/python3.10/dist-packages/oslo_log/helpers.py:65}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.dhcp.agent [-] Calling driver for network: c1ca059e-350d-4d78-9330-600f7315c380/seg=None action: reload_allocations {{(pid=60941) _call_driver /opt/stack/neutron/neutron/agent/dhcp/agent.py:246}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:384}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] CMD "ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f" returned: 0 in 0.011s {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:422}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo.privsep.daemon [-] privsep: reply[8a4f2794-3b63-4f8d-9604-53dd6a4a868c]: (4, ('', '')) {{(pid=78114) _call_back /usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] Running cmd (subprocess): ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f 01:52:54:00:44:5c:9f {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:384}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.processutils [-] CMD "ip netns exec qdhcp-c1ca059e-350d-4d78-9330-600f7315c380 dhcp_release tapbb6348d9-39 10.1.0.14 52:54:00:44:5c:9f 01:52:54:00:44:5c:9f" returned: 0 in 0.011s {{(pid=78114) execute /usr/local/lib/python3.10/dist-packages/oslo_concurrency/processutils.py:422}}
Jul 10 15:26:01 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo.privsep.daemon [-] privsep: reply[33a91aed-bc58-48dd-b673-d4a4d5da54f6]: (4, ('', '')) {{(pid=78114) _call_back /usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.dhcp [-] Building host file: /opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host {{(pid=60941) _output_hosts_file /opt/stack/neutron/neutron/agent/linux/dhcp.py:956}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.utils [-] Running command: ['env', 'LC_ALL=C', 'PATH=/sbin:/usr/sbin', 'dnsmasq', '--test', '--dhcp-host=tag:foo'] {{(pid=60941) create_process /opt/stack/neutron/neutron/agent/linux/utils.py:84}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.dhcp [-] Done building host file /opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host {{(pid=60941) _output_hosts_file /opt/stack/neutron/neutron/agent/linux/dhcp.py:997}}
Jul 10 15:26:02 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo.privsep.daemon [-] privsep: reply[f3dd1224-fe8c-4fb0-8113-699e779df64e]: (4, ('', '', 0)) {{(pid=62248) _call_back /usr/local/lib/python3.10/dist-packages/oslo_privsep/daemon.py:501}}
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.lockutils [-] Acquiring lock "_check_child_processes" by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" {{(pid=60941) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:404}}
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.001s {{(pid=60941) inner /usr/local/lib/python3.10/dist-packages/oslo_concurrency/lockutils.py:409}}
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: ERROR neutron.agent.linux.external_process [-] dnsmasq for dhcp with uuid c1ca059e-350d-4d78-9330-600f7315c380 not found. The process should not have died
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: WARNING neutron.agent.linux.external_process [-] Respawning dnsmasq for uuid c1ca059e-350d-4d78-9330-600f7315c380
Jul 10 15:27:00 np0034614991 neutron-dhcp-agent[60941]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qdhcp-c1ca059e-350d-4d78-9330-600f7315c380', 'env', 'PROCESS_TAG=dnsmasq-c1ca059e-350d-4d78-9330-600f7315c380', 'dnsmasq', '--no-hosts', '', '--pid-file=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/pid', '--dhcp-hostsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/host', '--addn-hosts=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/addn_hosts', '--dhcp-optsfile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/opts', '--dhcp-leasefile=/opt/stack/data/neutron/dhcp/c1ca059e-350d-4d78-9330-600f7315c380/leases', '--dhcp-match=set:ipxe,175', '--dhcp-userclass=set:ipxe6,iPXE', '--local-service', '--bind-dynamic', '--dhcp-range=set:subnet-3c1445e7-6f7d-4e62-997f-627bc53da72c,10.1.0.0,static,255.255.255.192,86400s', '--dhcp-option-force=option:mtu,1380', '--dhcp-lease-max=64', '--conf-file=/dev/null', '--domain=openstacklocal'] {{(pid=60941) execute_rootwrap_daemon /opt/stack/neutron/neutron/agent/linux/utils.py:108}}

We don't believe this is a neutron bug, at least outright, but suspect neutron is also likely encountering this issue as well, at least with any sort of exhaustive test jobs. Most of Ironic's one job tests would pass with this dnsmasq, it was only where we continually ran new test scenarios that we would see this issue crop up and cause failures.

In the mean time, the ironic project will likely downgrade dnsmasq to unblock it's CI.

[0]: https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2022q3/016562.html

no longer affects: dnsmasq
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dnsmasq (Ubuntu):
status: New → Confirmed
Revision history for this message
yatin (yatinkarel) wrote :

I tried to setup dnsmasq-2.87 with https://review.opendev.org/c/openstack/ironic/+/888121 by using source install and avoiding newer package from lunar, still sometime some of tests fails but i no longer see any segfault for dnsmasq with it. May be someone from ironic Team could check and see if it's related to dnsmasq or some other known issue.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

Greetings Yatin!

So, the failure appears to be rooted in ipxe failing to get the complete set of data from the server. My guess is that is something to do with spanning tree as iPXE for ubuntu has also changed it's behavior. My feeling is this is rooted with some spanning tree behavior, which we merged a patch after your last recheck to disable. I've re-rechecked your test patch to hopefully provide us an additional data point.

Revision history for this message
Miguel Lavalle (minsel) wrote :

I'll bring it up in the next weekly Neutron meeting for visibility purposes

Changed in neutron:
importance: Undecided → Low
Revision history for this message
yatin (yatinkarel) wrote :

<<< So, the failure appears to be rooted in ipxe failing to get the complete set of data from the server. My guess is that is something to do with spanning tree as iPXE for ubuntu has also changed it's behavior. My feeling is this is rooted with some spanning tree behavior, which we merged a patch after your last recheck to disable. I've re-rechecked your test patch to hopefully provide us an additional data point.

Thanks Julia even with spanning tree fixes, i still seen some failures in test patch. It could be some other issue though.

wrt segfaults, I validated this even with 2.89 + source install[1] and didn't see any segfault with it. May be the segfault that you noticed with Lunar dnsmasq-2.89 on jammy is due to using packages built for lunar used in jammy but not specific to dnsmasq itself. I triggered the jobs again to see if segfaults are seen.

Based on this i think would be to good to get Ubuntu jammy and kinetic to be updated to 2.87 or just backport the required fix https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=d290630d31f4517ab26392d00753d1397f9a4114.

I could see similar segfault in neutron-linuxbridge and openvswitch jobs[4][5], reported once in syslog but didn't saw any failure due to these. But as you said issue is seen with some specific tests in ironic.

For neutron i will send a patch to add a sanity check to warn users running 2.86 version about this issue.

[1] https://review.opendev.org/c/openstack/ironic/+/888984
[2] https://9d1e095f1746de4d26ae-cb25c10c29ca7bf26ff09ad92a16fa62.ssl.cf1.rackcdn.com/888984/1/check/ironic-standalone/7debc89/controller/logs/syslog.txt
[3] https://c38d0c9156ee6cc9fd3b-d97b0a3b599d6de6d0673faefd2f08b5.ssl.cf1.rackcdn.com/888984/1/check/ironic-standalone-redfish/1cafda9/controller/logs/syslog.txt
[4] https://5d62e00bab1ce95c0ca0-ea10db30e23f6b883afe49ff4b1074ff.ssl.cf2.rackcdn.com/periodic/opendev.org/openstack/neutron/master/neutron-tempest-plugin-linuxbridge/1279e73/controller/logs/syslog.txt
[5] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ecb/859871/12/gate/neutron-tempest-plugin-openvswitch/ecbfc37/controller/logs/syslog.txt

Revision history for this message
Brian Haley (brian-haley) wrote :

So this also fails with version 2.89 that is in Lunar?

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

@yatin, It appears, with your newest patch to our CI jobs in Ironic, in order to just use pure upsteam source (Thanks by the way!), that the CI job failed in a specific scenario where we're attempting to validate we can boot an ISO via iPXE. That being said, the logs indicate we made it far past dhcp before it failed, and actually failed somewhere in the process of downloading the file. Why, I don't know. I can see the chunked transfers happening in the log[0] for your change [1]. You can see where ipxe fails thinking the connection timed out in the console log[2].

Anyway, tl;dr, looks unrelated to this bug, also unfortunately that is the kind of failure we would likely need to be able to reproduce to figure out further.

[0]: https://9d1e095f1746de4d26ae-cb25c10c29ca7bf26ff09ad92a16fa62.ssl.cf1.rackcdn.com/888984/1/check/ironic-standalone/7debc89/controller/logs/apache/ipxe_access_log.txt
[1]: https://review.opendev.org/c/openstack/ironic/+/888984
[2]: https://9d1e095f1746de4d26ae-cb25c10c29ca7bf26ff09ad92a16fa62.ssl.cf1.rackcdn.com/888984/1/check/ironic-standalone/7debc89/controller/logs/ironic-bm-logs/node-1_console_log.txt

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

@yatin, Also, on the prior change, the most recent run failed somewhere rooted with libvirt or uefi firmware booting. The same exact scenario test worked in your version revision. That specific test is not doing network booting, but it got the DHCP addresses as I would expect. What seemingly failed is the virutal media configuration through the emulator. Actually, looking deeper, we didn't even try. We might have just picked the wrong node which means the test could have had a bug. https://bugs.launchpad.net/ironic/+bug/2028279 has been opened for this.

It actually looks like it is a bug with the test itself, but again, entirely unrelated to dnsmasq.

Revision history for this message
yatin (yatinkarel) wrote :

<< For neutron i will send a patch to add a sanity check to warn users running 2.86 version about this issue.
Pushed https://review.opendev.org/c/openstack/neutron/+/889015?usp=search

@Julia, thanks for checking those failures and reporting 2028279. Will update and remove DNM from the test patch in ironic to use upstream source 2.87.

Revision history for this message
yatin (yatinkarel) wrote :

<< Will update and remove DNM from the test patch in ironic to use upstream source 2.87.
Proposed https://review.opendev.org/c/openstack/ironic/+/888121

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hello and thanks for taking the time to report this bug.

I read the discussion above and would like to clarify a few things:

1) Does the segfault happen with the dnsmasq package from Lunar/Mantic? I see tasks for both systems added to this bug (and the Mantic one is set as Confirmed), but it's not clear from the messages above whether the failure really happens there.

2) Assuming that the segfault does *not* happen in Lunar/Mantic, I can prepare a PPA with the backported patch from upstream and ask you to test it.

3) If the failure *does* happen in Lunar/Mantic, we will need to investigate it further.

FWIW, Kinetic has reached its end of standard support so I will set its task as Won't Fix.

Thank you.

Changed in dnsmasq (Ubuntu Kinetic):
status: New → Won't Fix
Revision history for this message
yatin (yatinkarel) wrote (last edit ):

@sergiodj hi
<< 1) Does the segfault happen with the dnsmasq package from Lunar/Mantic? I see tasks for both systems added to this bug (and the Mantic one is set as Confirmed), but it's not clear from the messages above whether the failure really happens there.

I am not aware about any segfaults with dnsmasq packages in lunar/Mantic. The issue in Ubuntu jammy is clear with the version included, so a fix needs to be included for jammy
There were some segfaults seen when using lunar packages in jammy but i think that could be due to other issues, so unless and until we see issues with lunar packages in lunar node we could rule out this.
The bug seems to be confirmed by the bot and Mantic seems to be default tracker so that got updated

<< 2) Assuming that the segfault does *not* happen in Lunar/Mantic, I can prepare a PPA with the backported patch from upstream and ask you to test it.
Sounds good to me to have backport of the fix in Ubuntu Jammy

<< 3) If the failure *does* happen in Lunar/Mantic, we will need to investigate it further.
Unless and Until someone confirms the issue with the version included in Lunar/Mantic it we can avoid any updates in Lunar/Mantic

yatin (yatinkarel)
Changed in dnsmasq (Ubuntu Jammy):
status: New → Confirmed
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hello yatin,

Thanks for the reply, and apologies for the delay. I've been swamped with other work here.

Anyway, based on your feedback I went ahead and prepared an upload with the proposed patch (https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=d290630d31f4517ab26392d00753d1397f9a4114). You can find the PPA in the following link:

https://launchpad.net/~sergiodj/+archive/ubuntu/dnsmasq

Could you please give it a try and let me know if it works for you? I still haven't had the time to try and reproduce the issue locally. BTW, if you have an easy reproducer I'd appreciate it.

Thanks.

Changed in dnsmasq (Ubuntu Lunar):
status: New → Invalid
Changed in dnsmasq (Ubuntu Mantic):
status: Confirmed → Invalid
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Based on yatin's feedback, I am setting the status of dnsmasq's Lunar and Mantic tasks as Invalid. This bug only applies to Jammy.

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Hi,

I'm marking the status of this bug to Incomplete to reflect the fact that we're waiting for information from the reporter.

@yatin, please let me know when you are able to give my PPA a try.

Thanks.

Changed in dnsmasq (Ubuntu Jammy):
status: Confirmed → Incomplete
Revision history for this message
yatin (yatinkarel) wrote :

Thanks @Sergio, i missed those ppa links earlier, pushed test patch[1] to validate it.

[1] https://review.opendev.org/c/openstack/ironic/+/897277

Revision history for this message
yatin (yatinkarel) wrote :

@Sergio so i still see[1][2] those segfaults with those new packages:-
ii dnsmasq-base 2.86-1.1ubuntu0.4~ppa1 amd64 Small caching DNS proxy and DHCP/TFTP server
ii dnsmasq-utils 2.86-1.1ubuntu0.4~ppa1 amd64 Utilities for manipulating DHCP leases

Can you check if that patch is really applied on those packages?

[1] https://dcd105b404f93fc08fa3-82141499a48343cfb1270dc186f5ec2f.ssl.cf5.rackcdn.com/897277/1/check/ironic-standalone/1d6d1a0/controller/logs/syslog.txt
[2]
Oct 04 06:53:17 np0035409761 kernel: dnsmasq[67022]: segfault at 80c8 ip 00007f4ef40ce3fe sp 00007fff2346dab0 error 4 in libc.so.6[7f4ef4051000+195000]
Oct 04 06:53:17 np0035409761 kernel: Code: 99 13 00 e8 04 b9 ff ff 0f 1f 40 00 f3 0f 1e fa 48 85 ff 0f 84 bb 00 00 00 55 48 8d 77 f0 53 48 83 ec 18 48 8b 1d 12 3a 17 00 <48> 8b 47 f8 64 8b 2b a8 02 75 57 48 8b 15 98 39 17 00 64 48 83 3a
Oct 04 06:53:53 np0035409761 kernel: dnsmasq[77967]: segfault at d118 ip 00007f31c3b403fe sp 00007ffe7ed6cc40 error 4 in libc.so.6[7f31c3ac3000+195000]

Revision history for this message
Paride Legovini (paride) wrote :

Hello, I verified that Sergio's PPA contains the candidate upstream patch (upstream commit d290630d31f4517ab26392d00753d1397f9a4114). If the crash is still happening that probably wasn't the issue after all.

I see two possible ways forward here. One is classic git based debugging:

1. compile 2.86 from upstream git and verify that the crash happens
2. compile 2.89 from upstream and verify that the crash doesn't happen
3. use `git bisect` to find the the commit that introduced the bug.

If that's not practical [but we may have to bite the bullet here!], we could do some work guesswork after figuring out if the Ubuntu packaged dnsmasq 2.87 is buggy or not. That version is not available anymore in the Ubuntu archive, it has been in the archive at some point (in Lunar), and compiled debs are still available here:

https://launchpad.net/ubuntu/+source/dnsmasq/2.87-1.1/+build/24632779

So by testing those we'll be able to tell whether the bug has been fixed between 2.86-1.1ubuntu0.3 and 2.87-1.1 or not. You'll need to manually install those packages via `dpkg -i` (no need to mention that this is normally not recommended!).

I'd test some of this myself, but without a reproducer I won't be able to tell much.

Revision history for this message
Paride Legovini (paride) wrote (last edit ):

The bug description says "dnsmasq on Ubuntu Jammy/Lunar crashes [...]", but IIUC it's actually only the Jammy one that crashes (the lunar-on-jammy one maybe has other issues, but doesn't exhibit the crash). Am I right? In this case please update the bug description accordingly. Thanks!

Edit: I did this myself as the Lunar bug was set to Invalid already, per comment 12.

summary: - dnsmasq on Ubuntu Jammy/Lunar crashes on neutron-dhcp-agent updates
+ dnsmasq on Ubuntu Jammy crashes on neutron-dhcp-agent updates
Revision history for this message
Jay Faulkner (jason-oldos) wrote :

Noting that there's very little action Ironic can take on this bug, but marking it triaged to get it out of our dashboard. If there's anything Ironic contributors or CI can do to move this forward, please let us know.

Changed in ironic:
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (stable/2023.2)

Related fix proposed to branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/ironic/+/910444

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to ironic (master)

Reviewed: https://review.opendev.org/c/openstack/ironic/+/888121
Committed: https://opendev.org/openstack/ironic/commit/27f53debb6800d9552510d44957b4d7d6292eaf5
Submitter: "Zuul (22348)"
Branch: master

commit 27f53debb6800d9552510d44957b4d7d6292eaf5
Author: yatinkarel <email address hidden>
Date: Tue Jul 11 14:44:06 2023 +0530

    ci: Source install dnsmasq-2.87

    dnsmasq-2.86 shipped in Ubuntu jammy has a
    known issue[1] which is fixed in dnsmasq-2.87
    but it's not yet released with Ubuntu jammy.

    Until fixed version is available in Ubuntu
    jammy let's use source install instead of
    using a older version from Ubuntu focal.

    [1] https://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2022q3/016562.html

    Update from Julia:

    Pushing forward the source fix again as ubuntu removed the
    prior path we were using as a focal package and replaced
    it with a package which is demonstrating the same basic issue.

    Related-Bug: #2026757
    Change-Id: I7ffcd167fc1e3a8c1192d766743bb5620d85ef35

Revision history for this message
James Page (james-page) wrote :

dnsmasq in Jammy was updated to a 2.90 based to resolve some security issues.

As a result this bug may have been fixed by the rollforward to a new version.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to ironic (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/ironic/+/910518

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Download full text (4.0 KiB)

Indeed, and unfortunately it seems like I've been able to identify the root cause which is still present in 2.90. What changed, we're no longer seeing a segfault which was the tell-tale sign we were looking for, but instead we just see it quietly exit with no trace. This led me to getting a setup where I could reproduce the issue, which we've seen trigger on Ironic's "standalone" jobs as they exercise a number of different scenarios involving port/address updates quite a bit, and I just sat with strace attached to the dnsmasq process.

alarm(77377) = 77377
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=236696, si_uid=0} ---
getpid() = 235993
writev(18, [{iov_base="\1\0\0\0\0\0\0\0\0\0\0\0", iov_len=12}], 1) = 12
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = 1 ([{fd=17, revents=POLLIN}])
read(17, "\1\0\0\0\0\0\0\0\0\0\0\0", 12) = 12
newfstatat(AT_FDCWD, "/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts", {st_mode=S_IFREG|0644, st_size=1268, ...}, 0) = 0
openat(AT_FDCWD, "/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts", O_RDONLY) = 20
newfstatat(20, "", {st_mode=S_IFREG|0644, st_size=1268, ...}, AT_EMPTY_PATH) = 0
read(20, "10.1.0.1\thost-10-1-0-1.openstack"..., 4096) = 1268
read(20, "", 4096) = 0
close(20) = 0
getpid() = 235993
newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0644, st_size=114, ...}, 0) = 0
getpid() = 235993
write(19, "<30>Feb 29 22:35:30 dnsmasq[2359"..., 133) = 133
writev(2, [{iov_base="free(): invalid pointer", iov_len=23}, {iov_base="\n", iov_len=1}], 2) = 24
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fec31710000
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid() = 235993
getpid() = 235993
tgkill(235993, 235993, SIGABRT) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=235993, si_uid=65534} ---
+++ killed by SIGABRT +++

The above was captured from a dnsmasq install with https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=10d8b5f001a34ff46b3a72575f3af64b065f8637 where as running the commit before it, I don't crash out dnsmasq. The result with 2.90 is basically identical (https://paste.openstack.org...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (stable/2023.1)

Change abandoned by "Julia Kreger <email address hidden>" on branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/ironic/+/910518

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on ironic (stable/2023.2)

Change abandoned by "Julia Kreger <email address hidden>" on branch: stable/2023.2
Review: https://review.opendev.org/c/openstack/ironic/+/910444
Reason: superceded by 2.85 version.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

Took some effort, but I've managed to capture a core dump

root@np0036907443:/opt/stack/dnsmasq# coredumpctl info 328293
           PID: 328293 (dnsmasq)
           UID: 65534 (nobody)
           GID: 30 (dip)
        Signal: 6 (ABRT)
     Timestamp: Fri 2024-03-01 16:19:37 UTC (2min 34s ago)
  Command Line: dnsmasq --no-hosts "" --pid-file=/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/pid --dhcp-hostsfile=/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/host --a>
    Executable: /usr/sbin/dnsmasq
 Control Group: /user.slice/user-0.slice/session-54.scope
          Unit: session-54.scope
         Slice: user-0.slice
       Session: 54
     Owner UID: 0 (root)
       Boot ID: 9aed02ce9d8a44b9845ff26acd24ad62
    Machine ID: 312834901b204815afc5cf70e422129b
      Hostname: dnsmasq
       Storage: /var/lib/systemd/coredump/core.dnsmasq.65534.9aed02ce9d8a44b9845ff26acd24ad62.328293.1709309977000000.zst (present)
     Disk Size: 29.7K
       Message: Process 328293 (dnsmasq) of user 65534 dumped core.

                Found module linux-vdso.so.1 with build-id: 975d8292a19f8c241322ae7eb151b63f4f01d8e2
                Found module ld-linux-x86-64.so.2 with build-id: 15921ea631d9f36502d20459c43e5c85b7d6ab76
                Found module libc.so.6 with build-id: c289da5071a3399de893d2af81d6a30c62646e1e
                Found module dnsmasq with build-id: d8802051e6d28c6d5d2b5ac326a392c5d5a05f5b
                Stack trace of thread 328293:
                #0 0x00007fda047499fc pthread_kill (libc.so.6 + 0x969fc)
                #1 0x00007fda046f5476 raise (libc.so.6 + 0x42476)
                #2 0x00007fda046db7f3 abort (libc.so.6 + 0x287f3)
                #3 0x00007fda0473c676 n/a (libc.so.6 + 0x89676)
                #4 0x00007fda04753cfc n/a (libc.so.6 + 0xa0cfc)
                #5 0x00007fda04755a44 n/a (libc.so.6 + 0xa2a44)
                #6 0x00007fda04758453 free (libc.so.6 + 0xa5453)
                #7 0x0000563328e86068 dhcp_config_free (dnsmasq + 0x14068)
                #8 0x0000563328e8f5e1 reread_dhcp (dnsmasq + 0x1d5e1)
                #9 0x0000563328e9864f clear_cache_and_reload (dnsmasq + 0x2664f)
                #10 0x0000563328e7b8f6 main (dnsmasq + 0x98f6)
                #11 0x00007fda046dcd90 n/a (libc.so.6 + 0x29d90)
                #12 0x00007fda046dce40 __libc_start_main (libc.so.6 + 0x29e40)
                #13 0x0000563328e7c1b5 _start (dnsmasq + 0xa1b5)

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

Listing... Done
dnsmasq-base-lua/unknown,unknown 2.90-0ubuntu0.22.04.1 amd64
dnsmasq-base/unknown,unknown,now 2.90-0ubuntu0.22.04.1 amd64 [installed,automatic]
dnsmasq-utils/unknown,unknown,now 2.90-0ubuntu0.22.04.1 amd64 [installed]
dnsmasq/unknown,unknown,now 2.90-0ubuntu0.22.04.1 all [installed]
root@np0036907443:/opt/stack/dnsmasq#

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

So, trying to at least get symbols for the dnsmasq binary has been largely unsucessful, so i build v2.90 from dnsmasq git, and here is what I got:

                Found module linux-vdso.so.1 with build-id: 975d8292a19f8c241322ae7eb151b63f4f01d8e2
                Found module ld-linux-x86-64.so.2 with build-id: 15921ea631d9f36502d20459c43e5c85b7d6ab76
                Found module libc.so.6 with build-id: c289da5071a3399de893d2af81d6a30c62646e1e
                Found module dnsmasq with build-id: 9ebcae185737a13e7a224834f99a5781c2ba5e14
                Stack trace of thread 338919:
                #0 0x00007fb0ad8399fc pthread_kill (libc.so.6 + 0x969fc)
                #1 0x00007fb0ad7e5476 raise (libc.so.6 + 0x42476)
                #2 0x00007fb0ad7cb7f3 abort (libc.so.6 + 0x287f3)
                #3 0x00007fb0ad82c676 n/a (libc.so.6 + 0x89676)
                #4 0x00007fb0ad843cfc n/a (libc.so.6 + 0xa0cfc)
                #5 0x00007fb0ad845a44 n/a (libc.so.6 + 0xa2a44)
                #6 0x00007fb0ad848453 free (libc.so.6 + 0xa5453)
                #7 0x00005585456f6a30 dhcp_config_free (dnsmasq + 0x16a30)
                #8 0x0000558545700f61 reread_dhcp (dnsmasq + 0x20f61)
                #9 0x000055854570a7ff clear_cache_and_reload (dnsmasq + 0x2a7ff)
                #10 0x00005585456eaad7 main (dnsmasq + 0xaad7)
                #11 0x00007fb0ad7ccd90 n/a (libc.so.6 + 0x29d90)
                #12 0x00007fb0ad7cce40 __libc_start_main (libc.so.6 + 0x29e40)
                #13 0x00005585456eb375 _start (dnsmasq + 0xb375)

Revision history for this message
Brian Haley (brian-haley) wrote :

From that trace, it looks like it is in this code in dhcp_config_free() when it makes the free() call:

#ifdef HAVE_DHCP6
      if (config->flags & CONFIG_ADDR6)
        {
          struct addrlist *addr, *tmp;

          for (addr = config->addr6; addr; addr = tmp)
            {
              tmp = addr->next;
              free(addr);
            }
        }
#endif

That *seems* Ok at first look, right? I do like the while() loop above it better :)

I do see a potential issue when IPv6 addresses are added to this list, but I think it would just cause a memory leak, search for CONFIG_ADDR6 in that file. Guess I'll have to send that to the list, I can't unsee it now.

You might just need to step through that code to see what 'addr' actually is.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

This is from v2.90 from the ubuntu packaging:

                Stack trace of thread 349999:
                #0 0x00007f21c90499fc pthread_kill (libc.so.6 + 0x969fc)
                #1 0x00007f21c8ff5476 raise (libc.so.6 + 0x42476)
                #2 0x00007f21c8fdb7f3 abort (libc.so.6 + 0x287f3)
                #3 0x00007f21c903c676 n/a (libc.so.6 + 0x89676)
                #4 0x00007f21c9053cfc n/a (libc.so.6 + 0xa0cfc)
                #5 0x00007f21c9055a54 n/a (libc.so.6 + 0xa2a54)
                #6 0x00007f21c9058453 free (libc.so.6 + 0xa5453)
                #7 0x000055ea8f653810 dhcp_netid_free (dnsmasq + 0x1b810)
                #8 0x000055ea8f6538df dhcp_netid_list_free (dnsmasq + 0x1b8df)
                #9 0x000055ea8f653956 dhcp_config_free (dnsmasq + 0x1b956)
                #10 0x000055ea8f661868 clear_dynamic_conf (dnsmasq + 0x29868)
                #11 0x000055ea8f661947 reread_dhcp (dnsmasq + 0x29947)
                #12 0x000055ea8f6737b7 clear_cache_and_reload (dnsmasq + 0x3b7b7)
                #13 0x000055ea8f672db5 async_event (dnsmasq + 0x3adb5)
                #14 0x000055ea8f6725aa main (dnsmasq + 0x3a5aa)
                #15 0x00007f21c8fdcd90 n/a (libc.so.6 + 0x29d90)
                #16 0x00007f21c8fdce40 __libc_start_main (libc.so.6 + 0x29e40)
                #17 0x000055ea8f642bc5 _start (dnsmasq + 0xabc5)

Revision history for this message
Petr Menšík (pihhan) wrote :
Download full text (3.5 KiB)

Using instructions at https://askubuntu.com/questions/41610/how-do-i-rebuild-a-package-to-include-debugging-information built a package with working debug symbols.

(gdb) bt
#0 0x00007f21c90499fc in pthread_kill () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f21c8ff5476 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f21c8fdb7f3 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007f21c903c676 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007f21c9053cfc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x00007f21c9055a54 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#6 0x00007f21c9058453 in free () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x000055ea8f653810 in dhcp_netid_free (nid=0x8000000ba) at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c:1333
#8 0x000055ea8f6538df in dhcp_netid_list_free (netid=0x0) at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c:1363
#9 0x000055ea8f653956 in dhcp_config_free (config=0x55ea90bc0050)
    at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c:1381
#10 0x000055ea8f661868 in clear_dynamic_conf () at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c:5777
#11 0x000055ea8f661947 in reread_dhcp () at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c:5818
#12 0x000055ea8f6737b7 in clear_cache_and_reload (now=1709322392)
    at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/dnsmasq.c:1738
#13 0x000055ea8f672db5 in async_event (pipe=17, now=1709322392)
    at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/dnsmasq.c:1482
#14 0x000055ea8f6725aa in main (argc=17, argv=0x7ffe73b7b4c8)
    at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/dnsmasq.c:1224

(gdb) frame 9
#9 0x000055ea8f653956 in dhcp_config_free (config=0x55ea90bc0050)
    at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c:1381
1381 in /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c
(gdb) p *config->netid->list
$17 = {net = 0x55efce12bef0 <error: Cannot access memory at address 0x55efce12bef0>, next = 0xa03b9e2eb1d772d3}
(gdb) p *config->netid->list->next
Cannot access memory at address 0xa03b9e2eb1d772d3

(gdb) frame 10
#10 0x000055ea8f661868 in clear_dynamic_conf () at /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c:5777
5777 in /home/ubuntu/dnsmasq/dnsmasq-2.90/debian/build/no-lua/option.c
(gdb) info locals
configs = 0x55ea90bc0050
cp = 0x55ea90bbd220
up = 0x55ea90baefd8
(gdb) p *cp
$19 = {flags = 2096, clid_len = 0, clid = 0x0, hostname = 0x55ea90bbd2c0 "host-10-1-0-7", domain = 0x55ea90bbd2ce "openstacklocal",
  netid = 0x55ea90bbd2f0, filter = 0x0, addr6 = 0x0, addr = {s_addr = 117440778}, decline_time = 0, lease_time = 0,
  hwaddr = 0x55ea90bbd290, next = 0x55ea90bbd0f0}
(gdb) p *configs
$20 = {flags = 2096, clid_len = 0, clid = 0x0, hostname = 0x55ea90bc00c0 "host-10-1-0-62",
  domain = 0x55ea90bc00cf "openstacklocal", netid = 0x55ea90bc00f0, filter = 0x0, addr6 = 0x0, addr = {s_addr = 1040187658},
  decline_time = 0, lease_time = 0, hwaddr = 0x55ea90bbd370, next = 0x55ea90bbd220}
(gdb) p *configs->netid
$21 = {list = 0x55ea90bc0110, next = 0x0}
(gdb) p *configs->netid->list...

Read more...

Revision history for this message
Petr Menšík (pihhan) wrote :

Coredump obtained from dnsmasq-base_2.90-0ubuntu0.22.04.1_amd64.deb

Revision history for this message
Julia Kreger (juliaashleykreger) wrote (last edit ):

Petr messaged me and suggested maybe we try using rr to capture the execution and failure to aid in debugging, unfortunately the cpu performance events are unavailable on the machine I'm attempting reproduction on.

I did manage to spend a little time the last two days adding some additional debug logging into a source build of 2.90 which includes the patch Brian posted to the dnsmasq mailing list in regards to dhcpv6.

I was still able to reproduce this issue leveraging one of ironic's combined scenario tests jobs which exercises the dhcp configuration a number of times. I also turned off inotify updates, and dhcp6 in my local build, and was also still able to reproduce the failure.

I also tried sending a HUP signal a substantial number of times, and tried massaging the configuration files which were being loaded for static entries and I was still unable to reproduce the crash. There *IS* a distinct possibility I just didn't do it "enough", but reproduced crashes can barely be running for a long time and end up crashing.

From what I've seen, it appears that it can happen after a dhcp offer response has been sent back to a v4 client, however at least looking through the code, it appears netids being set is rather sparing to configuration loading and do_options in src/rfc2131.c. I unfortunately don't have the context to understand what and why that is being done in do_options.

I have also been able to figure out a change to prevent the sigabrt by only proceeding to the next iteration if netid->next is not null which seems to prevent crashing, but only masks the root cause and there is no telling how long and what impact that is having long term.

Revision history for this message
Brian Haley (brian-haley) wrote :

Can you paste the change you're using that seems to help? Maybe getting some eyes on it might help point in a direction? Not that I have lots of extra cycles.

And I didn't expect the change I made to help, that failure probably never happens, and if you're just dealing with IPv4 it won't come into play.

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

So, I tossed the change because I wanted to try and produce the failure. I tried to re-create it, but didn't have the best of luck which makes me think I masked the issue a bit too well. Playing with valgrind has me questioning reality:

==1119241== ERROR SUMMARY: 9 errors from 6 contexts (suppressed: 0 from 0)
==1119241==
==1119241== 1 errors in context 1 of 6:
==1119241== Invalid read of size 8
==1119241== at 0x11EA27: dhcp_netid_free (option.c:1332)
==1119241== by 0x11EA27: dhcp_netid_list_free (option.c:1363)
==1119241== by 0x11EA27: dhcp_config_free (option.c:1381)
==1119241== by 0x128F60: clear_dynamic_conf (option.c:5777)
==1119241== by 0x128F60: reread_dhcp (option.c:5818)
==1119241== by 0x1327FE: clear_cache_and_reload (dnsmasq.c:1738)
==1119241== by 0x112AD6: async_event (dnsmasq.c:1482)
==1119241== by 0x112AD6: main (dnsmasq.c:1224)
==1119241== Address 0x1ffefffbc0 is on thread 1's stack
==1119241== 624 bytes below stack pointer
==1119241==
==1119241==
==1119241== 2 errors in context 2 of 6:
==1119241== Invalid read of size 8
==1119241== at 0x11EA23: dhcp_netid_free (option.c:1331)
==1119241== by 0x11EA23: dhcp_netid_list_free (option.c:1363)
==1119241== by 0x11EA23: dhcp_config_free (option.c:1381)
==1119241== by 0x128F60: clear_dynamic_conf (option.c:5777)
==1119241== by 0x128F60: reread_dhcp (option.c:5818)
==1119241== by 0x1327FE: clear_cache_and_reload (dnsmasq.c:1738)
==1119241== by 0x112AD6: async_event (dnsmasq.c:1482)
==1119241== by 0x112AD6: main (dnsmasq.c:1224)
==1119241== Address 0x1ffefffbc8 is on thread 1's stack
==1119241== 616 bytes below stack pointer
==1119241==
==1119241==
==1119241== 2 errors in context 3 of 6:
==1119241== Invalid free() / delete / delete[] / realloc()
==1119241== at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1119241== by 0x11EA37: dhcp_netid_free (option.c:1333)
==1119241== by 0x11EA37: dhcp_netid_list_free (option.c:1363)
==1119241== by 0x11EA37: dhcp_config_free (option.c:1381)
==1119241== by 0x128F60: clear_dynamic_conf (option.c:5777)
==1119241== by 0x128F60: reread_dhcp (option.c:5818)
==1119241== by 0x1327FE: clear_cache_and_reload (dnsmasq.c:1738)
==1119241== by 0x112AD6: async_event (dnsmasq.c:1482)
==1119241== by 0x112AD6: main (dnsmasq.c:1224)
==1119241== Address 0x1ffefffbc0 is on thread 1's stack
==1119241== 544 bytes below stack pointer

Which has me sort of wondering if the overall pattern of the cleanup might be something we should be looking at.

Revision history for this message
Mitchell Dzurick (mitchdz) wrote :

I wonder if the latest update to Jammy has fixed the issue? Is this issue still occurring for you when upgrading to the Jammy package version 2.90-0ubuntu0.22.04.1 Julia?

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

Yes, still happens with 2.90-0ubuntu0.22.04.1.

It is a problem with the upstream code as far as I can tell from dnsmasq, not the ubuntu package build itself.

       Message: Process 1671294 (dnsmasq) of user 65534 dumped core.

                Found module linux-vdso.so.1 with build-id: 975d8292a19f8c241322ae7eb151b63f4f01d8e2
                Found module ld-linux-x86-64.so.2 with build-id: 15921ea631d9f36502d20459c43e5c85b7d6ab76
                Found module libc.so.6 with build-id: c289da5071a3399de893d2af81d6a30c62646e1e
                Found module dnsmasq with build-id: aa89f97a7ccd45a1c50674eb7d5da473d605d476
                Stack trace of thread 1671294:
                #0 0x00007fbd66ca29fc __pthread_kill_implementation (libc.so.6 + 0x969fc)
                #1 0x00007fbd66c4e476 __GI_raise (libc.so.6 + 0x42476)
                #2 0x00007fbd66c347f3 __GI_abort (libc.so.6 + 0x287f3)
                #3 0x00007fbd66c95676 __libc_message (libc.so.6 + 0x89676)
                #4 0x00007fbd66caccfc malloc_printerr (libc.so.6 + 0xa0cfc)
                #5 0x00007fbd66caea44 _int_free (libc.so.6 + 0xa2a44)
                #6 0x00007fbd66cb1453 __GI___libc_free (libc.so.6 + 0xa5453)
                #7 0x0000555899cd1a32 dhcp_netid_free (dnsmasq + 0x16a32)
                #8 0x0000555899cdc001 clear_dynamic_conf (dnsmasq + 0x21001)
                #9 0x0000555899ce58af clear_cache_and_reload (dnsmasq + 0x2a8af)
                #10 0x0000555899cc5ad7 async_event (dnsmasq + 0xaad7)
                #11 0x00007fbd66c35d90 __libc_start_call_main (libc.so.6 + 0x29d90)
                #12 0x00007fbd66c35e40 __libc_start_main_impl (libc.so.6 + 0x29e40)
                #13 0x0000555899cc6375 _start (dnsmasq + 0xb375)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.