Indeed, and unfortunately it seems like I've been able to identify the root cause which is still present in 2.90. What changed, we're no longer seeing a segfault which was the tell-tale sign we were looking for, but instead we just see it quietly exit with no trace. This led me to getting a setup where I could reproduce the issue, which we've seen trigger on Ironic's "standalone" jobs as they exercise a number of different scenarios involving port/address updates quite a bit, and I just sat with strace attached to the dnsmasq process.
I think ironic is going to have to pin to dnsmasq 2.85 for the time being, until we're able to get this fixed and fully understood in upstream dnsmasq since this is impacting numerous CI branches and test jobs.
Indeed, and unfortunately it seems like I've been able to identify the root cause which is still present in 2.90. What changed, we're no longer seeing a segfault which was the tell-tale sign we were looking for, but instead we just see it quietly exit with no trace. This led me to getting a setup where I could reproduce the issue, which we've seen trigger on Ironic's "standalone" jobs as they exercise a number of different scenarios involving port/address updates quite a bit, and I just sat with strace attached to the dnsmasq process.
alarm(77377) = 77377 RESTARTBLOCK (Interrupted by signal) "\1\0\0\ 0\0\0\0\ 0\0\0\0\ 0", iov_len=12}], 1) = 12 {mask=[ ]}) = -1 EINTR (Interrupted system call) 0\0\0\0\ 0\0\0\0\ 0", 12) = 12 AT_FDCWD, "/opt/stack/ data/neutron/ dhcp/77cabae0- 26bf-4374- 997a-781947f2e5 b2/addn_ hosts", {st_mode= S_IFREG| 0644, st_size=1268, ...}, 0) = 0 data/neutron/ dhcp/77cabae0- 26bf-4374- 997a-781947f2e5 b2/addn_ hosts", O_RDONLY) = 20 S_IFREG| 0644, st_size=1268, ...}, AT_EMPTY_PATH) = 0 1\thost- 10-1-0- 1.openstack" ..., 4096) = 1268 AT_FDCWD, "/etc/localtime", {st_mode= S_IFREG| 0644, st_size=114, ...}, 0) = 0 PROT_WRITE, MAP_PRIVATE| MAP_ANONYMOUS, -1, 0) = 0x7fec31710000 SIG_UNBLOCK, [ABRT], NULL, 8) = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = ? ERESTART_
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=236696, si_uid=0} ---
getpid() = 235993
writev(18, [{iov_base=
rt_sigreturn(
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = 1 ([{fd=17, revents=POLLIN}])
read(17, "\1\0\0\
newfstatat(
openat(AT_FDCWD, "/opt/stack/
newfstatat(20, "", {st_mode=
read(20, "10.1.0.
read(20, "", 4096) = 0
close(20) = 0
getpid() = 235993
newfstatat(
getpid() = 235993
write(19, "<30>Feb 29 22:35:30 dnsmasq[2359"..., 133) = 133
writev(2, [{iov_base="free(): invalid pointer", iov_len=23}, {iov_base="\n", iov_len=1}], 2) = 24
mmap(NULL, 4096, PROT_READ|
rt_sigprocmask(
gettid() = 235993
getpid() = 235993
tgkill(235993, 235993, SIGABRT) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=235993, si_uid=65534} ---
+++ killed by SIGABRT +++
The above was captured from a dnsmasq install with https:/ /thekelleys. org.uk/ gitweb/ ?p=dnsmasq. git;a=commit; h=10d8b5f001a34 ff46b3a72575f3a f64b065f8637 where as running the commit before it, I don't crash out dnsmasq. The result with 2.90 is basically identical (https:/ /paste. openstack. org/show/ 823338/).
In terms of the actual logging which results, all we get is along the lines of https:/ /paste. opendev. org/show/ 823337/, at least with the install on jammy, where neutron is directly launching the dnsmasq processes. Looking at the contents of the commit where this starts (https:/ /thekelleys. org.uk/ gitweb/ ?p=dnsmasq. git;a=blobdiff; f=src/option. c;h=23cf058e8ad 9fdcffe621ce0b9 2d626578d31f82; hp=e8926a49f460 50eea9cba7cc90e 12d67821f3a2f; hb=10d8b5f001a3 4ff46b3a72575f3 af64b065f8637; hpb=ffd3ceb856e b3e1ff923cae95a 2ddb889beab7c1), it is clear that the logic has been changed into a different which explains the error we are able to capture with stracce.
I think ironic is going to have to pin to dnsmasq 2.85 for the time being, until we're able to get this fixed and fully understood in upstream dnsmasq since this is impacting numerous CI branches and test jobs.