Comment 25 for bug 2026757

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

Indeed, and unfortunately it seems like I've been able to identify the root cause which is still present in 2.90. What changed, we're no longer seeing a segfault which was the tell-tale sign we were looking for, but instead we just see it quietly exit with no trace. This led me to getting a setup where I could reproduce the issue, which we've seen trigger on Ironic's "standalone" jobs as they exercise a number of different scenarios involving port/address updates quite a bit, and I just sat with strace attached to the dnsmasq process.

alarm(77377) = 77377
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
--- SIGHUP {si_signo=SIGHUP, si_code=SI_USER, si_pid=236696, si_uid=0} ---
getpid() = 235993
writev(18, [{iov_base="\1\0\0\0\0\0\0\0\0\0\0\0", iov_len=12}], 1) = 12
rt_sigreturn({mask=[]}) = -1 EINTR (Interrupted system call)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}, {fd=11, events=POLLIN}, {fd=12, events=POLLIN}, {fd=13, events=POLLIN}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}], 14, -1) = 1 ([{fd=17, revents=POLLIN}])
read(17, "\1\0\0\0\0\0\0\0\0\0\0\0", 12) = 12
newfstatat(AT_FDCWD, "/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts", {st_mode=S_IFREG|0644, st_size=1268, ...}, 0) = 0
openat(AT_FDCWD, "/opt/stack/data/neutron/dhcp/77cabae0-26bf-4374-997a-781947f2e5b2/addn_hosts", O_RDONLY) = 20
newfstatat(20, "", {st_mode=S_IFREG|0644, st_size=1268, ...}, AT_EMPTY_PATH) = 0
read(20, "10.1.0.1\thost-10-1-0-1.openstack"..., 4096) = 1268
read(20, "", 4096) = 0
close(20) = 0
getpid() = 235993
newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0644, st_size=114, ...}, 0) = 0
getpid() = 235993
write(19, "<30>Feb 29 22:35:30 dnsmasq[2359"..., 133) = 133
writev(2, [{iov_base="free(): invalid pointer", iov_len=23}, {iov_base="\n", iov_len=1}], 2) = 24
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fec31710000
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid() = 235993
getpid() = 235993
tgkill(235993, 235993, SIGABRT) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=235993, si_uid=65534} ---
+++ killed by SIGABRT +++

The above was captured from a dnsmasq install with https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=commit;h=10d8b5f001a34ff46b3a72575f3af64b065f8637 where as running the commit before it, I don't crash out dnsmasq. The result with 2.90 is basically identical (https://paste.openstack.org/show/823338/).

In terms of the actual logging which results, all we get is along the lines of https://paste.opendev.org/show/823337/, at least with the install on jammy, where neutron is directly launching the dnsmasq processes. Looking at the contents of the commit where this starts (https://thekelleys.org.uk/gitweb/?p=dnsmasq.git;a=blobdiff;f=src/option.c;h=23cf058e8ad9fdcffe621ce0b92d626578d31f82;hp=e8926a49f46050eea9cba7cc90e12d67821f3a2f;hb=10d8b5f001a34ff46b3a72575f3af64b065f8637;hpb=ffd3ceb856eb3e1ff923cae95a2ddb889beab7c1), it is clear that the logic has been changed into a different which explains the error we are able to capture with stracce.

I think ironic is going to have to pin to dnsmasq 2.85 for the time being, until we're able to get this fixed and fully understood in upstream dnsmasq since this is impacting numerous CI branches and test jobs.