loss of masters causing bind to become unresponsive

Bug #177489 reported by Matt LaPlante
2
Affects Status Importance Assigned to Milestone
bind (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: bind

Bind lived up to its name this morning. I have a bind server that effectively serves triple duty providing:

- Public zones
- Private zones (for the lan)
- Recursive lookups and caching for lan hosts

The first two are effectively the same thing except for some ACLs, but that's really beside the point. Anyway, for the sake of this example, assume the public zones and private zones are all slave zones. The public zones load over public ip addresses, while the private (lan) zones load over an ipsec connection to the master's network.

This morning the ipsec connection went away, as it occasionally does, and shortly thereafter so did bind. What really boggled me was that named would become completely unresponsive, even after rebooting the server, within a couple minutes of startup. It would refuse to do any lookups, would fail to resolve its own zones, and even failed to respond to rndc for restarts. Basically, I'd have to kill -9 it, then restart. It would work for a minute or two, then re-hang.

After poking around for some time, I broke out strace and had a look at a few of the threads running under named. One thread looked like it was waiting for input from the master over the down ipsec connection. Another appeared to be in a bit of an infinite loop of the following:

clock_gettime(CLOCK_REALTIME, {1198072309, 972541645}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1671, {0, 403081355}) = -1 ETIMEDOUT (Connection t imed out)
gettimeofday({1198072310, 377730}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 379388030}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1673, {0, 11452970}) = -1 ETIMEDOUT (Connection ti med out)
gettimeofday({1198072310, 392700}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 394254492}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1675, {0, 77563508}) = -1 ETIMEDOUT (Connection ti med out)
gettimeofday({1198072310, 473689}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 475455403}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1677, {0, 498233597}) = -1 ETIMEDOUT (Connection t imed out)
gettimeofday({1198072310, 976660}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072310, 977375142}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1679, {0, 499284858}) = -1 ETIMEDOUT (Connection t imed out)
gettimeofday({1198072311, 478563}, NULL) = 0
futex(0xb7a72010, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1198072311, 480351624}) = 0
futex(0xb7a72044, FUTEX_WAIT, 1681, {0, 498211376}) = ? ERESTART_RESTARTBLOCK (T o be restarted)

I took the hint and commented out the couple private zones thats required the master over ipsec. Following that, named has stayed up and running as normal. Apparently somewhere in the bind code, if it doesn't hear back from a master it will literally wait forever and stop serving all data. This, imo, is not good.

I also have the following additional observations to add:
- This is not the first time the ipsec connection has gone away, but it's the first time I've seen this. It may also be the first time ipsec has been down since upgrading to edgy, so the problem may be new in bind 9.4. It could also be a bizarre coincidence.
- The public zones, which resolve over public ip addresses, did not cause a failure even when their master was unreachable. This leads me to believe that there is something about the way ipsec dealt with bind's queries that was creating the condition, but I still think it's a condition bind should be able to deal with.

Matt LaPlante (cybrmatt)
description: updated
Revision history for this message
Matt LaPlante (cybrmatt) wrote :
Download full text (3.4 KiB)

I was mad at myself for losing the strace from the other thread, so I re-enabled a broken slave zone and recreated the problem (easily). Here's the tail of an strace from the thread that seems to be hanging on the ipsec ip, just as it stops responding:

futex(0xb7a8f0f0, FUTEX_WAIT, 943, NULL) = 0
futex(0xb7a8f0b8, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb7a8f0b8, FUTEX_WAKE, 1) = 0
gettimeofday({1198079742, 934859}, NULL) = 0
gettimeofday({1198079742, 935156}, NULL) = 0
gettimeofday({1198079742, 935487}, NULL) = 0
futex(0xb7a8f0f0, FUTEX_WAIT, 945, NULL) = 0
futex(0xb7a8f0b8, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb7a8f0b8, FUTEX_WAKE, 1) = 0
gettimeofday({1198079742, 948815}, NULL) = 0
gettimeofday({1198079742, 949291}, NULL) = 0
gettimeofday({1198079742, 949505}, NULL) = 0
futex(0xb7a8f0f0, FUTEX_WAIT, 947, NULL) = 0
futex(0xb7a8f0b8, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0xb7a8f0b8, FUTEX_WAKE, 1) = 0
gettimeofday({1198079745, 920414}, NULL) = 0
gettimeofday({1198079745, 920873}, NULL) = 0
gettimeofday({1198079745, 921082}, NULL) = 0
gettimeofday({1198079745, 921309}, NULL) = 0
futex(0xb7a91044, 0x5 /* FUTEX_??? */, 1) = 1
futex(0xb7a91010, FUTEX_WAKE, 1) = 0
gettimeofday({1198079745, 922263}, NULL) = 0
gettimeofday({1198079745, 923564}, NULL) = 0
gettimeofday({1198079745, 923935}, NULL) = 0
sendmsg(32, {msg_name(16)={sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("192.168.5.5")}, msg_iov(1)=[{"+\305\0\0\0\1\0\0\0\0\0\1\3glv\3lan\0\0\6\0\1\0\0)\10\0"..., 36}], msg_controllen=0, msg_flags=0}, 0

And here's the tail of a final thread, which also seems to stop doing anything during the hang. (The output of the third thread, the one that seems to loop forever, is in the original post):

read(5, 0xb6a8b174, 8) = -1 EAGAIN (Resource temporarily unavailable)
select(36, [5 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35], [], NULL, NULL) = 1 (in [22])
futex(0xb7a8f0f0, 0x5 /* FUTEX_??? */, 1) = 1
futex(0xb7a8f0b8, FUTEX_WAKE, 1) = 0
select(36, [5 20 21 23 24 25 26 27 28 29 30 31 32 33 34 35], [], NULL, NULL) = 1 (in [5])
read(5, "\26\0\0\0\375\377\377\377", 8) = 8
read(5, 0xb6a8b174, 8) = -1 EAGAIN (Resource temporarily unavailable)
select(36, [5 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35], [], NULL, NULL) = 1 (in [32])
futex(0xb7a8f0f0, 0x5 /* FUTEX_??? */, 1) = 1
futex(0xb7a8f0b8, FUTEX_WAKE, 1) = 0
select(36, [5 20 21 22 23 24 25 26 27 28 29 30 31 33 34 35], [], NULL, NULL) = 1 (in [5])
read(5, " \0\0\0\375\377\377\377", 8) = 8
read(5, 0xb6a8b174, 8) = -1 EAGAIN (Resource temporarily unavailable)
select(36, [5 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35], [], NULL, NULL) = 1 (in [32])
futex(0xb7a8f0f0, 0x5 /* FUTEX_??? */, 1) = 1
futex(0xb7a8f0b8, FUTEX_WAKE, 1) = 0
select(36, [5 20 21 22 23 24 25 26 27 28 29 30 31 33 34 35], [], NULL, NULL) = 1 (in [5])
read(5, " \0\0\0\375\377\377\377", 8) = 8
read(5, 0xb6a8b174, 8) = -1 EAGAIN (Resource temporarily unavailable)
select(36, [5 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 3...

Read more...

Revision history for this message
Daniel T Chen (crimsun) wrote :

Is this symptom still reproducible in 8.10?

Changed in bind:
status: New → Incomplete
Revision history for this message
Matt LaPlante (cybrmatt) wrote :

This was determined to be caused by a new kernel setting (at the time). If net.core.xfrm_larval_drop is not set to 1, bind does not get the blocking behavior it expects on the socket and causes this situation. Setting net.core.xfrm_larval_drop=1 works around it.

Changed in bind:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.