tomcat7 in ubuntu trusty is affected by the '50 days timeout bug' when run by the Oracle JDK 7

Bug #1449951 reported by Andrea Dell'Amico
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Tomcat7
Fix Released
High
tomcat7 (Ubuntu)
Fix Released
Undecided
Unassigned
Trusty
Won't Fix
Undecided
Unassigned

Bug Description

After 2^32 milliseconds from the startup, tomcat fails with the following error:

ERROR [core.StandardServer] - StandardServer.await: accept:
java.net.SocketTimeoutException: Accept timed out
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
        at java.net.ServerSocket.implAccept(ServerSocket.java:530)
        at java.net.ServerSocket.accept(ServerSocket.java:498)
        at org.apache.catalina.core.StandardServer.await(StandardServer.java:451)
        at org.apache.catalina.startup.Catalina.await(Catalina.java:777)
        at org.apache.catalina.startup.Catalina.start(Catalina.java:723)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:321)
        at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:455)

It has been solved in tomcat 7.0.55, and the bug is: https://bz.apache.org/bugzilla/show_bug.cgi?id=56684

Revision history for this message
In , Gustavo-l (gustavo-l) wrote :

After upgrading to Java 7, I've noticed some applications fail after about 50 days f (for instance May 8 till June 27). This is suspiciously close to 2 ^ 32 -1 milliseconds:

Jun 27, 2014 9:24:47 AM org.apache.catalina.core.StandardServer await
SEVERE: StandardServer.await: accept:
java.net.SocketTimeoutException: Accept timed out
        at java.net.PlainSocketImpl.socketAccept(Native Method)
        at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
        at java.net.ServerSocket.implAccept(ServerSocket.java:530)
        at java.net.ServerSocket.accept(ServerSocket.java:498)
        at org.apache.catalina.core.StandardServer.await(StandardServer.java:431)
        at org.apache.catalina.startup.Catalina.await(Catalina.java:676)
        at org.apache.catalina.startup.Catalina.start(Catalina.java:628)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:289)
        at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:414)
Jun 27, 2014 9:24:47 AM org.apache.coyote.http11.Http11AprProtocol pause
INFO: Pausing Coyote HTTP/1.1 on http-51080
Jun 27, 2014 9:24:47 AM org.apache.coyote.ajp.AjpAprProtocol pause
INFO: Pausing Coyote AJP/1.3 on ajp-51009
Jun 27, 2014 9:24:48 AM org.apache.catalina.core.StandardService stop
INFO: Stopping service Catalina

After this, the application is no longer listening.

I suspect this may be due to a change in behavior in Java. See:

$ $JAVA_HOME/bin/java -version
java version "1.7.0_60"
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)

$ strace -f -o /tmp/hh groovy -e 'new ServerSocket(60001, 1, InetAddress.getByName("localhost")).accept()'
File contains:
2212 bind(47, {sa_family=AF_INET6, sin6_port=htons(60001), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
2212 listen(47, 1) = 0
2212 poll([{fd=47, events=POLLIN|POLLERR}], 1, 4294967295 <unfinished ...>

$ $JAVA_HOME/bin/java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01, mixed mode)

2661 bind(46, {sa_family=AF_INET6, sin6_port=htons(60001), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
2661 listen(46, 1) = 0
2661 accept(46, <unfinished ...>

While java6 calls accept(), java7 first waits for an event on the socket with poll(). However, it uses a timeout of 4294967295 milliseconds.

Revision history for this message
In , Mark Thomas (asfmarkt) wrote :

Hmm. My initial impression is that that is a JRE bug. I've contacted someone at Oracle (see the dev list) who has been helpful in pointing us in the right direction to report bugs. Keep an eye on the dev list for progress on that.

However, I can't repeat the results you report. When I run strace on the same Java version as you are using, I see a call to poll with a timeout of -1 which I am assuming is an infinite timeout.

You'll need to provide enough information about your environment to enable us (and Oracle) to repeat this otherwise this is going to get resolved as WORKSFORME.

Revision history for this message
In , Gustavo-l (gustavo-l) wrote :

I tested this on Ubuntu Trusty amd64. I can also reproduce on my home machine, which runas arch (amd64).

You can try this Vagrantfile:

Vagrant.configure('2') do |config|
  config.vm.box = "ubuntu/trusty64"

  provision_script = <<-eos
      cat > ~vagrant/run-test << eod
#!/bin/bash

# only one version of Java installed; no need to set JAVA_HOME
# so Groovy picks the correct one
aptitude install -y oracle-java7-installer groovy
java -version
strace -f -o /tmp/hh groovy -e 'new ServerSocket(60001, 1, InetAddress.getByName("localhost")).accept()'
eod
      chmod +x ~vagrant/run-test

      apt-get install -y python-software-properties strace vim
      add-apt-repository -y ppa:webupd8team/java
      apt-get update
    eos

  config.vm.provision "shell", inline: provision_script
end

after ssh'ing, run sudo ./run-test, accept the license and then Ctrl+C. You can then instpect /tmp/hh

Revision history for this message
In , Gustavo-l (gustavo-l) wrote :

The strace output may actually be a bug in strace because poll() takes a signed int as timeout; it would seem strace is interpreting the value as an unsigned int. Indeed:

$ cat a.c && gcc a.c && strace ./a.out
#include <poll.h>
#include <stdio.h>

void main() {
 struct pollfd fd = { .fd = fileno(stdin), .events = POLLIN };
 int timeout = -1;

 poll(&fd, 1, timeout);
}
execve("./a.out", ["./a.out"], [/* 57 vars */]) = 0
brk(0) = 0xeb5000
(...)
poll([{fd=0, events=POLLIN}], 1, 4294967295^CProcess 6565 detached

That leaves me (short of a bug in glibc or the kernel) with no explanation for the exception I got, which takes quite a while to reproduce, though if poll doesn't use the monotonic clock, there may be an faster way. I'll check tomorrow.

Revision history for this message
In , Gustavo-l (gustavo-l) wrote :

A quick observation: the poll syscall used to take a long, and now takes an int. Maybe strace hasn't been updated yet:

http://lwn.net/Articles/483078/

Revision history for this message
In , Mark Thomas (asfmarkt) wrote :

I was testing on Ubuntu Precise. After updating to Trusty I see the same value as you of 4294967295. That, at least, explains why we were seeing different values.

I've looked through both the Tomcat code and the Java 7 code and I don't see anything wrong (although I am no C/C++ expert). Neither do I see anything that explain the behaviour you are seeing.

I think the next steps are to see if you can reproduce this issue with a test case that doesn't take ~50 days to run.

Revision history for this message
In , Knst-kolinko (knst-kolinko) wrote :

Javadoc for ServerSocket.accept() says that SocketTimeoutException can be thrown by the method "if a timeout was previously set with setSoTimeout". [1]

[1] http://docs.oracle.com/javase/7/docs/api/java/net/ServerSocket.html#accept%28%29

BTW,
Javadoc for ServerSocket.setSoTimeout() says that to set an infinite timeout one uses the value of "0". Is somebody confusing '0' and '-1'?

I think the javadoc for ServerSocket.accept() implies that the default timeout is infinite. So I think it is a JRE bug.

In general it makes some sense to protect against this. A timeout is not a "receiving a valid shutdown command" event, and encountering a timeout does not make the server socket an invalid one, so we should be able to continue.

Revision history for this message
In , Mark Thomas (asfmarkt) wrote :

(In reply to Konstantin Kolinko from comment #6)
> BTW,
> Javadoc for ServerSocket.setSoTimeout() says that to set an infinite timeout
> one uses the value of "0". Is somebody confusing '0' and '-1'?

Not that I can see having looked at the JRE code.

> I think the javadoc for ServerSocket.accept() implies that the default
> timeout is infinite. So I think it is a JRE bug.

That is a possibility, but having looked at the JRE code, I don't see where the problem is.

> In general it makes some sense to protect against this. A timeout is not a
> "receiving a valid shutdown command" event, and encountering a timeout does
> not make the server socket an invalid one, so we should be able to continue.

Agreed. There is no reason I can think of that accept should time out so we can certiainly add some protection against this issue here. We can probably add some useful debug info as well.

Revision history for this message
In , Mark Thomas (asfmarkt) wrote :

Note that the protection and debug info is going to go into Tomcat 8. It would help a lot if you could upgrade at least one system experiencing this error to Tomcat 8.

Revision history for this message
In , Mark Thomas (asfmarkt) wrote :

Workaround added to 8.0.x for 8.0.10 onwards. If the timout occurs a warning is logged that reports how long accept was blocking for before the timeout.

Revision history for this message
In , Mark Thomas (asfmarkt) wrote :

The workaround and simple debug code was pretty simple so I have backported it to 7.0.x (it will be in 7.0.55 onwards) and proposed if for 6.0.x.

Revision history for this message
In , Mark Thomas (asfmarkt) wrote :

Work around added to 6.0.x for 6.0.42 onwards.

Revision history for this message
In , Fachhoch (fachhoch) wrote :

is this fix available in tomcat 7.0.50? I am getting this error in jdk1.7 64 bit tomcat 7.0.50

Revision history for this message
In , Chuck-caldarale (chuck-caldarale) wrote :

(In reply to fach hoch from comment #12)
> is this fix available in tomcat 7.0.50?

No, it's in 7.0.55, as stated in comment #10.

Changed in tomcat7:
importance: Unknown → High
status: Unknown → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in tomcat7 (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi I'm clearing bugs on our list that were dormant for too long.
The fix was actually applied to 7.0.55, so >=Xenial is good.

The fix itself would be:
https://github.com/apache/tomcat70/commit/4b1818de

Changed in tomcat7 (Ubuntu):
status: Confirmed → Fix Released
Changed in tomcat7 (Ubuntu Trusty):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

After all the time I'm not sure if anybody is still really requiring that.
If one does I provided a test ppa [1] with a fix.

If you are affected by the bug and that fix helps please help to provide a SRU template [2] here in the bug - especially the steps to reproduce are interesting to verify the SRU.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3023
[2]: https://wiki.ubuntu.com/StableReleaseUpdates#SRU_Bug_Template

Revision history for this message
Andrea Dell'Amico (adellam) wrote :

It needs 50 days of wait to test the fix, and there's a workaround: disabling the shutdown port fix the behaviour, and as I always restart the tomcat instances from localhost so I'm OK with the workaround.
Additionally we are going to switch to 16.04 in the next months, so we don't need this particular fix anymore.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Wow, how long did I miss that answer ... */me feels bad*
Still, thank you for the feedback Andrea!

I agree, just like you almost everybody will have switched to later releases by now.
And since trusty is no more in standard support (only ESM) nowadays and we didn't have a party of "critical for me as well" updates here it is unlikely to be fixed.

Therefore while cleaning up old bugs I'm setting the Trusty task to "Won't Fix"

Changed in tomcat7 (Ubuntu Trusty):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.