Activity log for bug #1338637

Date Who What changed Old value New value Message
2014-07-07 15:54:03 James Hunt bug added bug
2014-07-07 15:54:18 James Hunt upstart: assignee James Hunt (jamesodhunt)
2014-07-07 15:55:04 James Hunt upstart: importance Undecided High
2014-07-07 15:59:15 James Hunt description $ sudo ls -al /proc/1/fd|grep anon|wc -l $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106 $ sudo ls -al /proc/1/fd|grep anon|wc -l 2 $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106
2014-07-08 10:03:16 Launchpad Janitor branch linked lp:~jamesodhunt/upstart/bug-1338637
2014-07-08 11:39:04 Launchpad Janitor branch linked lp:upstart
2014-07-22 13:41:37 James Hunt upstart: status New Fix Released
2014-08-15 18:19:50 James Hunt bug added subscriber Adam Conrad
2014-09-08 12:36:33 Thilo Uttendorfer bug added subscriber Thilo Uttendorfer
2014-09-17 14:37:28 James Hunt bug task added upstart (Ubuntu)
2014-09-17 15:38:20 Colin Watson nominated for series Ubuntu Trusty
2014-09-17 15:38:20 Colin Watson bug task added upstart (Ubuntu Trusty)
2014-09-18 09:28:18 James Hunt summary continuous re-exec can result in a build-up of inotify fds continuous re-exec can result in a build-up of inotify fds [SRU]
2014-09-18 10:02:26 James Hunt description $ sudo ls -al /proc/1/fd|grep anon|wc -l 2 $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106 = PROBLEM = The version of Upstart currently in trusty (1.12.1-0ubuntu4.2) suffers from a couple of problems which in combination could make upgrading difficult on systems with high uptimes. The 2 issues are: == bug 901038 == The issue here is that telinit in trusty is not fully synchronous. This is not normally a problem. However if Upstart is re-exec'ed using 'telinit u', as is done by the maintainer scripts for the following packages... - libc6 - libdbus-1-3 - libjson-c2 - libnih1 - libnih-dbus1 - libselinux1 - libsepol1 - upstart ... and if that operation is slow (it is normally extremely fast), subsequent package upgrades (part of the same apt-get run) which *also* need to restart Upstart may fail since 'telinit u' is unable to connect to PID 1 as a result of Upstart still being in the process of re-exec'ing from the last call to 'telinit u'. == bug 1338637 == This bug only really affects server systems which have high uptimes. If a re-exec is triggered via 'telinit u' as part of a package upgrade (see above), Upstart will consume 2 additional inotify watches. Although this doesn't affect the correct behaviour of Upstart, it does have two repercussions: a) It wastes inotify watches. b) It slows down an Upstart re-exec ('telinit u'). Note that the slow-down for a vanilla Trusty server may not be detectable unless the value of /proc/sys/fs/inotify/max_user_instances has been raised above the default value of 128. = FIX DETAILS = == bug 901038 == The fix for this bug now makes 'telinit u' block until the re-exec operation has completed fully. Technically, it is not possible to make 'telinit u' synchronous since because D-Bus connections cannot be serialised and since the 'telinit u' request is made via a D-Bus connection, when Upstart re-exec's, it has to sever all D-Bus connections, including the 'telinit u' D-Bus connection. As such, 'telinit u' now performs the following operations: - Requests synchronously that Upstart re-exec itself. - Polls Upstart "forever" by attempting to connect to PID 1 and if that operation fails, waiting for a period, the retrying. The code is well commented to explain this less-than-ideal but nonetheless essential poll operation: http://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/util/telinit.c#L171 == bug 1338637 == This was a simple bug fix. = FIX AVAILABILITY = TBD. = IMPACT = On a high uptime server, the more times that 'telinit u' is called (as the result of normal apt-get updates to any of the packages listed under bug 901038 above), the slower the operation will take to complete, and thus the likelihood of bug 901038 being seen will increase. = JUSTIFICATION = System updates need to "just work". However, as outlined above, the longer a systems uptime, the more likely it is to be affected by this issue which increases the likelihood of a system update failure. This issue needs to be fixed as soon as possible to minimise issues for trusty users and administrators, particularly before any potential upgrade to Utopic. = TEST CASE = == To demonstrate bug 1338637 == $ sudo telinit u $ sudo ls -l /proc/1/fd|grep inotify|wc -l The correct output from the above 2 commands should be simply "2". However, on trusty systems, if the commands above are run after a fresh boot, the value will in all likelihood be "4". Also, every subsequent 'telinit u' will increase the number displayed by 2. == To demonstrate bug 901038 == for i in $(seq 10); do telinit u; done There should be no output from the command-line above, but on a trusty system the chances are that one or more lines will display like this: $ for i in $(seq 5); do sudo telinit u;done telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused == To prove the overall problem has been fixed == 1) Download the attached "force-reexec.sh" onto a trusty system. 2) Make executable: $ chmod 755 ./force-reexec.sh 3) Run script as root, specifying a number of iterations (100) and saving output to file "./typescript.bad": $ script -c 'sudo ./force-reexec.sh 100' typescript.bad 4) Upgrade to latest version of Upstart that fixes the 2 bugs. 5) Reboot. 6) Re-run the script $ script -c 'sudo ./force-reexec.sh 100' typescript.good 7) Check the results: - file "typescript.bad" will show an increasing number of watches and *MAY* show a slowly increasing restart time (depending on the value of /proc/sys/fs/inotify/max_user_instances (see above)). - file "typescript.good" should consistently show 2 watches and a restart time of 0 on average. [REGRESSION POTENTIAL] The only theoretically potential issue is what happens if the continuous poll performed by 'telinit u' never completes. However, this should never happen since: 1) Upstart checks to ensure that it can serialise its own state *before* it actually performs the re-exec. If it cannot for some reason (the only possibility here is critically low memory), it will automatically degrade to a stateless re-exec. A stateless re-exec can only fail if the on-disk "/sbin/init" binary or associated libraries are somehow corrupted. 2) If, after checking that its own state can be serialised, the actual re-exec operation fails "mid-flight", again, Upstart will automatically revert to performing a simple stateless re-exec which can only fail if the on-disk "/sbin/init" binary or associated libraries are somehow corrupted. = OTHER INFORMATION = == Upgrade Procedure == Note that updating a system to the latest version of Upstart that fixes the two bugs outlined above will stop the number of inotify watches growing, but will NOT bring the value down to the expected "2". As such, to correct the problem fully, it is necessary to reboot the system after successfully upgrading the Upstart package. == D-Bus Daemon == Note that although Upstart makes heavy use of D-Bus, it does not require a D-Bus daemon to be running. Specifically, 'telinit u' communicates with PID 1 via a private abstract D-Bus socket, so is immune from issues with dbus-daemon(1). = Original Description = $ sudo ls -al /proc/1/fd|grep anon|wc -l 2 $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106
2014-09-18 10:03:32 James Hunt attachment added force-reexec.sh : script that re-exec's upstart and demonstrates both bugs. https://bugs.launchpad.net/upstart/+bug/1338637/+attachment/4207315/+files/force-reexec.sh
2014-09-18 10:04:34 James Hunt attachment added Output of force-reexec.sh on a trusty system demonstrating the problem. https://bugs.launchpad.net/upstart/+bug/1338637/+attachment/4207316/+files/typescript.bad
2014-09-18 10:05:25 James Hunt attachment added Output of force-reexec.sh on a trusty system with fixes for both bugs, showing the problem has been resolved. https://bugs.launchpad.net/upstart/+bug/1338637/+attachment/4207317/+files/typescript.good
2014-09-18 10:07:39 James Hunt description = PROBLEM = The version of Upstart currently in trusty (1.12.1-0ubuntu4.2) suffers from a couple of problems which in combination could make upgrading difficult on systems with high uptimes. The 2 issues are: == bug 901038 == The issue here is that telinit in trusty is not fully synchronous. This is not normally a problem. However if Upstart is re-exec'ed using 'telinit u', as is done by the maintainer scripts for the following packages... - libc6 - libdbus-1-3 - libjson-c2 - libnih1 - libnih-dbus1 - libselinux1 - libsepol1 - upstart ... and if that operation is slow (it is normally extremely fast), subsequent package upgrades (part of the same apt-get run) which *also* need to restart Upstart may fail since 'telinit u' is unable to connect to PID 1 as a result of Upstart still being in the process of re-exec'ing from the last call to 'telinit u'. == bug 1338637 == This bug only really affects server systems which have high uptimes. If a re-exec is triggered via 'telinit u' as part of a package upgrade (see above), Upstart will consume 2 additional inotify watches. Although this doesn't affect the correct behaviour of Upstart, it does have two repercussions: a) It wastes inotify watches. b) It slows down an Upstart re-exec ('telinit u'). Note that the slow-down for a vanilla Trusty server may not be detectable unless the value of /proc/sys/fs/inotify/max_user_instances has been raised above the default value of 128. = FIX DETAILS = == bug 901038 == The fix for this bug now makes 'telinit u' block until the re-exec operation has completed fully. Technically, it is not possible to make 'telinit u' synchronous since because D-Bus connections cannot be serialised and since the 'telinit u' request is made via a D-Bus connection, when Upstart re-exec's, it has to sever all D-Bus connections, including the 'telinit u' D-Bus connection. As such, 'telinit u' now performs the following operations: - Requests synchronously that Upstart re-exec itself. - Polls Upstart "forever" by attempting to connect to PID 1 and if that operation fails, waiting for a period, the retrying. The code is well commented to explain this less-than-ideal but nonetheless essential poll operation: http://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/util/telinit.c#L171 == bug 1338637 == This was a simple bug fix. = FIX AVAILABILITY = TBD. = IMPACT = On a high uptime server, the more times that 'telinit u' is called (as the result of normal apt-get updates to any of the packages listed under bug 901038 above), the slower the operation will take to complete, and thus the likelihood of bug 901038 being seen will increase. = JUSTIFICATION = System updates need to "just work". However, as outlined above, the longer a systems uptime, the more likely it is to be affected by this issue which increases the likelihood of a system update failure. This issue needs to be fixed as soon as possible to minimise issues for trusty users and administrators, particularly before any potential upgrade to Utopic. = TEST CASE = == To demonstrate bug 1338637 == $ sudo telinit u $ sudo ls -l /proc/1/fd|grep inotify|wc -l The correct output from the above 2 commands should be simply "2". However, on trusty systems, if the commands above are run after a fresh boot, the value will in all likelihood be "4". Also, every subsequent 'telinit u' will increase the number displayed by 2. == To demonstrate bug 901038 == for i in $(seq 10); do telinit u; done There should be no output from the command-line above, but on a trusty system the chances are that one or more lines will display like this: $ for i in $(seq 5); do sudo telinit u;done telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused == To prove the overall problem has been fixed == 1) Download the attached "force-reexec.sh" onto a trusty system. 2) Make executable: $ chmod 755 ./force-reexec.sh 3) Run script as root, specifying a number of iterations (100) and saving output to file "./typescript.bad": $ script -c 'sudo ./force-reexec.sh 100' typescript.bad 4) Upgrade to latest version of Upstart that fixes the 2 bugs. 5) Reboot. 6) Re-run the script $ script -c 'sudo ./force-reexec.sh 100' typescript.good 7) Check the results: - file "typescript.bad" will show an increasing number of watches and *MAY* show a slowly increasing restart time (depending on the value of /proc/sys/fs/inotify/max_user_instances (see above)). - file "typescript.good" should consistently show 2 watches and a restart time of 0 on average. [REGRESSION POTENTIAL] The only theoretically potential issue is what happens if the continuous poll performed by 'telinit u' never completes. However, this should never happen since: 1) Upstart checks to ensure that it can serialise its own state *before* it actually performs the re-exec. If it cannot for some reason (the only possibility here is critically low memory), it will automatically degrade to a stateless re-exec. A stateless re-exec can only fail if the on-disk "/sbin/init" binary or associated libraries are somehow corrupted. 2) If, after checking that its own state can be serialised, the actual re-exec operation fails "mid-flight", again, Upstart will automatically revert to performing a simple stateless re-exec which can only fail if the on-disk "/sbin/init" binary or associated libraries are somehow corrupted. = OTHER INFORMATION = == Upgrade Procedure == Note that updating a system to the latest version of Upstart that fixes the two bugs outlined above will stop the number of inotify watches growing, but will NOT bring the value down to the expected "2". As such, to correct the problem fully, it is necessary to reboot the system after successfully upgrading the Upstart package. == D-Bus Daemon == Note that although Upstart makes heavy use of D-Bus, it does not require a D-Bus daemon to be running. Specifically, 'telinit u' communicates with PID 1 via a private abstract D-Bus socket, so is immune from issues with dbus-daemon(1). = Original Description = $ sudo ls -al /proc/1/fd|grep anon|wc -l 2 $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106 = PROBLEM = The version of Upstart currently in trusty (1.12.1-0ubuntu4.2) suffers from a couple of problems which in combination could make upgrading difficult on systems with high uptimes. The 2 issues are: == bug 901038 == The issue here is that telinit in trusty is not fully synchronous. This is not normally a problem. However if Upstart is re-exec'ed using 'telinit u', as is done by the maintainer scripts for the following packages...   - libc6   - libdbus-1-3   - libjson-c2   - libnih1   - libnih-dbus1   - libselinux1   - libsepol1   - upstart ... and if that operation is slow (it is normally extremely fast), subsequent package upgrades (part of the same apt-get run) which *also* need to restart Upstart may fail since 'telinit u' is unable to connect to PID 1 as a result of Upstart still being in the process of re-exec'ing from the last call to 'telinit u'. == bug 1338637 == This bug only really affects server systems which have high uptimes. If a re-exec is triggered via 'telinit u' as part of a package upgrade (see above), Upstart will consume 2 additional inotify watches. Although this doesn't affect the correct behaviour of Upstart, it does have two repercussions: a) It wastes inotify watches. b) It slows down an Upstart re-exec ('telinit u'). Note that the slow-down for a vanilla Trusty server may not be detectable unless the value of /proc/sys/fs/inotify/max_user_instances has been raised above the default value of 128. = FIX DETAILS = == bug 901038 == The fix for this bug now makes 'telinit u' block until the re-exec operation has completed fully. Technically, it is not possible to make 'telinit u' synchronous since because D-Bus connections cannot be serialised and since the 'telinit u' request is made via a D-Bus connection, when Upstart re-exec's, it has to sever all D-Bus connections, including the 'telinit u' D-Bus connection. As such, 'telinit u' now performs the following operations:   - Requests synchronously that Upstart re-exec itself.   - Polls Upstart "forever" by attempting to connect to PID 1 and if     that operation fails, waiting for a period, the retrying. The code is well commented to explain this less-than-ideal but nonetheless essential poll operation:   http://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/util/telinit.c#L171 == bug 1338637 == This was a simple bug fix. = FIX AVAILABILITY = TBD. = IMPACT = On a high uptime server, the more times that 'telinit u' is called (as the result of normal apt-get updates to any of the packages listed under bug 901038 above), the slower the operation will take to complete, and thus the likelihood of bug 901038 being seen will increase. = JUSTIFICATION = System updates need to "just work". However, as outlined above, the longer a systems uptime, the more likely it is to be affected by this issue which increases the likelihood of a system update failure. This issue needs to be fixed as soon as possible to minimise issues for trusty users and administrators, particularly before any potential upgrade to Utopic. = TEST CASE = == To demonstrate bug 1338637 == $ sudo telinit u $ sudo ls -l /proc/1/fd|grep inotify|wc -l The correct output from the above 2 commands should be simply "2". However, on trusty systems, if the commands above are run after a fresh boot, the value will in all likelihood be "4". Also, every subsequent 'telinit u' will increase the number displayed by 2. == To demonstrate bug 901038 == for i in $(seq 10); do telinit u; done There should be no output from the command-line above, but on a trusty system the chances are that one or more lines will display like this: $ for i in $(seq 5); do sudo telinit u;done telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused == To prove the overall problem has been fixed == 1) Download the attached "force-reexec.sh" onto a trusty system. 2) Make executable:    $ chmod 755 ./force-reexec.sh 3) Run script as root, specifying a number of iterations (100) and    saving output to file "./typescript.bad":    $ script -c 'sudo ./force-reexec.sh 100' typescript.bad 4) Upgrade to latest version of Upstart that fixes the 2 bugs. 5) Reboot. 6) Re-run the script    $ script -c 'sudo ./force-reexec.sh 100' typescript.good 7) Check the results:    - file "typescript.bad" will show an increasing number of watches      and *MAY* show a slowly increasing restart time (depending on      the value of /proc/sys/fs/inotify/max_user_instances (see above)).   - file "typescript.good" should consistently show 2 watches and a     restart time of 0 on average. = TEST RESULTS = See the 2 attached typescript files showing a run of force-reexec.sh on an affected system, and the other showing the output of force-reexec.sh on a system where the problem has been resolved. [REGRESSION POTENTIAL] The only theoretically potential issue is what happens if the continuous poll performed by 'telinit u' never completes. However, this should never happen since: 1) Upstart checks to ensure that it can serialise its own state *before*    it actually performs the re-exec. If it cannot for some reason (the    only possibility here is critically low memory), it will    automatically degrade to a stateless re-exec. A stateless re-exec can    only fail if the on-disk "/sbin/init" binary or associated libraries    are somehow corrupted. 2) If, after checking that its own state can be serialised, the actual    re-exec operation fails "mid-flight", again, Upstart will    automatically revert to performing a simple stateless re-exec which    can only fail if the on-disk "/sbin/init" binary or associated    libraries are somehow corrupted. = OTHER INFORMATION = == Upgrade Procedure == Note that updating a system to the latest version of Upstart that fixes the two bugs outlined above will stop the number of inotify watches growing, but will NOT bring the value down to the expected "2". As such, to correct the problem fully, it is necessary to reboot the system after successfully upgrading the Upstart package. == D-Bus Daemon == Note that although Upstart makes heavy use of D-Bus, it does not require a D-Bus daemon to be running. Specifically, 'telinit u' communicates with PID 1 via a private abstract D-Bus socket, so is immune from issues with dbus-daemon(1). = Original Description = $ sudo ls -al /proc/1/fd|grep anon|wc -l 2 $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106
2014-09-18 10:24:01 James Hunt description = PROBLEM = The version of Upstart currently in trusty (1.12.1-0ubuntu4.2) suffers from a couple of problems which in combination could make upgrading difficult on systems with high uptimes. The 2 issues are: == bug 901038 == The issue here is that telinit in trusty is not fully synchronous. This is not normally a problem. However if Upstart is re-exec'ed using 'telinit u', as is done by the maintainer scripts for the following packages...   - libc6   - libdbus-1-3   - libjson-c2   - libnih1   - libnih-dbus1   - libselinux1   - libsepol1   - upstart ... and if that operation is slow (it is normally extremely fast), subsequent package upgrades (part of the same apt-get run) which *also* need to restart Upstart may fail since 'telinit u' is unable to connect to PID 1 as a result of Upstart still being in the process of re-exec'ing from the last call to 'telinit u'. == bug 1338637 == This bug only really affects server systems which have high uptimes. If a re-exec is triggered via 'telinit u' as part of a package upgrade (see above), Upstart will consume 2 additional inotify watches. Although this doesn't affect the correct behaviour of Upstart, it does have two repercussions: a) It wastes inotify watches. b) It slows down an Upstart re-exec ('telinit u'). Note that the slow-down for a vanilla Trusty server may not be detectable unless the value of /proc/sys/fs/inotify/max_user_instances has been raised above the default value of 128. = FIX DETAILS = == bug 901038 == The fix for this bug now makes 'telinit u' block until the re-exec operation has completed fully. Technically, it is not possible to make 'telinit u' synchronous since because D-Bus connections cannot be serialised and since the 'telinit u' request is made via a D-Bus connection, when Upstart re-exec's, it has to sever all D-Bus connections, including the 'telinit u' D-Bus connection. As such, 'telinit u' now performs the following operations:   - Requests synchronously that Upstart re-exec itself.   - Polls Upstart "forever" by attempting to connect to PID 1 and if     that operation fails, waiting for a period, the retrying. The code is well commented to explain this less-than-ideal but nonetheless essential poll operation:   http://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/util/telinit.c#L171 == bug 1338637 == This was a simple bug fix. = FIX AVAILABILITY = TBD. = IMPACT = On a high uptime server, the more times that 'telinit u' is called (as the result of normal apt-get updates to any of the packages listed under bug 901038 above), the slower the operation will take to complete, and thus the likelihood of bug 901038 being seen will increase. = JUSTIFICATION = System updates need to "just work". However, as outlined above, the longer a systems uptime, the more likely it is to be affected by this issue which increases the likelihood of a system update failure. This issue needs to be fixed as soon as possible to minimise issues for trusty users and administrators, particularly before any potential upgrade to Utopic. = TEST CASE = == To demonstrate bug 1338637 == $ sudo telinit u $ sudo ls -l /proc/1/fd|grep inotify|wc -l The correct output from the above 2 commands should be simply "2". However, on trusty systems, if the commands above are run after a fresh boot, the value will in all likelihood be "4". Also, every subsequent 'telinit u' will increase the number displayed by 2. == To demonstrate bug 901038 == for i in $(seq 10); do telinit u; done There should be no output from the command-line above, but on a trusty system the chances are that one or more lines will display like this: $ for i in $(seq 5); do sudo telinit u;done telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused == To prove the overall problem has been fixed == 1) Download the attached "force-reexec.sh" onto a trusty system. 2) Make executable:    $ chmod 755 ./force-reexec.sh 3) Run script as root, specifying a number of iterations (100) and    saving output to file "./typescript.bad":    $ script -c 'sudo ./force-reexec.sh 100' typescript.bad 4) Upgrade to latest version of Upstart that fixes the 2 bugs. 5) Reboot. 6) Re-run the script    $ script -c 'sudo ./force-reexec.sh 100' typescript.good 7) Check the results:    - file "typescript.bad" will show an increasing number of watches      and *MAY* show a slowly increasing restart time (depending on      the value of /proc/sys/fs/inotify/max_user_instances (see above)).   - file "typescript.good" should consistently show 2 watches and a     restart time of 0 on average. = TEST RESULTS = See the 2 attached typescript files showing a run of force-reexec.sh on an affected system, and the other showing the output of force-reexec.sh on a system where the problem has been resolved. [REGRESSION POTENTIAL] The only theoretically potential issue is what happens if the continuous poll performed by 'telinit u' never completes. However, this should never happen since: 1) Upstart checks to ensure that it can serialise its own state *before*    it actually performs the re-exec. If it cannot for some reason (the    only possibility here is critically low memory), it will    automatically degrade to a stateless re-exec. A stateless re-exec can    only fail if the on-disk "/sbin/init" binary or associated libraries    are somehow corrupted. 2) If, after checking that its own state can be serialised, the actual    re-exec operation fails "mid-flight", again, Upstart will    automatically revert to performing a simple stateless re-exec which    can only fail if the on-disk "/sbin/init" binary or associated    libraries are somehow corrupted. = OTHER INFORMATION = == Upgrade Procedure == Note that updating a system to the latest version of Upstart that fixes the two bugs outlined above will stop the number of inotify watches growing, but will NOT bring the value down to the expected "2". As such, to correct the problem fully, it is necessary to reboot the system after successfully upgrading the Upstart package. == D-Bus Daemon == Note that although Upstart makes heavy use of D-Bus, it does not require a D-Bus daemon to be running. Specifically, 'telinit u' communicates with PID 1 via a private abstract D-Bus socket, so is immune from issues with dbus-daemon(1). = Original Description = $ sudo ls -al /proc/1/fd|grep anon|wc -l 2 $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106 = PROBLEM = The version of Upstart currently in trusty (1.12.1-0ubuntu4.2) suffers from a couple of problems which in combination could make upgrading difficult on systems with high uptimes. The 2 issues are: == bug 901038 == The issue here is that telinit in trusty is not fully synchronous. This is not normally a problem. However if Upstart is re-exec'ed using 'telinit u', as is done by the maintainer scripts for the following packages...   - libc6   - libdbus-1-3   - libjson-c2   - libnih1   - libnih-dbus1   - libselinux1   - libsepol1   - upstart ... and if that operation is slow (it is normally extremely fast), subsequent package upgrades (part of the same apt-get run) which *also* need to restart Upstart may fail since 'telinit u' is unable to connect to PID 1 as a result of Upstart still being in the process of re-exec'ing from the last call to 'telinit u'. == bug 1338637 == This bug only really affects server systems which have high uptimes. If a re-exec is triggered via 'telinit u' as part of a package upgrade (see above), Upstart will consume 2 additional inotify watches. Although this doesn't affect the correct behaviour of Upstart, it does have two repercussions: a) It wastes inotify watches. b) It slows down an Upstart re-exec ('telinit u'). Note that the slow-down for a vanilla Trusty server may not be detectable unless the value of /proc/sys/fs/inotify/max_user_instances has been raised above the default value of 128. = FIX DETAILS = == bug 901038 == The fix for this bug now makes 'telinit u' block until the re-exec operation has completed fully. Technically, it is not possible to make 'telinit u' synchronous since because D-Bus connections cannot be serialised and since the 'telinit u' request is made via a D-Bus connection, when Upstart re-exec's, it has to sever all D-Bus connections, including the 'telinit u' D-Bus connection. As such, 'telinit u' now performs the following operations:   - Requests synchronously that Upstart re-exec itself.   - Polls Upstart "forever" by attempting to connect to PID 1 and if     that operation fails, waiting for a period, the retrying. The code is well commented to explain this less-than-ideal but nonetheless essential poll operation:   http://bazaar.launchpad.net/~upstart-devel/upstart/trunk/view/head:/util/telinit.c#L171 == bug 1338637 == This was a simple bug fix. = FIX AVAILABILITY = TBD. = IMPACT = On a high uptime server, the more times that 'telinit u' is called (as the result of normal apt-get updates to any of the packages listed under bug 901038 above), the slower the operation will take to complete, and thus the likelihood of bug 901038 being seen will increase. = JUSTIFICATION = System updates need to "just work". However, as outlined above, the longer a systems uptime, the more likely it is to be affected by this issue which increases the likelihood of a system update failure. This issue needs to be fixed as soon as possible to minimise issues for trusty users and administrators, particularly before any potential upgrade to Utopic. = TEST CASE = == To demonstrate bug 1338637 == $ sudo telinit u $ sudo ls -l /proc/1/fd|grep inotify|wc -l The correct output from the above 2 commands should be simply "2". However, on trusty systems, if the commands above are run after a fresh boot, the value will in all likelihood be "4". Also, every subsequent 'telinit u' will increase the number displayed by 2. == To demonstrate bug 901038 == for i in $(seq 10); do telinit u; done There should be no output from the command-line above, but on a trusty system the chances are that one or more lines will display like this: $ for i in $(seq 5); do sudo telinit u;done telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused telinit: Failed to connect to socket /com/ubuntu/upstart: Connection refused == To prove the overall problem has been fixed == 1) Download the attached "force-reexec.sh" onto a trusty system. 2) Make executable:    $ chmod 755 ./force-reexec.sh 3) Run script as root, specifying a number of iterations (100) and    saving output to file "./typescript.bad":    $ script -c 'sudo ./force-reexec.sh 100' typescript.bad 4) Upgrade to latest version of Upstart that fixes the 2 bugs. 5) Reboot. 6) Re-run the script    $ script -c 'sudo ./force-reexec.sh 100' typescript.good 7) Check the results:    - file "typescript.bad" will show an increasing number of watches      and *MAY* show a slowly increasing restart time (depending on      the value of /proc/sys/fs/inotify/max_user_instances (see above)).   - file "typescript.good" should consistently show 2 watches and a     restart time of 0 on average. = TEST RESULTS = See the 2 attached typescript files showing a run of force-reexec.sh on an affected system, and the other showing the output of force-reexec.sh on a system where the problem has been resolved. [REGRESSION POTENTIAL] The only theoretically potential issue is what happens if the continuous poll performed by 'telinit u' never completes. However, this should never happen since: 1) Upstart checks to ensure that it can serialise its own state *before*    it actually performs the re-exec. If it cannot for some reason (the    only possibility here is critically low memory), it will    automatically degrade to a stateless re-exec. A stateless re-exec can    only fail if the on-disk "/sbin/init" binary or associated libraries    are somehow corrupted. 2) If, after checking that its own state can be serialised, the actual    re-exec operation fails "mid-flight", again, Upstart will    automatically revert to performing a simple stateless re-exec which    can only fail if the on-disk "/sbin/init" binary or associated    libraries are somehow corrupted. = OTHER INFORMATION = == Upgrade Procedure == Note that updating a system to the latest version of Upstart that fixes the two bugs outlined above will stop the number of inotify watches growing, but will NOT bring the value down to the expected "2". As such, to correct the problem fully, it is necessary to reboot the system after successfully upgrading the Upstart package. == D-Bus Daemon == Note that although Upstart makes heavy use of D-Bus, it does not require a D-Bus daemon to be running. Specifically, 'telinit u' communicates with PID 1 via a private abstract D-Bus socket, so is immune from issues with dbus-daemon(1). == Work Around == If your system is affected by this issue, you can perform the following actions to work around the issue in a non-invasive manner by simply disabling all calls to 'telinit u' via package maintainer scripts): $ sudo mkdir /root/bin $ sudo ln -s /bin/true /root/bin/telinit $ sudo su - # (PATH=/root/bin:$PATH which telinit) # (PATH=/root/bin:$PATH apt-get update && apt-get upgrade) # exit $ sudo /sbin/telinit u Note the final required call to the real telinit to perform a real Upstart restart *after* apt-get has completed. = Original Description = $ sudo ls -al /proc/1/fd|grep anon|wc -l 2 $ i=0; while [ $i -lt 1024 ]; do sudo telinit u; i=$((i+1)); done $ sudo ls -al /proc/1/fd|grep anon|wc -l 106
2014-09-18 10:36:53 Launchpad Janitor branch linked lp:~jamesodhunt/ubuntu/trusty/upstart/SRU-bugs-901038+1338637
2015-06-18 02:56:26 Launchpad Janitor upstart (Ubuntu): status New Confirmed
2015-06-18 02:56:26 Launchpad Janitor upstart (Ubuntu Trusty): status New Confirmed
2015-06-18 02:56:43 Paul Collins bug added subscriber The Canonical Sysadmins
2015-07-18 00:44:38 Steve Langasek upstart (Ubuntu): status Confirmed Fix Released