Periodically Apache httpd restart fails

Bug #1572976 reported by Sagi (Sergey) Shnaidman
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Gabriele Cerami

Bug Description

In nonha jobs where SSL is enabled, the Apache server sometimes fails to restart on overcloud controller machine:

Apr 21 09:43:21 overcloud-controller-0 os-collect-config[3537]: Error: /Stage[main]/Apache::Service/Service[httpd]/ensure: change from stopped to running failed: Could not start Service[httpd]: Execution of '/usr/bin/systemctl start httpd' returned 1: Job for httpd.service failed because a timeout was exceeded. See "systemctl status httpd.service" and "journalctl -xe" for details.

Apr 21 09:38:23 overcloud-controller-0 systemd[1]: httpd.service start-pre operation timed out. Terminating.
Apr 21 09:38:23 overcloud-controller-0 systemd[1]: Unit httpd.service entered failed state.
Apr 21 09:38:23 overcloud-controller-0 systemd[1]: httpd.service failed.

Logs: http://logs.openstack.org/47/308747/1/check-tripleo/gate-tripleo-ci-f22-nonha/b02982c/

It's not always reproducible,
Probably relate: https://bugzilla.redhat.com/show_bug.cgi?id=907579

Tags: ci
Steven Hardy (shardy)
tags: added: ci
Changed in tripleo:
status: New → Triaged
importance: Undecided → Medium
milestone: none → newton-1
Steven Hardy (shardy)
Changed in tripleo:
milestone: newton-1 → newton-2
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Steven Hardy (shardy) wrote :

Unassigned and no definitely analysis of root-cause, so moving to n-3

Changed in tripleo:
milestone: newton-2 → newton-3
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :
Revision history for this message
James Slagle (james-slagle) wrote :

actual httpd error from journal is:

Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:33:54 overcloud-controller-0 python[28384]: ERROR:scss.ast:Function not found: function-exists:1
Jul 22 11:34:20 overcloud-controller-0 systemd[1]: httpd.service start-pre operation timed out. Terminating.
Jul 22 11:34:20 overcloud-controller-0 python[28384]: Found 'compress' tags in:
Jul 22 11:34:20 overcloud-controller-0 python[28384]: /usr/share/openstack-dashboard/openstack_dashboard/templates/horizon/_conf.html
Jul 22 11:34:20 overcloud-controller-0 python[28384]: /usr/share/openstack-dashboard/openstack_dashboard/templates/_stylesheets.html
Jul 22 11:34:20 overcloud-controller-0 python[28384]: /usr/share/openstack-dashboard/openstack_dashboard/templates/horizon/_scripts.html
Jul 22 11:34:20 overcloud-controller-0 python[28384]: Compressing... done
Jul 22 11:34:20 overcloud-controller-0 python[28384]: Compressed 5 block(s) from 3 template(s) for 2 context(s).
Jul 22 11:34:20 overcloud-controller-0 systemd[1]: Failed to start The Apache HTTP Server.
Jul 22 11:34:20 overcloud-controller-0 systemd[1]: Unit httpd.service entered failed state.
Jul 22 11:34:20 overcloud-controller-0 systemd[1]: httpd.service failed.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

@James, it's a different case.
In the case of this bug it is only message about timeout in journal log.

httpd.service start-pre operation timed out. Terminating.

Revision history for this message
Matthias Runge (mrunge) wrote :

does increasing timeout in /lib/system/systemd/httpd.conf.d/openstack solve this issue?

Revision history for this message
Giulio Fidente (gfidente) wrote :

I seem to be hitting a similar issue but I *do not* see the timeout messages in the journal log; instead the deployment is stuck in Step4

in the process list I can see systemd attempting a restart of httpd

  /usr/bin/systemctl restart httpd

there is no httpd process running though; the error_log mentions the restart signal:

  [core:notice] [pid 18223] SELinux policy enabled; httpd running as context system_u:system_r:httpd_t:s0
  [suexec:notice] [pid 18223] AH01232: suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
  [auth_digest:notice] [pid 18223] AH01757: generating secret for digest authentication ...
  [mpm_prefork:notice] [pid 18223] AH00163: Apache/2.4.6 (CentOS) mod_wsgi/3.4 Python/2.7.5 configured -- resuming normal operations
  [core:notice] [pid 18223] AH00094: Command line: '/usr/sbin/httpd -D FOREGROUND'
  [mpm_prefork:notice] [pid 18223] AH00170: caught SIGWINCH, shutting down gracefully

nothing seems to happen after that.

Changed in tripleo:
assignee: nobody → Gabriele Cerami (gcerami)
Revision history for this message
James Slagle (james-slagle) wrote :

@sagi, how is it a different issue? I pulled that from the logs you linked in:

http://logs.openstack.org/83/325983/6/check-tripleo/gate-tripleo-ci-centos-7-nonha/fe0739e//

Revision history for this message
Gabriele Cerami (gcerami) wrote :

I'm still unable to reproduce the bug. I spent two days running SSL non-ha deployments without problems of this type. I'll set up an automatic job that continuously runs this setup and stops when this pattern is found, so it can be debugged.

Revision history for this message
Jodie Miu (jodiemiu) wrote :

I'm seeing the exact same errors that you are, down to the 'scss.ast: Function not found' lines. I am using RDO Mitaka as well, along with SSL and non-HA.

Aug 04 18:11:06 overcloud-controller-0 python[12804]: ERROR:scss.ast:Function not found: function-exists:1
Aug 04 18:11:09 overcloud-controller-0 systemd[1]: httpd.service start-pre operation timed out. Terminating.

For some reason, I am continuously seeing this problem. It's not something that occurs rarely for me, unfortunately.

I increased the TimeoutStartSec configured for the httpd.service, but it still failed (since it was the start pre-operation that was timing out).

Apparently, I did not increase the timeout in a way that took -- at least, systemctl show httpd.service does not show the change I made to TimeoutStartSec.

systemd runs these two commands before beginning to start apache:
http://paste.openstack.org/show/549307/

The second command takes 2m 47s on my system to complete, which is beyond the default 1m 30s timeout period set.

Revision history for this message
Jodie Miu (jodiemiu) wrote :

I forgot to note that I was seeing apachectl -t return a "Syntax OK". In addition, I saw that manually starting Apache without using systemctl worked. This led me down the path of looking at the service config files.

I realized that the reason I still had failures with the timeout increase was because of a typo (thanks to @gcerami for pointing this out). With the TimeoutStartSec variable increased to 5m, httpd start/restart succeeds.

Revision history for this message
Gabriele Cerami (gcerami) wrote :

This bug is not specific to tripleo, but should be a problem introduced by /lib/systemd/system/httpd.service.d/openstack-dashboard.conf which is contained in the openstack-dashboard package. I proposed a fix here http://review.rdoproject.org/r/1811 to modify the timeout

Changed in tripleo:
status: Triaged → Fix Committed
Changed in tripleo:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.