fm event list returns HTTP 504 error on system controller

Bug #2039101 reported by Agustin Carranza
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
In Progress
Undecided
Agustin Carranza

Bug Description

Load Info / Patch Line-Up
STX 6.0 + P5

System Config
Distributed Cloud

Description of failure
Fm event list gives HTTP 504 error while fm alarm list returns HTTP 200 statuscode.

Timestamp when failure occurred

2023-01-19T20:47:34.000 127.0.0.1 haproxy[2204064]: info ******* [19/Jan/2023:20:46:04.539] fm-api-internal fm-api-internal-internal/s-fm-api-internal 0/0/0/-1/90002 504 194 - - sH-- 0/0/0/0/0 0/0 "GET /v1/event_log HTTP/1.1"

Issue intermittent (Frequency of occurrence) or 100% Reproducible?
Issue seen on user site

Impact of Failure
Standard

Time-line based on log analysis
---> Fm alarm list returns HTTP 200 statuscode but fm event list returns HTTP 504 error.
##haproxy##

2023-01-19T20:45:21.000 127.0.0.1 haproxy[2204064]: info ********* [19/Jan/2023:20:45:20.699] fm-api-internal fm-api-internal-internal/s-fm-api-internal 0/0/0/611/611 200 188 - - ---- 3/1/0/1/0 0/0 "GET /v1/alarms HTTP/1.1

2023-01-19T20:47:34.000 127.0.0.1 haproxy[2204064]: info *********[19/Jan/2023:20:46:04.539] fm-api-internal fm-api-internal-internal/s-fm-api-internal 0/0/0/-1/90002 504 194 - - sH-- 0/0/0/0/0 0/0 "GET /v1/event_log HTTP/1.1"

2023-01-19T20:49:15.000 127.0.0.1 haproxy[2204064]: info **********[19/Jan/2023:20:47:45.066] fm-api-internal fm-api-internal-internal/s-fm-api-internal 0/0/0/-1/90001 504 194 - - sH-- 1/0/0/0/0 0/0 "GET /v1/event_log HTTP/1.1"

---> But openstack log for the event api list returns HTTP 200 error but the time taken is very high - around 178.2s

2023-01-19 20:49:02.740 105974 INFO eventlet.wsgi.server [req-70da8a71-b2b5-4dc8-ab7b-8cc931be11bb 0a8b63be2259496681b716d2843c4092 4dc0628b6b704366bf1daf5832aee437 default - -] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 572, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 518, in write
    wfile.writelines(towrite)
  File "/usr/lib64/python2.7/socket.py", line 334, in writelines
    self.flush()
  File "/usr/lib64/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 401, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 395, in send
    return self._send_loop(self.fd.send, data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 382, in _send_loop
    return send_method(data, *args)
error: [Errno 104] Connection reset by peer

2023-01-19 20:49:02.741 105974 INFO eventlet.wsgi.server [req-70da8a71-b2b5-4dc8-ab7b-8cc931be11bb 0a8b63be2259496681b716d2843c4092 4dc0628b6b704366bf1daf5832aee437 default - -] 2607:f160:10:923e:ce:290:0:10,2607:f160:10:923e:ce:290:0:11 "GET /v1/event_log HTTP/1.1" status: 200  len: 0 time: 178.2016010

Key failure logs
##haproxy##, ##openstack.log##

Summary of triage
---> The issue was seen after the 'network resiliency' testing as mentioned by user but seems unrelated.
--->The system is alarm free and returns HTTP 200 statuscode when getting alarms.
--->But while fetching fm event-list, the system returns HTTP 504 error.

[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list

[sysadmin@controller-0 ~(keystone_admin)]$ fm event-list --nowrap
HTTP Server Error (HTTP 504)
[sysadmin@controller-0 ~(keystone_admin)]$

--->The 'fm-api' process is present on both the controllers therefore controller-1's network resiliency test should not impact the fm process in controller-0.

##Controller-0##

backend fm-api-internal-internal
  server s-fm-api-internal ********
##Controller-1##

backend fm-api-internal-internal
  server s-fm-api-internal ***********

Workaround

1. Restart fm-api
sudo systemctl restart fm-api

2. restart fm manager
sudo sm-restart service fm-mgr

3.Restart haproxy
sudo systemctl restart haproxy

Ask

Since both the alarm list and event list use ‘fm-api’, can be provided the cause for the ‘fm event-list’ to return 504 error while the alarm list works fine.
HTTP 504 is a gateway timeout code which indicates that the server took longer to respond and the request eventually timed out. But if this was the case, shouldn't both the requests return similar error?

Changed in starlingx:
assignee: nobody → Agustin Carranza (acarranz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/898005

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (master)

Change abandoned by "Agustin Carranza <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/898005
Reason: Different approach will be followed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.