Mistral

HTTP connection issues on simple load testing

Bug #1423054 reported by Lakshmi Kannan on 2015-02-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Mistral	Invalid	High	Moshe Elisha
	Mitaka	Invalid	High	Moshe Elisha

Bug Description

We wrote a simple test case that consumes Mistral APIs /v2/executions and /v2/executions/id/tasks every second. We have 20 workflows running in parallel. So every 1s, we hit the above APIs for all the 20 workflows in series. Mistral throws out all sorts of connection errors (accept issues, timeouts, bad http response).

See logs: https://gist.githubusercontent.com/lakshmi-kannan/f4295ca17c6c5ccf34a6/raw/8a5813e67290358e96e552af252e4e9137216998/gistfile1.txt

It looks like we need to do some minimal stress testing of the HTTP endpoints (by passing access to dbs for now) and make sure HTTP endpoint can scale.

Tags:

Revision history for this message

Anastasia Kuznetsova (akuznetsova) wrote on 2015-02-18:

Lakshmi, we started to work on stress testing and we already have a few scenarios and special gate (gate-rally-dsvm-mistral-task), seems that we have to add more scenarios as soon as possible.

Revision history for this message

Dmitri Zimine (i-dz) wrote on 2015-02-18:

Looks like we are leaking SQL connections.
m4dcoder is posting results shortly.

Revision history for this message

Winson Chan (winson-c-chan) wrote on 2015-02-18:

Maybe a separate bug. But basically, I got passed the HTTP errors when I move to an apache or nginx setup. Then we started noticing SQL connection problem afterward. It's complaining 1040 too many connections error. On closer examination using the following commands, what we noticed is that the SQL connections are still registered in MySql even though the WFs are long completed.

Example commands to get SQL connection numbers and API calls on a localhost:
mysql -hlocalhost -uMyUserName -pMyPassword -e "show processlist"
netstat -pant | grep ESTABLISHED | grep 8989 | wc -l

Renat Akhmerov (rakhmerov) on 2015-09-14

no longer affects:

mistral/liberty

Renat Akhmerov (rakhmerov) on 2015-09-14

Changed in mistral:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Renat Akhmerov (rakhmerov) wrote on 2015-11-11:

Winson, can you confirm that this is not really a bug? As I remember from our discussions we had earlier this year it is solved by putting Apache/Nginx in from of Mistral API, right?

tags:

added: liberty-backport-potential

Revision history for this message

Moshe Elisha (melisha) wrote on 2015-11-12:

prod_setup_mistral_logs.zip Edit (10.2 KiB, application/zip)

Hi,

I did a very simple test against one mistral VM (api+engine+executor) with an Apache in front and I also encountered the issue.
This was done on a setup that is built like a production setup not a devstack or anything like it.

Scenario:
1. I stopped all Mistral components.

2. I deleted Mistral logs.

3. I started all Mistral components.

4. I ran this command for 1-2 minutes:

watch -d -n 1 mistral execution-list

5. I encountered an error which can be seen in the attached logs "prod_setup_mistral_logs.zip".

Revision history for this message

Nikolay Makhotkin (nmakhotkin) wrote on 2015-11-13:

I've looked through your logs and see the following error:

Server-side error: "QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30".

It is rather DB issue than API. Did you configure the DB connection properly (in mistral.conf)? If not, try to configure these properties in config:

[database]
max_overflow = -1
max_pool_size = 1000

Revision history for this message

Moshe Elisha (melisha) wrote on 2015-11-16:

We are using the default values:

# Maximum number of SQL connections to keep open in a pool. (integer
# value)
#max_pool_size = <None>

# If set, use this value for max_overflow with SQLAlchemy. (integer
# value)
#max_overflow = <None>

These are the same as we have for the other services (Heat, Nova, etc.) and these work fine under load.

Revision history for this message

Moshe Elisha (melisha) wrote on 2015-11-17:

I have assigned this to myself. I will investigate further and update here.

Revision history for this message

Moshe Elisha (melisha) wrote on 2015-11-18:

parallel.wf.yaml Edit (1011 bytes, text/plain)

Hi,

The configuration Nikolay suggested fixed the SQL connections issue and I have monitored the SQL connections pool using:

SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE DB = 'mistral';

and did not see any SQL connection leak.

That said, I am experiencing HTTP request timeout when I do several (3-5) workflow executions in parallel of the attached workflow.
I will continue investigating.

Revision history for this message

Steven Hardy (shardy) wrote on 2016-01-26:

#10

Is this the same issue with defaults discussed here?

http://lists.openstack.org/pipermail/openstack-dev/2015-December/082717.html