HTTP connection issues on simple load testing

Bug #1423054 reported by Lakshmi Kannan
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mistral
Invalid
High
Moshe Elisha
Mitaka
Invalid
High
Moshe Elisha

Bug Description

We wrote a simple test case that consumes Mistral APIs /v2/executions and /v2/executions/id/tasks every second. We have 20 workflows running in parallel. So every 1s, we hit the above APIs for all the 20 workflows in series. Mistral throws out all sorts of connection errors (accept issues, timeouts, bad http response).

See logs: https://gist.githubusercontent.com/lakshmi-kannan/f4295ca17c6c5ccf34a6/raw/8a5813e67290358e96e552af252e4e9137216998/gistfile1.txt

It looks like we need to do some minimal stress testing of the HTTP endpoints (by passing access to dbs for now) and make sure HTTP endpoint can scale.

Revision history for this message
Anastasia Kuznetsova (akuznetsova) wrote :

Lakshmi, we started to work on stress testing and we already have a few scenarios and special gate (gate-rally-dsvm-mistral-task), seems that we have to add more scenarios as soon as possible.

Revision history for this message
Dmitri Zimine (i-dz) wrote :

Looks like we are leaking SQL connections.
m4dcoder is posting results shortly.

Revision history for this message
Winson Chan (winson-c-chan) wrote :

Maybe a separate bug. But basically, I got passed the HTTP errors when I move to an apache or nginx setup. Then we started noticing SQL connection problem afterward. It's complaining 1040 too many connections error. On closer examination using the following commands, what we noticed is that the SQL connections are still registered in MySql even though the WFs are long completed.

Example commands to get SQL connection numbers and API calls on a localhost:
mysql -hlocalhost -uMyUserName -pMyPassword -e "show processlist"
netstat -pant | grep ESTABLISHED | grep 8989 | wc -l

no longer affects: mistral/liberty
Changed in mistral:
importance: Undecided → High
status: New → Triaged
Revision history for this message
Renat Akhmerov (rakhmerov) wrote :

Winson, can you confirm that this is not really a bug? As I remember from our discussions we had earlier this year it is solved by putting Apache/Nginx in from of Mistral API, right?

tags: added: liberty-backport-potential
Revision history for this message
Moshe Elisha (melisha) wrote :

Hi,

I did a very simple test against one mistral VM (api+engine+executor) with an Apache in front and I also encountered the issue.
This was done on a setup that is built like a production setup not a devstack or anything like it.

Scenario:
1. I stopped all Mistral components.

2. I deleted Mistral logs.

3. I started all Mistral components.

4. I ran this command for 1-2 minutes:

     watch -d -n 1 mistral execution-list

5. I encountered an error which can be seen in the attached logs "prod_setup_mistral_logs.zip".

Revision history for this message
Nikolay Makhotkin (nmakhotkin) wrote :

I've looked through your logs and see the following error:

Server-side error: "QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30".

It is rather DB issue than API. Did you configure the DB connection properly (in mistral.conf)? If not, try to configure these properties in config:

[database]
max_overflow = -1
max_pool_size = 1000

Revision history for this message
Moshe Elisha (melisha) wrote :

We are using the default values:

# Maximum number of SQL connections to keep open in a pool. (integer
# value)
#max_pool_size = <None>

# If set, use this value for max_overflow with SQLAlchemy. (integer
# value)
#max_overflow = <None>

These are the same as we have for the other services (Heat, Nova, etc.) and these work fine under load.

Revision history for this message
Moshe Elisha (melisha) wrote :

I have assigned this to myself. I will investigate further and update here.

Revision history for this message
Moshe Elisha (melisha) wrote :

Hi,

The configuration Nikolay suggested fixed the SQL connections issue and I have monitored the SQL connections pool using:

SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE DB = 'mistral';

and did not see any SQL connection leak.

That said, I am experiencing HTTP request timeout when I do several (3-5) workflow executions in parallel of the attached workflow.
I will continue investigating.

Revision history for this message
Steven Hardy (shardy) wrote :

Is this the same issue with defaults discussed here?

http://lists.openstack.org/pipermail/openstack-dev/2015-December/082717.html

Revision history for this message
Moshe Elisha (melisha) wrote :

Since we have changed our configuration this issue did not reproduce for us.
This is the configuration we use in mistral.conf:

max_pool_size = 16
max_overflow = 128

(The other properties mentioned in the mail are irrelevant).

Andras Kovi (akovi)
Changed in mistral:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.