SQL errors on undercloud/overcloud cause HA deployments to fail

Bug #1585275 reported by Tim Rozet
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Invalid
High
Unassigned

Bug Description

In OPNFV we have run into several different issues on the undercloud while deploying. First we were running out of memory and openvswitch agent was crashing. After fixing that by increasing the RAM to 12GiB for the Undercloud VM. Following that change we started to hit random errors where heat stack queries to sql would fail, mysqldb would crash, etc:

ERROR: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '192.0.2.1' ([Errno 111] ECONNREFUSED)") [SQL: u'SELECT 1']

^https://build.opnfv.org/ci/job/apex-deploy-virtual-os-nosdn-nofeature-ha-master/249/console

ERROR: Remote error: DBConnectionError (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'SELECT stack.created_at AS stack_created_at, stack.deleted_at AS stack_deleted_at, stack.action AS stack_action, stack.status AS stack_status, stack.status_reason AS stack_status_reason, stack.id AS stack_id, stack.name AS stack_name, stack.raw_template_id AS stack_raw_template_id, stack.prev_raw_template_id AS stack_prev_raw_template_id, stack.username AS stack_username, stack.tenant AS stack_tenant, stack.user_creds_id AS stack_user_creds_id, stack.owner_id AS stack_owner_id, stack.parent_resource_name AS stack_parent_resource_name, stack.timeout AS stack_timeout, stack.disable_rollback AS stack_disable_rollback, stack.stack_user_project_id AS stack_stack_user_project_id, stack.backup AS stack_backup, stack.nested_depth AS stack_nested_depth, stack.convergence AS stack_convergence, stack.current_traversal AS stack_current_traversal, stack.current_deps AS stack_current_deps, stack.updated_at AS stack_updated_at, raw_template_1.created_at AS raw_template_1_created_at, raw_template_1.updated_at AS raw_template_1_updated_at, raw_template_1.id AS raw_template_1_id, raw_template_1.template AS raw_template_1_template, raw_template_1.files AS raw_template_1_files, raw_template_1.environment AS raw_template_1_environment \nFROM stack LEFT OUTER JOIN raw_template AS raw_template_1 ON raw_template_1.id = stack.raw_template_id \nWHERE stack.id = %s'] [parameters: ('5cd88f2b-4d8f-4c3b-9656-c400d7684d47',)]
[u'

^https://build.opnfv.org/ci/job/apex-deploy-virtual-os-odl_l2-nofeature-ha-master/259/console

I think there is still some kind of CPU or other resource contention. Looking at num_engine_workers in heat, this value by default has no limit for the undercloud. By modifying this value to 2, as well as the heat api workers, it solved the issue via this commit in OPNFV:

https://gerrit.opnfv.org/gerrit/#/c/14523/

TripleO should set a limit on num_engine_workers when configuring undercloud to stop so many heat engine process forks. Looking at our OPNFV CI, the time to deploy with setting the value to 2 is about the same as allowing infinite - about 35-40 minutes for an HA deployment.

Revision history for this message
Steven Hardy (shardy) wrote :

Note the default is not infinite, it will use either the number of cores on the box, or 4 (whichever is greater), and using the number of cores is the default behavior for most OpenStack services.

The memory usage issues are under investigation, see:

https://bugs.launchpad.net/heat/+bug/1570983
https://bugs.launchpad.net/heat/+bug/1570974

The problem with forcing e.g 2 as a default, is that you can then run into RPC timeouts where you deploy very large TripleO stacks consisting of many compute nodes, then a single (or even two) workers can't keep up and RPC timeouts cause the deployment to fail (ref bug #1526045)

How many cores are you using in this environment? We've seen issues similar to what you report in the past trying to run deployments on a single core VM, at least two (preferably 4) cores is reccomended, as the undercloud services are fairly CPU and memory intensive. 12G should be fine for RAM provided you're deploying relatively small overclouds, add some swap and monitor its usage to be sure.

Revision history for this message
Tim Rozet (trozet) wrote :

Thanks Steve. I see now in the engine code:
    workers = cfg.CONF.num_engine_workers
    if not workers:
        workers = max(4, processutils.get_worker_count())

This should be documented in the heat.conf instead of:
# Number of heat-engine processes to fork and run. (integer value)
#num_engine_workers = <None>

I'll file a patch to make that more clear.

For our setup, we have multiple identical servers running our CI. We do a virtual deployment with 3 controllers and 2 compute nodes. Each VM is 8GiB RAM with 4 VCPUs. Our Undercloud VM is 12GiB with 4 VCPUs, with no swap. The host server itself has 72 cores (with hyperthreading) on 2 sockets, with 132GiB RAM.

Have you seen those sql failures before? We also see clustercheck failures on overcloud sometimes as well.

Revision history for this message
Tim Rozet (trozet) wrote :

We are now seeing on undercloud almost every run:

[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")
Error: Could not prefetch mysql_database provider 'mysql': Execution of '/usr/bin/mysql -NBe show databases' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")

summary: - Restrict heat num_engine_workers on Undercloud
+ SQL errors on undercloud/overcloud cause HA deployments to fail
Revision history for this message
Tim Rozet (trozet) wrote :

Log from comment 3:
https://build.opnfv.org/ci/job/apex-deploy-baremetal-os-nosdn-nofeature-ha-master/6/console

I think that is the undercloud failure, but could be overcloud. Hard for me to tell from the output.

Revision history for this message
Tim Rozet (trozet) wrote :

It turns out the failures from comment #3 are multiple and all in the overcloud. The issue is exacerbated by using ceph on the control and compute nodes. It turns out there are multiple failures:

1) SQL calls to create the databases (db schema upgrades per service, in step2) fail to connect to SQL. This seeems to be a timing issue for when the mariadb cluster is really ready to handle requests. A hack fix is to add a sleep between when clustercheck passes and db syncs start.

I have noticed it looks like clustercheck passes, even when only 1 node is in the cluster. My theory is not all memebers of the cluster have joined, and they may join around the same time as the db schema upgrades happen, causing some type of deadlock.

The interesting part of this failure, is it occurs in the puppet mysql provider, but for some reason the resource does not fail which called the provider. Not sure how that happens...

2) Commands to acces ceph mon and osd timeout. The problem looks to be some type of resource contention with other services coming up around the same time. Moving Ceph configuration to "step1" and making it happen first (before tripleO loadbalancer, mongodb, etc) fixes the problem.

We have fixed this in our OPNFV fork of THT with this patch:
https://github.com/trozet/opnfv-tht/pull/18/files

This resolved #2 completely. We still see some issues with #1, but now it seems to be an issue of the db schema upgrades themselves happening too quickly per each service. We are workikng on a patch to serialize the db schema upgrades and add a 10 second sleep timer between each one to see if it fixes that issue.

Revision history for this message
Tim Rozet (trozet) wrote :

As previously mentioned, we came up with a patch that moves galera to step 1, where there is less resource contention. We also then serialize DB schema upgrades. Refer to this patch:
https://github.com/trozet/opnfv-tht/commit/9bc4a4fc9412ee67075ed2421523d876bab5979a

18 virtual deployments with HA and Ceph on all nodes were run to test out these changes. 17/18 deployments passed. I believe these fixes (or some similar fix) should go into stable/mitaka for TripleO.

Revision history for this message
Adam Young (ayoung) wrote :

Redeploys seem to trigger this. I have a 12 Gb undercloud, and 3 redeploys just kicked in the OOM killer.

Aug 18 18:15:06 undercloud kernel: Out of memory: Kill process 13124 (mysqld) score 40 or sacrifice child
Aug 18 18:15:06 undercloud kernel: Killed process 13124 (mysqld) total-vm:4826828kB, anon-rss:494700kB, file-rss:0kB

Revision history for this message
Steven Hardy (shardy) wrote :

> I believe these fixes (or some similar fix) should go into stable/mitaka for TripleO.

We can certainly discuss that, have you proposed the patch to upstream TripleO at all, or does it only exist on your fork?

I'd like to define:

1. What fixes we need on master to resolve this, and in particular how different this may look with the HA-lite architecture that has been under development

2. How can we serialize db-sync commands in the composable services architecture, where we define per-service profiles that don't have any knowledge of other services (e.g the puppet orchestration you used probably won't be possible).

3. Can we justify a mitaka only patch if the problem isn't fixed on master? My opinion is we can't, we need to figure out a fix to this on master too, or folks will just break again on upgrade.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → newton-3
Revision history for this message
Tim Rozet (trozet) wrote :

Hi Steve,
There are a couple more patches that we came up with to get to 100% pass rate in our CI:
https://github.com/trozet/opnfv-tht/pull/41/commits/1305620235fcb368525be87b038bff633f418ed9
https://github.com/trozet/opnfv-tht/pull/40/commits/7b8cae1b1ccec49fed5fe1c75075d33b7c19fde3
^ These include trying to recover pacemaker services scored with -infinity and moving mongodb to step 1

None of these are proposed to upstream. I think they fundamentally conflict with Tripleo's "steps" definitions, and involve adding sleep commands...which may be too hacky. Let me also state that the failures are not due to hardware or type of deployment (virtual or baremetal).

1. Right, we basically need to validate if these problems exist in master. In OPNFV our current dev cycle is based on stable/mitaka, and will move to newton middle of September. At that time I can test and provide feedback to stability issues that still exist.

2. There are 2 solutions I see here: 1) Change the entire step model to allow for more steps, meaning break pieces of the deployment into more steps to isolate the deployment more. 2) Keep the same step model, but in composable roles add a way to declare resource chains that eventually get translated into puppet ordering chains. So if a user declared a "controller" role with Neutron and Nova pieces, he could do something like Neutron::Server -> Nova::Scheduler or something like that.

3. Probably not worth it at this point.

Side note: This is just my opinion, but since we are talking about design/step model here. I think the entire method of passing step into every puppet module is not scalable. Composable services puppet manifests now check for steps that hardcoded checks. What happens if we find out we need to insert a new step between 3 and 4, to handle fixing one of these resource contention issues? All of the puppet manifests have to be edited. It is also not very visible from an overall deployment flow perspective of what is happening when (we did have that piece when it was all a single manifest). This goes along with my 2nd solution in #2. I would prefer here if we could remove all of this step stuff, and just provide ordering in some abstract/dynamic way (like in the composable role) or maybe variablizing the manifests more, rather than doing "if $step >=3" in the puppet manifests.

Steven Hardy (shardy)
Changed in tripleo:
milestone: newton-3 → newton-rc1
Changed in tripleo:
milestone: newton-rc1 → newton-rc2
Revision history for this message
Emilien Macchi (emilienm) wrote :

Nobody is actively working on it at this time, I'm not sure it will make RC2. Reading the last comment, it looks like a spec would be required as you suggest design changes. Deferring it to Ocata 1.

Changed in tripleo:
milestone: newton-rc2 → ocata-1
Revision history for this message
Tim Rozet (trozet) wrote :

Looks like we are hitting this in stable/newton:
https://build.opnfv.org/ci/job/apex-deploy-virtual-os-odl_l2-nofeature-ha-master/749/console

DB sync errors there. Need to do more testing to confirm. If this is still a problem, we will figure out a way to order DB syncs after galera cluster check with the composable model.

Revision history for this message
Michele Baldessari (michele) wrote :

I will try to take a deeper look in the next days, but it seems it is failing in step 3:
2016-10-20 08:00:15Z [overcloud.AllNodesDeploySteps.ControllerDeployment_Step2]: CREATE_COMPLETE state changed
....
Notice: /Stage[main]/Gnocchi::Db::Sync/Exec[gnocchi-db-sync]/returns: 2016-10-20 08:04:42.848 967 CRITICAL gnocchi [-] DBConnectionError: (pymysql.err.OperationalError) (2003, \"Can't connect to MySQL server on '192.0.2.7' ([Errno 113] No route to host)\")

In step2 the galera db should actually be up and running. Is 192.0.2.7 the correct VIP for galera on this host? Any chance you can pull off sosreports from this install and pass them along? (I am bandini on irc)

Revision history for this message
Tim Rozet (trozet) wrote :

Hey Michele,
It turns out this was a false positive. As you said the issue was actually due to this bug:
https://review.openstack.org/#/c/391660/

Haproxy was failing to start, and the db syncs use the VIP to connect to sql so that whole thing was failing. Once OPNFV is fully moved to stable/newton, we can monitor deployments over the coming weeks and see if this bug cannot be reproduced.

Revision history for this message
Michele Baldessari (michele) wrote :

Ah good to know, thanks for the update Tim.

Shall we close this one now?

Revision history for this message
Tim Rozet (trozet) wrote :

Let's wait a couple weeks. Then our CI will be running almost 24/7 and we can see if we can reproduce the error.

Steven Hardy (shardy)
Changed in tripleo:
milestone: ocata-1 → ocata-2
Changed in tripleo:
milestone: ocata-2 → ocata-3
Revision history for this message
Tim Rozet (trozet) wrote :

FYI we have daily stable/newton CI running now and I see instances of mysql failures in multiple runs. Note that these failures to access sql do not fail the puppet apply (perhaps we should figure out why?):

https://build.opnfv.org/ci/job/apex-deploy-baremetal-os-nosdn-nofeature-ha-master/79/consoleFull

 "exception: connect failed\n\u001b[1;31mWarning: Scope(Class[Mongodb::Server]): Replset specified, but no replset_members or replset_config provided.\u001b[0m\n\u001b[1;31mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.\u001b[0m\n\u001b[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")\u001b[0m\n", "deploy_status_code": 0}

Dec 15 06:03:15 localhost os-collect-config: #033[1;31mWarning: Scope(Class[Mongodb::Server]): Replset specified, but no replset_members or replset_config provided.#033[0m
Dec 15 06:03:15 localhost os-collect-config: #033[1;31mWarning: Scope(Haproxy::Config[haproxy]): haproxy: The $merge_options parameter will default to true in the next major release. Please review the documentation regarding the implications.#033[0m
Dec 15 06:03:15 localhost os-collect-config: #033[1;31mError: Could not prefetch mysql_user provider 'mysql': Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 "No such file or directory")#033[0m

Revision history for this message
Michele Baldessari (michele) wrote :

The specific non fatal mysql user puppet error was fixed with:
Ifad3cb40fd958d7ea606b9cd2ba4c8ec22a8e94e (bug 1633113). In any case it is harmless although admittedly very confusing.

Revision history for this message
Tim Rozet (trozet) wrote :

I have the fix for that bug and still hit this. I think there are multiple places that this error can occur when sql configuration is being done by each service. You can see in that log that the service DB configurations were being done in that step, but I cannot determine which manifest triggered the error. This is what we saw in mitaka, before we serialized all the db syncs.

Changed in tripleo:
milestone: ocata-3 → ocata-rc1
Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Revision history for this message
Tim Rozet (trozet) wrote :

I have not encountered this bug in some time now. I think it is safe to close it and we can re-open if we hit it again.

Changed in tripleo:
status: Triaged → Incomplete
Changed in tripleo:
milestone: ocata-rc2 → pike-1
Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Changed in tripleo:
milestone: pike-3 → pike-rc1
Changed in tripleo:
milestone: pike-rc1 → pike-rc2
Tim Rozet (trozet)
Changed in tripleo:
status: Incomplete → Invalid
Changed in tripleo:
milestone: pike-rc2 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.