mariadb.pid checking after mariadb container starts fails deployment

Bug #1564278 reported by Steven Dake
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
Fix Released
High
Steven Dake
Mitaka
Fix Released
High
Steven Dake

Bug Description

I'm playing with kolla on 10 hosts. I had a working setup untill recently when I cleaned the hosts and redeployed. The deployment fails at mariadb.

Here are some logs

Deploy log:
...
...
TASK: [mariadb | Waiting for MariaDB service to be ready] *********************
ok: [opst3]
failed: [opst1] => {"attempts": 6, "changed": false, "cmd": ["docker", "exec", "mariadb", "ls", "/var/lib/mysql/mariadb.pid"], "delta": "0:00:00.160993", "end": "2016-03-26 20:44:34.960145", "failed": true, "rc": 2, "start": "2016-03-26 20:44:34.799152", "stdout_lines": [], "warnings": []}
stderr: ls: cannot access /var/lib/mysql/mariadb.pid: No such file or directory
msg: Task failed as maximum retries was encountered

TASK: [mariadb | Creating haproxy mysql user] *********************************
failed: [opst3] => {"attempts": 10, "changed": false, "cmd": ["docker", "exec", "-t", "kolla_toolbox", "/usr/bin/ansible", "localhost", "-m", "mysql_user", "-a", "login_host='10.0.112.63' login_port='3306' login_user='root' login_password='AJJtJquvk6rBHg65eaPhsEBPlHUGs96vMq09GQUW' name='haproxy' password='' host='%' priv=*.*:USAGE"], "delta": "0:00:00.632850", "end": "2016-03-26 20:45:35.785470", "failed": true, "rc": 2, "start": "2016-03-26 20:45:35.152620", "stdout_lines": ["localhost | FAILED! => {", " \"changed\": false, ", " \"failed\": true, ", " \"msg\": \"(1047, 'WSREP has not yet prepared node for application use')\"", "}"], "warnings": []}
stdout: localhost | FAILED! => {
    "changed": false,
    "failed": true,
    "msg": "(1047, 'WSREP has not yet prepared node for application use')"
}
msg: Task failed as maximum retries was encountered

FATAL: all hosts have already failed -- aborting

===========
$ docker exec heka cat /var/log/kolla/mariadb/mariadb.log

WSREP_SST: [INFO] Streaming with xbstream (20160326 19:43:00.952)
WSREP_SST: [INFO] Using socat as streamer (20160326 19:43:00.954)
WSREP_SST: [INFO] Using /tmp/tmp.JhFMycZXyl as xtrabackup temporary directory (20160326 19:43:00.984)
WSREP_SST: [INFO] Using /tmp/tmp.KSTtd4nr8L as innobackupex temporary directory (20160326 19:43:00.987)
WSREP_SST: [INFO] Streaming GTID file before SST (20160326 19:43:00.992)
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio TCP:10.0.112.63:4444; RC=( ${PIPESTATUS[@]} ) (20160326 19:43:00.994)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20160326 19:43:01.188)
WSREP_SST: [INFO] Streaming the backup to joiner at 10.0.112.63 4444 (20160326 19:43:11.192)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:10.0.112.63:4444; RC=( ${PIPESTATUS[@]} ) (20160326 19:43:11.195)
160326 19:43:12 [Note] WSREP: Provider paused at e8f55b0f-f38a-11e5-845f-f6d5940a87f2:5 (5)
160326 19:43:12 [Note] WSREP: resuming provider at 5
160326 19:43:12 [Note] WSREP: Provider resumed.
160326 19:43:12 [Note] WSREP: 0.0 (opst1): State transfer to 1.0 (opst3) complete.
160326 19:43:12 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 5)
160326 19:...

Revision history for this message
Steven Dake (sdake) wrote :

This was reported in https://bugs.launchpad.net/kolla/+bug/1546789 but that bug was a different issue. I have also seen this issue occur with my own eyeballs on multinode, but I'm not sure what the root cause is.

Changed in kolla:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → newton-1
Revision history for this message
Hui Kang (huikang27) wrote :

Hi, Steven, in your inventory file, how many mariadb nodes in the cluster? - Hui

Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :

Hi,

It was two nodes that caused the problem.

The problem was resolved by using three nodes.

I guess a check should be added to the "kolla-ansible prechecks"?

Revision history for this message
Steven Dake (sdake) wrote :

two nodes of what? were you running two nodes of ocontroller,compute,storage on the same machine?

more detail needed to resolve correctly.

Revision history for this message
Dave McCowan (dave-mccowan) wrote :

I've gotten this same error with an all-in-one inventory, also with a "redeploy" meaning "cleanup-containers" followed by "kolla-ansible deploy". it seemed to start when i moved to using a remote docker registry (added latency?), but that could be a coincidence.

Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :

Hi,

Sorry for the incomplete info. Here is the full story.
I was deploying on a 10 node cluster using Kolla (centos, source). The original setup had 3 control & 3 network nodes N1-N3, and 10 compute and storage nodes N1-N10. Then N2 was down for a hardware problem.

I updated Kolla & redeployed using two control & two network nodes and 9 compute & 9 storage nodes. Then I got the problem described above.

I thought it was because of the update but later when N2 was up again and went back to the 3 controller/network setup the deployment succeeded.

Before each deploy I properly clean all nodes.

Still having problems getting OVS access the external (internet) network but that is a separate issue.

Revision history for this message
Steven Dake (sdake) wrote :

Thanks for the feedback. If you have trouble with OVS and think its a kolla bug, please file a bug with whatever information you can provide. FWIW ovs works like a champ for me.

Regards
-steve

Changed in kolla:
importance: Critical → High
Revision history for this message
Steven Dake (sdake) wrote :

This is the play:

- name: Waiting for MariaDB service to be ready
  command: "docker exec mariadb ls /var/lib/mysql/mariadb.pid"
  register: bootstrap_result
  when: delegate_host != 'None' or
        ( groups['mariadb'] | length ) == 1 or
        ( delegate_host == 'None' and inventory_hostname != groups['mariadb'][0] )
  until: bootstrap_result|success
  changed_when: False
  retries: 6
  delay: 10

I don't know if it is correct. Whatever the case is it looks flakey.

Another option for waiting for the MariaDB service to be ready is to use wait_for.

For example:
local_action: wait_for port="{{mariadb_port}}" host="{{notsurewhattoputhere}}" search_regex=MariaDB timeout=60

Changed in kolla:
assignee: nobody → Steven Dake (sdake)
Steven Dake (sdake)
summary: - mariadb.pid causes mariadb container from starting
+ mariadb.pid checking after mariadb container starts fails deployment
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.openstack.org/301772

Changed in kolla:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.openstack.org/301772
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=ba4bcdbbe84a42e0346c0e36abf0d1cbc5007444
Submitter: Jenkins
Branch: master

commit ba4bcdbbe84a42e0346c0e36abf0d1cbc5007444
Author: Steven Dake <email address hidden>
Date: Tue Apr 5 11:32:20 2016 -0400

    Change waiting on mariadb.pid

    Previously the code looked at mariadb.pid, but this seemed
    flakey in function. It seemed racey and prone to failure on
    slower connections to a registry. The original task was
    extremely complex and it didnt really verify that the MariaDB
    was ready to serve connections. Use wait_for with a regex
    instead.

    Change-Id: I3aafac04f03639b08e0ef4d6a9c9e1a4499f000c
    Closes-Bug: #1564278

Changed in kolla:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/301974

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (stable/mitaka)

Reviewed: https://review.openstack.org/301974
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=967d845bdfabfe1655486d4a60b34af3a1efeacb
Submitter: Jenkins
Branch: stable/mitaka

commit 967d845bdfabfe1655486d4a60b34af3a1efeacb
Author: Steven Dake <email address hidden>
Date: Tue Apr 5 11:32:20 2016 -0400

    Change waiting on mariadb.pid

    Previously the code looked at mariadb.pid, but this seemed
    flakey in function. It seemed racey and prone to failure on
    slower connections to a registry. The original task was
    extremely complex and it didnt really verify that the MariaDB
    was ready to serve connections. Use wait_for with a regex
    instead.

    Change-Id: I3aafac04f03639b08e0ef4d6a9c9e1a4499f000c
    Closes-Bug: #1564278
    (cherry picked from commit ba4bcdbbe84a42e0346c0e36abf0d1cbc5007444)

Revision history for this message
Davanum Srinivas (DIMS) (dims-v) wrote : Fix included in openstack/kolla 2.0.0.0rc4

This issue was fixed in the openstack/kolla 2.0.0.0rc4 release candidate.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 2.0.0

This issue was fixed in the openstack/kolla 2.0.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 1.1.0

This issue was fixed in the openstack/kolla 1.1.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 3.0.0.0b1

This issue was fixed in the openstack/kolla 3.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.