Multinode node job failed to start etcd

Bug #1720240 reported by hongbin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Zun
Triaged
Wishlist
Unassigned
kuryr-libnetwork
New
Undecided
Unassigned

Bug Description

Description
===========
Zun multinode job broke starting from September 28. One of the failed job:

  http://logs.openstack.org/19/508119/1/check/gate-tempest-dsvm-zun-multinode-docker-sql-ubuntu-xenial-nv/95f4665/

Errors
======

In /console.html
----------------
...
2017-09-28 12:01:00.234797 | + /opt/stack/new/devstack-gate/devstack-vm-gate.sh:main:L773: /tmp/ansible/bin/ansible subnodes -f 5 -i /home/jenkins/workspace/gate-tempest-dsvm-zun-multinode-docker-sql-ubuntu-xenial-nv/inventory -m shell -a 'cd '\''/opt/stack/new/devstack'\'' && sudo -H -u stack DSTOOLS_VERSION=0.4.0 stdbuf -oL -eL ./stack.sh 2>&1 executable=/bin/bash'
2017-09-28 12:08:30.221956 | ERROR: the main setup script run by this job failed - exit code: 2

In /subnode-2/devstacklog.txt.gz
--------------------------------
...
2017-09-28 12:08:28.707 | + lib/etcd3:start_etcd3:64 : sudo systemctl start <email address hidden>
2017-09-28 12:08:28.758 | Job for <email address hidden> failed because the control process exited with error code. See "systemctl status <email address hidden>" and "journalctl -xe" for details.

In /subnode-2/screen-etcd.txt.gz
--------------------------------
Sep 28 12:08:28.735930 ubuntu-xenial-2-node-rax-dfw-11186920-929847 systemd[1]: Starting Devstack <email address hidden>...
Sep 28 12:08:28.753698 ubuntu-xenial-2-node-rax-dfw-11186920-929847 systemd[1]: <email address hidden>: Main process exited, code=exited, status=1/FAILURE
...

hongbin (hongbin034)
Changed in zun:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

I think you are correct in locating the patch that triggers this, but I'm wondering how the original devstack setup is sensible. IIUC it is starting two independent etcd processes, one on each node, and then directs clients on each node to the local etcd. So the expected feature of getting some kind of coordination between the nodes won't happen. Unless we want to create a real etcd cluster, I think we should better just stop starting etcd on subnodes in order to fix this bug.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

etcd doesn't run on the subnode by default, it is enabled by kuryr-libnetwork here:

http://logs.openstack.org/88/508088/3/check/gate-tempest-dsvm-zun-multinode-docker-sql-ubuntu-xenial-nv/bbd40be/logs/subnode-2/devstacklog.txt.gz#_2017-09-28_09_25_20_888

++ /opt/stack/new/kuryr-libnetwork/devstack/settings:source:29 : enable_service kuryr-libnetwork etcd3 docker-engine

So IMO you should fix kuryr-libnetwork to be multinode compatible and things will be fine.

Revision history for this message
hongbin (hongbin034) wrote :

Yes, it seems the subnode's etcd is not needed because kuryr-libnetwork is using the etcd in SERVICE_HOST.

hongbin (hongbin034)
Changed in zun:
importance: Critical → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.