kuryr-libnetwork

Multinode node job failed to start etcd

Bug #1720240 reported by hongbin on 2017-09-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Zun	Triaged	Wishlist	Unassigned
	kuryr-libnetwork	New	Undecided	Unassigned

Bug Description

Description
===========
Zun multinode job broke starting from September 28. One of the failed job:

http://logs.openstack.org/19/508119/1/check/gate-tempest-dsvm-zun-multinode-docker-sql-ubuntu-xenial-nv/95f4665/

Errors
======

In /console.html
----------------
...
2017-09-28 12:01:00.234797 | + /opt/stack/new/devstack-gate/devstack-vm-gate.sh:main:L773: /tmp/ansible/bin/ansible subnodes -f 5 -i /home/jenkins/workspace/gate-tempest-dsvm-zun-multinode-docker-sql-ubuntu-xenial-nv/inventory -m shell -a 'cd '\''/opt/stack/new/devstack'\'' && sudo -H -u stack DSTOOLS_VERSION=0.4.0 stdbuf -oL -eL ./stack.sh 2>&1 executable=/bin/bash'
2017-09-28 12:08:30.221956 | ERROR: the main setup script run by this job failed - exit code: 2

In /subnode-2/devstacklog.txt.gz
--------------------------------
...
2017-09-28 12:08:28.707 | + lib/etcd3:start_etcd3:64 : sudo systemctl start <email address hidden>
2017-09-28 12:08:28.758 | Job for <email address hidden> failed because the control process exited with error code. See "systemctl status <email address hidden>" and "journalctl -xe" for details.

In /subnode-2/screen-etcd.txt.gz
--------------------------------
Sep 28 12:08:28.735930 ubuntu-xenial-2-node-rax-dfw-11186920-929847 systemd[1]: Starting Devstack <email address hidden>...
Sep 28 12:08:28.753698 ubuntu-xenial-2-node-rax-dfw-11186920-929847 systemd[1]: <email address hidden>: Main process exited, code=exited, status=1/FAILURE
...

hongbin (hongbin034) on 2017-09-28

Changed in zun:
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2017-09-29:

I think you are correct in locating the patch that triggers this, but I'm wondering how the original devstack setup is sensible. IIUC it is starting two independent etcd processes, one on each node, and then directs clients on each node to the local etcd. So the expected feature of getting some kind of coordination between the nodes won't happen. Unless we want to create a real etcd cluster, I think we should better just stop starting etcd on subnodes in order to fix this bug.

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2017-09-29:

etcd doesn't run on the subnode by default, it is enabled by kuryr-libnetwork here:

http://logs.openstack.org/88/508088/3/check/gate-tempest-dsvm-zun-multinode-docker-sql-ubuntu-xenial-nv/bbd40be/logs/subnode-2/devstacklog.txt.gz#_2017-09-28_09_25_20_888

++ /opt/stack/new/kuryr-libnetwork/devstack/settings:source:29 : enable_service kuryr-libnetwork etcd3 docker-engine

So IMO you should fix kuryr-libnetwork to be multinode compatible and things will be fine.

Revision history for this message

hongbin (hongbin034) wrote on 2017-09-29:

Yes, it seems the subnode's etcd is not needed because kuryr-libnetwork is using the etcd in SERVICE_HOST.

hongbin (hongbin034) on 2018-02-11

Changed in zun:
importance:	Critical → Wishlist

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.