etcd sometimes fails to start.

Bug #1694499 reported by Clark Boylan
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
DragonFlow
Fix Released
Undecided
Unassigned
devstack
Invalid
Undecided
Unassigned
networking-calico
Fix Released
Undecided
Nell Jerram

Bug Description

We occasionally see that etcd fails to start. Here is a truncated view of what that looks like in the devstack log:

2017-05-24 08:33:14.468 | + functions-common:write_user_unit_file:1442 : local <email address hidden>
2017-05-24 08:33:14.469 | + functions-common:write_user_unit_file:1443 : mkdir -p /etc/systemd/system
2017-05-24 08:33:14.473 | + functions-common:write_user_unit_file:1445 : iniset -sudo /<email address hidden> Unit Description 'Devstack <email address hidden>'
2017-05-24 08:33:14.499 | + functions-common:write_user_unit_file:1446 : iniset -sudo /<email address hidden> Service User root
2017-05-24 08:33:14.524 | + functions-common:write_user_unit_file:1447 : iniset -sudo /<email address hidden> Service ExecStart '/opt/stack/new/bin/etcd --name ubuntu-xenial-2-node-osic-cloud1-s3500-8961986-591152 --data-dir /opt/stack/new/data/etcd --initial-cluster-state new --initial-cluster-token etcd-cluster-01 --initial-cluster ubuntu-xenial-2-node-osic-cloud1-s3500-8961986-591152=http://10.38.227.149:2380 --initial-advertise-peer-urls http://10.38.227.149:2380 --advertise-client-urls http://10.38.227.149:2379 --listen-peer-urls http://0.0.0.0:2380 --listen-client-urls http://10.38.227.149:2379'
2017-05-24 08:33:14.544 | + functions-common:write_user_unit_file:1448 : [[ -n '' ]]
2017-05-24 08:33:14.546 | + functions-common:write_user_unit_file:1451 : iniset -sudo /<email address hidden> Install WantedBy multi-user.target
2017-05-24 08:33:14.570 | + functions-common:write_user_unit_file:1454 : sudo systemctl daemon-reload
2017-05-24 08:33:14.613 | + lib/etcd3:start_etcd3:53 : iniset -sudo /<email address hidden> Unit After network.target
2017-05-24 08:33:14.634 | + lib/etcd3:start_etcd3:54 : iniset -sudo /<email address hidden> Service Type notify
2017-05-24 08:33:14.653 | + lib/etcd3:start_etcd3:55 : iniset -sudo /<email address hidden> Service Restart on-failure
2017-05-24 08:33:14.675 | + lib/etcd3:start_etcd3:56 : iniset -sudo /<email address hidden> Service LimitNOFILE 65536
2017-05-24 08:33:14.697 | + lib/etcd3:start_etcd3:58 : sudo systemctl daemon-reload
2017-05-24 08:33:14.739 | + lib/etcd3:start_etcd3:59 : sudo systemctl enable <email address hidden>
2017-05-24 08:33:14.744 | Created symlink from /<email address hidden> to /<email address hidden>.
2017-05-24 08:33:14.777 | + lib/etcd3:start_etcd3:60 : sudo systemctl start <email address hidden>
2017-05-24 08:33:14.819 | Job for <email address hidden> failed because the control process exited with error code. See "systemctl status <email address hidden>" and "journalctl -xe" for details.
2017-05-24 08:33:14.821 | + lib/etcd3:start_etcd3:1 : exit_trap
2017-05-24 08:33:14.822 | + ./stack.sh:exit_trap:492 : local r=1
2017-05-24 08:33:14.824 | ++ ./stack.sh:exit_trap:493 : jobs -p
2017-05-24 08:33:14.826 | + ./stack.sh:exit_trap:493 : jobs=
2017-05-24 08:33:14.827 | + ./stack.sh:exit_trap:496 : [[ -n '' ]]
2017-05-24 08:33:14.828 | + ./stack.sh:exit_trap:502 : kill_spinner
2017-05-24 08:33:14.830 | + ./stack.sh:kill_spinner:388 : '[' '!' -z '' ']'
2017-05-24 08:33:14.831 | + ./stack.sh:exit_trap:504 : [[ 1 -ne 0 ]]
2017-05-24 08:33:14.832 | + ./stack.sh:exit_trap:505 : echo 'Error on exit'
2017-05-24 08:33:14.833 | Error on exit

For some reason this appears to affect the calico and dragonflow jobs more than any others but we ran into this in a normal neutron multinode job too.

Revision history for this message
Omer Anson (omer-anson) wrote :

Would it be possible to add the output of the following when this happens:
systemctl status <email address hidden>
journalctl -u <email address hidden>

Revision history for this message
Omar Sanhaji (sanhaji-omar) wrote :

I had the same problem:
It wasn't a bug related to etcd. This came from my local.conf, /etc/hosts and etc/hostname.
the HOST_IP (and all the other IPs) in my local.conf, weren't the same in /etc/hosts so etcd couldn't start the service.

Revision history for this message
Nell Jerram (neil-jerram) wrote :

Thanks Clark for reporting this, and Omer and Omar for commenting. I'm sorry that it has taken a while for me to have a proper look.

I think the problem here - for networking-calico at least - is that networking-calico's devstack plugin also installs and starts etcd (which it has done for a long time). It clearly won't work for both networking-calico and devstack top level to install and start etcd services.

I think the simplest immediate fix will be for me to add 'disable_service etcd3' in the networking-calico plugin's settings. I could also look at simplifying the plugin to use the base etcd instead - but that is not a straightforward an option as it might sound because one of networking-calico's goals is that it's master code (including the devstack plugin) should work with past OpenStack releases as well as with current master.

I hope that all makes sense. If you have further thoughts, I would be very happy to hear them.

Changed in networking-calico:
assignee: nobody → Neil Jerram (neil-jerram)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-calico (master)
Download full text (3.6 KiB)

Reviewed: https://review.openstack.org/476445
Committed: https://git.openstack.org/cgit/openstack/networking-calico/commit/?id=e794848060e7ab3edf320b1847151de4eb6af142
Submitter: Jenkins
Branch: master

commit e794848060e7ab3edf320b1847151de4eb6af142
Author: Neil Jerram <email address hidden>
Date: Thu Jun 22 10:51:33 2017 +0100

    Fix networking-calico CI (against master OpenStack)

    - Mock 'network' object now needs a 'non_local_subnets' attr.

    - Need to exclude Babel 2.4.0 (same as OpenStack global requirements now
      do; presumably there is something wrong with that version).

    - DevStack plugin: disable the base etcd3 service

    - Avoid calling db.get_security_group_rules with an empty set of
      security group IDs, as we then see this error:

      2017-06-22 09:53:14.328 16834 ERROR oslo_db.sqlalchemy.exc_filters
      [req-574e3215-ed4e-45e6-baae-7cf1f0c09a26 - -]
      DBAPIError exception wrapped from (pymysql.err.ProgrammingError)
      (1064, u"You have an error in your SQL syntax; check the manual that
      corresponds to your MySQL server version for the right syntax to use
      near '))' at line 3") [SQL: u'SELECT securitygrouprules.project_id AS
      securitygrouprules_project_id, securitygrouprules.id AS
      securitygrouprules_id, securitygrouprules.security_group_id AS
      securitygrouprules_security_group_id,
      securitygrouprules.remote_group_id AS
      securitygrouprules_remote_group_id, securitygrouprules.direction AS
      securitygrouprules_direction, securitygrouprules.ethertype AS
      securitygrouprules_ethertype, securitygrouprules.protocol AS
      securitygrouprules_protocol, securitygrouprules.port_range_min AS
      securitygrouprules_port_range_min, securitygrouprules.port_range_max
      AS securitygrouprules_port_range_max,
      securitygrouprules.remote_ip_prefix AS
      securitygrouprules_remote_ip_prefix,
      securitygrouprules.standard_attr_id AS
      securitygrouprules_standard_attr_id, standardattributes_1.id AS
      standardattributes_1_id, standardattributes_1.resource_type AS
      standardattributes_1_resource_type, standardattributes_1.description
      AS standardattributes_1_description,
      standardattributes_1.revision_number AS
      standardattributes_1_revision_number, standardattributes_1.created_at
      AS standardattributes_1_created_at, standardattributes_1.updated_at AS
      standardattributes_1_updated_at \nFROM securitygrouprules LEFT OUTER
      JOIN standardattributes AS standardattributes_1 ON
      standardattributes_1.id = securitygrouprules.standard_attr_id \nWHERE
      securitygrouprules.security_group_id IN (%(security_group_id_1)s)']
      [parameters: {u'security_group_id_1': set([])}]:
      ProgrammingError: (1064, u"You have an error in your SQL syntax; check
      the manual that corresponds to your MySQL server version for the right
      syntax to use near '))' at line 3")

    - Ensure that we pass a list, not a set, as the filter for
      get_security_group_rules.

    - Add CI test and gate hooks so we can see the Felix logs.

    - Don't run the test_hotplug_nic test, because it's hitting a...

Read more...

Changed in networking-calico:
status: In Progress → Fix Released
Revision history for this message
Omer Anson (omer-anson) wrote :

Following Neil's comment, Dragonflow also removed its custom etcd installation and relies fully on Openstack's base service. Done in patch https://review.openstack.org/#/c/476401/ .

Closing the Dragonflow bug.

Changed in dragonflow:
status: New → Fix Released
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

If there are still issues with devstack, please reopen or create a new bug.

Changed in devstack:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.