fuel 7.0 fails to deploy

Bug #1515858 reported by Ed Balduf
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Undecided
Fuel Library (Deprecated)

Bug Description

Fuel 7.0 GA fails to deploy a multi-node (3) deployment with KVM and Neutron w/ tunneling segmentation. I have repeated this in several environments (VMs) and have yet another one this time with some nice stable physical hardware so it's time to report it. I have found that this issue can be avoided if one deploys a single controller node first, then in subsequent deployments adds a controller until you have the 3 that you want, but that is not the way it is supposed to to work.

I guess I'll start with the error message in the UI:

Error
Deployment has failed. Method granular_deploy. Deployment failed on nodes 5.
Inspect Astute logs for the details

The astute log is attached to this bug report, but it always comes down to a trace similar to below (I say similar, because sometimes it's a different service's puppet deployments, keystone, network controller, glance etc.. but it's always the similar) which shows a timeout of the puppet deployment. I have spent hours trying to figure out why the deployment has failed and the closest I can come is that haproxy is having some issues and preventing Keystone from responding to requests for endpoint information, but maybe I don't have that correct either.

2015-11-13T01:22:23 debug: [485] Node 5 has failed to deploy. There is no more retries for puppet run.
2015-11-13T01:22:23 debug: [485] {"nodes"=>[{"status"=>"error", "error_type"=>"deploy", "uid"=>"5", "role"=>"primary-controller"}]}
2015-11-13T01:22:23 err: [485] Task '{"priority"=>5700, "type"=>"puppet", "uids"=>["5"], "parameters"=>{"puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/openstack-network/openstack-network-controller.pp", "timeout"=>3600, "cwd"=>"/"}}' failed on node 5
2015-11-13T01:22:23 err: [485] No more tasks will be executed on the node 5
2015-11-13T01:22:23 debug: [485] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"5", "status"=>"error", "error_type"=>"deploy", "role"=>"primary-controller", "task"=>{"priority"=>5700, "type"=>"puppet", "uids"=>["5"], "parameters"=>{"puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/openstack-network/openstack-network-controller.pp", "timeout"=>3600, "cwd"=>"/"}}}]}
2015-11-13T01:22:23 debug: [485] Data send by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"5", "status"=>"error", "error_type"=>"deploy", "role"=>"primary-controller", "task"=>{"priority"=>5700, "type"=>"puppet", "uids"=>["5"], "parameters"=>{"puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/openstack-network/openstack-network-controller.pp", "timeout"=>3600, "cwd"=>"/"}}}]}
2015-11-13T01:22:23 info: [485] f7a6b2ae-5fca-4916-a6a4-a04271dd885e: Finished deployment of nodes => roles: {"5"=>"primary-controller"}
2015-11-13T01:22:23 err: [485] Error running RPC method granular_deploy: Deployment failed on nodes 5, trace:
["/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/deployment_engine.rb:68:in `block (3 levels) in deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/deployment_engine.rb:51:in `each_slice'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/deployment_engine.rb:51:in `block (2 levels) in deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/deployment_engine.rb:49:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/deployment_engine.rb:49:in `block in deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/deployment_engine.rb:45:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/deployment_engine.rb:45:in `deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/orchestrator.rb:179:in `deploy_cluster'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/orchestrator.rb:45:in `granular_deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/dispatcher.rb:91:in `granular_deploy'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/server.rb:146:in `dispatch_message'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/server.rb:107:in `block in dispatch'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/task_queue.rb:64:in `call'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/task_queue.rb:64:in `block in each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/task_queue.rb:56:in `each'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/server.rb:105:in `each_with_index'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/server.rb:105:in `dispatch'",
 "/usr/lib64/ruby/gems/2.1.0/gems/astute-7.0.0/lib/astute/server/server.rb:89:in `block in perform_main_job'"]

2015-11-13T01:22:23 info: [485] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"f7a6b2ae-5fca-4916-a6a4-a04271dd885e", "nodes"=>[{"uid"=>"5", "status"=>"error", "error_type"=>"deploy", "role"=>"primary-controller", "task"=>{"priority"=>5700, "type"=>"puppet", "uids"=>["5"], "parameters"=>{"puppet_modules"=>"/etc/puppet/modules", "puppet_manifest"=>"/etc/puppet/modules/osnailyfacter/modular/openstack-network/openstack-network-controller.pp", "timeout"=>3600, "cwd"=>"/"}}}]}}
2015-11-13T01:22:23 info: [485] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"f7a6b2ae-5fca-4916-a6a4-a04271dd885e", "status"=>"error", "error"=>"Method granular_deploy. Deployment failed on nodes 5.\nInspect Astute logs for the details"}}

Tags: area-library
Revision history for this message
Ed Balduf (ebalduf) wrote :
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Ed,

Please provide a fuel snapshot. Also, you may want to look in the puppet.log for that node for additional information as to why it failed.

Changed in fuel:
status: New → Incomplete
tags: added: area-library
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Ed Balduf (ebalduf) wrote :
Revision history for this message
Ed Balduf (ebalduf) wrote :

I have uploaded the puppet log from the failing node. I have looked at several of these (from trying this many times, they all seem to start with issues with mysql as shown here:

2015-11-12 22:51:40 +0000 Puppet (err): Execution of '/usr/bin/mysqladmin --defaults-extra-file=/root/.my.cnf flush-privileges' returned 1: /usr/bin/mysqladmin: connect to server at '172.27.34.2' failed
2015-11-12 22:51:40 +0000 /Stage[main]/Nova::Db::Mysql/Openstacklib::Db::Mysql[nova]/Openstacklib::Db::Mysql::Host_access[nova_localhost]/Database_grant[nova@localhost/nova]/privileges (err): change from to all failed: Execution of '/usr/bin/mysqladmin --defaults-extra-file=/root/.my.cnf flush-privileges' returned 1: /usr/bin/mysqladmin: connect to server at '172.27.34.2' failed
2015-11-12 23:00:13 +0000 Puppet (err): Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf mysql -Be describe db' returned 1: ERROR 2003 (HY000): Can't connect to MySQL server on '172.27.34.2' (110)
2015-11-12 23:00:13 +0000 /Stage[main]/Heat::Db::Mysql/Openstacklib::Db::Mysql[heat]/Openstacklib::Db::Mysql::Host_access[heat_localhost]/Database_grant[heat@localhost/heat]/privileges (err): change from to all failed: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf mysql -Be describe db' returned 1: ERROR 2003 (HY000): Can't connect to MySQL server on '172.27.34.2' (110)
2015-11-12 23:02:52 +0000 Puppet (err): Execution of '/usr/bin/mysqladmin --defaults-extra-file=/root/.my.cnf flush-privileges' returned 1: /usr/bin/mysqladmin: connect to server at '172.27.34.2' failed
2015-11-12 23:02:52 +0000 /Stage[main]/Heat::Db::Mysql/Openstacklib::Db::Mysql[heat]/Openstacklib::Db::Mysql::Host_access[heat_localhost]/Database_grant[heat@localhost/heat]/privileges (err): change from to all failed: Execution of '/usr/bin/mysqladmin --defaults-extra-file=/root/.my.cnf flush-privileges' returned 1: /usr/bin/mysqladmin: connect to server at '172.27.34.2' failed
2015-11-12 23:10:04 +0000 Puppet (err): Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf mysql -e create user 'neutron'@'localhost' identified by PASSWORD '*4354BA07EAF949F028416EA248FD0E3F61D115F5'' returned 1: ERROR 2003 (HY000): Can't connect to MySQL server on '172.27.34.2' (110)
2015-11-12 23:10:04 +0000 /Stage[main]/Neutron::Db::Mysql/Openstacklib::Db::Mysql[neutron]/Openstacklib::Db::Mysql::Host_access[neutron_localhost]/Database_user[neutron@localhost]/ensure (err): change from absent to present failed: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf mysql -e create user 'neutron'@'localhost' identified by PASSWORD '*4354BA07EAF949F028416EA248FD0E3F61D115F5'' returned 1: ERROR 2003 (HY000): Can't connect to MySQL server on '172.27.34.2' (110)
2015-11-12 23:12:13 +0000 Puppet (err): Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf mysql -e INSERT INTO db (host, user, db) VALUES ('%', 'neutron', 'neutron')' returned 1: ERROR 2003 (HY000): Can't connect to MySQL server on '172.27.34.2' (110)

Revision history for this message
Ed Balduf (ebalduf) wrote :

I have tried to upload the Fuel Snapshot to Launchpad several times and it keeps throwing an error, so I have placed the file on dropbox and shared it so anyone can get it with this link https://www.dropbox.com/s/xqwwtern6igit6w/fuel-snapshot-2015-11-13_15-25-21.tar.xz?dl=0

Revision history for this message
Alex Schultz (alex-schultz) wrote :

Ed, Actually it looks more like a connectivity issue. I looked at the puppet.log and found errors around not being able to communicate with the keystone/neutron endpoints. The mysql error would also be related to a node not being able to talk to the VIPs for these services. If you check the status of the pacemaker managed services on the controllers (via "crm status"), is everything reported as started? Specifically look for the items with __vip in the name

Revision history for this message
Ed Balduf (ebalduf) wrote :

I thought connectivity issues the first time I saw this (2 months ago) and didn't report it. I have tried this in 2 different VMware environments and now in this physical environment and they all behave the same. The interesting part is the VMware environments would work find if you deploy a single controller node, then add the 2 additional ones. I even did some failover testing once all 3 were up. I believe this to be some kind of a timing issues with the deployment procedure, but I also don't want to influence the witness. Let me know what else you want to look at or want me to do.

Here is the output from the pacemaker status

root@node-5:~# crm
crm(live)# status
Last updated: Fri Nov 13 17:24:33 2015
Last change: Fri Nov 13 00:49:41 2015
Stack: corosync
Current DC: node-5.pm.solidfire.net (5) - partition WITHOUT quorum
Version: 1.1.12-561c4cf
1 Nodes configured
15 Resources configured

Online: [ node-5.pm.solidfire.net ]

 Clone Set: clone_p_vrouter [p_vrouter]
     Started: [ node-5.pm.solidfire.net ]
 vip__management (ocf::fuel:ns_IPaddr2): Started node-5.pm.solidfire.net
 vip__vrouter_pub (ocf::fuel:ns_IPaddr2): Started node-5.pm.solidfire.net
 vip__vrouter (ocf::fuel:ns_IPaddr2): Started node-5.pm.solidfire.net
 vip__public (ocf::fuel:ns_IPaddr2): Started node-5.pm.solidfire.net
 Master/Slave Set: master_p_conntrackd [p_conntrackd]
     Masters: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-5.pm.solidfire.net ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_dns [p_dns]
     Started: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_neutron-dhcp-agent [p_neutron-dhcp-agent]
     Started: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-5.pm.solidfire.net ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-5.pm.solidfire.net ]

Revision history for this message
Alex Schultz (alex-schultz) wrote :

These nodes on on ESX hosts? I think i've seen issues with arps so that may be why they cannot talk to each other. Someone else might be able to comment better on that. The corosync status looks OK so it would seem that it might be a network interaction between the hosts.

Revision history for this message
Ed Balduf (ebalduf) wrote :

Nope, these nodes are not on VMware.... the other 2 previous environments were, but these are on physical Dell R620's connected via redundant Dell 4810 10Gbps Ethernet switches and redundant 1 Gbps management network switches.

Revision history for this message
Ed Balduf (ebalduf) wrote :

I had to move on with this environment, so I deployed, but this time with a single controller and compute node to start (worked) and then added a second and third controller to the mix after the first successful deployment with one controller. All worked. I believe there is some kind of a timing problem with deploying multiple controllers at once.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Closing as this issue remained incomplete for more than a month

Changed in fuel:
milestone: none → 7.0-updates
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.