Deployment fails after reset environment

Bug #1536167 reported by Fabrizio Soppelsa
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fabrizio Soppelsa

Bug Description

7.0, VLAN with Ceph and Radosgw.

A) Reset env + Deploy = Error
B) Delete env + Re create same env + Deploy = Success

Maybe very close to https://bugs.launchpad.net/fuel/+bug/1529870 but not exactly the same Astute error.
These are some log portions of A):

node-1.domain.tld/neutron-openvswitch-agent.log:2016-01-09T20:15:12.451205+00:00 err: 2016-01-09 20:15:12.449 4584 ERROR neutron.agent.ovsdb.impl_vsctl [req-31db460f-e1bd-40a6-81a0-d7f81c0e66c3 ] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=type', 'list', 'Interface', 'int-br-prv'].

node-11.domain.tld/bootstrap/agent.log:2016-01-09T17:31:30.477050+00:00 debug: 17:31:29.857064 #2239] DEBUG -- : Response: status: 409 body: {"message": "Node with mac 90:B1:1C:28:D0:AB already exists - doing nothing", "errors": []}

node-12.domain.tld/apache2_error.log:2016-01-09T20:07:10.082747+00:00 err: [Sat Jan 09 20:07:02.982889 2016] [fastcgi:error] [pid 7009:tid 140517287319296] (2)No such file or directory: [client 240.0.0.2:53166] FastCGI: failed to connect to server "/var/www/radosgw/s3gw.fcgi": connect() failed
node-6.domain.tld/apache2_error.log:2016-01-09T19:15:57.680770+00:00 err: [Sat Jan 09 19:15:54.518571 2016] [fastcgi:error] [pid 25243:tid 140675924281088] (2)No such file or directory: [client 10.102.255.51:48750] FastCGI: failed to connect to server "/var/www/radosgw/s3gw.fcgi": connect() failed
node-8.domain.tld/apache2_error.log:2016-01-09T19:51:15.950208+00:00 err: [Sat Jan 09 19:51:14.744628 2016] [fastcgi:error] [pid 29025:tid 140315004450560] (2)No such file or directory: [client 240.0.0.2:48587] FastCGI: failed to connect to server "/var/www/radosgw/s3gw.fcgi": connect() failed

node-13.domain.tld/ceph-osd.log:2016-01-09T20:22:27.017267+00:00 emerg: 2016-01-09 20:22:27.012366 7f2c473fc800 -1 auth: error reading file: /var/lib/ceph/tmp/mnt.UA889K/keyring: can't open /var/lib/ceph/tmp/mnt.UA889K/keyring: (2) No such file or directory

node-13.domain.tld/puppet-apply.log:2016-01-09T20:22:51.135691+00:00 notice: (/Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/sdu3]/Exec[ceph-deploy osd prepare node-13:/dev/sdu3]/returns) [node-13][WARNING] Error: Partition(s) 1 on /dev/sdu3 have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes.
node-7.domain.tld/puppet-apply.log:2016-01-09T20:14:38.523784+00:00 notice: (/Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/sdj3]/Exec[ceph-deploy osd prepare node-7:/dev/sdj3]/returns) [node-7][WARNING] Error: Partition(s) 1 on /dev/sdj3 have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes.

node-14.domain.tld/ceph-osd.log:2016-01-09T20:22:25.922524+00:00 emerg: 2016-01-09 20:22:25.920223 7f771fdc4800 -1 auth: error reading file: /var/lib/ceph/tmp/mnt.cvHNTH/keyring: can't open /var/lib/ceph/tmp/mnt.cvHNTH/keyring: (2) No such file or directory

node-14.domain.tld/neutron-openvswitch-agent.log:2016-01-09T20:18:38.342165+00:00 err: 2016-01-09 20:18:38.338 10659 ERROR neutron.agent.ovsdb.impl_vsctl [req-990323c3-da16-41c0-92c8-75e3cc42e59a ] Unable to execute ['ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=type', 'list', 'Interface', 'int-br-prv'].

Snapshot is 1.2G, available upon request (in case, ask me in Slack or write a mail).

Maciej Relewicz (rlu)
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Library Team (fuel-library)
tags: added: area-library
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Looks like a ceph deploy specific bug

tags: added: ceph life-cycle-management
tags: added: team-bugfix
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

It may be a generic disc partitioning issue as well

Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Removed 7.0-mu-3 milestone, since the patch to fix this issue is not committed yet.

FYI I couldn't reproduce it in maintenance lab using steps-to-reproduce from the first message.

Changed in fuel:
milestone: 7.0-mu-3 → 7.0-updates
Changed in fuel:
milestone: 7.0-updates → 7.0-mu-3
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Fabrizio, we are unable to reproduce the issue with the steps provided. Need more details - number of nodes, plugins installed, etc

Changed in fuel:
status: Confirmed → Incomplete
assignee: Fuel Library Team (fuel-library) → Fabrizio Soppelsa (fsoppelsa)
Revision history for this message
Fabrizio Soppelsa (fsoppelsa) wrote :

I didn't previously mention that, also, this configuration had some bonded interfaces on nodes (if that matters).
I passed the snapshot to one developer.

Revision history for this message
Rodion Tikunov (rtikunov) wrote :
Download full text (5.6 KiB)

Seems like all deploys failed by reason of connectivity errors. Did you do a network verification before deployment?

I considered this snapshot where are logs presented since Jan 8 23:10:38.
The reasons why cluster has not deployed.
1)node 6 deployment error on /etc/puppet/modules/osnailyfacter/modular/netconfig/connectivity_tests.pp.
2016-01-09 03:08:00.603 INFO [7f41dcca4700] (receiver) RPC method deploy_resp received: {"task_uuid": "41b461a7-11f1-4a55-81ae-7f2ac4e82c1a", "nodes": [{"status": "error", "error_type": "deploy", "role": "
primary-controller", "uid": "6", "task": {"priority": 700, "type": "puppet", "uids": ["6"], "parameters": {"puppet_modules": "/etc/puppet/modules", "puppet_manifest": "/etc/puppet/modules/osnailyfacter/mod
ular/netconfig/connectivity_tests.pp", "timeout": 3600, "cwd": "/"}}}]}
2016-01-09 03:08:00.743 INFO [7f41dcca4700] (notification) Notification: topic: error message: Deployment has failed. Method granular_deploy. Deployment failed on nodes 6.
2016-01-09 03:08:00.797 DEBUG [7f41dcca4700] (task) Updating cluster (SC-Kilo-Dev (id=1, mode=ha_compact)) status: from error to error

2) Errors with apt-get update
2016-01-09 04:08:58.230 INFO [7f41dcca4700] (receiver) RPC method deploy_resp received: {"status": "error", "task_uuid": "4e6fa843-4576-48a4-a961-32391d53db72", "error": "Method granular_deploy. Failed to
execute hook 'shell' Failed to run command cd / && apt-get update\n\n---\npriority: 2500\ntype: shell\nuids:\n- '11'\n- '10'\n- '13'\n- '12'\n- '15'\n- '14'\n- '16'\n- '1'\n- '5'\n- '4'\n- '7'\n- '6'\n- '9
'\n- '8'\nparameters:\n retries: 3\n cmd: apt-get update\n cwd: \"/\"\n timeout: 1800\n interval: 1\n.\nInspect Astute logs for the details"}
2016-01-09 04:08:58.243 INFO [7f41dcca4700] (notification) Notification: topic: error message: Deployment has failed. Method granular_deploy. Failed to execute hook 'shell' Failed to run command cd / && apt-get update
2016-01-09 04:08:58,251 DEBG 'receiverd' stdout output:
2016-01-09 04:08:58.250 DEBUG [7f41dcca4700] (task) Updating cluster (SC-Kilo-Dev (id=1, mode=ha_compact)) status: from deployment to error

3) Some changes in arttibutes and network configuration
2016-01-09 04:35:23.997 DEBUG [7f41dcca4700] (cluster) New pending changes in environment 1: attributes
2016-01-09 04:35:24.002 DEBUG [7f41dcca4700] (cluster) New pending changes in environment 1: networks
node 6 deployment error on /etc/puppet/modules/osnailyfacter/modular/netconfig/connectivity_tests.pp step
2016-01-09 05:24:49.182 INFO [7f41dcca4700] (receiver) RPC method deploy_resp received: {"task_uuid": "8ed9b482-0d55-43a9-8b15-9255d447d587", "nodes": [{"status": "error", "error_type": "deploy", "role": "primary-controller", "uid": "6", "task": {"priority": 700, "type": "puppet", "uids": ["6"], "parameters": {"puppet_modules": "/etc/puppet/modules", "puppet_manifest": "/etc/puppet/modules/osnailyfacter/modular/netconfig/connectivity_tests.pp", "timeout": 3600, "cwd": "/"}}}]}
2016-01-09 05:24:49.207 DEBUG [7f41dcca4700] (receiver) Updating node 6 - set error_type to deploy
2016-01-09 05:24:49.207 DEBUG [7f41dcca4700] (receiver) Updating node 6 - set status to error
2016-01-09 05:24:49...

Read more...

Revision history for this message
Fabrizio Soppelsa (fsoppelsa) wrote :

Network verifications was fine, everything connected outside. Initially I thought of some problems with zapping disks on ceph nodes.

Create new env & Deploy changes => Deployment goes perfect 100% of times
Reset environment && (even with no modifications) Deploy changes => Deployment fails, error reported above

So I thought it was connected with the Reset action itself...

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Retargeted to 7.0-mu-4 as this issue requires additional investigation.

Changed in fuel:
milestone: 7.0-mu-3 → 7.0-mu-4
Revision history for this message
Rodion Tikunov (rtikunov) wrote :
Download full text (4.6 KiB)

Deploy tries #1, 2, 3 has failed explicitly because connectivity problems.

In deploy #4 I saw errors with CEPH in the logs but seems they are not critical. Because after that puppet reports what node is ready.

node-9.domain.tld/var/log/puppet.log
2016-01-09T20:29:23.769526+00:00 notice: (/Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/sdd3]/Exec[ceph-deploy osd prepare node-13:/dev/sdd3]/returns) [node-13][WARNING] INFO:ceph-disk:Running command: /sbin/partprobe /dev/sdd3
2016-01-09T20:29:23.769992+00:00 notice: (/Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/sdd3]/Exec[ceph-deploy osd prepare node-13:/dev/sdd3]/returns) [node-13][WARNING] Error: Partition(s) 1 on /dev/sdd3 have been written, but we have been unable to inform the kernel of the change, probably because it/they are in use. As a result, the old partition(s) will remain in use. You should reboot now before making further changes.
2016-01-09T20:29:23.770664+00:00 notice: (/Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/sdd3]/Exec[ceph-deploy osd prepare node-13:/dev/sdd3]/returns) [node-13][INFO ] checking OSD status...
2016-01-09T20:29:23.770664+00:00 notice: (/Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/sdd3]/Exec[ceph-deploy osd prepare node-13:/dev/sdd3]/returns) [node-13][INFO ] Running command: ceph --cluster=ceph osd stat --format=json
2016-01-09T20:29:23.770664+00:00 notice: (/Stage[main]/Ceph::Osds/Ceph::Osds::Osd[/dev/sdd3]/Exec[ceph-deploy osd prepare node-13:/dev/sdd3]/returns) [ceph_deploy.osd][DEBUG ] Host node-13 is now ready for osd use.

10.20.0.2/var/log/docker-logs/supervisor/supervisord.log-20160110:
2016-01-09 20:29:38.515 INFO [7f41dcca4700] (receiver) RPC method deploy_resp received: {"task_uuid": "b4f6d7d7-79e2-4c32-ae9a-eb0c93f610bd", "nodes": [{"status": "ready", "progress": 100, "task": {"priority": 1200, "type": "puppet", "uids": ["13"], "parameters": {"puppet_modules": "/etc/puppet/modules", "puppet_manifest": "/etc/puppet/modules/osnailyfacter/modular/zabbix/zabbix.pp", "timeout": 3600, "cwd": "/"}}, "role": "ceph-osd", "uid": "13"}]}
2016-01-09 20:29:38.541 DEBUG [7f41dcca4700] (receiver) Updating node 13 - set status to ready
2016-01-09 20:29:38.541 DEBUG [7f41dcca4700] (receiver) Updating node 13 - set progress to 100

Latest deploy failed after this errors
10.20.0.2/var/log/docker-logs/astute/astute.log.1.gz
2016-01-09T20:32:41 debug: [795] Node 9(hook) status: error
2016-01-09T20:32:41 debug: [795] Node 9 has failed to deploy. There is no more retries for puppet run.
2016-01-09T20:32:41 debug: [795] {"nodes"=>[{"status"=>"error", "error_type"=>"deploy", "uid"=>"9", "role"=>"hook"}]}
2016-01-09T20:32:41 info: [795] b4f6d7d7-79e2-4c32-ae9a-eb0c93f610bd: Spent 75.474620829 seconds on puppet run for following nodes(uids): 11,10,13,14,1,4,9
2016-01-09T20:32:41 warning: [795] Puppet run failed. Check puppet logs for details
2016-01-09T20:32:41 debug: [795] Data received by DeploymentProxyReporter to report it up: {"nodes"=>[{"uid"=>"11", "status"=>"error", "error_type"=>"deploy", "role"=>"hook", "hook"=>"puppet", "error_msg"=>"Puppet run failed. Check puppet logs for details"}, {"uid"=>"10", "status"=>"error", "error_type"=>"deploy", "role"=>"hook",...

Read more...

Revision history for this message
Dmitry Klenov (dklenov) wrote :

Looks like info from Fabrizio is gotten. Moving to confirmed state.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

We are unable to reproduce the issue.
As explained in comment https://bugs.launchpad.net/fuel/+bug/1536167/comments/9, there are network connectivity errors and we need more details about network configuration, bonding settings and may be hardware to create the same environment.
So bug closed as Invalid. Feel free to reopen it again with providing new details.

Changed in fuel:
status: Confirmed → Invalid
Changed in fuel:
milestone: 7.0-mu-4 → 7.0-updates
Revision history for this message
ZHI BING WANG (zwang) wrote :

We have an issue with the same symptom in MOS 7, MU 3. It happens to all nodes systemically.
The first time deployment will work. Provisioning will fail after env reset.

We have the issue with or without ceph.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.