StarlingX

During setup controllers remain in Degraded status - postgres standalone

Bug #1850836 reported by Cristopher Lemus on 2019-10-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Critical	Don Penney

Bug Description

Brief Description
-----------------
During the setup stages of Duplex, Standard (2+2 and 2+2+2) configurations, both controllers remain on degraded state after unlocking controller-1.

Severity
--------
Provide the severity of the defect.
Critical

Steps to Reproduce
------------------
Follow up docs procedure for duplex, standard or standard external storage configurations.

Expected Behavior
------------------
Controllers on Available status

Actual Behavior
----------------
Controllers on Degraded status

Reproducibility
---------------
100%

System Configuration
--------------------
Duplex, standard and standard external configurations.

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20191031T013000Z"

Last Pass
---------
Last pass on previous day: 20191030T013000Z

Timestamp/Logs
--------------
Full log attached from a Standard Local storage configuration. Here are general details from this config: http://paste.openstack.org/show/785692/

Test Activity
-------------
Sanity

Tags:

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-10-31:

ALL_NODES_20191031.100019.tar Edit (37.8 MiB, application/x-tar)

Ghada Khalil (gkhalil) on 2019-10-31

summary:	- During setup controllers remain on Degraded status + During setup controllers remain in Degraded status -postgres standalone
summary:	- During setup controllers remain in Degraded status -postgres standalone + During setup controllers remain in Degraded status - postgres standalone

Revision history for this message

Don Penney (dpenney) wrote on 2019-10-31:

On controller-0:
Disk /dev/mapper/cgts--vg-pgsql--lv: 42.9 GB, 42949672960 bytes, 83886080 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

On controller-1:
Disk /dev/mapper/cgts--vg-pgsql--lv: 21.5 GB, 21474836480 bytes, 41943040 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

So the first controller was setup with 40G for the postgres LV, while the second controller only 20G

Revision history for this message

Don Penney (dpenney) wrote on 2019-10-31:

The user.log is showing ansible setting the size to 20G on controller-0, but then setting it again to 40G immediately after:

$ grep pgsql-lv controller-0_20191031.100019/var/log/user.log
2019-10-30T22:57:42.000 localhost ansible-command: info Invoked with warn=True executable=None _uses_shell=False _raw_params=lvextend -L20G /dev/cgts-vg/pgsql-lv removes=None argv=None creates=None chdir=None stdin=None
2019-10-30T22:57:43.000 localhost ansible-command: info Invoked with warn=True executable=None _uses_shell=False _raw_params=lvextend -L40G /dev/cgts-vg/pgsql-lv removes=None argv=None creates=None chdir=None stdin=None

I don't know where this second ansible directive is coming from.

Revision history for this message

Don Penney (dpenney) wrote on 2019-10-31:

Can you provide the content of your localhost.yaml file?

Revision history for this message

Don Penney (dpenney) wrote on 2019-10-31:

From: https://opendev.org/starlingx/ansible-playbooks/src/branch/master/playbookconfig/src/playbooks/roles/bootstrap/persist-config/tasks/one_time_config_tasks.yml#L102

The LV is initially sized to 10G. Then if the disk is larger, it gets resized to 20G. With the logs here showing 20G and 40G, it looks like somehow the initial configuration of controller-0 is using ansible-playbooks from a previous load.

Revision history for this message

Cristopher Lemus (cjlemusc) wrote on 2019-10-31:

Here's the content:

system_mode: duplex

dns_servers:
- 192.168.100.60

docker_registries:
defaults:
url: 192.168.100.60

is_secure_registry: False

external_oam_subnet: 192.168.200.0/24
external_oam_gateway_address: 192.168.200.1
external_oam_floating_address: 192.168.200.207
external_oam_node_0_address: 192.168.200.82
external_oam_node_1_address: 192.168.200.83

management_subnet: 10.10.54.0/24
management_start_address: 10.10.54.10
management_end_address: 10.10.54.100
management_gateway_address: 10.10.54.1

ansible_become_pass: ANSIBLE_PASS
admin_password: ADMIN_PASS

Just the ANSIBLE_PASS/ADMIN_PASS are replaced by actual passwords.

Revision history for this message

Don Penney (dpenney) wrote on 2019-10-31:

Ok, you can ignore my previous comments. The problem is that the two updates to resize the filesystem were not properly coordinated:
https://review.opendev.org/692129
https://review.opendev.org/692125

One went in yesterday, the other this morning. So the next build should be ok.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-10-31:

Marking as Fix Released.
As per above, this will be addressed in the 2019-11-01 load

tags:	added: stx.config
Changed in starlingx:
assignee:	nobody → Don Penney (dpenney)
importance:	Undecided → Critical
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

ALL_NODES_20191031.100019.tar Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.