vault-charm

Vault becomes inaccessible if an etcd unit is removed/down

Bug #1782620 reported by Liam Young on 2018-07-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	vault-charm	Fix Released	Critical	Liam Young	vault-charm 19.04

Bug Description

If vault is deployed in an HA configuration then if the etcd unit with the lowest IP is shutdown then vault becomes inaccessible:

$ export VAULT_ADDR="http://$(juju config vault vip):8200"
$ export VAULT_TOKEN=43625de2-b301-7fe4-0bbd-a1c62049e43c
$ vault status
Key Value
--- -----
Seal Type shamir
Sealed false
Total Shares 1
Threshold 1
Version 0.10.1
Cluster Name vault-cluster-fe6e94e6
Cluster ID 6f5b5c26-6a61-3710-6a75-02c2b0931040
HA Enabled true
HA Cluster https://10.53.82.244:8201
HA Mode active

$ juju status etcd | awk '/^etcd\// {print $5;}' | sort | head -n1
10.53.82.119

$ juju status | awk '/started.*10.53.82.119/ {print $4}'
juju-4d60d9-2

$ lxc stop juju-4d60d9-2

$ vault status
Error checking leader status: Error making API request.

URL: GET http://10.53.82.200:8200/v1/sys/leader
Code: 500. Errors:

* context deadline exceeded

Looking at vaults config on a vault unit the etcd units are listed:

$ sudo grep -A5 ha_storage /var/snap/vault/common/vault.hcl | grep address
address = "https://10.53.82.119:2379,https://10.53.82.150:2379,https://10.53.82.157:2379"

It looks suspiciously like it is supposed to work through them in order but never gets past the first one.

Liam Young (gnuoy) on 2018-07-19

summary:	- Vault becomes inaccessible if a vault unit and an etcd unit are removed + Vault becomes inaccessible if an etcd unit is removed
Changed in vault-charm:
importance:	Undecided → Critical

Revision history for this message

James Page (james-page) wrote on 2018-07-19: Re: Vault becomes inaccessible if an etcd unit is removed

Confirmed - I was able to reproduce this issue with a three unit vault cluster - shutting down one of the etcd units results in the described error message.

Changed in vault-charm:
status:	New → Confirmed

Revision history for this message

Liam Young (gnuoy) wrote on 2018-07-20:

My plan is to verify whether the bug exists with the latest version of vault. If it does raise an upstream bug. I'll leave this bug open too to document any workarounds etc.

Changed in vault-charm:
assignee:	nobody → Liam Young (gnuoy)

Revision history for this message

Liam Young (gnuoy) wrote on 2018-07-20:

Confirmed the bug exists in latest version of vault (0.10.3). I have raised upstream bug https://github.com/hashicorp/vault/issues/4961

Revision history for this message

James Page (james-page) wrote on 2018-07-20:

https://github.com/hashicorp/vault/blob/f09c365ef1b24b90bb20144e8a83f9d10c9958c6/vendor/github.com/coreos/etcd/clientv3/client.go#L466

looks suspicious (thanks jam).

Revision history for this message

Liam Young (gnuoy) wrote on 2018-07-21:

I believe I have tracked the bug down to a bug in the etcd go client library that only shows itself when using tls to talk to etcd, I've raised an upstream bug for that ( https://github.com/coreos/etcd/issues/9949 ). Given that in our deployment vault only uses etcd to manage leadership election and not store secrets I think its ok to disable tls between vault and etcd. Obviously if my assumption is wrong and vault stores something secret in etcd then this is not ok. I have created two charms that do this: cs:~gnuoy/vault-16 and cs:~gnuoy/etcd-0

If you want to test them then they need manual post-upgrade steps. In the case of vault, the vault service needs to be restarted by hand and then unsealed. The etcd charm seems reluctant to re-render its config so I had to change the juju config port setting and then change it back.

Revision history for this message

Liam Young (gnuoy) wrote on 2018-07-22:

juju remove-relation etcd:db vault:etcd

juju upgrade-charm --switch cs:~gnuoy/etcd-0 etcd
juju upgrade-charm --switch cs:~gnuoy/vault-16 vault
juju config etcd port=2350

juju config etcd port=2379
juju add-relation etcd:db vault:etcd

juju run --application vault "sudo systemctl restart vault"

James Page (james-page) on 2018-07-23

Changed in vault-charm:
status:	Confirmed → In Progress
summary:	- Vault becomes inaccessible if an etcd unit is removed + Vault becomes inaccessible if an etcd unit is removed/down

Revision history for this message

John George (jog) wrote on 2018-08-14:

This bug is subscribed to Canonical Field Critical and tracked by the SLA process.
I see that alternative charms have been provided for testing but can you please provide a target date for an official released solution?

Revision history for this message

Liam Young (gnuoy) wrote on 2018-08-20:

I have pushed a new version of the fork to cs:~gnuoy/etcd-10 .

To upgrade and from cs:~gnuoy/etcd-0 please try the following steps:

M="MODEL NAME"
juju remove-relation -m $M vault:etcd etcd:db
# wait for hooks to complete
juju run -m $M --application vault "sudo systemctl stop vault"
# wait for hooks to complete
juju upgrade-charm -m $M --switch cs:~gnuoy/etcd-10 etcd
# wait for hooks to complete
juju run-action -m $M etcd/0 render-and-restart
juju run-action -m $M etcd/1 render-and-restart
juju run-action -m $M etcd/2 render-and-restart
juju run -m $M --application etcd "./hooks/update-status"
# wait for hooks to complete
juju add-relation -m $M vault:etcd etcd:db
# You may not need the following step
juju run -m $M --application vault "sudo systemctl start vault"

If the cluster count in the workload status is wrong it should be fixed when the next update-status hook runs. This can be forced with:

juju run -m $M --application etcd "./hooks/update-status"

Revision history for this message

Xav Paice (xavpaice) wrote on 2018-08-21:

Using cs:~gnuoy/vault-16 we get NRPE reporting "CRITICAL: failed to fetch version of running vault server: <urlopen error [SSL: UNKNOWN_PROTOC..."

Would be good to consider that with the alternate charm in use.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-08-27:

#10

Track status of the charm patches here:

https://review.openstack.org/#/q/topic:bug/1782620+(status:open+OR+status:merged)

Changed in vault-charm:
status:	In Progress → Fix Committed

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2018-08-28:

#11

The official documentation for etcd ha_backend for Vault mentions issues with the v2 API:

https://www.vaultproject.io/docs/configuration/storage/etcd.html
"High Availability – the Etcd storage backend supports high availability. The v2 API has known issues with HA support and should not be used in HA scenarios."

Based on the github issue tracker, there is an issue with how leadership locking is done:

https://github.com/hashicorp/vault/issues/1184#issuecomment-232433373 (day 2 issue description)
https://github.com/hashicorp/vault/issues/1184#issuecomment-224868339 (all vault units go to "standby" and error out with "etcd: couldn't renew key")

https://github.com/hashicorp/vault/issues/1908 (multiple vault units consider themselves to be leaders intermittently)

The v2 backend will not be fixed upstream by the looks of it:

https://github.com/hashicorp/vault/issues/2214#issuecomment-269718252
https://github.com/hashicorp/vault/issues/1908#issuecomment-248686276
https://github.com/hashicorp/vault/pull/1909 (disables HA for etcd by default)

And it does not look like there were any significant changes (other than moving the code to etcd2.go when v3 support was added)
https://github.com/hashicorp/vault/blame/master/physical/etcd/etcd2.go

I would prefer to use v3 and have no HA rather than have day 2 problems with vault when all units are up and etcd is fully operational.

It seems that either fixing v3 support upstream or exercising the recently community-contributed mysql ha_backend are the proper paths forward.

https://github.com/hashicorp/vault/pull/4686
(vault) git tag --contains 801eddf5f868dec54fb8aedade12d38cd9b1c8df
v0.11.0-beta1

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-08-28:

#12

I agree that we should expose the API version config and document/add pointers to known issues with each. That will be more flexible than the current work-around as merged (hard coded to v2 api). Operators can then choose which path of work-arounds are the best fit for their specific needs and tolerances while the underlying component issues are resolved more adequately.

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2018-08-28:

#13

https://review.openstack.org/#/c/596963/

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-09-07:

#14

When can we expect this fix to be released?

Revision history for this message

Alexander Balderson (asbalderson) wrote on 2018-09-14:

#15

Any update on when this fix is scheduled for released?

Revision history for this message

Liam Young (gnuoy) wrote on 2018-09-16:

#16

I have updated the vault snap and the vault charm to skip TLS host verification while we are blocked by the upstream bug. The vault charm only uses etcd to store a token indicating who the current HA leader is. This token is in turn encrypted so the lack of TLS verification should not be an issue. This means that the forks are no longer needed. Below are instructions for moving off the forks:

If using cs:~gnuoy/vault* with cs:~gnuoy/etcd-0 or cs:~gnuoy/etcd-10:

# Prepare vault for upgrade
juju remove-relation vault:etcd etcd:db
juju run --application vault "systemctl stop vault"

# Upgrade snap
juju run --application vault "snap refresh vault --channel edge"
# or confirm snap has been upgraded
juju run --application vault "/snap/bin/vault --version"

# Upgrade charm
juju upgrade-charm --switch cs:~openstack-charmers-next/vault vault
juju upgrade-charm --switch cs:etcd etcd
juju config etcd port=2350

# Trigger config render from etcd
juju config etcd port=2379

# Restore relation
juju add-relation vault:etcd etcd:db

# Unseal
*unseal*

If using cs:~gnuoy/vault* with cs:etcd:

juju run --application vault "systemctl stop vault"

# Upgrade snap
juju run --application vault "snap refresh vault --channel edge"

# or confirm snap has been upgraded
juju run --application vault "/snap/bin/vault --version"

# Upgrade charm
juju upgrade-charm --switch cs:~openstack-charmers-next/vault vault

David Ames (thedac) on 2018-11-20

Changed in vault-charm:
milestone:	none → 19.04

James Page (james-page) on 2018-11-26

Changed in vault-charm:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-coreos-etcd #9949
[closed area/clientv3] Edit
auto-github-hashicorp-vault #1184
[closed] Edit
auto-github-hashicorp-vault #1908
[closed] Edit
auto-github-hashicorp-vault #2214
[closed] Edit
auto-github-hashicorp-vault #4961
[closed bug ha/etcd] Edit

Bug watches keep track of this bug in other bug trackers.