Vault becomes inaccessible if an etcd unit is removed/down

Bug #1782620 reported by Liam Young
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
vault-charm
Fix Released
Critical
Liam Young

Bug Description

If vault is deployed in an HA configuration then if the etcd unit with the lowest IP is shutdown then vault becomes inaccessible:

$ export VAULT_ADDR="http://$(juju config vault vip):8200"
$ export VAULT_TOKEN=43625de2-b301-7fe4-0bbd-a1c62049e43c
$ vault status
Key Value
--- -----
Seal Type shamir
Sealed false
Total Shares 1
Threshold 1
Version 0.10.1
Cluster Name vault-cluster-fe6e94e6
Cluster ID 6f5b5c26-6a61-3710-6a75-02c2b0931040
HA Enabled true
HA Cluster https://10.53.82.244:8201
HA Mode active

$ juju status etcd | awk '/^etcd\// {print $5;}' | sort | head -n1
10.53.82.119

$ juju status | awk '/started.*10.53.82.119/ {print $4}'
juju-4d60d9-2

$ lxc stop juju-4d60d9-2

$ vault status
Error checking leader status: Error making API request.

URL: GET http://10.53.82.200:8200/v1/sys/leader
Code: 500. Errors:

* context deadline exceeded

Looking at vaults config on a vault unit the etcd units are listed:

$ sudo grep -A5 ha_storage /var/snap/vault/common/vault.hcl | grep address
  address = "https://10.53.82.119:2379,https://10.53.82.150:2379,https://10.53.82.157:2379"

It looks suspiciously like it is supposed to work through them in order but never gets past the first one.

Liam Young (gnuoy)
summary: - Vault becomes inaccessible if a vault unit and an etcd unit are removed
+ Vault becomes inaccessible if an etcd unit is removed
Changed in vault-charm:
importance: Undecided → Critical
Revision history for this message
James Page (james-page) wrote : Re: Vault becomes inaccessible if an etcd unit is removed

Confirmed - I was able to reproduce this issue with a three unit vault cluster - shutting down one of the etcd units results in the described error message.

Changed in vault-charm:
status: New → Confirmed
Revision history for this message
Liam Young (gnuoy) wrote :

My plan is to verify whether the bug exists with the latest version of vault. If it does raise an upstream bug. I'll leave this bug open too to document any workarounds etc.

Changed in vault-charm:
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :

Confirmed the bug exists in latest version of vault (0.10.3). I have raised upstream bug https://github.com/hashicorp/vault/issues/4961

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
Liam Young (gnuoy) wrote :

I believe I have tracked the bug down to a bug in the etcd go client library that only shows itself when using tls to talk to etcd, I've raised an upstream bug for that ( https://github.com/coreos/etcd/issues/9949 ). Given that in our deployment vault only uses etcd to manage leadership election and not store secrets I think its ok to disable tls between vault and etcd. Obviously if my assumption is wrong and vault stores something secret in etcd then this is not ok. I have created two charms that do this: cs:~gnuoy/vault-16 and cs:~gnuoy/etcd-0

If you want to test them then they need manual post-upgrade steps. In the case of vault, the vault service needs to be restarted by hand and then unsealed. The etcd charm seems reluctant to re-render its config so I had to change the juju config port setting and then change it back.

Revision history for this message
Liam Young (gnuoy) wrote :

juju remove-relation etcd:db vault:etcd

<when hooks have completed continue>

juju upgrade-charm --switch cs:~gnuoy/etcd-0 etcd
juju upgrade-charm --switch cs:~gnuoy/vault-16 vault
juju config etcd port=2350

<when hooks have completed continue>

juju config etcd port=2379
juju add-relation etcd:db vault:etcd

<when hooks have completed continue>

juju run --application vault "sudo systemctl restart vault"

James Page (james-page)
Changed in vault-charm:
status: Confirmed → In Progress
summary: - Vault becomes inaccessible if an etcd unit is removed
+ Vault becomes inaccessible if an etcd unit is removed/down
Revision history for this message
John George (jog) wrote :

This bug is subscribed to Canonical Field Critical and tracked by the SLA process.
I see that alternative charms have been provided for testing but can you please provide a target date for an official released solution?

Revision history for this message
Liam Young (gnuoy) wrote :

I have pushed a new version of the fork to cs:~gnuoy/etcd-10 .

To upgrade and from cs:~gnuoy/etcd-0 please try the following steps:

M="MODEL NAME"
juju remove-relation -m $M vault:etcd etcd:db
# wait for hooks to complete
juju run -m $M --application vault "sudo systemctl stop vault"
# wait for hooks to complete
juju upgrade-charm -m $M --switch cs:~gnuoy/etcd-10 etcd
# wait for hooks to complete
juju run-action -m $M etcd/0 render-and-restart
juju run-action -m $M etcd/1 render-and-restart
juju run-action -m $M etcd/2 render-and-restart
juju run -m $M --application etcd "./hooks/update-status"
# wait for hooks to complete
juju add-relation -m $M vault:etcd etcd:db
# You may not need the following step
juju run -m $M --application vault "sudo systemctl start vault"

If the cluster count in the workload status is wrong it should be fixed when the next update-status hook runs. This can be forced with:

juju run -m $M --application etcd "./hooks/update-status"

Revision history for this message
Xav Paice (xavpaice) wrote :

Using cs:~gnuoy/vault-16 we get NRPE reporting "CRITICAL: failed to fetch version of running vault server: <urlopen error [SSL: UNKNOWN_PROTOC..."

Would be good to consider that with the alternate charm in use.

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Changed in vault-charm:
status: In Progress → Fix Committed
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The official documentation for etcd ha_backend for Vault mentions issues with the v2 API:

https://www.vaultproject.io/docs/configuration/storage/etcd.html
"High Availability – the Etcd storage backend supports high availability. The v2 API has known issues with HA support and should not be used in HA scenarios."

Based on the github issue tracker, there is an issue with how leadership locking is done:

https://github.com/hashicorp/vault/issues/1184#issuecomment-232433373 (day 2 issue description)
https://github.com/hashicorp/vault/issues/1184#issuecomment-224868339 (all vault units go to "standby" and error out with "etcd: couldn't renew key")

https://github.com/hashicorp/vault/issues/1908 (multiple vault units consider themselves to be leaders intermittently)

The v2 backend will not be fixed upstream by the looks of it:

https://github.com/hashicorp/vault/issues/2214#issuecomment-269718252
https://github.com/hashicorp/vault/issues/1908#issuecomment-248686276
https://github.com/hashicorp/vault/pull/1909 (disables HA for etcd by default)

And it does not look like there were any significant changes (other than moving the code to etcd2.go when v3 support was added)
https://github.com/hashicorp/vault/blame/master/physical/etcd/etcd2.go

I would prefer to use v3 and have no HA rather than have day 2 problems with vault when all units are up and etcd is fully operational.

It seems that either fixing v3 support upstream or exercising the recently community-contributed mysql ha_backend are the proper paths forward.

https://github.com/hashicorp/vault/pull/4686
(vault) git tag --contains 801eddf5f868dec54fb8aedade12d38cd9b1c8df
v0.11.0-beta1

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I agree that we should expose the API version config and document/add pointers to known issues with each. That will be more flexible than the current work-around as merged (hard coded to v2 api). Operators can then choose which path of work-arounds are the best fit for their specific needs and tolerances while the underlying component issues are resolved more adequately.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Revision history for this message
Chris Gregan (cgregan) wrote :

When can we expect this fix to be released?

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Any update on when this fix is scheduled for released?

Revision history for this message
Liam Young (gnuoy) wrote :

I have updated the vault snap and the vault charm to skip TLS host verification while we are blocked by the upstream bug. The vault charm only uses etcd to store a token indicating who the current HA leader is. This token is in turn encrypted so the lack of TLS verification should not be an issue. This means that the forks are no longer needed. Below are instructions for moving off the forks:

If using cs:~gnuoy/vault* with cs:~gnuoy/etcd-0 or cs:~gnuoy/etcd-10:

# Prepare vault for upgrade
juju remove-relation vault:etcd etcd:db
juju run --application vault "systemctl stop vault"

# Upgrade snap
juju run --application vault "snap refresh vault --channel edge"
# or confirm snap has been upgraded
juju run --application vault "/snap/bin/vault --version"

# Upgrade charm
juju upgrade-charm --switch cs:~openstack-charmers-next/vault vault
juju upgrade-charm --switch cs:etcd etcd
juju config etcd port=2350

# Trigger config render from etcd
juju config etcd port=2379

# Restore relation
juju add-relation vault:etcd etcd:db

# Unseal
*unseal*

If using cs:~gnuoy/vault* with cs:etcd:

juju run --application vault "systemctl stop vault"

# Upgrade snap
juju run --application vault "snap refresh vault --channel edge"

# or confirm snap has been upgraded
juju run --application vault "/snap/bin/vault --version"

# Upgrade charm
juju upgrade-charm --switch cs:~openstack-charmers-next/vault vault

David Ames (thedac)
Changed in vault-charm:
milestone: none → 19.04
James Page (james-page)
Changed in vault-charm:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.