Bionic/Focal VMs get the same machine-id

Bug #1886994 reported by Chris Johnston
144
This bug affects 31 people
Affects Status Importance Assigned to Milestone
Canonical Livepatch Charm
Fix Released
Low
Adam Dyess
Canonical Livepatch Client
Won't Fix
Undecided
Unassigned
MAAS
Won't Fix
Undecided
Unassigned

Bug Description

Redeployed VMs with B,F don't get new machine-id's. This is causing an issue where a machine that has been registered in livepatch and then redeployed can't be re-registered in livepatch because the machine-id already exists.

To reproduce:
1) compose 3 VMs on a MAAS KVM host
2) Deploy one of each of X,B,F on the new VMs
3) cat /etc/machine-id on each new machine and take note of the result
4) Release the machine
5) Redeploy the machine with the same release
4) cat /etc/machine-id on each machine after it has been redeployed. note that the machine-id for Xenial should have changed, however the machine-id for Bionic and Focal is the same.

Other related bugs for more reference:
https://bugs.launchpad.net/cloud-images/+bug/1731279
https://bugs.launchpad.net/canonical-livepatch-client/+bug/1680611

Related branches

tags: added: sts
Revision history for this message
Lee Trager (ltrager) wrote :

The images contain a blank /etc/machine-id. According to [1] the machine-id should be automatically generated at boot by systemd on first boot.

[1] https://www.freedesktop.org/software/systemd/man/machine-id.html

no longer affects: maas-images
Revision history for this message
Robert C Jennings (rcj) wrote :

This is a bit surprising, it shouldn't be an issue for Canonical-produced Ubuntu MAAS images. livecd-rootfs removes[1] dbus machine-id for all builds on all projects by calling a live-build chroot hook[2].

But this is a redeploy and I need to understand the workflow. I'm assuming the machine-id that is reused is not the same across all machines (something inherent in the base image), but you can check this and the answer would be significant.

For cloud-image tests we ensure machine-id is generated at boot and not in the image for this reason. So lp:maas-images[3] take the base cloud squashfs and does some modifications and a check of their latest bionic squashfs[4] shows that /var/lib/dbus directory is empty and /etc/machine-id is also empty

I thought cloud-init cleared the machine-id if it detected a boot on a new instance, but I can't find it in code at the moment. (And this would be in the case of a redeploy of an image captured with the machine-id baked in)

I don't expect this is a dbus-uuidgen issue, it would be exceptional for it to consistently generate the same value and not have an upstream bug and fix. I can see that each time I run it in a lxd container it gives me something new (which is what I'd expect) but I thought I'd mention it.

I would talk to MAAS and cloud-init folks to go further. I expect that the redeploy (per the description) is from the pristine image which should not have the machine ID file. For sanity sake, please double check that these are the official images that are built specifically to have machine-id removed and not something custom built which contains /var/lib/dbus/machine-id or /etc/machine-id.

[1] https://git.launchpad.net/livecd-rootfs/tree/live-build/auto/config#n958
[2] https://git.launchpad.net/ubuntu/+source/live-build/tree/share/hooks/004-remove-dbus-machine-id.chroot
[3] https://code.launchpad.net/~maas-images-maintainers/maas-images/maas-ephemerals
[4] http://images.maas.io/ephemeral-v3/daily/bionic/amd64/20200629/

Revision history for this message
Dan Streetman (ddstreet) wrote :

systemd creates the /etc/machine-id file from a list of possible sources (reference 'man machine-id):

1. systemd.machine_id= kernel cmdline parameter
2. --machine-id= parameter to systemd binary (i.e. /sbin/init)
3. content of /etc/machine-id file
4. content of /var/lib/dbus/machine-id file
5. container_uuid= kernel cmdline parameter
6. KVM DMI product_uuid (i.e., UUID defined in libvirt xml, passed to qemu -uuid param)
7. devicetree VM uuid (for ppc guests)
8. randomly generated UUID

it appears that on first boot of cloud images, since they have an empty /etc/machine-id and no /var/lib/dbus/machine-id at all, this list falls through until the KVM uuid is used.

The downside of that, of course, is that the machine-id will be tied directly to the KVM uuid, so if the VM is simply wiped clean and reinstalled, it will get the exact same machine-id.

It would probably be best for cloud images to forcibly generate a new random machine-id uuid on first boot.

It's also possible that upstream systemd shouldn't be attempting to use uuid sources that are not random, such as the uuid provided to qemu, or at least it might be good to add a parameter to systemd-machine-id-setup (or some config file settings) to allow controlling what sources are looked at to find an initial machine-id.

Revision history for this message
Robert C Jennings (rcj) wrote :

Okay, maybe the key here is that it doesn't get a random uuid because it used the dmi info from the KVM VM which doesn't change (since the pre-created kvm guests are 'registered' with maas, the actual instances are never re-created, just re-installed). In which case I think that maas would need to provide something higher on that list on each deployment so that dbus-uuidgen gives unique output for the case of VMs added as maas machines.

Revision history for this message
Lee Trager (ltrager) wrote :

lp:maas-images doesn't modify the SquashFS at all. The SquashFS is mounted with an overlay so the latest kernels are pulled from the archive. The published SquashFS hash matches what CPC publishes.

The installation is performed by Curtin, MAAS generates curtin.cfg and Curtin formats the disk and writes bits to the filesystem. Neither MAAS nor Curtin do anything with /etc/machine-id. As per documentation systemd seems to be doing the right thing.

You could modify the preseed[1] to run systemd-machine-id-setup to generate a random uuid or create a kernel tag[2] to create a static one.

[1] https://maas.io/docs/custom-node-setup-preseed#heading--curtin
[2] https://maas.io/docs/kernel-boot-options#heading--per-node-kernel-boot-options

Revision history for this message
Lee Trager (ltrager) wrote :

As per the conversation with xnox systemd is doing the correct thing. systemd is using the uuid from KVM because libvirt and OpenStack consider the machine to be the same. Part of the reason for this is VMs don't have the same level of protection on their firmware as real hardware does. Thus its easier for a user to acquire a VM, install some malicious firmware, release the VM for another user to acquire, exploit the firmware to compromise the VM.

MAAS shouldn't be treating VMs are machines, it should be creating them on demand and destroying them when done. This would require significant work in MAAS.

Changed in maas:
status: New → Won't Fix
no longer affects: systemd
Revision history for this message
Vern Hart (vern) wrote :

Since this is a wont-fix for maas, I see this as a bug in how the canonical-livepatch charm behaves.

A machine that has livepatch enabled can remove the machine token from the livepatch servers by issuing a disable command:

    $ canonical-livepatch status
    last check: 34 minutes ago
    kernel: 4.15.0-112.113-generic
    server check-in: succeeded
    patch state: ✓ no livepatches needed for this kernel yet
    $ canonical-livepatch disable
    Successfully disabled device. Removed machine-token: xxx

The trouble is when the machine is wiped without disabling livepatch. Then enabling and disabling doesn't work:

    $ canonical-livepatch status
    Machine is not enabled. Please run 'sudo canonical-livepatch enable' with the token obtained from https://ubuntu.com/livepatch.

    $ sudo canonical-livepatch enable xxx
    2020/07/30 12:29:41 error executing enable: cannot enable machine: bad temporary server status 500 (URL: https://livepatch.canonical.com/api/machine-tokens) server response: machine token already exists
    $ sudo canonical-livepatch disable
    $

The problem stems from the fact that /var/snap/canonical-livepatch/common/machine-token is generated (by the livepatch server) in response to a supplied machine-id and without that machine-token, the livepatch server won't let us disable.

It seems to me that the canonical-livepatch charm should disable livepatch when terminating. On a proper model-destroy, this would mean we won't have this issue on redeploy.

Additionally, I think it would be safe enough to allow disabling livepatch if we supply the livepatch key that the original token was generated with -- even if we no longer have the machine-token locally.

Revision history for this message
Casey Marshall (cmars) wrote :

Added the livepatch client charm project.

Changed in canonical-livepatch-client:
status: New → Won't Fix
Revision history for this message
Nobuto Murata (nobuto) wrote :

> Added the livepatch client charm project.

What would be the suggested action for the charm to take?

We had a comment on the duplicated private bug 1680611 as:

> Also machine-id is not meant to be used as an external identifier, and shouldn't be uploaded anywhere.
>
> Also we must not recommend users to destroy and/or recreate it, as doing so has many other consequences on their system, w.r.t. breaking dbus, dhcp leases, journald logs locally, remote journald logs, and identification locally of containers.

Can livepatch-client/server generate its own unique identifier or handle the machine-id-exist scenario more gracefully instead of just giving 500?

Revision history for this message
Haw Loeung (hloeung) wrote :

What would the livepatch client charm do? What about for those not using or deploying the livepatch client charm? Shouldn't this be handled by cloud-init?

Revision history for this message
Haw Loeung (hloeung) wrote :

I mean, someone uses MAAS to deploy KVMs or to physical servers. Then use a separate tool such as puppet, chef, or ansible to configure up the host. They call the livepatch client to register then release and redeploy, they'll still run into this.

I think instead, something should happen on "deploy" to wipe the machine-id if it exists. Whether that's some enlisting or MAAS telling cloud-init to do so.

tags: added: cdo-qa foundations-engine
David Coronel (davecore)
tags: added: field-medium
Revision history for this message
David Coronel (davecore) wrote :

subscribed ~field-medium

tags: removed: field-medium
Revision history for this message
David A. Desrosiers (setuid) wrote :

Just adding some additional context from the field: I've run into this quite a bit and have resorted to using virt-sysprep to wipe and regenerate the /etc/machine-id:

sudo virt-sysprep --network --operations=-ssh-hostkeys,tmp-files,logfiles,bash-history,package-manager-cache,customize --delete /etc/machine-id --run-command 'systemd-machine-id-setup' -a *img

But that won't work on a baremetal host provisioned by MAAS, and managed external to MAAS (eg: Ansible, Chef, Puppet)

A large customer of ours ran into something similar late last year, where a specific vendor delivered a firmware revision which applied the exact same UUID to all baremetal chassis'.

As a result, 'dmidecode -s system-uuid' would not provide a valid uuid and libvirt would get confused especially when it was attempting to live-migrate instances between hosts, because it thought it was migrating back to itself.

Their workaround consisted of the following construct in their deployment automation:

host_uuid = ''
ruby_block 'generate host uuid' do
  block do
    Chef::Resource::RubyBlock.send(:include, Chef::Mixin::ShellOut)
    cmd = "uuidgen --md5 --name #{node['fqdn']} --namespace @dns"
    cmd = shell_out(cmd)
    host_uuid = cmd.stdout.chomp()
  end
end

This generates a pseudorandom UUID that lands in /etc/libvirt/libvirtd.conf.

So it's not just MAAS that can trip on this problem with virtual instances, sometimes it happens on baremetal as well.

Revision history for this message
Casey Marshall (cmars) wrote :

The livepatch charm should `canonical-livepatch disable` on relation depart and stop hooks.

Casey Marshall (cmars)
Changed in canonical-livepatch-client:
status: Won't Fix → Confirmed
Revision history for this message
Haw Loeung (hloeung) wrote :

> The livepatch charm should `canonical-livepatch disable` on relation depart and stop hooks.

But that doesn't help with those not using the livepatch charm, which looks like the original report and a bunch of this marked duplicates of this one.

Should the livepatch client just detect on duplicate and then generate a new one?

Adam Dyess (addyess)
Changed in charm-canonical-livepatch:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Adam Dyess (addyess)
Revision history for this message
Adam Dyess (addyess) wrote :

I wanted to mention here that while this did point out a trivial bug to fix in charm-canonical-livepatch, the fix there only offers a workaround where juju and that charm are used -- not the more overarching problem associated with this bug. Please continue looking at other avenues for a final solution

Changed in charm-canonical-livepatch:
importance: Medium → Low
Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

This is still a problem for charm deployments. Is there a simple way to workaround this ? I've tried regenerating a machine-id on the affected units, but it is not removing those ids from the livepatch database, hence I am stuck with a few units not being able to register.

Revision history for this message
Domas Monkus (tasdomas) wrote : Re: [Bug 1886994] Re: Bionic/Focal VMs get the same machine-id

Hello,
duplicate machine ids are a known issue. However, regenerating machine
ids should enable you to enable livepatch client on the affected
machines.

Could you provide more information as to how this is failing?

Domas Monkus

On Tue, Mar 9, 2021 at 7:11 PM Camille Rodriguez
<email address hidden> wrote:
>
> This is still a problem for charm deployments. Is there a simple way to
> workaround this ? I've tried regenerating a machine-id on the affected
> units, but it is not removing those ids from the livepatch database,
> hence I am stuck with a few units not being able to register.
>
> --
> You received this bug notification because you are subscribed to
> Canonical Live Patch Client.
> Matching subscriptions: Livepatch Client bugs
> https://bugs.launchpad.net/bugs/1886994
>
> Title:
> Bionic/Focal VMs get the same machine-id
>
> Status in Canonical Live Patch Client:
> Confirmed
> Status in Canonical Livepatch Charm:
> In Progress
> Status in MAAS:
> Won't Fix
>
> Bug description:
> Redeployed VMs with B,F don't get new machine-id's. This is causing an
> issue where a machine that has been registered in livepatch and then
> redeployed can't be re-registered in livepatch because the machine-id
> already exists.
>
> To reproduce:
> 1) compose 3 VMs on a MAAS KVM host
> 2) Deploy one of each of X,B,F on the new VMs
> 3) cat /etc/machine-id on each new machine and take note of the result
> 4) Release the machine
> 5) Redeploy the machine with the same release
> 4) cat /etc/machine-id on each machine after it has been redeployed. note that the machine-id for Xenial should have changed, however the machine-id for Bionic and Focal is the same.
>
>
> Other related bugs for more reference:
> https://bugs.launchpad.net/cloud-images/+bug/1731279
> https://bugs.launchpad.net/canonical-livepatch-client/+bug/1680611
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/canonical-livepatch-client/+bug/1886994/+subscriptions

--
Best regards,
Domas Monkus

Revision history for this message
Gábor Mészáros (gabor.meszaros) wrote :

Camille,

There are two ways I'm aware of to work around the problem.
Either you recommission the VMs in question or run something like this:

for unit in $(juju status -m openstack | grep canonical-livepatch | \
    tail -n +2 | grep -v active | tr -d '*' | awk '{print $1}'); do \
    juju ssh -m openstack $unit sudo bash -c \
    '"'"rm /etc/machine-id && dbus-uuidgen --ensure=/etc/machine-id"'"'; done

Feel free to suggest a cleaner improved snippet with less UUOG

James Troup (elmo)
Changed in charm-canonical-livepatch:
status: In Progress → Fix Released
Domas Monkus (tasdomas)
Changed in canonical-livepatch-client:
status: Confirmed → Fix Released
status: Fix Released → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.