diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/changelog xen-4.11.3+24-g14b62ab3e5/debian/changelog --- xen-4.11.3+24-g14b62ab3e5/debian/changelog 2020-03-09 15:17:56.000000000 +0000 +++ xen-4.11.3+24-g14b62ab3e5/debian/changelog 2022-07-13 14:07:29.000000000 +0100 @@ -1,3 +1,100 @@ +xen (4.11.3+24-g14b62ab3e5-1ubuntu2.3) focal-security; urgency=medium + + * SECURITY UPDATE: CVE-2020-0543, CVE-2020-11739, CVE-2020-11740, + CVE-2020-11741, CVE-2020-11742, CVE-2020-11743, CVE-2020-15563, + CVE-2020-15565, CVE-2020-15566, CVE-2020-25595, CVE-2020-25596, + CVE-2020-25597, CVE-2020-25599, CVE-2020-25600, CVE-2020-25601, + CVE-2020-25602, CVE-2020-25603, CVE-2020-25604, CVE-2020-27670, + CVE-2020-27671, CVE-2020-27672, CVE-2020-27674, CVE-2020-28368, + CVE-2020-29040, CVE-2020-29479, CVE-2020-29480, CVE-2020-29481, + CVE-2020-29482, CVE-2020-29483, CVE-2020-29484, CVE-2020-29485, + CVE-2020-29486, CVE-2020-29566, CVE-2020-29570, CVE-2020-29571, + CVE-2021-0089, CVE-2021-26313, CVE-2021-26933, CVE-2021-27379, + CVE-2021-28689, CVE-2021-28690, CVE-2021-28692, CVE-2021-28694, + CVE-2021-28695, CVE-2021-28696, CVE-2021-28697, CVE-2021-28698, + CVE-2021-28699, CVE-2021-28701, CVE-2021-28704, CVE-2021-28705, + CVE-2021-28706, CVE-2021-28707, CVE-2021-28708, CVE-2021-28709, + CVE-2022-23034, CVE-2022-23035, CVE-2022-26356, CVE-2022-26357, + CVE-2022-26358, CVE-2022-26359, CVE-2022-26360, CVE-2022-26361, + CVE-2022-26362, CVE-2022-26363 and CVE-2022-26364 (LP: #1970507). + - Also fixes CVE-2018-3639 on Arm systems. + - debian/patches/*.patch: New patches from upstream security advisories. + Some were backported to this version. + - debian/patches/evtchn-fifo-use-stable-fields-when-recording-last-queue-information.patch, + debian/patches/xen-evtchn-rework-per-event-channel-lock.patch: + New patches from stable-4.11 branch in upstream Git needed to apply + xen-events-access-last_priority-and-last_vcpu_id-together.patch. + - debian/patches/xen-events-access-last_priority-and-last_vcpu_id-together.patch: + New patch from stable-4.11 branch in upstream Git needed to apply + fix_event_channel_race.patch. + - debian/patches/x86-pv-Options-to-disable-and-or-compile-out-32bit-PV-support.patch: + New backported patch from master branch in upstream Git needed to apply + 0002-SUPPORT.md-Un-shimmed-32-bit-PV-guests-are-no-longer.patch. + - debian/patches/xen-split-parameter-related-definitions-in-own-header-file.patch: + New backported patch from master branch in upstream Git needed to apply + x86-pv-Options-to-disable-and-or-compile-out-32bit-PV-support.patch. + - debian/patches/fix_event_channel_race.patch: New patch from stable-4.11 + branch in upstream Git needed to apply xsa358-4.14.patch. + - debian/patches/AMD-IOMMU-fix-off-by-one-in-amd_iommu_get_paging_mode-callers.patch: + New patch from stable-4.11 branch in upstream Git needed to apply + xsa378-4.11-6.patch. + - debian/patches/xen-arm-Simplify-alternative-patching-of-non-writable-region.patch: + New patch from stable-4.12 branch in upstream Git needed to apply + xen-arm-alternatives-Add-dynamic-patching-feature.patch. + - debian/patches/xen-arm64-entry-Use-named-label-in-guest_sync.patch, + debian/patches/xen-arm-alternatives-Add-dynamic-patching-feature.patch: + New backported patches from stable-4.12 branch in upstream Git needed to + apply xen-arm64-Implement-a-fast-path-for-handling-SMCCC_ARCH_WORKAROUND_2.patch. + - debian/patches/xen-arm64-Add-generic-assembly-macros.patch: New backported + patch from stable-4.12 branch in upstream Git needed to apply + xsa398-4.12-4-xen-arm-Add-Spectre-BHB-handling.patch. + - debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-probing.patch, + debian/patches/xen-arm-Add-command-line-option-to-control-SSBD-mitigation.patch, + debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-support-for-guests.patch, + debian/patches/xen-arm64-Implement-a-fast-path-for-handling-SMCCC_ARCH_WORKAROUND_2.patch: + New backported patches from stable-4.12 branch in upstream Git that fix + CVE-2018-3639 for Arm systems, and some of these are needed to apply + xsa398-4.12-5-xen-arm-Allow-to-discover-and-use-SMCCC_ARCH_WORKARO.patch. + - debian/patches/VT-d-dont-pass-bridge-devices-to-domain_context_mapping_one.patch: + New backported patch from stable-4.12 branch in upstream Git needed to + apply xsa400-4.12-04.patch. + - debian/patches/amd-iommu-get-rid-of-pointless-IOMMU_PAGING_MODE_LEVEL_X-definitions.patch: + New backported patch from stable-4.12 branch in upstream Git needed to + apply xsa400-4.12-10.patch. + - debian/patches/x86-feature-Generalise-synth-and-introduce-a-bug-word.patch, + debian/patches/x86-AMD-Fix-handling-of-x87-exception-pointers-on-Fam17h-hardware.patch: + New backported patches from stable-4.13 branch in upstream Git needed to + apply xsa402-4.13-4.patch. + - debian/patches/x86-cpu-intel-Clear-cache-self-snoop-capability-in-CPUs-with-known-errata.patch: + New backported patch from stable-4.13 branch in upstream Git needed to + apply xsa402-4.13-5.patch. + * debian/not-installed: Do not install systemd-specific files. + * debian/source/lintian-overrides: Override debhelper-but-no-misc-depends + warnings for transitional packages. + * debian/control: Add ${misc:Depends} to dependencies of xen-doc. + * debian/control: Remove build dependency on autotools-dev. + + -- Luís Infante da Câmara Wed, 13 Jul 2022 14:07:29 +0100 + +xen (4.11.3+24-g14b62ab3e5-1ubuntu2.2) focal; urgency=medium + + * Fix FTBFS on armhf/arm64 due to missing : + - d/p/lp1956166-0006-fix-ftbfs-arm-lzo-unaligned.h.patch + + -- Mauricio Faria de Oliveira Thu, 07 Jul 2022 13:53:37 -0300 + +xen (4.11.3+24-g14b62ab3e5-1ubuntu2.1) focal; urgency=medium + + * Add support for zstd compressed kernels for Dom0/DomU on x86 (LP: #1956166) + - d/p/lp1956166-0001-introduce-unaligned.h.patch + - d/p/lp1956166-0002-lib-introduce-xxhash.patch + - d/p/lp1956166-0003-x86-Dom0-support-zstd-compressed-kernels.patch + - d/p/lp1956166-0004-libxenguest-add-get_unaligned_le32.patch + - d/p/lp1956166-0005-libxenguest-support-zstd-compressed-kernels.patch + - d/control: add libzstd-dev as build-dep + + -- Mauricio Faria de Oliveira Mon, 04 Jul 2022 16:02:20 -0300 + xen (4.11.3+24-g14b62ab3e5-1ubuntu2) focal; urgency=medium * Update: Building hypervisor with cf-protection enabled diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/control xen-4.11.3+24-g14b62ab3e5/debian/control --- xen-4.11.3+24-g14b62ab3e5/debian/control 2020-02-28 13:14:00.000000000 +0000 +++ xen-4.11.3+24-g14b62ab3e5/debian/control 2022-07-13 14:06:12.000000000 +0100 @@ -4,11 +4,10 @@ XSBC-Original-Maintainer: Debian Xen Team Uploaders: Guido Trotter , Bastian Blank , Ian Jackson Section: admin -Standards-Version: 3.9.4 +Standards-Version: 4.5.0 Build-Depends: debhelper (>= 10), dh-exec, - autotools-dev, dpkg-dev (>= 1.16.0~), rdfind, lsb-release, @@ -34,6 +33,7 @@ ocaml-native-compilers | ocaml-nox, ocaml-findlib, lmodern, + libzstd-dev, XS-Python-Version: current Homepage: https://xenproject.org/ Vcs-Browser: https://salsa.debian.org/xen-team/debian-xen diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/libxen-4.11.bug-control xen-4.11.3+24-g14b62ab3e5/debian/libxen-4.11.bug-control --- xen-4.11.3+24-g14b62ab3e5/debian/libxen-4.11.bug-control 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/libxen-4.11.bug-control 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,2 @@ +# autogenerated, do not edit +Submit-As: src:xen diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.install xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.install --- xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.install 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.install 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,9 @@ +# autogenerated, do not edit +usr/lib/*/libxenctrl.so.* +usr/lib/*/libxenguest.so.* +usr/lib/*/libxenlight.so.* +usr/lib/*/libxenstat.so.* +usr/lib/*/libxenvchan.so.* +usr/lib/*/libxlutil.so.* +usr/lib/xen-4.11/lib/*/libfsimage* +usr/lib/xen-4.11/lib/*/fs diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.lintian-overrides xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.lintian-overrides --- xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.lintian-overrides 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/libxenmisc4.11.lintian-overrides 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,8 @@ +# autogenerated, do not edit +no-symbols-control-file usr/lib/*/lib*.so.4.11.0 +# ^ the ABI changes every Xen release and every Debian release anyway +# and we do not upload to Debian packages based on Xen upstream +# versions which are at least an rc with a stable ABI. + +package-name-doesnt-match-sonames +# ^ yes, this is a portmanteau package. They all change at once. diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/not-installed xen-4.11.3+24-g14b62ab3e5/debian/not-installed --- xen-4.11.3+24-g14b62ab3e5/debian/not-installed 2020-02-28 13:14:00.000000000 +0000 +++ xen-4.11.3+24-g14b62ab3e5/debian/not-installed 2022-06-17 08:58:55.000000000 +0100 @@ -34,3 +34,7 @@ # If someone wants this, suggestions from ocaml experts on what # to ship where would be welcome. usr/local/lib/ocaml + +# systemd-specific files are not installed in this version +usr/lib/modules-load.d +usr/lib/systemd diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-SUPPORT.md-Document-speculative-attacks-status-of-no.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-SUPPORT.md-Document-speculative-attacks-status-of-no.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-SUPPORT.md-Document-speculative-attacks-status-of-no.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-SUPPORT.md-Document-speculative-attacks-status-of-no.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,55 @@ +From 4e37e21f6e71752fb69c27ab9f1417a5d19ebedb Mon Sep 17 00:00:00 2001 +From: Ian Jackson +Date: Tue, 9 Mar 2021 15:00:47 +0000 +Subject: [PATCH 1/2] SUPPORT.md: Document speculative attacks status of + non-shim 32-bit PV + +This documents, but does not fix, XSA-370. + +Reported-by: Jann Horn +Signed-off-by: Ian Jackson +Signed-off-by: George Dunlap +Acked-by: Jan Beulich +--- + +NB that the security team does not consider the security support +status of un-shimmed 32-bit PV guests in this patch to be particularly +useful. However, we do not consider ourselves to have the authority to decide +to completely de-support 32-bit PV guests without community consultation. + +The support status in this patch should therefore be considered +transitional. A permanent support status is proposed in a subsequent +patch in this series. + +v2: +- Fix double 'be' +- Don't mention user -> kernel attacks, which have nothing to do with Xen +--- + SUPPORT.md | 11 ++++++++++- + 1 file changed, 10 insertions(+), 1 deletion(-) + +diff --git a/SUPPORT.md b/SUPPORT.md +index 7db4568f1a..6dcd93e22f 100644 +--- a/SUPPORT.md ++++ b/SUPPORT.md +@@ -84,7 +84,16 @@ Traditional Xen PV guest + + No hardware requirements + +- Status: Supported ++ Status, x86_64: Supported ++ Status, x86_32, shim: Supported ++ Status, x86_32, without shim: Supported, with caveats ++ ++Due to architectural limitations, ++32-bit PV guests must be assumed to be able to read arbitrary host memory ++using speculative execution attacks. ++Advisories will continue to be issued ++for new vulnerabilities related to un-shimmed 32-bit PV guests ++enabling denial-of-service attacks or privilege escalation attacks. + + ### x86/HVM + +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-ocaml-xenstored-ignore-transaction-id-for-un-w.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-ocaml-xenstored-ignore-transaction-id-for-un-w.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-ocaml-xenstored-ignore-transaction-id-for-un-w.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-ocaml-xenstored-ignore-transaction-id-for-un-w.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,43 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: ignore transaction id for [un]watch +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH +commands as it is documented in docs/misc/xenstore.txt, it is tested +for validity today. + +Really ignore the transaction id for XS_WATCH and XS_UNWATCH. + +This is part of XSA-115. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml +index 74c69f869c..0a0e43d1f0 100644 +--- a/tools/ocaml/xenstored/process.ml ++++ b/tools/ocaml/xenstored/process.ml +@@ -492,12 +492,19 @@ let retain_op_in_history ty = + | Xenbus.Xb.Op.Reset_watches + | Xenbus.Xb.Op.Invalid -> false + ++let maybe_ignore_transaction = function ++ | Xenbus.Xb.Op.Watch | Xenbus.Xb.Op.Unwatch -> fun tid -> ++ if tid <> Transaction.none then ++ debug "Ignoring transaction ID %d for watch/unwatch" tid; ++ Transaction.none ++ | _ -> fun x -> x ++ + (** + * Nothrow guarantee. + *) + let process_packet ~store ~cons ~doms ~con ~req = + let ty = req.Packet.ty in +- let tid = req.Packet.tid in ++ let tid = maybe_ignore_transaction ty req.Packet.tid in + let rid = req.Packet.rid in + try + let fct = function_of_type ty in diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-xenstore-allow-removing-child-of-a-node-exceed.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-xenstore-allow-removing-child-of-a-node-exceed.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-xenstore-allow-removing-child-of-a-node-exceed.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-tools-xenstore-allow-removing-child-of-a-node-exceed.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,157 @@ +From e92f3dfeaae21a335e666c9247954424e34e5c56 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:37 +0200 +Subject: [PATCH 01/10] tools/xenstore: allow removing child of a node + exceeding quota + +An unprivileged user of Xenstore is not allowed to write nodes with a +size exceeding a global quota, while privileged users like dom0 are +allowed to write such nodes. The size of a node is the needed space +to store all node specific data, this includes the names of all +children of the node. + +When deleting a node its parent has to be modified by removing the +name of the to be deleted child from it. + +This results in the strange situation that an unprivileged owner of a +node might not succeed in deleting that node in case its parent is +exceeding the quota of that unprivileged user (it might have been +written by dom0), as the user is not allowed to write the updated +parent node. + +Fix that by not checking the quota when writing a node for the +purpose of removing a child's name only. + +The same applies to transaction handling: a node being read during a +transaction is written to the transaction specific area and it should +not be tested for exceeding the quota, as it might not be owned by +the reader and presumably the original write would have failed if the +node is owned by the reader. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_core.c | 20 +++++++++++--------- + tools/xenstore/xenstored_core.h | 3 ++- + tools/xenstore/xenstored_transaction.c | 2 +- + 3 files changed, 14 insertions(+), 11 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index 97ceabf9642d..b43e1018babd 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -417,7 +417,8 @@ static struct node *read_node(struct connection *conn, const void *ctx, + return node; + } + +-int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node) ++int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node, ++ bool no_quota_check) + { + TDB_DATA data; + void *p; +@@ -427,7 +428,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node) + + node->num_perms*sizeof(node->perms[0]) + + node->datalen + node->childlen; + +- if (domain_is_unprivileged(conn) && ++ if (!no_quota_check && domain_is_unprivileged(conn) && + data.dsize >= quota_max_entry_size) { + errno = ENOSPC; + return errno; +@@ -455,14 +456,15 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node) + return 0; + } + +-static int write_node(struct connection *conn, struct node *node) ++static int write_node(struct connection *conn, struct node *node, ++ bool no_quota_check) + { + TDB_DATA key; + + if (access_node(conn, node, NODE_ACCESS_WRITE, &key)) + return errno; + +- return write_node_raw(conn, &key, node); ++ return write_node_raw(conn, &key, node, no_quota_check); + } + + static enum xs_perm_type perm_for_conn(struct connection *conn, +@@ -999,7 +1001,7 @@ static struct node *create_node(struct connection *conn, const void *ctx, + /* We write out the nodes down, setting destructor in case + * something goes wrong. */ + for (i = node; i; i = i->parent) { +- if (write_node(conn, i)) { ++ if (write_node(conn, i, false)) { + domain_entry_dec(conn, i); + return NULL; + } +@@ -1039,7 +1041,7 @@ static int do_write(struct connection *conn, struct buffered_data *in) + } else { + node->data = in->buffer + offset; + node->datalen = datalen; +- if (write_node(conn, node)) ++ if (write_node(conn, node, false)) + return errno; + } + +@@ -1115,7 +1117,7 @@ static int remove_child_entry(struct connection *conn, struct node *node, + size_t childlen = strlen(node->children + offset); + memdel(node->children, offset, childlen + 1, node->childlen); + node->childlen -= childlen + 1; +- return write_node(conn, node); ++ return write_node(conn, node, true); + } + + +@@ -1254,7 +1256,7 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in) + node->num_perms = num; + domain_entry_inc(conn, node); + +- if (write_node(conn, node)) ++ if (write_node(conn, node, false)) + return errno; + + fire_watches(conn, in, name, false); +@@ -1514,7 +1516,7 @@ static void manual_node(const char *name, const char *child) + if (child) + node->childlen = strlen(child) + 1; + +- if (write_node(NULL, node)) ++ if (write_node(NULL, node, false)) + barf_perror("Could not create initial node %s", name); + talloc_free(node); + } +diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h +index 56a279cfbb47..3cb1c235a101 100644 +--- a/tools/xenstore/xenstored_core.h ++++ b/tools/xenstore/xenstored_core.h +@@ -149,7 +149,8 @@ void send_ack(struct connection *conn, enum xsd_sockmsg_type type); + char *canonicalize(struct connection *conn, const void *ctx, const char *node); + + /* Write a node to the tdb data base. */ +-int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node); ++int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node, ++ bool no_quota_check); + + /* Get this node, checking we have permissions. */ + struct node *get_node(struct connection *conn, +diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c +index 2824f7b359b8..e87897573469 100644 +--- a/tools/xenstore/xenstored_transaction.c ++++ b/tools/xenstore/xenstored_transaction.c +@@ -276,7 +276,7 @@ int access_node(struct connection *conn, struct node *node, + i->check_gen = true; + if (node->generation != NO_GENERATION) { + set_tdb_key(trans_name, &local_key); +- ret = write_node_raw(conn, &local_key, node); ++ ret = write_node_raw(conn, &local_key, node, true); + if (ret) + goto err; + i->ta_node = true; +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,94 @@ +From edbe70427e17743351f1b739ea1536acd757ae6c Mon Sep 17 00:00:00 2001 +From: Wei Liu +Date: Sat, 11 Jan 2020 21:57:41 +0000 +Subject: [PATCH 1/3] x86/mm: Refactor map_pages_to_xen to have only a single + exit path + +We will soon need to perform clean-ups before returning. + +No functional change. + +This is part of XSA-345. + +Reported-by: Hongyan Xia +Signed-off-by: Wei Liu +Signed-off-by: Hongyan Xia +Signed-off-by: George Dunlap +Acked-by: Jan Beulich +--- + xen/arch/x86/mm.c | 17 +++++++++++------ + 1 file changed, 11 insertions(+), 6 deletions(-) + +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index 626768a950..79a3fac3cc 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -5194,6 +5194,7 @@ int map_pages_to_xen( + l2_pgentry_t *pl2e, ol2e; + l1_pgentry_t *pl1e, ol1e; + unsigned int i; ++ int rc = -ENOMEM; + + #define flush_flags(oldf) do { \ + unsigned int o_ = (oldf); \ +@@ -5214,7 +5215,8 @@ int map_pages_to_xen( + l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt); + + if ( !pl3e ) +- return -ENOMEM; ++ goto out; ++ + ol3e = *pl3e; + + if ( cpu_has_page1gb && +@@ -5302,7 +5304,7 @@ int map_pages_to_xen( + + pl2e = alloc_xen_pagetable(); + if ( pl2e == NULL ) +- return -ENOMEM; ++ goto out; + + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) + l2e_write(pl2e + i, +@@ -5331,7 +5333,7 @@ int map_pages_to_xen( + + pl2e = virt_to_xen_l2e(virt); + if ( !pl2e ) +- return -ENOMEM; ++ goto out; + + if ( ((((virt >> PAGE_SHIFT) | mfn_x(mfn)) & + ((1u << PAGETABLE_ORDER) - 1)) == 0) && +@@ -5374,7 +5376,7 @@ int map_pages_to_xen( + { + pl1e = virt_to_xen_l1e(virt); + if ( pl1e == NULL ) +- return -ENOMEM; ++ goto out; + } + else if ( l2e_get_flags(*pl2e) & _PAGE_PSE ) + { +@@ -5401,7 +5403,7 @@ int map_pages_to_xen( + + pl1e = alloc_xen_pagetable(); + if ( pl1e == NULL ) +- return -ENOMEM; ++ goto out; + + for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ ) + l1e_write(&pl1e[i], +@@ -5545,7 +5547,10 @@ int map_pages_to_xen( + + #undef flush_flags + +- return 0; ++ rc = 0; ++ ++ out: ++ return rc; + } + + int populate_pt_range(unsigned long virt, unsigned long nr_mfns) +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-SUPPORT.md-Un-shimmed-32-bit-PV-guests-are-no-longer.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-SUPPORT.md-Un-shimmed-32-bit-PV-guests-are-no-longer.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-SUPPORT.md-Un-shimmed-32-bit-PV-guests-are-no-longer.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-SUPPORT.md-Un-shimmed-32-bit-PV-guests-are-no-longer.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,72 @@ +From: George Dunlap +Subject: SUPPORT.md: Un-shimmed 32-bit PV guests are no longer supported + +The support status of 32-bit guests doesn't seem particularly useful. + +With it changed to fully unsupported outside of PV-shim, adjust the PV32 +Kconfig default accordingly. + +Reported-by: Jann Horn +Signed-off-by: George Dunlap +Signed-off-by: Jan Beulich +--- + +NB this patch should be considered a proposal to the community. It +will not become effective until three weeks after the XSA-370 embargo +lifts, and only if there are no objections raised before that point. + +TBD: Should we also default opt_pv32 to false when not running in shim + mode? + +The (forward) dependency on PV_SHIM isn't very useful especially when +configuring from scratch - we may want to re-order items down the road, +such that the prompt for PV_SHIM occurs ahead of that for PV32. Yet then +this conflicts with PV_SHIM also depending on GUEST. + +v3: +- Add Kconfig adjustment. + +v2: +- Port over changes in patch 1 + +--- a/SUPPORT.md ++++ b/SUPPORT.md +@@ -86,14 +86,7 @@ No hardware requirements + + Status, x86_64: Supported + Status, x86_32, shim: Supported +- Status, x86_32, without shim: Supported, with caveats +- +-Due to architectural limitations, +-32-bit PV guests must be assumed to be able to read arbitrary host memory +-using speculative execution attacks. +-Advisories will continue to be issued +-for new vulnerabilities related to un-shimmed 32-bit PV guests +-enabling denial-of-service attacks or privilege escalation attacks. ++ Status, x86_32, without shim: Supported, not security supported + + ### x86/HVM + +--- a/xen/arch/x86/Kconfig ++++ b/xen/arch/x86/Kconfig +@@ -56,7 +56,7 @@ config PV + config PV32 + bool "Support for 32bit PV guests" + depends on PV +- default y ++ default PV_SHIM + ---help--- + The 32bit PV ABI uses Ring1, an area of the x86 architecture which + was deprecated and mostly removed in the AMD64 spec. As a result, +@@ -67,7 +67,10 @@ config PV32 + reduction, or performance reasons. Backwards compatibility can be + provided via the PV Shim mechanism. + +- If unsure, say Y. ++ Note that outside of PV Shim, 32-bit PV guests are not security ++ supported anymore. ++ ++ If unsure, use the default setting. + + config PV_LINEAR_PT + bool "Support for PV linear pagetables" diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-ocaml-xenstored-check-privilege-for-XS_IS_DOMA.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-ocaml-xenstored-check-privilege-for-XS_IS_DOMA.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-ocaml-xenstored-check-privilege-for-XS_IS_DOMA.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-ocaml-xenstored-check-privilege-for-XS_IS_DOMA.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,30 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: check privilege for XS_IS_DOMAIN_INTRODUCED +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +The Xenstore command XS_IS_DOMAIN_INTRODUCED should be possible for privileged +domains only (the only user in the tree is the xenpaging daemon). + +This is part of XSA-115. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml +index 0a0e43d1f0..f374abe998 100644 +--- a/tools/ocaml/xenstored/process.ml ++++ b/tools/ocaml/xenstored/process.ml +@@ -166,7 +166,9 @@ let do_setperms con t domains cons data = + let do_error con t domains cons data = + raise Define.Unknown_operation + +-let do_isintroduced con t domains cons data = ++let do_isintroduced con _t domains _cons data = ++ if not (Connection.is_dom0 con) ++ then raise Define.Permission_denied; + let domid = + match (split None '\000' data) with + | domid :: _ -> int_of_string domid diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-xenstore-ignore-transaction-id-for-un-watch.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-xenstore-ignore-transaction-id-for-un-watch.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-xenstore-ignore-transaction-id-for-un-watch.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-tools-xenstore-ignore-transaction-id-for-un-watch.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,86 @@ +From e8076f73de65c4816f69d6ebf75839c706145fcd Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:38 +0200 +Subject: [PATCH 02/10] tools/xenstore: ignore transaction id for [un]watch + +Instead of ignoring the transaction id for XS_WATCH and XS_UNWATCH +commands as it is documented in docs/misc/xenstore.txt, it is tested +for validity today. + +Really ignore the transaction id for XS_WATCH and XS_UNWATCH. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_core.c | 26 ++++++++++++++++---------- + 1 file changed, 16 insertions(+), 10 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index b43e1018babd..bb2f9fd4e76e 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -1268,13 +1268,17 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in) + static struct { + const char *str; + int (*func)(struct connection *conn, struct buffered_data *in); ++ unsigned int flags; ++#define XS_FLAG_NOTID (1U << 0) /* Ignore transaction id. */ + } const wire_funcs[XS_TYPE_COUNT] = { + [XS_CONTROL] = { "CONTROL", do_control }, + [XS_DIRECTORY] = { "DIRECTORY", send_directory }, + [XS_READ] = { "READ", do_read }, + [XS_GET_PERMS] = { "GET_PERMS", do_get_perms }, +- [XS_WATCH] = { "WATCH", do_watch }, +- [XS_UNWATCH] = { "UNWATCH", do_unwatch }, ++ [XS_WATCH] = ++ { "WATCH", do_watch, XS_FLAG_NOTID }, ++ [XS_UNWATCH] = ++ { "UNWATCH", do_unwatch, XS_FLAG_NOTID }, + [XS_TRANSACTION_START] = { "TRANSACTION_START", do_transaction_start }, + [XS_TRANSACTION_END] = { "TRANSACTION_END", do_transaction_end }, + [XS_INTRODUCE] = { "INTRODUCE", do_introduce }, +@@ -1296,7 +1300,7 @@ static struct { + + static const char *sockmsg_string(enum xsd_sockmsg_type type) + { +- if ((unsigned)type < XS_TYPE_COUNT && wire_funcs[type].str) ++ if ((unsigned int)type < ARRAY_SIZE(wire_funcs) && wire_funcs[type].str) + return wire_funcs[type].str; + + return "**UNKNOWN**"; +@@ -1311,7 +1315,14 @@ static void process_message(struct connection *conn, struct buffered_data *in) + enum xsd_sockmsg_type type = in->hdr.msg.type; + int ret; + +- trans = transaction_lookup(conn, in->hdr.msg.tx_id); ++ if ((unsigned int)type >= XS_TYPE_COUNT || !wire_funcs[type].func) { ++ eprintf("Client unknown operation %i", type); ++ send_error(conn, ENOSYS); ++ return; ++ } ++ ++ trans = (wire_funcs[type].flags & XS_FLAG_NOTID) ++ ? NULL : transaction_lookup(conn, in->hdr.msg.tx_id); + if (IS_ERR(trans)) { + send_error(conn, -PTR_ERR(trans)); + return; +@@ -1320,12 +1331,7 @@ static void process_message(struct connection *conn, struct buffered_data *in) + assert(conn->transaction == NULL); + conn->transaction = trans; + +- if ((unsigned)type < XS_TYPE_COUNT && wire_funcs[type].func) +- ret = wire_funcs[type].func(conn, in); +- else { +- eprintf("Client unknown operation %i", type); +- ret = ENOSYS; +- } ++ ret = wire_funcs[type].func(conn, in); + if (ret) + send_error(conn, ret); + +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,68 @@ +From 7101786be91dce650b6e79f1374c580c731bb348 Mon Sep 17 00:00:00 2001 +From: Wei Liu +Date: Sat, 11 Jan 2020 21:57:42 +0000 +Subject: [PATCH 2/3] x86/mm: Refactor modify_xen_mappings to have one exit + path + +We will soon need to perform clean-ups before returning. + +No functional change. + +This is part of XSA-345. + +Reported-by: Hongyan Xia +Signed-off-by: Wei Liu +Signed-off-by: Hongyan Xia +Signed-off-by: George Dunlap +Acked-by: Jan Beulich +--- + xen/arch/x86/mm.c | 12 +++++++++--- + 1 file changed, 9 insertions(+), 3 deletions(-) + +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index 79a3fac3cc..8ed3ecacbe 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -5577,6 +5577,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) + l1_pgentry_t *pl1e; + unsigned int i; + unsigned long v = s; ++ int rc = -ENOMEM; + + /* Set of valid PTE bits which may be altered. */ + #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT) +@@ -5618,7 +5619,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) + /* PAGE1GB: shatter the superpage and fall through. */ + pl2e = alloc_xen_pagetable(); + if ( !pl2e ) +- return -ENOMEM; ++ goto out; ++ + for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ ) + l2e_write(pl2e + i, + l2e_from_pfn(l3e_get_pfn(*pl3e) + +@@ -5673,7 +5675,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) + /* PSE: shatter the superpage and try again. */ + pl1e = alloc_xen_pagetable(); + if ( !pl1e ) +- return -ENOMEM; ++ goto out; ++ + for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ ) + l1e_write(&pl1e[i], + l1e_from_pfn(l2e_get_pfn(*pl2e) + i, +@@ -5802,7 +5805,10 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) + flush_area(NULL, FLUSH_TLB_GLOBAL); + + #undef FLAGS_MASK +- return 0; ++ rc = 0; ++ ++ out: ++ return rc; + } + + #undef flush_area +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-ocaml-xenstored-unify-watch-firing.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-ocaml-xenstored-unify-watch-firing.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-ocaml-xenstored-unify-watch-firing.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-ocaml-xenstored-unify-watch-firing.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,29 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: unify watch firing +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +This will make it easier insert additional checks in a follow-up patch. +All watches are now fired from a single function. + +This is part of XSA-115. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml +index be9c62f27f..d7432c6597 100644 +--- a/tools/ocaml/xenstored/connection.ml ++++ b/tools/ocaml/xenstored/connection.ml +@@ -210,8 +210,7 @@ let fire_watch watch path = + end else + path + in +- let data = Utils.join_by_null [ new_path; watch.token; "" ] in +- send_reply watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data ++ fire_single_watch { watch with path = new_path } + + (* Search for a valid unused transaction id. *) + let rec valid_transaction_id con proposed_id = diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-xenstore-fix-node-accounting-after-failed-node.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-xenstore-fix-node-accounting-after-failed-node.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-xenstore-fix-node-accounting-after-failed-node.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-tools-xenstore-fix-node-accounting-after-failed-node.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,104 @@ +From b8c6dbb67ebb449126023446a7d209eedf966537 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:39 +0200 +Subject: [PATCH 03/10] tools/xenstore: fix node accounting after failed node + creation + +When a node creation fails the number of nodes of the domain should be +the same as before the failed node creation. In case of failure when +trying to create a node requiring to create one or more intermediate +nodes as well (e.g. when /a/b/c/d is to be created, but /a/b isn't +existing yet) it might happen that the number of nodes of the creating +domain is not reset to the value it had before. + +So move the quota accounting out of construct_node() and into the node +write loop in create_node() in order to be able to undo the accounting +in case of an error in the intermediate node destructor. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Paul Durrant +Acked-by: Julien Grall +--- + tools/xenstore/xenstored_core.c | 37 ++++++++++++++++++++++----------- + 1 file changed, 25 insertions(+), 12 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index bb2f9fd4e76e..db9b9ca7957d 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -925,11 +925,6 @@ static struct node *construct_node(struct connection *conn, const void *ctx, + if (!parent) + return NULL; + +- if (domain_entry(conn) >= quota_nb_entry_per_domain) { +- errno = ENOSPC; +- return NULL; +- } +- + /* Add child to parent. */ + base = basename(name); + baselen = strlen(base) + 1; +@@ -962,7 +957,6 @@ static struct node *construct_node(struct connection *conn, const void *ctx, + node->children = node->data = NULL; + node->childlen = node->datalen = 0; + node->parent = parent; +- domain_entry_inc(conn, node); + return node; + + nomem: +@@ -982,6 +976,9 @@ static int destroy_node(void *_node) + key.dsize = strlen(node->name); + + tdb_delete(tdb_ctx, key); ++ ++ domain_entry_dec(talloc_parent(node), node); ++ + return 0; + } + +@@ -998,18 +995,34 @@ static struct node *create_node(struct connection *conn, const void *ctx, + node->data = data; + node->datalen = datalen; + +- /* We write out the nodes down, setting destructor in case +- * something goes wrong. */ ++ /* ++ * We write out the nodes bottom up. ++ * All new created nodes will have i->parent set, while the final ++ * node will be already existing and won't have i->parent set. ++ * New nodes are subject to quota handling. ++ * Initially set a destructor for all new nodes removing them from ++ * TDB again and undoing quota accounting for the case of an error ++ * during the write loop. ++ */ + for (i = node; i; i = i->parent) { +- if (write_node(conn, i, false)) { +- domain_entry_dec(conn, i); ++ /* i->parent is set for each new node, so check quota. */ ++ if (i->parent && ++ domain_entry(conn) >= quota_nb_entry_per_domain) { ++ errno = ENOSPC; + return NULL; + } +- talloc_set_destructor(i, destroy_node); ++ if (write_node(conn, i, false)) ++ return NULL; ++ ++ /* Account for new node, set destructor for error case. */ ++ if (i->parent) { ++ domain_entry_inc(conn, i); ++ talloc_set_destructor(i, destroy_node); ++ } + } + + /* OK, now remove destructors so they stay around */ +- for (i = node; i; i = i->parent) ++ for (i = node; i->parent; i = i->parent) + talloc_set_destructor(i, NULL); + return node; + } +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,249 @@ +From e7bbc4a0b5af76a82f0dcf4afcbf1509b020eb73 Mon Sep 17 00:00:00 2001 +From: Hongyan Xia +Date: Sat, 11 Jan 2020 21:57:43 +0000 +Subject: [PATCH 3/3] x86/mm: Prevent some races in hypervisor mapping updates + +map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB +superpages if possible, to maximize TLB efficiency. This means both +replacing superpage entries with smaller entries, and replacing +smaller entries with superpages. + +Unfortunately, while some potential races are handled correctly, +others are not. These include: + +1. When one processor modifies a sub-superpage mapping while another +processor replaces the entire range with a superpage. + +Take the following example: + +Suppose L3[N] points to L2. And suppose we have two processors, A and +B. + +* A walks the pagetables, get a pointer to L2. +* B replaces L3[N] with a 1GiB mapping. +* B Frees L2 +* A writes L2[M] # + +This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't +handle higher-level superpages properly: If you call virt_xen_to_l2e +on a virtual address within an L3 superpage, you'll either hit a BUG() +(most likely), or get a pointer into the middle of a data page; same +with virt_xen_to_l1 on a virtual address within either an L3 or L2 +superpage. + +So take the following example: + +* A reads pl3e and discovers it to point to an L2. +* B replaces L3[N] with a 1GiB mapping +* A calls virt_to_xen_l2e() and hits the BUG_ON() # + +2. When two processors simultaneously try to replace a sub-superpage +mapping with a superpage mapping. + +Take the following example: + +Suppose L3[N] points to L2. And suppose we have two processors, A and B, +both trying to replace L3[N] with a superpage. + +* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2. +* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2. +* A writes the new value into L3[N] +* B writes the new value into L3[N] +* A recursively frees all the L1's under L2, then frees L2 +* B recursively double-frees all the L1's under L2, then double-frees L2 # + +Fix this by grabbing a lock for the entirety of the mapping update +operation. + +Rather than grabbing map_pgdir_lock for the entire operation, however, +repurpose the PGT_locked bit from L3's page->type_info as a lock. +This means that rather than locking the entire address space, we +"only" lock a single 512GiB chunk of hypervisor address space at a +time. + +There was a proposal for a lock-and-reverify approach, where we walk +the pagetables to the point where we decide what to do; then grab the +map_pgdir_lock, re-verify the information we collected without the +lock, and finally make the change (starting over again if anything had +changed). Without being able to guarantee that the L2 table wasn't +freed, however, that means every read would need to be considered +potentially unsafe. Thinking carefully about that is probably +something that wants to be done on public, not under time pressure. + +This is part of XSA-345. + +Reported-by: Hongyan Xia +Signed-off-by: Hongyan Xia +Signed-off-by: George Dunlap +Reviewed-by: Jan Beulich +--- + xen/arch/x86/mm.c | 92 +++++++++++++++++++++++++++++++++++++++++++++-- + 1 file changed, 89 insertions(+), 3 deletions(-) + +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index 8ed3ecacbe..4ff24de73d 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -2153,6 +2153,50 @@ void page_unlock(struct page_info *page) + current_locked_page_set(NULL); + } + ++/* ++ * L3 table locks: ++ * ++ * Used for serialization in map_pages_to_xen() and modify_xen_mappings(). ++ * ++ * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to ++ * reuse the PGT_locked flag. This lock is taken only when we move down to L3 ++ * tables and below, since L4 (and above, for 5-level paging) is still globally ++ * protected by map_pgdir_lock. ++ * ++ * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock(). ++ * This has two implications: ++ * - We cannot reuse reuse current_locked_page_* for debugging ++ * - To avoid the chance of deadlock, even for different pages, we ++ * must never grab page_lock() after grabbing l3t_lock(). This ++ * includes any page_lock()-based locks, such as ++ * mem_sharing_page_lock(). ++ * ++ * Also note that we grab the map_pgdir_lock while holding the ++ * l3t_lock(), so to avoid deadlock we must avoid grabbing them in ++ * reverse order. ++ */ ++static void l3t_lock(struct page_info *page) ++{ ++ unsigned long x, nx; ++ ++ do { ++ while ( (x = page->u.inuse.type_info) & PGT_locked ) ++ cpu_relax(); ++ nx = x | PGT_locked; ++ } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x ); ++} ++ ++static void l3t_unlock(struct page_info *page) ++{ ++ unsigned long x, nx, y = page->u.inuse.type_info; ++ ++ do { ++ x = y; ++ BUG_ON(!(x & PGT_locked)); ++ nx = x & ~PGT_locked; ++ } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x ); ++} ++ + /* + * PTE flags that a guest may change without re-validating the PTE. + * All other bits affect translation, caching, or Xen's safety. +@@ -5184,6 +5228,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v) + flush_area_local((const void *)v, f) : \ + flush_area_all((const void *)v, f)) + ++#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR ++ ++#define L3T_LOCK(page) \ ++ do { \ ++ if ( locking ) \ ++ l3t_lock(page); \ ++ } while ( false ) ++ ++#define L3T_UNLOCK(page) \ ++ do { \ ++ if ( locking && (page) != ZERO_BLOCK_PTR ) \ ++ { \ ++ l3t_unlock(page); \ ++ (page) = ZERO_BLOCK_PTR; \ ++ } \ ++ } while ( false ) ++ + int map_pages_to_xen( + unsigned long virt, + mfn_t mfn, +@@ -5195,6 +5256,7 @@ int map_pages_to_xen( + l1_pgentry_t *pl1e, ol1e; + unsigned int i; + int rc = -ENOMEM; ++ struct page_info *current_l3page; + + #define flush_flags(oldf) do { \ + unsigned int o_ = (oldf); \ +@@ -5210,13 +5272,20 @@ int map_pages_to_xen( + } \ + } while (0) + ++ L3T_INIT(current_l3page); ++ + while ( nr_mfns != 0 ) + { +- l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt); ++ l3_pgentry_t *pl3e, ol3e; + ++ L3T_UNLOCK(current_l3page); ++ ++ pl3e = virt_to_xen_l3e(virt); + if ( !pl3e ) + goto out; + ++ current_l3page = virt_to_page(pl3e); ++ L3T_LOCK(current_l3page); + ol3e = *pl3e; + + if ( cpu_has_page1gb && +@@ -5550,6 +5619,7 @@ int map_pages_to_xen( + rc = 0; + + out: ++ L3T_UNLOCK(current_l3page); + return rc; + } + +@@ -5578,6 +5648,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) + unsigned int i; + unsigned long v = s; + int rc = -ENOMEM; ++ struct page_info *current_l3page; + + /* Set of valid PTE bits which may be altered. */ + #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT) +@@ -5586,11 +5657,22 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) + ASSERT(IS_ALIGNED(s, PAGE_SIZE)); + ASSERT(IS_ALIGNED(e, PAGE_SIZE)); + ++ L3T_INIT(current_l3page); ++ + while ( v < e ) + { +- l3_pgentry_t *pl3e = virt_to_xen_l3e(v); ++ l3_pgentry_t *pl3e; ++ ++ L3T_UNLOCK(current_l3page); + +- if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ) ++ pl3e = virt_to_xen_l3e(v); ++ if ( !pl3e ) ++ goto out; ++ ++ current_l3page = virt_to_page(pl3e); ++ L3T_LOCK(current_l3page); ++ ++ if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) ) + { + /* Confirm the caller isn't trying to create new mappings. */ + ASSERT(!(nf & _PAGE_PRESENT)); +@@ -5808,9 +5890,13 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf) + rc = 0; + + out: ++ L3T_UNLOCK(current_l3page); + return rc; + } + ++#undef L3T_LOCK ++#undef L3T_UNLOCK ++ + #undef flush_area + + int destroy_xen_mappings(unsigned long s, unsigned long e) +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-ocaml-xenstored-introduce-permissions-for-spec.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-ocaml-xenstored-introduce-permissions-for-spec.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-ocaml-xenstored-introduce-permissions-for-spec.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-ocaml-xenstored-introduce-permissions-for-spec.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,117 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: introduce permissions for special watches +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +The special watches "@introduceDomain" and "@releaseDomain" should be +allowed for privileged callers only, as they allow to gain information +about presence of other guests on the host. So send watch events for +those watches via privileged connections only. + +Start to address this by treating the special watches as regular nodes +in the tree, which gives them normal semantics for permissions. A later +change will restrict the handling, so that they can't be listed, etc. + +This is part of XSA-115. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml +index f374abe998..c3c8ea2f4b 100644 +--- a/tools/ocaml/xenstored/process.ml ++++ b/tools/ocaml/xenstored/process.ml +@@ -414,7 +414,7 @@ let do_introduce con t domains cons data = + else try + let ndom = Domains.create domains domid mfn port in + Connections.add_domain cons ndom; +- Connections.fire_spec_watches cons "@introduceDomain"; ++ Connections.fire_spec_watches cons Store.Path.introduce_domain; + ndom + with _ -> raise Invalid_Cmd_Args + in +@@ -433,7 +433,7 @@ let do_release con t domains cons data = + Domains.del domains domid; + Connections.del_domain cons domid; + if fire_spec_watches +- then Connections.fire_spec_watches cons "@releaseDomain" ++ then Connections.fire_spec_watches cons Store.Path.release_domain + else raise Invalid_Cmd_Args + + let do_resume con t domains cons data = +diff --git a/tools/ocaml/xenstored/store.ml b/tools/ocaml/xenstored/store.ml +index 6375a1c889..98d368d52f 100644 +--- a/tools/ocaml/xenstored/store.ml ++++ b/tools/ocaml/xenstored/store.ml +@@ -214,6 +214,11 @@ let rec lookup node path fct = + + let apply rnode path fct = + lookup rnode path fct ++ ++let introduce_domain = "@introduceDomain" ++let release_domain = "@releaseDomain" ++let specials = List.map of_string [ introduce_domain; release_domain ] ++ + end + + (* The Store.t type *) +diff --git a/tools/ocaml/xenstored/utils.ml b/tools/ocaml/xenstored/utils.ml +index b252db799b..e8c9fe4e94 100644 +--- a/tools/ocaml/xenstored/utils.ml ++++ b/tools/ocaml/xenstored/utils.ml +@@ -88,19 +88,17 @@ let read_file_single_integer filename = + Unix.close fd; + int_of_string (Bytes.sub_string buf 0 sz) + +-let path_complete path connection_path = +- if String.get path 0 <> '/' then +- connection_path ^ path +- else +- path +- ++(* @path may be guest data and needs its length validating. @connection_path ++ * is generated locally in xenstored and always of the form "/local/domain/$N/" *) + let path_validate path connection_path = +- if String.length path = 0 || String.length path > 1024 then +- raise Define.Invalid_path +- else +- let cpath = path_complete path connection_path in +- if String.get cpath 0 <> '/' then +- raise Define.Invalid_path +- else +- cpath ++ let len = String.length path in ++ ++ if len = 0 || len > 1024 then raise Define.Invalid_path; ++ ++ let abs_path = ++ match String.get path 0 with ++ | '/' | '@' -> path ++ | _ -> connection_path ^ path ++ in + ++ abs_path +diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml +index 49fc18bf19..32c3b1c0f1 100644 +--- a/tools/ocaml/xenstored/xenstored.ml ++++ b/tools/ocaml/xenstored/xenstored.ml +@@ -287,6 +287,8 @@ let _ = + let quit = ref false in + + Logging.init_xenstored_log(); ++ List.iter (fun path -> ++ Store.write store Perms.Connection.full_rights path "") Store.Path.specials; + + let filename = Paths.xen_run_stored ^ "/db" in + if cf.restart && Sys.file_exists filename then ( +@@ -339,7 +341,7 @@ let _ = + let (notify, deaddom) = Domains.cleanup domains in + List.iter (Connections.del_domain cons) deaddom; + if deaddom <> [] || notify then +- Connections.fire_spec_watches cons "@releaseDomain" ++ Connections.fire_spec_watches cons Store.Path.release_domain + ) + else + let c = Connections.find_domain_by_port cons port in diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-xenstore-simplify-and-rename-check_event_node.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-xenstore-simplify-and-rename-check_event_node.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-xenstore-simplify-and-rename-check_event_node.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0004-tools-xenstore-simplify-and-rename-check_event_node.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,55 @@ +From 318aa75bd0c05423e717ad0b64adb204282025db Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:40 +0200 +Subject: [PATCH 04/10] tools/xenstore: simplify and rename check_event_node() + +There is no path which allows to call check_event_node() without a +event name. So don't let the result depend on the name being NULL and +add an assert() covering that case. + +Rename the function to check_special_event() to better match the +semantics. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_watch.c | 12 +++++------- + 1 file changed, 5 insertions(+), 7 deletions(-) + +diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c +index 7dedca60dfd6..f2f1bed47cc6 100644 +--- a/tools/xenstore/xenstored_watch.c ++++ b/tools/xenstore/xenstored_watch.c +@@ -47,13 +47,11 @@ struct watch + char *node; + }; + +-static bool check_event_node(const char *node) ++static bool check_special_event(const char *name) + { +- if (!node || !strstarts(node, "@")) { +- errno = EINVAL; +- return false; +- } +- return true; ++ assert(name); ++ ++ return strstarts(name, "@"); + } + + /* Is child a subnode of parent, or equal? */ +@@ -87,7 +85,7 @@ static void add_event(struct connection *conn, + unsigned int len; + char *data; + +- if (!check_event_node(name)) { ++ if (!check_special_event(name)) { + /* Can this conn load node, or see that it doesn't exist? */ + struct node *node = get_node(conn, ctx, name, XS_PERM_READ); + /* +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-ocaml-xenstored-avoid-watch-events-for-nodes-w.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-ocaml-xenstored-avoid-watch-events-for-nodes-w.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-ocaml-xenstored-avoid-watch-events-for-nodes-w.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-ocaml-xenstored-avoid-watch-events-for-nodes-w.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,389 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: avoid watch events for nodes without access +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Today watch events are sent regardless of the access rights of the +node the event is sent for. This enables any guest to e.g. setup a +watch for "/" in order to have a detailed record of all Xenstore +modifications. + +Modify that by sending only watch events for nodes that the watcher +has a chance to see otherwise (either via direct reads or by querying +the children of a node). This includes cases where the visibility of +a node for a watcher is changing (permissions being removed). + +Permissions for nodes are looked up either in the old (pre +transaction/command) or current trees (post transaction). If +permissions are changed multiple times in a transaction only the final +version is checked, because considering a transaction atomic the +individual permission changes would not be noticable to an outside +observer. + +Two trees are only needed for set_perms: here we can either notice the +node disappearing (if we loose permission), appearing +(if we gain permission), or changing (if we preserve permission). + +RM needs to only look at the old tree: in the new tree the node would be +gone, or could have different permissions if it was recreated (the +recreation would get its own watch fired). + +Inside a tree we lookup the watch path's parent, and then the watch path +child itself. This gets us 4 sets of permissions in worst case, and if +either of these allows a watch, then we permit it to fire. The +permission lookups are done without logging the failures, otherwise we'd +get confusing errors about permission denied for some paths, but a watch +still firing. The actual result is logged in xenstored-access log: + + 'w event ...' as usual if watch was fired + 'w notfired...' if the watch was not fired, together with path and + permission set to help in troubleshooting + +Adding a watch bypasses permission checks and always fires the watch +once immediately. This is consistent with the specification, and no +information is gained (the watch is fired both if the path exists or +doesn't, and both if you have or don't have access, i.e. it reflects the +path a domain gave it back to that domain). + +There are some semantic changes here: + + * Write+rm in a single transaction of the same path is unobservable + now via watches: both before and after a transaction the path + doesn't exist, thus both tree lookups come up with the empty + permission set, and noone, not even Dom0 can see this. This is + consistent with transaction atomicity though. + * Similar to above if we temporarily grant and then revoke permission + on a path any watches fired inbetween are ignored as well + * There is a new log event (w notfired) which shows the permission set + of the path, and the path. + * Watches on paths that a domain doesn't have access to are now not + seen, which is the purpose of the security fix. + +This is part of XSA-115. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml +index d7432c6597..1389d971c2 100644 +--- a/tools/ocaml/xenstored/connection.ml ++++ b/tools/ocaml/xenstored/connection.ml +@@ -196,11 +196,36 @@ let list_watches con = + con.watches [] in + List.concat ll + +-let fire_single_watch watch = ++let dbg fmt = Logging.debug "connection" fmt ++let info fmt = Logging.info "connection" fmt ++ ++let lookup_watch_perm path = function ++| None -> [] ++| Some root -> ++ try Store.Path.apply root path @@ fun parent name -> ++ Store.Node.get_perms parent :: ++ try [Store.Node.get_perms (Store.Node.find parent name)] ++ with Not_found -> [] ++ with Define.Invalid_path | Not_found -> [] ++ ++let lookup_watch_perms oldroot root path = ++ lookup_watch_perm path oldroot @ lookup_watch_perm path (Some root) ++ ++let fire_single_watch_unchecked watch = + let data = Utils.join_by_null [watch.path; watch.token; ""] in + send_reply watch.con Transaction.none 0 Xenbus.Xb.Op.Watchevent data + +-let fire_watch watch path = ++let fire_single_watch (oldroot, root) watch = ++ let abspath = get_watch_path watch.con watch.path |> Store.Path.of_string in ++ let perms = lookup_watch_perms oldroot root abspath in ++ if List.exists (Perms.has watch.con.perm READ) perms then ++ fire_single_watch_unchecked watch ++ else ++ let perms = perms |> List.map (Perms.Node.to_string ~sep:" ") |> String.concat ", " in ++ let con = get_domstr watch.con in ++ Logging.watch_not_fired ~con perms (Store.Path.to_string abspath) ++ ++let fire_watch roots watch path = + let new_path = + if watch.is_relative && path.[0] = '/' + then begin +@@ -210,7 +235,7 @@ let fire_watch watch path = + end else + path + in +- fire_single_watch { watch with path = new_path } ++ fire_single_watch roots { watch with path = new_path } + + (* Search for a valid unused transaction id. *) + let rec valid_transaction_id con proposed_id = +diff --git a/tools/ocaml/xenstored/connections.ml b/tools/ocaml/xenstored/connections.ml +index ae7692819d..020b875dcd 100644 +--- a/tools/ocaml/xenstored/connections.ml ++++ b/tools/ocaml/xenstored/connections.ml +@@ -135,25 +135,26 @@ let del_watch cons con path token = + watch + + (* path is absolute *) +-let fire_watches cons path recurse = ++let fire_watches ?oldroot root cons path recurse = + let key = key_of_path path in + let path = Store.Path.to_string path in ++ let roots = oldroot, root in + let fire_watch _ = function + | None -> () +- | Some watches -> List.iter (fun w -> Connection.fire_watch w path) watches ++ | Some watches -> List.iter (fun w -> Connection.fire_watch roots w path) watches + in + let fire_rec x = function + | None -> () + | Some watches -> +- List.iter (fun w -> Connection.fire_single_watch w) watches ++ List.iter (Connection.fire_single_watch roots) watches + in + Trie.iter_path fire_watch cons.watches key; + if recurse then + Trie.iter fire_rec (Trie.sub cons.watches key) + +-let fire_spec_watches cons specpath = ++let fire_spec_watches root cons specpath = + iter cons (fun con -> +- List.iter (fun w -> Connection.fire_single_watch w) (Connection.get_watches con specpath)) ++ List.iter (Connection.fire_single_watch (None, root)) (Connection.get_watches con specpath)) + + let set_target cons domain target_domain = + let con = find_domain cons domain in +diff --git a/tools/ocaml/xenstored/logging.ml b/tools/ocaml/xenstored/logging.ml +index ea6033195d..99c7bc5e13 100644 +--- a/tools/ocaml/xenstored/logging.ml ++++ b/tools/ocaml/xenstored/logging.ml +@@ -161,6 +161,8 @@ let xenstored_log_nb_lines = ref 13215 + let xenstored_log_nb_chars = ref (-1) + let xenstored_logger = ref (None: logger option) + ++let debug_enabled () = !xenstored_log_level = Debug ++ + let set_xenstored_log_destination s = + xenstored_log_destination := log_destination_of_string s + +@@ -204,6 +206,7 @@ type access_type = + | Commit + | Newconn + | Endconn ++ | Watch_not_fired + | XbOp of Xenbus.Xb.Op.operation + + let string_of_tid ~con tid = +@@ -217,6 +220,7 @@ let string_of_access_type = function + | Commit -> "commit " + | Newconn -> "newconn " + | Endconn -> "endconn " ++ | Watch_not_fired -> "w notfired" + + | XbOp op -> match op with + | Xenbus.Xb.Op.Debug -> "debug " +@@ -331,3 +335,7 @@ let xb_answer ~tid ~con ~ty data = + | _ -> false, Debug + in + if print then access_logging ~tid ~con ~data (XbOp ty) ~level ++ ++let watch_not_fired ~con perms path = ++ let data = Printf.sprintf "EPERM perms=[%s] path=%s" perms path in ++ access_logging ~tid:0 ~con ~data Watch_not_fired ~level:Info +diff --git a/tools/ocaml/xenstored/perms.ml b/tools/ocaml/xenstored/perms.ml +index 3ea193ea14..23b80aba3d 100644 +--- a/tools/ocaml/xenstored/perms.ml ++++ b/tools/ocaml/xenstored/perms.ml +@@ -79,9 +79,9 @@ let of_string s = + let string_of_perm perm = + Printf.sprintf "%c%u" (char_of_permty (snd perm)) (fst perm) + +-let to_string permvec = ++let to_string ?(sep="\000") permvec = + let l = ((permvec.owner, permvec.other) :: permvec.acl) in +- String.concat "\000" (List.map string_of_perm l) ++ String.concat sep (List.map string_of_perm l) + + end + +@@ -132,8 +132,8 @@ let check_owner (connection:Connection.t) (node:Node.t) = + then Connection.is_owner connection (Node.get_owner node) + else true + +-(* check if the current connection has the requested perm on the current node *) +-let check (connection:Connection.t) request (node:Node.t) = ++(* check if the current connection lacks the requested perm on the current node *) ++let lacks (connection:Connection.t) request (node:Node.t) = + let check_acl domainid = + let perm = + if List.mem_assoc domainid (Node.get_acl node) +@@ -154,11 +154,19 @@ let check (connection:Connection.t) request (node:Node.t) = + info "Permission denied: Domain %d has write only access" domainid; + false + in +- if !activate ++ !activate + && not (Connection.is_dom0 connection) + && not (check_owner connection node) + && not (List.exists check_acl (Connection.get_owners connection)) ++ ++(* check if the current connection has the requested perm on the current node. ++* Raises an exception if it doesn't. *) ++let check connection request node = ++ if lacks connection request node + then raise Define.Permission_denied + ++(* check if the current connection has the requested perm on the current node *) ++let has connection request node = not (lacks connection request node) ++ + let equiv perm1 perm2 = + (Node.to_string perm1) = (Node.to_string perm2) +diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml +index c3c8ea2f4b..3cd0097db9 100644 +--- a/tools/ocaml/xenstored/process.ml ++++ b/tools/ocaml/xenstored/process.ml +@@ -56,15 +56,17 @@ let split_one_path data con = + | path :: "" :: [] -> Store.Path.create path (Connection.get_path con) + | _ -> raise Invalid_Cmd_Args + +-let process_watch ops cons = ++let process_watch t cons = ++ let oldroot = t.Transaction.oldroot in ++ let newroot = Store.get_root t.store in ++ let ops = Transaction.get_paths t |> List.rev in + let do_op_watch op cons = +- let recurse = match (fst op) with +- | Xenbus.Xb.Op.Write -> false +- | Xenbus.Xb.Op.Mkdir -> false +- | Xenbus.Xb.Op.Rm -> true +- | Xenbus.Xb.Op.Setperms -> false ++ let recurse, oldroot, root = match (fst op) with ++ | Xenbus.Xb.Op.Write|Xenbus.Xb.Op.Mkdir -> false, None, newroot ++ | Xenbus.Xb.Op.Rm -> true, None, oldroot ++ | Xenbus.Xb.Op.Setperms -> false, Some oldroot, newroot + | _ -> raise (Failure "huh ?") in +- Connections.fire_watches cons (snd op) recurse in ++ Connections.fire_watches ?oldroot root cons (snd op) recurse in + List.iter (fun op -> do_op_watch op cons) ops + + let create_implicit_path t perm path = +@@ -205,7 +207,7 @@ let reply_ack fct con t doms cons data = + fct con t doms cons data; + Packet.Ack (fun () -> + if Transaction.get_id t = Transaction.none then +- process_watch (Transaction.get_paths t) cons ++ process_watch t cons + ) + + let reply_data fct con t doms cons data = +@@ -353,14 +355,17 @@ let transaction_replay c t doms cons = + Connection.end_transaction c tid None + ) + +-let do_watch con t domains cons data = ++let do_watch con t _domains cons data = + let (node, token) = + match (split None '\000' data) with + | [node; token; ""] -> node, token + | _ -> raise Invalid_Cmd_Args + in + let watch = Connections.add_watch cons con node token in +- Packet.Ack (fun () -> Connection.fire_single_watch watch) ++ Packet.Ack (fun () -> ++ (* xenstore.txt says this watch is fired immediately, ++ implying even if path doesn't exist or is unreadable *) ++ Connection.fire_single_watch_unchecked watch) + + let do_unwatch con t domains cons data = + let (node, token) = +@@ -391,7 +396,7 @@ let do_transaction_end con t domains cons data = + if not success then + raise Transaction_again; + if commit then begin +- process_watch (List.rev (Transaction.get_paths t)) cons; ++ process_watch t cons; + match t.Transaction.ty with + | Transaction.No -> + () (* no need to record anything *) +@@ -414,7 +419,7 @@ let do_introduce con t domains cons data = + else try + let ndom = Domains.create domains domid mfn port in + Connections.add_domain cons ndom; +- Connections.fire_spec_watches cons Store.Path.introduce_domain; ++ Connections.fire_spec_watches (Transaction.get_root t) cons Store.Path.introduce_domain; + ndom + with _ -> raise Invalid_Cmd_Args + in +@@ -433,7 +438,7 @@ let do_release con t domains cons data = + Domains.del domains domid; + Connections.del_domain cons domid; + if fire_spec_watches +- then Connections.fire_spec_watches cons Store.Path.release_domain ++ then Connections.fire_spec_watches (Transaction.get_root t) cons Store.Path.release_domain + else raise Invalid_Cmd_Args + + let do_resume con t domains cons data = +@@ -501,6 +506,8 @@ let maybe_ignore_transaction = function + Transaction.none + | _ -> fun x -> x + ++ ++let () = Printexc.record_backtrace true + (** + * Nothrow guarantee. + *) +@@ -542,7 +549,8 @@ let process_packet ~store ~cons ~doms ~con ~req = + (* Put the response on the wire *) + send_response ty con t rid response + with exn -> +- error "process packet: %s" (Printexc.to_string exn); ++ let bt = Printexc.get_backtrace () in ++ error "process packet: %s. %s" (Printexc.to_string exn) bt; + Connection.send_error con tid rid "EIO" + + let do_input store cons doms con = +diff --git a/tools/ocaml/xenstored/transaction.ml b/tools/ocaml/xenstored/transaction.ml +index 23e7ccff1b..9e9e28db9b 100644 +--- a/tools/ocaml/xenstored/transaction.ml ++++ b/tools/ocaml/xenstored/transaction.ml +@@ -82,6 +82,7 @@ type t = { + start_count: int64; + store: Store.t; (* This is the store that we change in write operations. *) + quota: Quota.t; ++ oldroot: Store.Node.t; + mutable paths: (Xenbus.Xb.Op.operation * Store.Path.t) list; + mutable operations: (Packet.request * Packet.response) list; + mutable read_lowpath: Store.Path.t option; +@@ -123,6 +124,7 @@ let make ?(internal=false) id store = + start_count = !counter; + store = if id = none then store else Store.copy store; + quota = Quota.copy store.Store.quota; ++ oldroot = Store.get_root store; + paths = []; + operations = []; + read_lowpath = None; +@@ -137,6 +139,8 @@ let make ?(internal=false) id store = + let get_store t = t.store + let get_paths t = t.paths + ++let get_root t = Store.get_root t.store ++ + let is_read_only t = t.paths = [] + let add_wop t ty path = t.paths <- (ty, path) :: t.paths + let add_operation ~perm t request response = +diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml +index 32c3b1c0f1..e9f471846f 100644 +--- a/tools/ocaml/xenstored/xenstored.ml ++++ b/tools/ocaml/xenstored/xenstored.ml +@@ -341,7 +341,9 @@ let _ = + let (notify, deaddom) = Domains.cleanup domains in + List.iter (Connections.del_domain cons) deaddom; + if deaddom <> [] || notify then +- Connections.fire_spec_watches cons Store.Path.release_domain ++ Connections.fire_spec_watches ++ (Store.get_root store) ++ cons Store.Path.release_domain + ) + else + let c = Connections.find_domain_by_port cons port in diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-xenstore-check-privilege-for-XS_IS_DOMAIN_INTR.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-xenstore-check-privilege-for-XS_IS_DOMAIN_INTR.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-xenstore-check-privilege-for-XS_IS_DOMAIN_INTR.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0005-tools-xenstore-check-privilege-for-XS_IS_DOMAIN_INTR.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,115 @@ +From c625fae44aedc246776b52eb1173cf847a3d4d80 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:41 +0200 +Subject: [PATCH 05/10] tools/xenstore: check privilege for + XS_IS_DOMAIN_INTRODUCED + +The Xenstore command XS_IS_DOMAIN_INTRODUCED should be possible for +privileged domains only (the only user in the tree is the xenpaging +daemon). + +Instead of having the privilege test for each command introduce a +per-command flag for that purpose. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_core.c | 24 ++++++++++++++++++------ + tools/xenstore/xenstored_domain.c | 7 ++----- + 2 files changed, 20 insertions(+), 11 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index db9b9ca7957d..6afd58431111 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -1283,8 +1283,10 @@ static struct { + int (*func)(struct connection *conn, struct buffered_data *in); + unsigned int flags; + #define XS_FLAG_NOTID (1U << 0) /* Ignore transaction id. */ ++#define XS_FLAG_PRIV (1U << 1) /* Privileged domain only. */ + } const wire_funcs[XS_TYPE_COUNT] = { +- [XS_CONTROL] = { "CONTROL", do_control }, ++ [XS_CONTROL] = ++ { "CONTROL", do_control, XS_FLAG_PRIV }, + [XS_DIRECTORY] = { "DIRECTORY", send_directory }, + [XS_READ] = { "READ", do_read }, + [XS_GET_PERMS] = { "GET_PERMS", do_get_perms }, +@@ -1294,8 +1296,10 @@ static struct { + { "UNWATCH", do_unwatch, XS_FLAG_NOTID }, + [XS_TRANSACTION_START] = { "TRANSACTION_START", do_transaction_start }, + [XS_TRANSACTION_END] = { "TRANSACTION_END", do_transaction_end }, +- [XS_INTRODUCE] = { "INTRODUCE", do_introduce }, +- [XS_RELEASE] = { "RELEASE", do_release }, ++ [XS_INTRODUCE] = ++ { "INTRODUCE", do_introduce, XS_FLAG_PRIV }, ++ [XS_RELEASE] = ++ { "RELEASE", do_release, XS_FLAG_PRIV }, + [XS_GET_DOMAIN_PATH] = { "GET_DOMAIN_PATH", do_get_domain_path }, + [XS_WRITE] = { "WRITE", do_write }, + [XS_MKDIR] = { "MKDIR", do_mkdir }, +@@ -1304,9 +1308,11 @@ static struct { + [XS_WATCH_EVENT] = { "WATCH_EVENT", NULL }, + [XS_ERROR] = { "ERROR", NULL }, + [XS_IS_DOMAIN_INTRODUCED] = +- { "IS_DOMAIN_INTRODUCED", do_is_domain_introduced }, +- [XS_RESUME] = { "RESUME", do_resume }, +- [XS_SET_TARGET] = { "SET_TARGET", do_set_target }, ++ { "IS_DOMAIN_INTRODUCED", do_is_domain_introduced, XS_FLAG_PRIV }, ++ [XS_RESUME] = ++ { "RESUME", do_resume, XS_FLAG_PRIV }, ++ [XS_SET_TARGET] = ++ { "SET_TARGET", do_set_target, XS_FLAG_PRIV }, + [XS_RESET_WATCHES] = { "RESET_WATCHES", do_reset_watches }, + [XS_DIRECTORY_PART] = { "DIRECTORY_PART", send_directory_part }, + }; +@@ -1334,6 +1340,12 @@ static void process_message(struct connection *conn, struct buffered_data *in) + return; + } + ++ if ((wire_funcs[type].flags & XS_FLAG_PRIV) && ++ domain_is_unprivileged(conn)) { ++ send_error(conn, EACCES); ++ return; ++ } ++ + trans = (wire_funcs[type].flags & XS_FLAG_NOTID) + ? NULL : transaction_lookup(conn, in->hdr.msg.tx_id); + if (IS_ERR(trans)) { +diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c +index 1eae703ef680..0e2926e2a3d0 100644 +--- a/tools/xenstore/xenstored_domain.c ++++ b/tools/xenstore/xenstored_domain.c +@@ -377,7 +377,7 @@ int do_introduce(struct connection *conn, struct buffered_data *in) + if (get_strings(in, vec, ARRAY_SIZE(vec)) < ARRAY_SIZE(vec)) + return EINVAL; + +- if (domain_is_unprivileged(conn) || !conn->can_write) ++ if (!conn->can_write) + return EACCES; + + domid = atoi(vec[0]); +@@ -445,7 +445,7 @@ int do_set_target(struct connection *conn, struct buffered_data *in) + if (get_strings(in, vec, ARRAY_SIZE(vec)) < ARRAY_SIZE(vec)) + return EINVAL; + +- if (domain_is_unprivileged(conn) || !conn->can_write) ++ if (!conn->can_write) + return EACCES; + + domid = atoi(vec[0]); +@@ -480,9 +480,6 @@ static struct domain *onearg_domain(struct connection *conn, + if (!domid) + return ERR_PTR(-EINVAL); + +- if (domain_is_unprivileged(conn)) +- return ERR_PTR(-EACCES); +- + return find_connected_domain(domid); + } + +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-ocaml-xenstored-add-xenstored.conf-flag-to-tur.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-ocaml-xenstored-add-xenstored.conf-flag-to-tur.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-ocaml-xenstored-add-xenstored.conf-flag-to-tur.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-ocaml-xenstored-add-xenstored.conf-flag-to-tur.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,84 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: add xenstored.conf flag to turn off watch + permission checks +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +There are flags to turn off quotas and the permission system, so add one +that turns off the newly introduced watch permission checks as well. + +This is part of XSA-115. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/connection.ml b/tools/ocaml/xenstored/connection.ml +index 1389d971c2..698f721345 100644 +--- a/tools/ocaml/xenstored/connection.ml ++++ b/tools/ocaml/xenstored/connection.ml +@@ -218,7 +218,7 @@ let fire_single_watch_unchecked watch = + let fire_single_watch (oldroot, root) watch = + let abspath = get_watch_path watch.con watch.path |> Store.Path.of_string in + let perms = lookup_watch_perms oldroot root abspath in +- if List.exists (Perms.has watch.con.perm READ) perms then ++ if Perms.can_fire_watch watch.con.perm perms then + fire_single_watch_unchecked watch + else + let perms = perms |> List.map (Perms.Node.to_string ~sep:" ") |> String.concat ", " in +diff --git a/tools/ocaml/xenstored/oxenstored.conf.in b/tools/ocaml/xenstored/oxenstored.conf.in +index 6579b84448..d5d4f00de8 100644 +--- a/tools/ocaml/xenstored/oxenstored.conf.in ++++ b/tools/ocaml/xenstored/oxenstored.conf.in +@@ -44,6 +44,16 @@ conflict-rate-limit-is-aggregate = true + # Activate node permission system + perms-activate = true + ++# Activate the watch permission system ++# When this is enabled unprivileged guests can only get watch events ++# for xenstore entries that they would've been able to read. ++# ++# When this is disabled unprivileged guests may get watch events ++# for xenstore entries that they cannot read. The watch event contains ++# only the entry name, not the value. ++# This restores behaviour prior to XSA-115. ++perms-watch-activate = true ++ + # Activate quota + quota-activate = true + quota-maxentity = 1000 +diff --git a/tools/ocaml/xenstored/perms.ml b/tools/ocaml/xenstored/perms.ml +index 23b80aba3d..ee7fee6bda 100644 +--- a/tools/ocaml/xenstored/perms.ml ++++ b/tools/ocaml/xenstored/perms.ml +@@ -20,6 +20,7 @@ let info fmt = Logging.info "perms" fmt + open Stdext + + let activate = ref true ++let watch_activate = ref true + + type permty = READ | WRITE | RDWR | NONE + +@@ -168,5 +169,9 @@ let check connection request node = + (* check if the current connection has the requested perm on the current node *) + let has connection request node = not (lacks connection request node) + ++let can_fire_watch connection perms = ++ not !watch_activate ++ || List.exists (has connection READ) perms ++ + let equiv perm1 perm2 = + (Node.to_string perm1) = (Node.to_string perm2) +diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml +index e9f471846f..30fc874327 100644 +--- a/tools/ocaml/xenstored/xenstored.ml ++++ b/tools/ocaml/xenstored/xenstored.ml +@@ -95,6 +95,7 @@ let parse_config filename = + ("conflict-max-history-seconds", Config.Set_float Define.conflict_max_history_seconds); + ("conflict-rate-limit-is-aggregate", Config.Set_bool Define.conflict_rate_limit_is_aggregate); + ("perms-activate", Config.Set_bool Perms.activate); ++ ("perms-watch-activate", Config.Set_bool Perms.watch_activate); + ("quota-activate", Config.Set_bool Quota.activate); + ("quota-maxwatch", Config.Set_int Define.maxwatch); + ("quota-transaction", Config.Set_int Define.maxtransaction); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-xenstore-rework-node-removal.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-xenstore-rework-node-removal.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-xenstore-rework-node-removal.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0006-tools-xenstore-rework-node-removal.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,217 @@ +From 461c880600175c06e23a63e62d9f1ccab755d708 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:42 +0200 +Subject: [PATCH 06/10] tools/xenstore: rework node removal + +Today a Xenstore node is being removed by deleting it from the parent +first and then deleting itself and all its children. This results in +stale entries remaining in the data base in case e.g. a memory +allocation is failing during processing. This would result in the +rather strange behavior to be able to read a node (as its still in the +data base) while not being visible in the tree view of Xenstore. + +Fix that by deleting the nodes from the leaf side instead of starting +at the root. + +As fire_watches() is now called from _rm() the ctx parameter needs a +const attribute. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_core.c | 99 ++++++++++++++++---------------- + tools/xenstore/xenstored_watch.c | 4 +- + tools/xenstore/xenstored_watch.h | 2 +- + 3 files changed, 54 insertions(+), 51 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index 6afd58431111..1cb729a2cd5f 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -1087,74 +1087,76 @@ static int do_mkdir(struct connection *conn, struct buffered_data *in) + return 0; + } + +-static void delete_node(struct connection *conn, struct node *node) +-{ +- unsigned int i; +- char *name; +- +- /* Delete self, then delete children. If we crash, then the worst +- that can happen is the children will continue to take up space, but +- will otherwise be unreachable. */ +- delete_node_single(conn, node); +- +- /* Delete children, too. */ +- for (i = 0; i < node->childlen; i += strlen(node->children+i) + 1) { +- struct node *child; +- +- name = talloc_asprintf(node, "%s/%s", node->name, +- node->children + i); +- child = name ? read_node(conn, node, name) : NULL; +- if (child) { +- delete_node(conn, child); +- } +- else { +- trace("delete_node: Error deleting child '%s/%s'!\n", +- node->name, node->children + i); +- /* Skip it, we've already deleted the parent. */ +- } +- talloc_free(name); +- } +-} +- +- + /* Delete memory using memmove. */ + static void memdel(void *mem, unsigned off, unsigned len, unsigned total) + { + memmove(mem + off, mem + off + len, total - off - len); + } + +- +-static int remove_child_entry(struct connection *conn, struct node *node, +- size_t offset) ++static void remove_child_entry(struct connection *conn, struct node *node, ++ size_t offset) + { + size_t childlen = strlen(node->children + offset); ++ + memdel(node->children, offset, childlen + 1, node->childlen); + node->childlen -= childlen + 1; +- return write_node(conn, node, true); ++ if (write_node(conn, node, true)) ++ corrupt(conn, "Can't update parent node '%s'", node->name); + } + +- +-static int delete_child(struct connection *conn, +- struct node *node, const char *childname) ++static void delete_child(struct connection *conn, ++ struct node *node, const char *childname) + { + unsigned int i; + + for (i = 0; i < node->childlen; i += strlen(node->children+i) + 1) { + if (streq(node->children+i, childname)) { +- return remove_child_entry(conn, node, i); ++ remove_child_entry(conn, node, i); ++ return; + } + } + corrupt(conn, "Can't find child '%s' in %s", childname, node->name); +- return ENOENT; + } + ++static int delete_node(struct connection *conn, struct node *parent, ++ struct node *node) ++{ ++ char *name; ++ ++ /* Delete children. */ ++ while (node->childlen) { ++ struct node *child; ++ ++ name = talloc_asprintf(node, "%s/%s", node->name, ++ node->children); ++ child = name ? read_node(conn, node, name) : NULL; ++ if (child) { ++ if (delete_node(conn, node, child)) ++ return errno; ++ } else { ++ trace("delete_node: Error deleting child '%s/%s'!\n", ++ node->name, node->children); ++ /* Quit deleting. */ ++ errno = ENOMEM; ++ return errno; ++ } ++ talloc_free(name); ++ } ++ ++ delete_node_single(conn, node); ++ delete_child(conn, parent, basename(node->name)); ++ talloc_free(node); ++ ++ return 0; ++} + + static int _rm(struct connection *conn, const void *ctx, struct node *node, + const char *name) + { +- /* Delete from parent first, then if we crash, the worst that can +- happen is the child will continue to take up space, but will +- otherwise be unreachable. */ ++ /* ++ * Deleting node by node, so the result is always consistent even in ++ * case of a failure. ++ */ + struct node *parent; + char *parentname = get_parent(ctx, name); + +@@ -1165,11 +1167,13 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node, + if (!parent) + return (errno == ENOMEM) ? ENOMEM : EINVAL; + +- if (delete_child(conn, parent, basename(name))) +- return EINVAL; +- +- delete_node(conn, node); +- return 0; ++ /* ++ * Fire the watches now, when we can still see the node permissions. ++ * This fine as we are single threaded and the next possible read will ++ * be handled only after the node has been really removed. ++ */ ++ fire_watches(conn, ctx, name, true); ++ return delete_node(conn, parent, node); + } + + +@@ -1207,7 +1211,6 @@ static int do_rm(struct connection *conn, struct buffered_data *in) + if (ret) + return ret; + +- fire_watches(conn, in, name, true); + send_ack(conn, XS_RM); + + return 0; +diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c +index f2f1bed47cc6..f0bbfe7a6dc6 100644 +--- a/tools/xenstore/xenstored_watch.c ++++ b/tools/xenstore/xenstored_watch.c +@@ -77,7 +77,7 @@ static bool is_child(const char *child, const char *parent) + * Temporary memory allocations are done with ctx. + */ + static void add_event(struct connection *conn, +- void *ctx, ++ const void *ctx, + struct watch *watch, + const char *name) + { +@@ -121,7 +121,7 @@ static void add_event(struct connection *conn, + * Check whether any watch events are to be sent. + * Temporary memory allocations are done with ctx. + */ +-void fire_watches(struct connection *conn, void *ctx, const char *name, ++void fire_watches(struct connection *conn, const void *ctx, const char *name, + bool recurse) + { + struct connection *i; +diff --git a/tools/xenstore/xenstored_watch.h b/tools/xenstore/xenstored_watch.h +index c72ea6a68542..54d4ea7e0d41 100644 +--- a/tools/xenstore/xenstored_watch.h ++++ b/tools/xenstore/xenstored_watch.h +@@ -25,7 +25,7 @@ int do_watch(struct connection *conn, struct buffered_data *in); + int do_unwatch(struct connection *conn, struct buffered_data *in); + + /* Fire all watches: recurse means all the children are affected (ie. rm). */ +-void fire_watches(struct connection *conn, void *tmp, const char *name, ++void fire_watches(struct connection *conn, const void *tmp, const char *name, + bool recurse); + + void conn_delete_all_watches(struct connection *conn); +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0007-tools-xenstore-fire-watches-only-when-removing-a-spe.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0007-tools-xenstore-fire-watches-only-when-removing-a-spe.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0007-tools-xenstore-fire-watches-only-when-removing-a-spe.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0007-tools-xenstore-fire-watches-only-when-removing-a-spe.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,118 @@ +From 6ca2e14b43aecc79effc1a0cd528a4aceef44d42 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:43 +0200 +Subject: [PATCH 07/10] tools/xenstore: fire watches only when removing a + specific node + +Instead of firing all watches for removing a subtree in one go, do so +only when the related node is being removed. + +The watches for the top-most node being removed include all watches +including that node, while watches for nodes below that are only fired +if they are matching exactly. This avoids firing any watch more than +once when removing a subtree. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_core.c | 11 ++++++----- + tools/xenstore/xenstored_watch.c | 13 ++++++++----- + tools/xenstore/xenstored_watch.h | 4 ++-- + 3 files changed, 16 insertions(+), 12 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index 1cb729a2cd5f..d7c025616ead 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -1118,8 +1118,8 @@ static void delete_child(struct connection *conn, + corrupt(conn, "Can't find child '%s' in %s", childname, node->name); + } + +-static int delete_node(struct connection *conn, struct node *parent, +- struct node *node) ++static int delete_node(struct connection *conn, const void *ctx, ++ struct node *parent, struct node *node) + { + char *name; + +@@ -1131,7 +1131,7 @@ static int delete_node(struct connection *conn, struct node *parent, + node->children); + child = name ? read_node(conn, node, name) : NULL; + if (child) { +- if (delete_node(conn, node, child)) ++ if (delete_node(conn, ctx, node, child)) + return errno; + } else { + trace("delete_node: Error deleting child '%s/%s'!\n", +@@ -1143,6 +1143,7 @@ static int delete_node(struct connection *conn, struct node *parent, + talloc_free(name); + } + ++ fire_watches(conn, ctx, node->name, true); + delete_node_single(conn, node); + delete_child(conn, parent, basename(node->name)); + talloc_free(node); +@@ -1172,8 +1173,8 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node, + * This fine as we are single threaded and the next possible read will + * be handled only after the node has been really removed. + */ +- fire_watches(conn, ctx, name, true); +- return delete_node(conn, parent, node); ++ fire_watches(conn, ctx, name, false); ++ return delete_node(conn, ctx, parent, node); + } + + +diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c +index f0bbfe7a6dc6..3836675459fa 100644 +--- a/tools/xenstore/xenstored_watch.c ++++ b/tools/xenstore/xenstored_watch.c +@@ -122,7 +122,7 @@ static void add_event(struct connection *conn, + * Temporary memory allocations are done with ctx. + */ + void fire_watches(struct connection *conn, const void *ctx, const char *name, +- bool recurse) ++ bool exact) + { + struct connection *i; + struct watch *watch; +@@ -134,10 +134,13 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name, + /* Create an event for each watch. */ + list_for_each_entry(i, &connections, list) { + list_for_each_entry(watch, &i->watches, list) { +- if (is_child(name, watch->node)) +- add_event(i, ctx, watch, name); +- else if (recurse && is_child(watch->node, name)) +- add_event(i, ctx, watch, watch->node); ++ if (exact) { ++ if (streq(name, watch->node)) ++ add_event(i, ctx, watch, name); ++ } else { ++ if (is_child(name, watch->node)) ++ add_event(i, ctx, watch, name); ++ } + } + } + } +diff --git a/tools/xenstore/xenstored_watch.h b/tools/xenstore/xenstored_watch.h +index 54d4ea7e0d41..1b3c80d3dda1 100644 +--- a/tools/xenstore/xenstored_watch.h ++++ b/tools/xenstore/xenstored_watch.h +@@ -24,9 +24,9 @@ + int do_watch(struct connection *conn, struct buffered_data *in); + int do_unwatch(struct connection *conn, struct buffered_data *in); + +-/* Fire all watches: recurse means all the children are affected (ie. rm). */ ++/* Fire all watches: !exact means all the children are affected (ie. rm). */ + void fire_watches(struct connection *conn, const void *tmp, const char *name, +- bool recurse); ++ bool exact); + + void conn_delete_all_watches(struct connection *conn); + +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0008-tools-xenstore-introduce-node_perms-structure.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0008-tools-xenstore-introduce-node_perms-structure.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0008-tools-xenstore-introduce-node_perms-structure.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0008-tools-xenstore-introduce-node_perms-structure.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,289 @@ +From 2d4f410899bf59e112c107f371c3d164f8a592f8 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:44 +0200 +Subject: [PATCH 08/10] tools/xenstore: introduce node_perms structure + +There are several places in xenstored using a permission array and the +size of that array. Introduce a new struct node_perms containing both. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Acked-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_core.c | 79 +++++++++++++++---------------- + tools/xenstore/xenstored_core.h | 8 +++- + tools/xenstore/xenstored_domain.c | 12 ++--- + 3 files changed, 50 insertions(+), 49 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index d7c025616ead..fe9943113b9f 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -401,14 +401,14 @@ static struct node *read_node(struct connection *conn, const void *ctx, + /* Datalen, childlen, number of permissions */ + hdr = (void *)data.dptr; + node->generation = hdr->generation; +- node->num_perms = hdr->num_perms; ++ node->perms.num = hdr->num_perms; + node->datalen = hdr->datalen; + node->childlen = hdr->childlen; + + /* Permissions are struct xs_permissions. */ +- node->perms = hdr->perms; ++ node->perms.p = hdr->perms; + /* Data is binary blob (usually ascii, no nul). */ +- node->data = node->perms + node->num_perms; ++ node->data = node->perms.p + node->perms.num; + /* Children is strings, nul separated. */ + node->children = node->data + node->datalen; + +@@ -425,7 +425,7 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node, + struct xs_tdb_record_hdr *hdr; + + data.dsize = sizeof(*hdr) +- + node->num_perms*sizeof(node->perms[0]) ++ + node->perms.num * sizeof(node->perms.p[0]) + + node->datalen + node->childlen; + + if (!no_quota_check && domain_is_unprivileged(conn) && +@@ -437,12 +437,13 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node, + data.dptr = talloc_size(node, data.dsize); + hdr = (void *)data.dptr; + hdr->generation = node->generation; +- hdr->num_perms = node->num_perms; ++ hdr->num_perms = node->perms.num; + hdr->datalen = node->datalen; + hdr->childlen = node->childlen; + +- memcpy(hdr->perms, node->perms, node->num_perms*sizeof(node->perms[0])); +- p = hdr->perms + node->num_perms; ++ memcpy(hdr->perms, node->perms.p, ++ node->perms.num * sizeof(*node->perms.p)); ++ p = hdr->perms + node->perms.num; + memcpy(p, node->data, node->datalen); + p += node->datalen; + memcpy(p, node->children, node->childlen); +@@ -468,8 +469,7 @@ static int write_node(struct connection *conn, struct node *node, + } + + static enum xs_perm_type perm_for_conn(struct connection *conn, +- struct xs_permissions *perms, +- unsigned int num) ++ const struct node_perms *perms) + { + unsigned int i; + enum xs_perm_type mask = XS_PERM_READ|XS_PERM_WRITE|XS_PERM_OWNER; +@@ -478,16 +478,16 @@ static enum xs_perm_type perm_for_conn(struct connection *conn, + mask &= ~XS_PERM_WRITE; + + /* Owners and tools get it all... */ +- if (!domain_is_unprivileged(conn) || perms[0].id == conn->id +- || (conn->target && perms[0].id == conn->target->id)) ++ if (!domain_is_unprivileged(conn) || perms->p[0].id == conn->id ++ || (conn->target && perms->p[0].id == conn->target->id)) + return (XS_PERM_READ|XS_PERM_WRITE|XS_PERM_OWNER) & mask; + +- for (i = 1; i < num; i++) +- if (perms[i].id == conn->id +- || (conn->target && perms[i].id == conn->target->id)) +- return perms[i].perms & mask; ++ for (i = 1; i < perms->num; i++) ++ if (perms->p[i].id == conn->id ++ || (conn->target && perms->p[i].id == conn->target->id)) ++ return perms->p[i].perms & mask; + +- return perms[0].perms & mask; ++ return perms->p[0].perms & mask; + } + + /* +@@ -534,7 +534,7 @@ static int ask_parents(struct connection *conn, const void *ctx, + return 0; + } + +- *perm = perm_for_conn(conn, node->perms, node->num_perms); ++ *perm = perm_for_conn(conn, &node->perms); + return 0; + } + +@@ -580,8 +580,7 @@ struct node *get_node(struct connection *conn, + node = read_node(conn, ctx, name); + /* If we don't have permission, we don't have node. */ + if (node) { +- if ((perm_for_conn(conn, node->perms, node->num_perms) & perm) +- != perm) { ++ if ((perm_for_conn(conn, &node->perms) & perm) != perm) { + errno = EACCES; + node = NULL; + } +@@ -757,16 +756,15 @@ const char *onearg(struct buffered_data *in) + return in->buffer; + } + +-static char *perms_to_strings(const void *ctx, +- struct xs_permissions *perms, unsigned int num, ++static char *perms_to_strings(const void *ctx, const struct node_perms *perms, + unsigned int *len) + { + unsigned int i; + char *strings = NULL; + char buffer[MAX_STRLEN(unsigned int) + 1]; + +- for (*len = 0, i = 0; i < num; i++) { +- if (!xs_perm_to_string(&perms[i], buffer, sizeof(buffer))) ++ for (*len = 0, i = 0; i < perms->num; i++) { ++ if (!xs_perm_to_string(&perms->p[i], buffer, sizeof(buffer))) + return NULL; + + strings = talloc_realloc(ctx, strings, char, +@@ -945,13 +943,13 @@ static struct node *construct_node(struct connection *conn, const void *ctx, + goto nomem; + + /* Inherit permissions, except unprivileged domains own what they create */ +- node->num_perms = parent->num_perms; +- node->perms = talloc_memdup(node, parent->perms, +- node->num_perms * sizeof(node->perms[0])); +- if (!node->perms) ++ node->perms.num = parent->perms.num; ++ node->perms.p = talloc_memdup(node, parent->perms.p, ++ node->perms.num * sizeof(*node->perms.p)); ++ if (!node->perms.p) + goto nomem; + if (domain_is_unprivileged(conn)) +- node->perms[0].id = conn->id; ++ node->perms.p[0].id = conn->id; + + /* No children, no data */ + node->children = node->data = NULL; +@@ -1228,7 +1226,7 @@ static int do_get_perms(struct connection *conn, struct buffered_data *in) + if (!node) + return errno; + +- strings = perms_to_strings(node, node->perms, node->num_perms, &len); ++ strings = perms_to_strings(node, &node->perms, &len); + if (!strings) + return errno; + +@@ -1239,13 +1237,12 @@ static int do_get_perms(struct connection *conn, struct buffered_data *in) + + static int do_set_perms(struct connection *conn, struct buffered_data *in) + { +- unsigned int num; +- struct xs_permissions *perms; ++ struct node_perms perms; + char *name, *permstr; + struct node *node; + +- num = xs_count_strings(in->buffer, in->used); +- if (num < 2) ++ perms.num = xs_count_strings(in->buffer, in->used); ++ if (perms.num < 2) + return EINVAL; + + /* First arg is node name. */ +@@ -1256,21 +1253,21 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in) + return errno; + + permstr = in->buffer + strlen(in->buffer) + 1; +- num--; ++ perms.num--; + +- perms = talloc_array(node, struct xs_permissions, num); +- if (!perms) ++ perms.p = talloc_array(node, struct xs_permissions, perms.num); ++ if (!perms.p) + return ENOMEM; +- if (!xs_strings_to_perms(perms, num, permstr)) ++ if (!xs_strings_to_perms(perms.p, perms.num, permstr)) + return errno; + + /* Unprivileged domains may not change the owner. */ +- if (domain_is_unprivileged(conn) && perms[0].id != node->perms[0].id) ++ if (domain_is_unprivileged(conn) && ++ perms.p[0].id != node->perms.p[0].id) + return EPERM; + + domain_entry_dec(conn, node); + node->perms = perms; +- node->num_perms = num; + domain_entry_inc(conn, node); + + if (write_node(conn, node, false)) +@@ -1545,8 +1542,8 @@ static void manual_node(const char *name, const char *child) + barf_perror("Could not allocate initial node %s", name); + + node->name = name; +- node->perms = &perms; +- node->num_perms = 1; ++ node->perms.p = &perms; ++ node->perms.num = 1; + node->children = (char *)child; + if (child) + node->childlen = strlen(child) + 1; +diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h +index 3cb1c235a101..193d93142636 100644 +--- a/tools/xenstore/xenstored_core.h ++++ b/tools/xenstore/xenstored_core.h +@@ -109,6 +109,11 @@ struct connection + }; + extern struct list_head connections; + ++struct node_perms { ++ unsigned int num; ++ struct xs_permissions *p; ++}; ++ + struct node { + const char *name; + +@@ -120,8 +125,7 @@ struct node { + #define NO_GENERATION ~((uint64_t)0) + + /* Permissions. */ +- unsigned int num_perms; +- struct xs_permissions *perms; ++ struct node_perms perms; + + /* Contents. */ + unsigned int datalen; +diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c +index 0e2926e2a3d0..dc51cdfa9aa7 100644 +--- a/tools/xenstore/xenstored_domain.c ++++ b/tools/xenstore/xenstored_domain.c +@@ -657,12 +657,12 @@ void domain_entry_inc(struct connection *conn, struct node *node) + if (!conn) + return; + +- if (node->perms && node->perms[0].id != conn->id) { ++ if (node->perms.p && node->perms.p[0].id != conn->id) { + if (conn->transaction) { + transaction_entry_inc(conn->transaction, +- node->perms[0].id); ++ node->perms.p[0].id); + } else { +- d = find_domain_by_domid(node->perms[0].id); ++ d = find_domain_by_domid(node->perms.p[0].id); + if (d) + d->nbentry++; + } +@@ -683,12 +683,12 @@ void domain_entry_dec(struct connection *conn, struct node *node) + if (!conn) + return; + +- if (node->perms && node->perms[0].id != conn->id) { ++ if (node->perms.p && node->perms.p[0].id != conn->id) { + if (conn->transaction) { + transaction_entry_dec(conn->transaction, +- node->perms[0].id); ++ node->perms.p[0].id); + } else { +- d = find_domain_by_domid(node->perms[0].id); ++ d = find_domain_by_domid(node->perms.p[0].id); + if (d && d->nbentry) + d->nbentry--; + } +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0009-tools-xenstore-allow-special-watches-for-privileged-.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0009-tools-xenstore-allow-special-watches-for-privileged-.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0009-tools-xenstore-allow-special-watches-for-privileged-.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0009-tools-xenstore-allow-special-watches-for-privileged-.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,237 @@ +From cddf74031b3c8a108e8fd7db0bf56e9c2809d3e2 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:45 +0200 +Subject: [PATCH 09/10] tools/xenstore: allow special watches for privileged + callers only + +The special watches "@introduceDomain" and "@releaseDomain" should be +allowed for privileged callers only, as they allow to gain information +about presence of other guests on the host. So send watch events for +those watches via privileged connections only. + +In order to allow for disaggregated setups where e.g. driver domains +need to make use of those special watches add support for calling +"set permissions" for those special nodes, too. + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + docs/misc/xenstore.txt | 5 +++ + tools/xenstore/xenstored_core.c | 27 ++++++++------ + tools/xenstore/xenstored_core.h | 2 ++ + tools/xenstore/xenstored_domain.c | 60 +++++++++++++++++++++++++++++++ + tools/xenstore/xenstored_domain.h | 5 +++ + tools/xenstore/xenstored_watch.c | 4 +++ + 6 files changed, 93 insertions(+), 10 deletions(-) + +diff --git a/docs/misc/xenstore.txt b/docs/misc/xenstore.txt +index 6f8569d5760f..32969eb3fecd 100644 +--- a/docs/misc/xenstore.txt ++++ b/docs/misc/xenstore.txt +@@ -170,6 +170,9 @@ SET_PERMS ||+? + n no access + See http://wiki.xen.org/wiki/XenBus section + `Permissions' for details of the permissions system. ++ It is possible to set permissions for the special watch paths ++ "@introduceDomain" and "@releaseDomain" to enable receiving those ++ watches in unprivileged domains. + + ---------- Watches ---------- + +@@ -194,6 +197,8 @@ WATCH ||? + @releaseDomain occurs on any domain crash or + shutdown, and also on RELEASE + and domain destruction ++ events are sent to privileged callers or explicitly ++ via SET_PERMS enabled domains only. + + When a watch is first set up it is triggered once straight + away, with equal to . Watches may be triggered +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index fe9943113b9f..720bec269dd3 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -468,8 +468,8 @@ static int write_node(struct connection *conn, struct node *node, + return write_node_raw(conn, &key, node, no_quota_check); + } + +-static enum xs_perm_type perm_for_conn(struct connection *conn, +- const struct node_perms *perms) ++enum xs_perm_type perm_for_conn(struct connection *conn, ++ const struct node_perms *perms) + { + unsigned int i; + enum xs_perm_type mask = XS_PERM_READ|XS_PERM_WRITE|XS_PERM_OWNER; +@@ -1245,22 +1245,29 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in) + if (perms.num < 2) + return EINVAL; + +- /* First arg is node name. */ +- /* We must own node to do this (tools can do this too). */ +- node = get_node_canonicalized(conn, in, in->buffer, &name, +- XS_PERM_WRITE | XS_PERM_OWNER); +- if (!node) +- return errno; +- + permstr = in->buffer + strlen(in->buffer) + 1; + perms.num--; + +- perms.p = talloc_array(node, struct xs_permissions, perms.num); ++ perms.p = talloc_array(in, struct xs_permissions, perms.num); + if (!perms.p) + return ENOMEM; + if (!xs_strings_to_perms(perms.p, perms.num, permstr)) + return errno; + ++ /* First arg is node name. */ ++ if (strstarts(in->buffer, "@")) { ++ if (set_perms_special(conn, in->buffer, &perms)) ++ return errno; ++ send_ack(conn, XS_SET_PERMS); ++ return 0; ++ } ++ ++ /* We must own node to do this (tools can do this too). */ ++ node = get_node_canonicalized(conn, in, in->buffer, &name, ++ XS_PERM_WRITE | XS_PERM_OWNER); ++ if (!node) ++ return errno; ++ + /* Unprivileged domains may not change the owner. */ + if (domain_is_unprivileged(conn) && + perms.p[0].id != node->perms.p[0].id) +diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h +index 193d93142636..f3da6bbc943d 100644 +--- a/tools/xenstore/xenstored_core.h ++++ b/tools/xenstore/xenstored_core.h +@@ -165,6 +165,8 @@ struct node *get_node(struct connection *conn, + struct connection *new_connection(connwritefn_t *write, connreadfn_t *read); + void check_store(void); + void corrupt(struct connection *conn, const char *fmt, ...); ++enum xs_perm_type perm_for_conn(struct connection *conn, ++ const struct node_perms *perms); + + /* Is this a valid node name? */ + bool is_valid_nodename(const char *node); +diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c +index dc51cdfa9aa7..7afabe0ae084 100644 +--- a/tools/xenstore/xenstored_domain.c ++++ b/tools/xenstore/xenstored_domain.c +@@ -41,6 +41,9 @@ static evtchn_port_t virq_port; + + xenevtchn_handle *xce_handle = NULL; + ++static struct node_perms dom_release_perms; ++static struct node_perms dom_introduce_perms; ++ + struct domain + { + struct list_head list; +@@ -589,6 +592,59 @@ void restore_existing_connections(void) + { + } + ++static int set_dom_perms_default(struct node_perms *perms) ++{ ++ perms->num = 1; ++ perms->p = talloc_array(NULL, struct xs_permissions, perms->num); ++ if (!perms->p) ++ return -1; ++ perms->p->id = 0; ++ perms->p->perms = XS_PERM_NONE; ++ ++ return 0; ++} ++ ++static struct node_perms *get_perms_special(const char *name) ++{ ++ if (!strcmp(name, "@releaseDomain")) ++ return &dom_release_perms; ++ if (!strcmp(name, "@introduceDomain")) ++ return &dom_introduce_perms; ++ return NULL; ++} ++ ++int set_perms_special(struct connection *conn, const char *name, ++ struct node_perms *perms) ++{ ++ struct node_perms *p; ++ ++ p = get_perms_special(name); ++ if (!p) ++ return EINVAL; ++ ++ if ((perm_for_conn(conn, p) & (XS_PERM_WRITE | XS_PERM_OWNER)) != ++ (XS_PERM_WRITE | XS_PERM_OWNER)) ++ return EACCES; ++ ++ p->num = perms->num; ++ talloc_free(p->p); ++ p->p = perms->p; ++ talloc_steal(NULL, perms->p); ++ ++ return 0; ++} ++ ++bool check_perms_special(const char *name, struct connection *conn) ++{ ++ struct node_perms *p; ++ ++ p = get_perms_special(name); ++ if (!p) ++ return false; ++ ++ return perm_for_conn(conn, p) & XS_PERM_READ; ++} ++ + static int dom0_init(void) + { + evtchn_port_t port; +@@ -610,6 +666,10 @@ static int dom0_init(void) + + xenevtchn_notify(xce_handle, dom0->port); + ++ if (set_dom_perms_default(&dom_release_perms) || ++ set_dom_perms_default(&dom_introduce_perms)) ++ return -1; ++ + return 0; + } + +diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h +index 56ae01597475..259183962a9c 100644 +--- a/tools/xenstore/xenstored_domain.h ++++ b/tools/xenstore/xenstored_domain.h +@@ -65,6 +65,11 @@ void domain_watch_inc(struct connection *conn); + void domain_watch_dec(struct connection *conn); + int domain_watch(struct connection *conn); + ++/* Special node permission handling. */ ++int set_perms_special(struct connection *conn, const char *name, ++ struct node_perms *perms); ++bool check_perms_special(const char *name, struct connection *conn); ++ + /* Write rate limiting */ + + #define WRL_FACTOR 1000 /* for fixed-point arithmetic */ +diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c +index 3836675459fa..f4e289362eb6 100644 +--- a/tools/xenstore/xenstored_watch.c ++++ b/tools/xenstore/xenstored_watch.c +@@ -133,6 +133,10 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name, + + /* Create an event for each watch. */ + list_for_each_entry(i, &connections, list) { ++ /* introduce/release domain watches */ ++ if (check_special_event(name) && !check_perms_special(name, i)) ++ continue; ++ + list_for_each_entry(watch, &i->watches, list) { + if (exact) { + if (streq(name, watch->node)) +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/0010-tools-xenstore-avoid-watch-events-for-nodes-without-.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/0010-tools-xenstore-avoid-watch-events-for-nodes-without-.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/0010-tools-xenstore-avoid-watch-events-for-nodes-without-.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/0010-tools-xenstore-avoid-watch-events-for-nodes-without-.patch 2022-05-26 17:34:05.000000000 +0100 @@ -0,0 +1,375 @@ +From e57b7687b43b033fe45e755e285efbe67bc71921 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Thu, 11 Jun 2020 16:12:46 +0200 +Subject: [PATCH 10/10] tools/xenstore: avoid watch events for nodes without + access + +Today watch events are sent regardless of the access rights of the +node the event is sent for. This enables any guest to e.g. setup a +watch for "/" in order to have a detailed record of all Xenstore +modifications. + +Modify that by sending only watch events for nodes that the watcher +has a chance to see otherwise (either via direct reads or by querying +the children of a node). This includes cases where the visibility of +a node for a watcher is changing (permissions being removed). + +This is part of XSA-115. + +Signed-off-by: Juergen Gross +[julieng: Handle rebase conflict] +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +--- + tools/xenstore/xenstored_core.c | 28 +++++----- + tools/xenstore/xenstored_core.h | 15 ++++-- + tools/xenstore/xenstored_domain.c | 6 +-- + tools/xenstore/xenstored_transaction.c | 21 +++++++- + tools/xenstore/xenstored_watch.c | 75 +++++++++++++++++++------- + tools/xenstore/xenstored_watch.h | 2 +- + 6 files changed, 104 insertions(+), 43 deletions(-) + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index 720bec269dd3..1c2845454560 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -358,8 +358,8 @@ static void initialize_fds(int sock, int *p_sock_pollfd_idx, + * If it fails, returns NULL and sets errno. + * Temporary memory allocations will be done with ctx. + */ +-static struct node *read_node(struct connection *conn, const void *ctx, +- const char *name) ++struct node *read_node(struct connection *conn, const void *ctx, ++ const char *name) + { + TDB_DATA key, data; + struct xs_tdb_record_hdr *hdr; +@@ -494,7 +494,7 @@ enum xs_perm_type perm_for_conn(struct connection *conn, + * Get name of node parent. + * Temporary memory allocations are done with ctx. + */ +-static char *get_parent(const void *ctx, const char *node) ++char *get_parent(const void *ctx, const char *node) + { + char *parent; + char *slash = strrchr(node + 1, '/'); +@@ -566,10 +566,10 @@ static int errno_from_parents(struct connection *conn, const void *ctx, + * If it fails, returns NULL and sets errno. + * Temporary memory allocations are done with ctx. + */ +-struct node *get_node(struct connection *conn, +- const void *ctx, +- const char *name, +- enum xs_perm_type perm) ++static struct node *get_node(struct connection *conn, ++ const void *ctx, ++ const char *name, ++ enum xs_perm_type perm) + { + struct node *node; + +@@ -1056,7 +1056,7 @@ static int do_write(struct connection *conn, struct buffered_data *in) + return errno; + } + +- fire_watches(conn, in, name, false); ++ fire_watches(conn, in, name, node, false, NULL); + send_ack(conn, XS_WRITE); + + return 0; +@@ -1078,7 +1078,7 @@ static int do_mkdir(struct connection *conn, struct buffered_data *in) + node = create_node(conn, in, name, NULL, 0); + if (!node) + return errno; +- fire_watches(conn, in, name, false); ++ fire_watches(conn, in, name, node, false, NULL); + } + send_ack(conn, XS_MKDIR); + +@@ -1141,7 +1141,7 @@ static int delete_node(struct connection *conn, const void *ctx, + talloc_free(name); + } + +- fire_watches(conn, ctx, node->name, true); ++ fire_watches(conn, ctx, node->name, node, true, NULL); + delete_node_single(conn, node); + delete_child(conn, parent, basename(node->name)); + talloc_free(node); +@@ -1165,13 +1165,14 @@ static int _rm(struct connection *conn, const void *ctx, struct node *node, + parent = read_node(conn, ctx, parentname); + if (!parent) + return (errno == ENOMEM) ? ENOMEM : EINVAL; ++ node->parent = parent; + + /* + * Fire the watches now, when we can still see the node permissions. + * This fine as we are single threaded and the next possible read will + * be handled only after the node has been really removed. + */ +- fire_watches(conn, ctx, name, false); ++ fire_watches(conn, ctx, name, node, false, NULL); + return delete_node(conn, ctx, parent, node); + } + +@@ -1237,7 +1238,7 @@ static int do_get_perms(struct connection *conn, struct buffered_data *in) + + static int do_set_perms(struct connection *conn, struct buffered_data *in) + { +- struct node_perms perms; ++ struct node_perms perms, old_perms; + char *name, *permstr; + struct node *node; + +@@ -1273,6 +1274,7 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in) + perms.p[0].id != node->perms.p[0].id) + return EPERM; + ++ old_perms = node->perms; + domain_entry_dec(conn, node); + node->perms = perms; + domain_entry_inc(conn, node); +@@ -1280,7 +1282,7 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in) + if (write_node(conn, node, false)) + return errno; + +- fire_watches(conn, in, name, false); ++ fire_watches(conn, in, name, node, false, &old_perms); + send_ack(conn, XS_SET_PERMS); + + return 0; +diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h +index f3da6bbc943d..e050b27cbdde 100644 +--- a/tools/xenstore/xenstored_core.h ++++ b/tools/xenstore/xenstored_core.h +@@ -152,15 +152,17 @@ void send_ack(struct connection *conn, enum xsd_sockmsg_type type); + /* Canonicalize this path if possible. */ + char *canonicalize(struct connection *conn, const void *ctx, const char *node); + ++/* Get access permissions. */ ++enum xs_perm_type perm_for_conn(struct connection *conn, ++ const struct node_perms *perms); ++ + /* Write a node to the tdb data base. */ + int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node, + bool no_quota_check); + +-/* Get this node, checking we have permissions. */ +-struct node *get_node(struct connection *conn, +- const void *ctx, +- const char *name, +- enum xs_perm_type perm); ++/* Get a node from the tdb data base. */ ++struct node *read_node(struct connection *conn, const void *ctx, ++ const char *name); + + struct connection *new_connection(connwritefn_t *write, connreadfn_t *read); + void check_store(void); +@@ -171,6 +173,9 @@ enum xs_perm_type perm_for_conn(struct connection *conn, + /* Is this a valid node name? */ + bool is_valid_nodename(const char *node); + ++/* Get name of parent node. */ ++char *get_parent(const void *ctx, const char *node); ++ + /* Tracing infrastructure. */ + void trace_create(const void *data, const char *type); + void trace_destroy(const void *data, const char *type); +diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c +index 7afabe0ae084..711a11b18ad6 100644 +--- a/tools/xenstore/xenstored_domain.c ++++ b/tools/xenstore/xenstored_domain.c +@@ -206,7 +206,7 @@ static int destroy_domain(void *_domain) + unmap_interface(domain->interface); + } + +- fire_watches(NULL, domain, "@releaseDomain", false); ++ fire_watches(NULL, domain, "@releaseDomain", NULL, false, NULL); + + wrl_domain_destroy(domain); + +@@ -244,7 +244,7 @@ static void domain_cleanup(void) + } + + if (notify) +- fire_watches(NULL, NULL, "@releaseDomain", false); ++ fire_watches(NULL, NULL, "@releaseDomain", NULL, false, NULL); + } + + /* We scan all domains rather than use the information given here. */ +@@ -410,7 +410,7 @@ int do_introduce(struct connection *conn, struct buffered_data *in) + /* Now domain belongs to its connection. */ + talloc_steal(domain->conn, domain); + +- fire_watches(NULL, in, "@introduceDomain", false); ++ fire_watches(NULL, in, "@introduceDomain", NULL, false, NULL); + } else if ((domain->mfn == mfn) && (domain->conn != conn)) { + /* Use XS_INTRODUCE for recreating the xenbus event-channel. */ + if (domain->port) +diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c +index e87897573469..a7d8c5d475ec 100644 +--- a/tools/xenstore/xenstored_transaction.c ++++ b/tools/xenstore/xenstored_transaction.c +@@ -114,6 +114,9 @@ struct accessed_node + /* Generation count (or NO_GENERATION) for conflict checking. */ + uint64_t generation; + ++ /* Original node permissions. */ ++ struct node_perms perms; ++ + /* Generation count checking required? */ + bool check_gen; + +@@ -260,6 +263,15 @@ int access_node(struct connection *conn, struct node *node, + i->node = talloc_strdup(i, node->name); + if (!i->node) + goto nomem; ++ if (node->generation != NO_GENERATION && node->perms.num) { ++ i->perms.p = talloc_array(i, struct xs_permissions, ++ node->perms.num); ++ if (!i->perms.p) ++ goto nomem; ++ i->perms.num = node->perms.num; ++ memcpy(i->perms.p, node->perms.p, ++ i->perms.num * sizeof(*i->perms.p)); ++ } + + introduce = true; + i->ta_node = false; +@@ -368,9 +380,14 @@ static int finalize_transaction(struct connection *conn, + talloc_free(data.dptr); + if (ret) + goto err; +- } else if (tdb_delete(tdb_ctx, key)) ++ fire_watches(conn, trans, i->node, NULL, false, ++ i->perms.p ? &i->perms : NULL); ++ } else { ++ fire_watches(conn, trans, i->node, NULL, false, ++ i->perms.p ? &i->perms : NULL); ++ if (tdb_delete(tdb_ctx, key)) + goto err; +- fire_watches(conn, trans, i->node, false); ++ } + } + + if (i->ta_node && tdb_delete(tdb_ctx, ta_key)) +diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c +index f4e289362eb6..71c108ea99f1 100644 +--- a/tools/xenstore/xenstored_watch.c ++++ b/tools/xenstore/xenstored_watch.c +@@ -85,22 +85,6 @@ static void add_event(struct connection *conn, + unsigned int len; + char *data; + +- if (!check_special_event(name)) { +- /* Can this conn load node, or see that it doesn't exist? */ +- struct node *node = get_node(conn, ctx, name, XS_PERM_READ); +- /* +- * XXX We allow EACCES here because otherwise a non-dom0 +- * backend driver cannot watch for disappearance of a frontend +- * xenstore directory. When the directory disappears, we +- * revert to permissions of the parent directory for that path, +- * which will typically disallow access for the backend. +- * But this breaks device-channel teardown! +- * Really we should fix this better... +- */ +- if (!node && errno != ENOENT && errno != EACCES) +- return; +- } +- + if (watch->relative_path) { + name += strlen(watch->relative_path); + if (*name == '/') /* Could be "" */ +@@ -117,12 +101,60 @@ static void add_event(struct connection *conn, + talloc_free(data); + } + ++/* ++ * Check permissions of a specific watch to fire: ++ * Either the node itself or its parent have to be readable by the connection ++ * the watch has been setup for. In case a watch event is created due to ++ * changed permissions we need to take the old permissions into account, too. ++ */ ++static bool watch_permitted(struct connection *conn, const void *ctx, ++ const char *name, struct node *node, ++ struct node_perms *perms) ++{ ++ enum xs_perm_type perm; ++ struct node *parent; ++ char *parent_name; ++ ++ if (perms) { ++ perm = perm_for_conn(conn, perms); ++ if (perm & XS_PERM_READ) ++ return true; ++ } ++ ++ if (!node) { ++ node = read_node(conn, ctx, name); ++ if (!node) ++ return false; ++ } ++ ++ perm = perm_for_conn(conn, &node->perms); ++ if (perm & XS_PERM_READ) ++ return true; ++ ++ parent = node->parent; ++ if (!parent) { ++ parent_name = get_parent(ctx, node->name); ++ if (!parent_name) ++ return false; ++ parent = read_node(conn, ctx, parent_name); ++ if (!parent) ++ return false; ++ } ++ ++ perm = perm_for_conn(conn, &parent->perms); ++ ++ return perm & XS_PERM_READ; ++} ++ + /* + * Check whether any watch events are to be sent. + * Temporary memory allocations are done with ctx. ++ * We need to take the (potential) old permissions of the node into account ++ * as a watcher losing permissions to access a node should receive the ++ * watch event, too. + */ + void fire_watches(struct connection *conn, const void *ctx, const char *name, +- bool exact) ++ struct node *node, bool exact, struct node_perms *perms) + { + struct connection *i; + struct watch *watch; +@@ -134,8 +166,13 @@ void fire_watches(struct connection *conn, const void *ctx, const char *name, + /* Create an event for each watch. */ + list_for_each_entry(i, &connections, list) { + /* introduce/release domain watches */ +- if (check_special_event(name) && !check_perms_special(name, i)) +- continue; ++ if (check_special_event(name)) { ++ if (!check_perms_special(name, i)) ++ continue; ++ } else { ++ if (!watch_permitted(i, ctx, name, node, perms)) ++ continue; ++ } + + list_for_each_entry(watch, &i->watches, list) { + if (exact) { +diff --git a/tools/xenstore/xenstored_watch.h b/tools/xenstore/xenstored_watch.h +index 1b3c80d3dda1..03094374f379 100644 +--- a/tools/xenstore/xenstored_watch.h ++++ b/tools/xenstore/xenstored_watch.h +@@ -26,7 +26,7 @@ int do_unwatch(struct connection *conn, struct buffered_data *in); + + /* Fire all watches: !exact means all the children are affected (ie. rm). */ + void fire_watches(struct connection *conn, const void *tmp, const char *name, +- bool exact); ++ struct node *node, bool exact, struct node_perms *perms); + + void conn_delete_all_watches(struct connection *conn); + +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/AMD-IOMMU-fix-off-by-one-in-amd_iommu_get_paging_mode-callers.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/AMD-IOMMU-fix-off-by-one-in-amd_iommu_get_paging_mode-callers.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/AMD-IOMMU-fix-off-by-one-in-amd_iommu_get_paging_mode-callers.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/AMD-IOMMU-fix-off-by-one-in-amd_iommu_get_paging_mode-callers.patch 2022-06-01 21:10:15.000000000 +0100 @@ -0,0 +1,124 @@ +From 696d142276e277264a9c6fcdd4f00edc8a6ce292 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Thu, 9 Apr 2020 10:11:50 +0200 +Subject: [PATCH] AMD/IOMMU: fix off-by-one in amd_iommu_get_paging_mode() + callers + +amd_iommu_get_paging_mode() expects a count, not a "maximum possible" +value. Prior to b4f042236ae0 dropping the reference, the use of our mis- +named "max_page" in amd_iommu_domain_init() may have lead to such a +misunderstanding. In an attempt to avoid such confusion in the future, +rename the function's parameter and - while at it - convert it to an +inline function. + +Also replace a literal 4 by an expression tying it to a wider use +constant, just like amd_iommu_quarantine_init() does. + +Fixes: ea38867831da ("x86 / iommu: set up a scratch page in the quarantine domain") +Fixes: b4f042236ae0 ("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") +Signed-off-by: Jan Beulich +Acked-by: Andrew Cooper +master commit: b75b3c62fe4afe381c6f74a07f614c0b39fe2f5d +master date: 2020-03-16 11:24:29 +0100 +--- + xen/drivers/passthrough/amd/iommu_map.c | 6 ++--- + xen/drivers/passthrough/amd/pci_amd_iommu.c | 23 ++++--------------- + xen/include/asm-x86/hvm/svm/amd-iommu-proto.h | 17 +++++++++++++- + 3 files changed, 23 insertions(+), 23 deletions(-) + +diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c +index 21fbea0467..aa382dbabd 100644 +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -745,9 +745,9 @@ void amd_iommu_share_p2m(struct domain *d) + int __init amd_iommu_quarantine_init(struct domain *d) + { + struct domain_iommu *hd = dom_iommu(d); +- unsigned long max_gfn = +- PFN_DOWN((1ul << DEFAULT_DOMAIN_ADDRESS_WIDTH) - 1); +- unsigned int level = amd_iommu_get_paging_mode(max_gfn); ++ unsigned long end_gfn = ++ 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT); ++ unsigned int level = amd_iommu_get_paging_mode(end_gfn); + uint64_t *table; + + if ( hd->arch.root_table ) +diff --git a/xen/drivers/passthrough/amd/pci_amd_iommu.c b/xen/drivers/passthrough/amd/pci_amd_iommu.c +index 0b641ff75c..983ece5981 100644 +--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c ++++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c +@@ -218,22 +218,6 @@ static int __must_check allocate_domain_resources(struct domain_iommu *hd) + return rc; + } + +-int amd_iommu_get_paging_mode(unsigned long entries) +-{ +- int level = 1; +- +- BUG_ON( !entries ); +- +- while ( entries > PTE_PER_TABLE_SIZE ) +- { +- entries = PTE_PER_TABLE_ALIGN(entries) >> PTE_PER_TABLE_SHIFT; +- if ( ++level > 6 ) +- return -ENOMEM; +- } +- +- return level; +-} +- + static int amd_iommu_domain_init(struct domain *d) + { + struct domain_iommu *hd = dom_iommu(d); +@@ -246,9 +230,10 @@ static int amd_iommu_domain_init(struct domain *d) + * physical address space we give it, but this isn't known yet so use 4 + * unilaterally. + */ +- hd->arch.paging_mode = is_hvm_domain(d) +- ? IOMMU_PAGING_MODE_LEVEL_4 +- : amd_iommu_get_paging_mode(get_upper_mfn_bound()); ++ hd->arch.paging_mode = amd_iommu_get_paging_mode( ++ is_hvm_domain(d) ++ ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT) ++ : get_upper_mfn_bound() + 1); + + return 0; + } +diff --git a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +index c42688fe51..22d6614169 100644 +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +@@ -51,7 +51,6 @@ void get_iommu_features(struct amd_iommu *iommu); + int amd_iommu_init(void); + int amd_iommu_update_ivrs_mapping_acpi(void); + +-int amd_iommu_get_paging_mode(unsigned long entries); + int amd_iommu_quarantine_init(struct domain *d); + + /* mapping functions */ +@@ -168,6 +167,22 @@ static inline unsigned long region_to_pages(unsigned long addr, unsigned long si + return (PAGE_ALIGN(addr + size) - (addr & PAGE_MASK)) >> PAGE_SHIFT; + } + ++static inline int amd_iommu_get_paging_mode(unsigned long max_frames) ++{ ++ int level = 1; ++ ++ BUG_ON(!max_frames); ++ ++ while ( max_frames > PTE_PER_TABLE_SIZE ) ++ { ++ max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT; ++ if ( ++level > 6 ) ++ return -ENOMEM; ++ } ++ ++ return level; ++} ++ + static inline struct page_info* alloc_amd_iommu_pgtable(void) + { + struct page_info *pg; +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/amd-iommu-get-rid-of-pointless-IOMMU_PAGING_MODE_LEVEL_X-definitions.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/amd-iommu-get-rid-of-pointless-IOMMU_PAGING_MODE_LEVEL_X-definitions.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/amd-iommu-get-rid-of-pointless-IOMMU_PAGING_MODE_LEVEL_X-definitions.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/amd-iommu-get-rid-of-pointless-IOMMU_PAGING_MODE_LEVEL_X-definitions.patch 2022-06-15 19:47:32.000000000 +0100 @@ -0,0 +1,169 @@ +From 1ecb1ee4d8475475c3ccf72f6654644b242ce856 Mon Sep 17 00:00:00 2001 +From: Paul Durrant +Date: Mon, 29 Oct 2018 13:47:24 +0100 +Subject: [PATCH] amd-iommu: get rid of pointless IOMMU_PAGING_MODE_LEVEL_X + definitions + +The levels are absolute numbers such that IOMMU_PAGING_MODE_LEVEL_X +evaluates to X (for the valid range of 0 - 7) so simply use numbers in +the code. + +No functional change. + +NOTE: This patch also adds emacs boilerplate to amd-iommu-defs.h + +Signed-off-by: Paul Durrant +Acked-by: Brian Woods +--- + xen/drivers/passthrough/amd/iommu_map.c | 26 +++++++++----------- + xen/drivers/passthrough/amd/pci_amd_iommu.c | 4 +-- + xen/include/asm-x86/hvm/svm/amd-iommu-defs.h | 21 ++++++++-------- + 3 files changed, 23 insertions(+), 28 deletions(-) + +diff --git a/xen/drivers/passthrough/amd/iommu_map.c b/xen/drivers/passthrough/amd/iommu_map.c +index d03a6d72b9..6a2c877d34 100644 +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -40,7 +40,7 @@ static void clear_iommu_pte_present(unsigned long l1_mfn, unsigned long gfn) + u64 *table, *pte; + + table = map_domain_page(_mfn(l1_mfn)); +- pte = table + pfn_to_pde_idx(gfn, IOMMU_PAGING_MODE_LEVEL_1); ++ pte = table + pfn_to_pde_idx(gfn, 1); + write_atomic(pte, 0); + unmap_domain_page(table); + } +@@ -103,7 +103,7 @@ static bool_t set_iommu_pde_present(u32 *pde, unsigned long next_mfn, + /* FC bit should be enabled in PTE, this helps to solve potential + * issues with ATS devices + */ +- if ( next_level == IOMMU_PAGING_MODE_LEVEL_0 ) ++ if ( next_level == 0 ) + set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, entry, + IOMMU_PTE_FC_MASK, IOMMU_PTE_FC_SHIFT, &entry); + full = (uint64_t)entry << 32; +@@ -137,8 +137,7 @@ static bool_t set_iommu_pte_present(unsigned long pt_mfn, unsigned long gfn, + + pde = (u32*)(table + pfn_to_pde_idx(gfn, pde_level)); + +- need_flush = set_iommu_pde_present(pde, next_mfn, +- IOMMU_PAGING_MODE_LEVEL_0, iw, ir); ++ need_flush = set_iommu_pde_present(pde, next_mfn, 0, iw, ir); + unmap_domain_page(table); + return need_flush; + } +@@ -458,8 +457,7 @@ static int iommu_merge_pages(struct domain *d, unsigned long pt_mfn, + } + + /* setup super page mapping, next level = 0 */ +- set_iommu_pde_present((u32*)pde, first_mfn, +- IOMMU_PAGING_MODE_LEVEL_0, ++ set_iommu_pde_present((u32*)pde, first_mfn, 0, + !!(flags & IOMMUF_writable), + !!(flags & IOMMUF_readable)); + +@@ -486,25 +484,24 @@ static int iommu_pde_from_gfn(struct domain *d, unsigned long gfn, + table = hd->arch.root_table; + level = hd->arch.paging_mode; + +- BUG_ON( table == NULL || level < IOMMU_PAGING_MODE_LEVEL_1 || +- level > IOMMU_PAGING_MODE_LEVEL_6 ); ++ BUG_ON( table == NULL || level < 1 || level > 6 ); + + /* + * A frame number past what the current page tables can represent can't + * possibly have a mapping. + */ + if ( pfn >> (PTE_PER_TABLE_SHIFT * level) ) + return 0; + + next_table_mfn = mfn_x(page_to_mfn(table)); + +- if ( level == IOMMU_PAGING_MODE_LEVEL_1 ) ++ if ( level == 1 ) + { + pt_mfn[level] = next_table_mfn; + return 0; + } + +- while ( level > IOMMU_PAGING_MODE_LEVEL_1 ) ++ while ( level > 1 ) + { + unsigned int next_level = level - 1; + pt_mfn[level] = next_table_mfn; +@@ -622,8 +619,7 @@ int amd_iommu_map_page(struct domain *d, unsigned long gfn, unsigned long mfn, + } + + /* Install 4k mapping first */ +- need_flush = set_iommu_pte_present(pt_mfn[1], gfn, mfn, +- IOMMU_PAGING_MODE_LEVEL_1, ++ need_flush = set_iommu_pte_present(pt_mfn[1], gfn, mfn, 1, + !!(flags & IOMMUF_writable), + !!(flags & IOMMUF_readable)); + +@@ -646,8 +642,8 @@ int amd_iommu_map_page(struct domain *d, unsigned long gfn, unsigned long mfn, + goto out; + } + +- for ( merge_level = IOMMU_PAGING_MODE_LEVEL_2; +- merge_level <= hd->arch.paging_mode; merge_level++ ) ++ for ( merge_level = 2; merge_level <= hd->arch.paging_mode; ++ merge_level++ ) + { + if ( pt_mfn[merge_level] == 0 ) + break; +@@ -777,7 +773,7 @@ void amd_iommu_share_p2m(struct domain *d) + hd->arch.root_table = p2m_table; + + /* When sharing p2m with iommu, paging mode = 4 */ +- hd->arch.paging_mode = IOMMU_PAGING_MODE_LEVEL_4; ++ hd->arch.paging_mode = 4; + AMD_IOMMU_DEBUG("Share p2m table with iommu: p2m table = %#lx\n", + mfn_x(pgd_mfn)); + } +diff --git a/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h b/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h +index 1f19cd3d27..a217245249 100644 +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h +@@ -35,8 +35,7 @@ + PAGE_SIZE * (PTE_PER_TABLE_ALIGN(entries) >> PTE_PER_TABLE_SHIFT) + + #define amd_offset_level_address(offset, level) \ +- ((u64)(offset) << (12 + (PTE_PER_TABLE_SHIFT * \ +- (level - IOMMU_PAGING_MODE_LEVEL_1)))) ++ ((uint64_t)(offset) << (12 + (PTE_PER_TABLE_SHIFT * ((level) - 1)))) + + #define PCI_MIN_CAP_OFFSET 0x40 + #define PCI_MAX_CAP_BLOCKS 48 +@@ -446,14 +445,6 @@ + + /* Paging modes */ + #define IOMMU_PAGING_MODE_DISABLED 0x0 +-#define IOMMU_PAGING_MODE_LEVEL_0 0x0 +-#define IOMMU_PAGING_MODE_LEVEL_1 0x1 +-#define IOMMU_PAGING_MODE_LEVEL_2 0x2 +-#define IOMMU_PAGING_MODE_LEVEL_3 0x3 +-#define IOMMU_PAGING_MODE_LEVEL_4 0x4 +-#define IOMMU_PAGING_MODE_LEVEL_5 0x5 +-#define IOMMU_PAGING_MODE_LEVEL_6 0x6 +-#define IOMMU_PAGING_MODE_LEVEL_7 0x7 + + /* Flags */ + #define IOMMU_CONTROL_DISABLED 0 +@@ -494,3 +485,13 @@ + #define IOMMU_REG_BASE_ADDR_HIGH_SHIFT 0 + + #endif /* _ASM_X86_64_AMD_IOMMU_DEFS_H */ ++ ++/* ++ * Local variables: ++ * mode: C ++ * c-file-style: "BSD" ++ * c-basic-offset: 4 ++ * tab-width: 4 ++ * indent-tabs-mode: nil ++ * End: ++ */ +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/evtchn-fifo-use-stable-fields-when-recording-last-queue-information.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/evtchn-fifo-use-stable-fields-when-recording-last-queue-information.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/evtchn-fifo-use-stable-fields-when-recording-last-queue-information.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/evtchn-fifo-use-stable-fields-when-recording-last-queue-information.patch 2022-06-01 11:07:17.000000000 +0100 @@ -0,0 +1,41 @@ +From 2a730d5b6ad1ea95c3d67fa12ab0091d32b29505 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Tue, 1 Dec 2020 17:03:12 +0100 +Subject: [PATCH] evtchn/fifo: use stable fields when recording "last queue" + information + +Both evtchn->priority and evtchn->notify_vcpu_id could change behind the +back of evtchn_fifo_set_pending(), as for it - in the case of +interdomain channels - only the remote side's per-channel lock is held. +Neither the queue's priority nor the vCPU's vcpu_id fields have similar +properties, so they seem better suited for the purpose. In particular +they reflect the respective evtchn fields' values at the time they were +used to determine queue and vCPU. + +Signed-off-by: Jan Beulich +Reviewed-by: Julien Grall +Reviewed-by: Paul Durrant +master commit: 6f6f07b64cbe90e54f8e62b4d6f2404cf5306536 +master date: 2020-10-02 08:37:35 +0200 +--- + xen/common/event_fifo.c | 4 ++-- + 1 file changed, 2 insertions(+), 2 deletions(-) + +diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c +index 45c024739d..98742ba9cb 100644 +--- a/xen/common/event_fifo.c ++++ b/xen/common/event_fifo.c +@@ -224,8 +224,8 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + /* Moved to a different queue? */ + if ( old_q != q ) + { +- evtchn->last_vcpu_id = evtchn->notify_vcpu_id; +- evtchn->last_priority = evtchn->priority; ++ evtchn->last_vcpu_id = v->vcpu_id; ++ evtchn->last_priority = q->priority; + + spin_unlock_irqrestore(&old_q->lock, flags); + spin_lock_irqsave(&q->lock, flags); +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/fix_event_channel_race.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/fix_event_channel_race.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/fix_event_channel_race.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/fix_event_channel_race.patch 2022-05-30 12:44:59.000000000 +0100 @@ -0,0 +1,196 @@ +diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c +index b1951a29ad..0a90a8404d 100644 +--- a/xen/common/event_fifo.c ++++ b/xen/common/event_fifo.c +@@ -65,38 +65,6 @@ static void evtchn_fifo_init(struct domain *d, struct evtchn *evtchn) + d->domain_id, evtchn->port); + } + +-static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d, +- struct evtchn *evtchn, +- unsigned long *flags) +-{ +- struct vcpu *v; +- struct evtchn_fifo_queue *q, *old_q; +- unsigned int try; +- union evtchn_fifo_lastq lastq; +- +- for ( try = 0; try < 3; try++ ) +- { +- lastq.raw = read_atomic(&evtchn->fifo_lastq); +- v = d->vcpu[lastq.last_vcpu_id]; +- old_q = &v->evtchn_fifo->queue[lastq.last_priority]; +- +- spin_lock_irqsave(&old_q->lock, *flags); +- +- v = d->vcpu[lastq.last_vcpu_id]; +- q = &v->evtchn_fifo->queue[lastq.last_priority]; +- +- if ( old_q == q ) +- return old_q; +- +- spin_unlock_irqrestore(&old_q->lock, *flags); +- } +- +- gprintk(XENLOG_WARNING, +- "dom%d port %d lost event (too many queue changes)\n", +- d->domain_id, evtchn->port); +- return NULL; +-} +- + static int try_set_link(event_word_t *word, event_word_t *w, uint32_t link) + { + event_word_t new, old; +@@ -168,6 +136,9 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + event_word_t *word; + unsigned long flags; + bool_t was_pending; ++ struct evtchn_fifo_queue *q, *old_q; ++ unsigned int try; ++ bool linked = true; + + port = evtchn->port; + word = evtchn_fifo_word_from_port(d, port); +@@ -182,17 +153,67 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + return; + } + ++ /* ++ * Lock all queues related to the event channel (in case of a queue change ++ * this might be two). ++ * It is mandatory to do that before setting and testing the PENDING bit ++ * and to hold the current queue lock until the event has been put into the ++ * list of pending events in order to avoid waking up a guest without the ++ * event being visibly pending in the guest. ++ */ ++ for ( try = 0; try < 3; try++ ) ++ { ++ union evtchn_fifo_lastq lastq; ++ const struct vcpu *old_v; ++ ++ lastq.raw = read_atomic(&evtchn->fifo_lastq); ++ old_v = d->vcpu[lastq.last_vcpu_id]; ++ ++ q = &v->evtchn_fifo->queue[evtchn->priority]; ++ old_q = &old_v->evtchn_fifo->queue[lastq.last_priority]; ++ ++ if ( q == old_q ) ++ spin_lock_irqsave(&q->lock, flags); ++ else if ( q < old_q ) ++ { ++ spin_lock_irqsave(&q->lock, flags); ++ spin_lock(&old_q->lock); ++ } ++ else ++ { ++ spin_lock_irqsave(&old_q->lock, flags); ++ spin_lock(&q->lock); ++ } ++ ++ lastq.raw = read_atomic(&evtchn->fifo_lastq); ++ old_v = d->vcpu[lastq.last_vcpu_id]; ++ if ( q == &v->evtchn_fifo->queue[evtchn->priority] && ++ old_q == &old_v->evtchn_fifo->queue[lastq.last_priority] ) ++ break; ++ ++ if ( q != old_q ) ++ spin_unlock(&old_q->lock); ++ spin_unlock_irqrestore(&q->lock, flags); ++ } ++ + was_pending = guest_test_and_set_bit(d, EVTCHN_FIFO_PENDING, word); + ++ /* If we didn't get the lock bail out. */ ++ if ( try == 3 ) ++ { ++ gprintk(XENLOG_WARNING, ++ "%pd port %u lost event (too many queue changes)\n", ++ d, evtchn->port); ++ goto done; ++ } ++ + /* + * Link the event if it unmasked and not already linked. + */ + if ( !guest_test_bit(d, EVTCHN_FIFO_MASKED, word) && + !guest_test_bit(d, EVTCHN_FIFO_LINKED, word) ) + { +- struct evtchn_fifo_queue *q, *old_q; + event_word_t *tail_word; +- bool_t linked = 0; + + /* + * Control block not mapped. The guest must not unmask an +@@ -203,25 +224,11 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + { + printk(XENLOG_G_WARNING + "%pv has no FIFO event channel control block\n", v); +- goto done; ++ goto unlock; + } + +- /* +- * No locking around getting the queue. This may race with +- * changing the priority but we are allowed to signal the +- * event once on the old priority. +- */ +- q = &v->evtchn_fifo->queue[evtchn->priority]; +- +- old_q = lock_old_queue(d, evtchn, &flags); +- if ( !old_q ) +- goto done; +- + if ( guest_test_and_set_bit(d, EVTCHN_FIFO_LINKED, word) ) +- { +- spin_unlock_irqrestore(&old_q->lock, flags); +- goto done; +- } ++ goto unlock; + + /* + * If this event was a tail, the old queue is now empty and +@@ -240,8 +247,8 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + lastq.last_priority = q->priority; + write_atomic(&evtchn->fifo_lastq, lastq.raw); + +- spin_unlock_irqrestore(&old_q->lock, flags); +- spin_lock_irqsave(&q->lock, flags); ++ spin_unlock(&old_q->lock); ++ old_q = q; + } + + /* +@@ -254,6 +261,7 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + * If the queue is empty (i.e., we haven't linked to the new + * event), head must be updated. + */ ++ linked = false; + if ( q->tail ) + { + tail_word = evtchn_fifo_word_from_port(d, q->tail); +@@ -262,15 +270,19 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + if ( !linked ) + write_atomic(q->head, port); + q->tail = port; ++ } + +- spin_unlock_irqrestore(&q->lock, flags); ++ unlock: ++ if ( q != old_q ) ++ spin_unlock(&old_q->lock); ++ spin_unlock_irqrestore(&q->lock, flags); + +- if ( !linked +- && !guest_test_and_set_bit(d, q->priority, +- &v->evtchn_fifo->control_block->ready) ) +- vcpu_mark_events_pending(v); +- } + done: ++ if ( !linked && ++ !guest_test_and_set_bit(d, q->priority, ++ &v->evtchn_fifo->control_block->ready) ) ++ vcpu_mark_events_pending(v); ++ + if ( !was_pending ) + evtchn_check_pollers(d, port); + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0001-introduce-unaligned.h.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0001-introduce-unaligned.h.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0001-introduce-unaligned.h.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0001-introduce-unaligned.h.patch 2022-07-13 14:06:12.000000000 +0100 @@ -0,0 +1,284 @@ +From 3453f57b52a84a522b864a5d01773e0911a2184e Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Mon, 18 Jan 2021 12:09:13 +0100 +Subject: [PATCH 1/5] introduce unaligned.h + +Rather than open-coding commonly used constructs in yet more places when +pulling in zstd decompression support (and its xxhash prereq), pull out +the custom bits into a commonly used header (for the hypervisor build; +the tool stack and stubdom builds of libxenguest will still remain in +need of similarly taking care of). For now this is limited to x86, where +custom logic isn't needed (considering this is going to be used in init +code only, even using alternatives patching to use MOVBE doesn't seem +worthwhile). + +For Arm64 with CONFIG_ACPI=y (due to efi-dom0.c's re-use of xz/crc32.c) +drop the not really necessary inclusion of xz's private.h. + +No change in generated code. + +Signed-off-by: Jan Beulich +Acked-by: Andrew Cooper + +Bug-Ubuntu: https://bugs.launchpad.net/bugs/1956166 +Origin: backport, http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=7c9f81687ad611515474b1c17afc2f79f19faef5 +[backport: xen/common/lzo.c: refresh 2 context lines.] +--- + xen/common/lz4/defs.h | 9 ++-- + xen/common/lzo.c | 7 ++- + xen/common/unlzo.c | 19 ++------ + xen/common/xz/crc32.c | 2 - + xen/common/xz/private.h | 23 +++------- + xen/include/asm-x86/unaligned.h | 6 +++ + xen/include/xen/unaligned.h | 79 +++++++++++++++++++++++++++++++++ + 7 files changed, 104 insertions(+), 41 deletions(-) + create mode 100644 xen/include/asm-x86/unaligned.h + create mode 100644 xen/include/xen/unaligned.h + +diff --git a/xen/common/lz4/defs.h b/xen/common/lz4/defs.h +index d886a4e122b8..4fbea2ac3dd4 100644 +--- a/xen/common/lz4/defs.h ++++ b/xen/common/lz4/defs.h +@@ -10,18 +10,21 @@ + + #ifdef __XEN__ + #include +-#endif ++#include ++#else + +-static inline u16 INIT get_unaligned_le16(const void *p) ++static inline u16 get_unaligned_le16(const void *p) + { + return le16_to_cpup(p); + } + +-static inline u32 INIT get_unaligned_le32(const void *p) ++static inline u32 get_unaligned_le32(const void *p) + { + return le32_to_cpup(p); + } + ++#endif ++ + /* + * Detects 64 bits mode + */ +diff --git a/xen/common/lzo.c b/xen/common/lzo.c +index 74831cb26836..f1cd1b58d27f 100644 +--- a/xen/common/lzo.c ++++ b/xen/common/lzo.c +@@ -97,13 +97,12 @@ + #ifdef __XEN__ + #include + #include ++#include ++#else ++#define get_unaligned_le16(_p) (*(u16 *)(_p)) + #endif + + #include +-#define get_unaligned(_p) (*(_p)) +-#define put_unaligned(_val,_p) (*(_p)=_val) +-#define get_unaligned_le16(_p) (*(u16 *)(_p)) +-#define get_unaligned_le32(_p) (*(u32 *)(_p)) + + static noinline size_t + lzo1x_1_do_compress(const unsigned char *in, size_t in_len, +diff --git a/xen/common/unlzo.c b/xen/common/unlzo.c +index 5ae6cf911e86..11f64fcf3b26 100644 +--- a/xen/common/unlzo.c ++++ b/xen/common/unlzo.c +@@ -34,30 +34,19 @@ + + #ifdef __XEN__ + #include +-#endif ++#include ++#else + +-#if 1 /* ndef CONFIG_??? */ +-static inline u16 INIT get_unaligned_be16(void *p) ++static inline u16 get_unaligned_be16(const void *p) + { + return be16_to_cpup(p); + } + +-static inline u32 INIT get_unaligned_be32(void *p) ++static inline u32 get_unaligned_be32(const void *p) + { + return be32_to_cpup(p); + } +-#else +-#include +- +-static inline u16 INIT get_unaligned_be16(void *p) +-{ +- return be16_to_cpu(__get_unaligned(p, 2)); +-} + +-static inline u32 INIT get_unaligned_be32(void *p) +-{ +- return be32_to_cpu(__get_unaligned(p, 4)); +-} + #endif + + static const unsigned char lzop_magic[] = { +diff --git a/xen/common/xz/crc32.c b/xen/common/xz/crc32.c +index af08ae2cf6e2..0708b6163812 100644 +--- a/xen/common/xz/crc32.c ++++ b/xen/common/xz/crc32.c +@@ -15,8 +15,6 @@ + * but they are bigger and use more memory for the lookup table. + */ + +-#include "private.h" +- + XZ_EXTERN uint32_t INITDATA xz_crc32_table[256]; + + XZ_EXTERN void INIT xz_crc32_init(void) +diff --git a/xen/common/xz/private.h b/xen/common/xz/private.h +index 7ea24892297f..511343fcc234 100644 +--- a/xen/common/xz/private.h ++++ b/xen/common/xz/private.h +@@ -13,34 +13,23 @@ + #ifdef __XEN__ + #include + #include +-#endif +- +-#define get_le32(p) le32_to_cpup((const uint32_t *)(p)) ++#include ++#else + +-#if 1 /* ndef CONFIG_??? */ +-static inline u32 INIT get_unaligned_le32(void *p) ++static inline u32 get_unaligned_le32(const void *p) + { + return le32_to_cpup(p); + } + +-static inline void INIT put_unaligned_le32(u32 val, void *p) ++static inline void put_unaligned_le32(u32 val, void *p) + { + *(__force __le32*)p = cpu_to_le32(val); + } +-#else +-#include +- +-static inline u32 INIT get_unaligned_le32(void *p) +-{ +- return le32_to_cpu(__get_unaligned(p, 4)); +-} + +-static inline void INIT put_unaligned_le32(u32 val, void *p) +-{ +- __put_unaligned(cpu_to_le32(val), p, 4); +-} + #endif + ++#define get_le32(p) le32_to_cpup((const uint32_t *)(p)) ++ + #define false 0 + #define true 1 + +diff --git a/xen/include/asm-x86/unaligned.h b/xen/include/asm-x86/unaligned.h +new file mode 100644 +index 000000000000..6070801d4afd +--- /dev/null ++++ b/xen/include/asm-x86/unaligned.h +@@ -0,0 +1,6 @@ ++#ifndef __ASM_UNALIGNED_H__ ++#define __ASM_UNALIGNED_H__ ++ ++#include ++ ++#endif /* __ASM_UNALIGNED_H__ */ +diff --git a/xen/include/xen/unaligned.h b/xen/include/xen/unaligned.h +new file mode 100644 +index 000000000000..eef7ec73b658 +--- /dev/null ++++ b/xen/include/xen/unaligned.h +@@ -0,0 +1,79 @@ ++/* ++ * This header can be used by architectures where unaligned accesses work ++ * without faulting, and at least reasonably efficiently. Other architectures ++ * will need to have a custom asm/unaligned.h. ++ */ ++#ifndef __ASM_UNALIGNED_H__ ++#error "xen/unaligned.h should not be included directly - include asm/unaligned.h instead" ++#endif ++ ++#ifndef __XEN_UNALIGNED_H__ ++#define __XEN_UNALIGNED_H__ ++ ++#include ++#include ++ ++#define get_unaligned(p) (*(p)) ++#define put_unaligned(val, p) (*(p) = (val)) ++ ++static inline uint16_t get_unaligned_be16(const void *p) ++{ ++ return be16_to_cpup(p); ++} ++ ++static inline void put_unaligned_be16(uint16_t val, void *p) ++{ ++ *(__force __be16*)p = cpu_to_be16(val); ++} ++ ++static inline uint32_t get_unaligned_be32(const void *p) ++{ ++ return be32_to_cpup(p); ++} ++ ++static inline void put_unaligned_be32(uint32_t val, void *p) ++{ ++ *(__force __be32*)p = cpu_to_be32(val); ++} ++ ++static inline uint64_t get_unaligned_be64(const void *p) ++{ ++ return be64_to_cpup(p); ++} ++ ++static inline void put_unaligned_be64(uint64_t val, void *p) ++{ ++ *(__force __be64*)p = cpu_to_be64(val); ++} ++ ++static inline uint16_t get_unaligned_le16(const void *p) ++{ ++ return le16_to_cpup(p); ++} ++ ++static inline void put_unaligned_le16(uint16_t val, void *p) ++{ ++ *(__force __le16*)p = cpu_to_le16(val); ++} ++ ++static inline uint32_t get_unaligned_le32(const void *p) ++{ ++ return le32_to_cpup(p); ++} ++ ++static inline void put_unaligned_le32(uint32_t val, void *p) ++{ ++ *(__force __le32*)p = cpu_to_le32(val); ++} ++ ++static inline uint64_t get_unaligned_le64(const void *p) ++{ ++ return le64_to_cpup(p); ++} ++ ++static inline void put_unaligned_le64(uint64_t val, void *p) ++{ ++ *(__force __le64*)p = cpu_to_le64(val); ++} ++ ++#endif /* __XEN_UNALIGNED_H__ */ +-- +2.34.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0002-lib-introduce-xxhash.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0002-lib-introduce-xxhash.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0002-lib-introduce-xxhash.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0002-lib-introduce-xxhash.patch 2022-07-13 14:06:12.000000000 +0100 @@ -0,0 +1,888 @@ +From 7253046d49a835c7fc13de1bd3529ff66dd2e1df Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Mon, 18 Jan 2021 12:10:34 +0100 +Subject: [PATCH 2/5] lib: introduce xxhash + +Taken from Linux at commit d89775fc929c ("lib/: replace HTTP links with +HTTPS ones"), but split into separate 32-bit and 64-bit sources, since +the immediate consumer (zstd) will need only the latter. + +Note that the building of this code is restricted to x86 for now because +of the need to sort asm/unaligned.h for Arm. + +Signed-off-by: Jan Beulich +Acked-by: Andrew Cooper + +Bug-Ubuntu: https://bugs.launchpad.net/bugs/1956166 +Origin: backport, http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=35d2960ae65f28106fdc5c2130f5f08fadca0e4c +[backport: additional changes: Makefile and Rules.mk, + based on much larger/unneeded commits, respectively: + commit f301f9a9e84f ("lib: collect library files in an archive") + commit fea2fab96356 ("libx86: introduce a libx86 shared library") + - xen/lib/Makefile: add objects xen/lib/xxhash{32,64}.o + - xen/Rules.mk: add dir xen/lib/] +--- + xen/Rules.mk | 1 + + xen/include/xen/xxhash.h | 259 ++++++++++++++++++++++++++++++++++ + xen/lib/Makefile | 2 + + xen/lib/xxhash32.c | 259 ++++++++++++++++++++++++++++++++++ + xen/lib/xxhash64.c | 294 +++++++++++++++++++++++++++++++++++++++ + 5 files changed, 815 insertions(+) + create mode 100644 xen/include/xen/xxhash.h + create mode 100644 xen/lib/Makefile + create mode 100644 xen/lib/xxhash32.c + create mode 100644 xen/lib/xxhash64.c + +diff --git a/xen/Rules.mk b/xen/Rules.mk +index 5337e206ee17..47c954425d69 100644 +--- a/xen/Rules.mk ++++ b/xen/Rules.mk +@@ -36,6 +36,7 @@ TARGET := $(BASEDIR)/xen + # Note that link order matters! + ALL_OBJS-y += $(BASEDIR)/common/built_in.o + ALL_OBJS-y += $(BASEDIR)/drivers/built_in.o ++ALL_OBJS-$(CONFIG_X86) += $(BASEDIR)/lib/built_in.o + ALL_OBJS-y += $(BASEDIR)/xsm/built_in.o + ALL_OBJS-y += $(BASEDIR)/arch/$(TARGET_ARCH)/built_in.o + ALL_OBJS-$(CONFIG_CRYPTO) += $(BASEDIR)/crypto/built_in.o +diff --git a/xen/include/xen/xxhash.h b/xen/include/xen/xxhash.h +new file mode 100644 +index 000000000000..6f2237cbcf8e +--- /dev/null ++++ b/xen/include/xen/xxhash.h +@@ -0,0 +1,259 @@ ++/* ++ * xxHash - Extremely Fast Hash algorithm ++ * Copyright (C) 2012-2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at: ++ * - xxHash homepage: https://cyan4973.github.io/xxHash/ ++ * - xxHash source repository: https://github.com/Cyan4973/xxHash ++ */ ++ ++/* ++ * Notice extracted from xxHash homepage: ++ * ++ * xxHash is an extremely fast Hash algorithm, running at RAM speed limits. ++ * It also successfully passes all tests from the SMHasher suite. ++ * ++ * Comparison (single thread, Windows Seven 32 bits, using SMHasher on a Core 2 ++ * Duo @3GHz) ++ * ++ * Name Speed Q.Score Author ++ * xxHash 5.4 GB/s 10 ++ * CrapWow 3.2 GB/s 2 Andrew ++ * MumurHash 3a 2.7 GB/s 10 Austin Appleby ++ * SpookyHash 2.0 GB/s 10 Bob Jenkins ++ * SBox 1.4 GB/s 9 Bret Mulvey ++ * Lookup3 1.2 GB/s 9 Bob Jenkins ++ * SuperFastHash 1.2 GB/s 1 Paul Hsieh ++ * CityHash64 1.05 GB/s 10 Pike & Alakuijala ++ * FNV 0.55 GB/s 5 Fowler, Noll, Vo ++ * CRC32 0.43 GB/s 9 ++ * MD5-32 0.33 GB/s 10 Ronald L. Rivest ++ * SHA1-32 0.28 GB/s 10 ++ * ++ * Q.Score is a measure of quality of the hash function. ++ * It depends on successfully passing SMHasher test set. ++ * 10 is a perfect score. ++ * ++ * A 64-bits version, named xxh64 offers much better speed, ++ * but for 64-bits applications only. ++ * Name Speed on 64 bits Speed on 32 bits ++ * xxh64 13.8 GB/s 1.9 GB/s ++ * xxh32 6.8 GB/s 6.0 GB/s ++ */ ++ ++#ifndef __XENXXHASH_H__ ++#define __XENXXHASH_H__ ++ ++#include ++ ++/*-**************************** ++ * Simple Hash Functions ++ *****************************/ ++ ++/** ++ * xxh32() - calculate the 32-bit hash of the input with a given seed. ++ * ++ * @input: The data to hash. ++ * @length: The length of the data to hash. ++ * @seed: The seed can be used to alter the result predictably. ++ * ++ * Speed on Core 2 Duo @ 3 GHz (single thread, SMHasher benchmark) : 5.4 GB/s ++ * ++ * Return: The 32-bit hash of the data. ++ */ ++uint32_t xxh32(const void *input, size_t length, uint32_t seed); ++ ++/** ++ * xxh64() - calculate the 64-bit hash of the input with a given seed. ++ * ++ * @input: The data to hash. ++ * @length: The length of the data to hash. ++ * @seed: The seed can be used to alter the result predictably. ++ * ++ * This function runs 2x faster on 64-bit systems, but slower on 32-bit systems. ++ * ++ * Return: The 64-bit hash of the data. ++ */ ++uint64_t xxh64(const void *input, size_t length, uint64_t seed); ++ ++/** ++ * xxhash() - calculate wordsize hash of the input with a given seed ++ * @input: The data to hash. ++ * @length: The length of the data to hash. ++ * @seed: The seed can be used to alter the result predictably. ++ * ++ * If the hash does not need to be comparable between machines with ++ * different word sizes, this function will call whichever of xxh32() ++ * or xxh64() is faster. ++ * ++ * Return: wordsize hash of the data. ++ */ ++ ++static inline unsigned long xxhash(const void *input, size_t length, ++ uint64_t seed) ++{ ++#if BITS_PER_LONG == 64 ++ return xxh64(input, length, seed); ++#else ++ return xxh32(input, length, seed); ++#endif ++} ++ ++/*-**************************** ++ * Streaming Hash Functions ++ *****************************/ ++ ++/* ++ * These definitions are only meant to allow allocation of XXH state ++ * statically, on stack, or in a struct for example. ++ * Do not use members directly. ++ */ ++ ++/** ++ * struct xxh32_state - private xxh32 state, do not use members directly ++ */ ++struct xxh32_state { ++ uint32_t total_len_32; ++ uint32_t large_len; ++ uint32_t v1; ++ uint32_t v2; ++ uint32_t v3; ++ uint32_t v4; ++ uint32_t mem32[4]; ++ uint32_t memsize; ++}; ++ ++/** ++ * struct xxh32_state - private xxh64 state, do not use members directly ++ */ ++struct xxh64_state { ++ uint64_t total_len; ++ uint64_t v1; ++ uint64_t v2; ++ uint64_t v3; ++ uint64_t v4; ++ uint64_t mem64[4]; ++ uint32_t memsize; ++}; ++ ++/** ++ * xxh32_reset() - reset the xxh32 state to start a new hashing operation ++ * ++ * @state: The xxh32 state to reset. ++ * @seed: Initialize the hash state with this seed. ++ * ++ * Call this function on any xxh32_state to prepare for a new hashing operation. ++ */ ++void xxh32_reset(struct xxh32_state *state, uint32_t seed); ++ ++/** ++ * xxh32_update() - hash the data given and update the xxh32 state ++ * ++ * @state: The xxh32 state to update. ++ * @input: The data to hash. ++ * @length: The length of the data to hash. ++ * ++ * After calling xxh32_reset() call xxh32_update() as many times as necessary. ++ * ++ * Return: Zero on success, otherwise an error code. ++ */ ++int xxh32_update(struct xxh32_state *state, const void *input, size_t length); ++ ++/** ++ * xxh32_digest() - produce the current xxh32 hash ++ * ++ * @state: Produce the current xxh32 hash of this state. ++ * ++ * A hash value can be produced at any time. It is still possible to continue ++ * inserting input into the hash state after a call to xxh32_digest(), and ++ * generate new hashes later on, by calling xxh32_digest() again. ++ * ++ * Return: The xxh32 hash stored in the state. ++ */ ++uint32_t xxh32_digest(const struct xxh32_state *state); ++ ++/** ++ * xxh64_reset() - reset the xxh64 state to start a new hashing operation ++ * ++ * @state: The xxh64 state to reset. ++ * @seed: Initialize the hash state with this seed. ++ */ ++void xxh64_reset(struct xxh64_state *state, uint64_t seed); ++ ++/** ++ * xxh64_update() - hash the data given and update the xxh64 state ++ * @state: The xxh64 state to update. ++ * @input: The data to hash. ++ * @length: The length of the data to hash. ++ * ++ * After calling xxh64_reset() call xxh64_update() as many times as necessary. ++ * ++ * Return: Zero on success, otherwise an error code. ++ */ ++int xxh64_update(struct xxh64_state *state, const void *input, size_t length); ++ ++/** ++ * xxh64_digest() - produce the current xxh64 hash ++ * ++ * @state: Produce the current xxh64 hash of this state. ++ * ++ * A hash value can be produced at any time. It is still possible to continue ++ * inserting input into the hash state after a call to xxh64_digest(), and ++ * generate new hashes later on, by calling xxh64_digest() again. ++ * ++ * Return: The xxh64 hash stored in the state. ++ */ ++uint64_t xxh64_digest(const struct xxh64_state *state); ++ ++/*-************************** ++ * Utils ++ ***************************/ ++ ++/** ++ * xxh32_copy_state() - copy the source state into the destination state ++ * ++ * @src: The source xxh32 state. ++ * @dst: The destination xxh32 state. ++ */ ++void xxh32_copy_state(struct xxh32_state *dst, const struct xxh32_state *src); ++ ++/** ++ * xxh64_copy_state() - copy the source state into the destination state ++ * ++ * @src: The source xxh64 state. ++ * @dst: The destination xxh64 state. ++ */ ++void xxh64_copy_state(struct xxh64_state *dst, const struct xxh64_state *src); ++ ++#endif /* __XENXXHASH_H__ */ +diff --git a/xen/lib/Makefile b/xen/lib/Makefile +new file mode 100644 +index 000000000000..922e09439a80 +--- /dev/null ++++ b/xen/lib/Makefile +@@ -0,0 +1,2 @@ ++obj-$(CONFIG_X86) += xxhash32.o ++obj-$(CONFIG_X86) += xxhash64.o +diff --git a/xen/lib/xxhash32.c b/xen/lib/xxhash32.c +new file mode 100644 +index 000000000000..e8d403e5ced6 +--- /dev/null ++++ b/xen/lib/xxhash32.c +@@ -0,0 +1,259 @@ ++/* ++ * xxHash - Extremely Fast Hash algorithm ++ * Copyright (C) 2012-2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at: ++ * - xxHash homepage: https://cyan4973.github.io/xxHash/ ++ * - xxHash source repository: https://github.com/Cyan4973/xxHash ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++ ++/*-************************************* ++ * Macros ++ **************************************/ ++#define xxh_rotl32(x, r) ((x << r) | (x >> (32 - r))) ++ ++#ifdef __LITTLE_ENDIAN ++# define XXH_CPU_LITTLE_ENDIAN 1 ++#else ++# define XXH_CPU_LITTLE_ENDIAN 0 ++#endif ++ ++/*-************************************* ++ * Constants ++ **************************************/ ++static const uint32_t PRIME32_1 = 2654435761U; ++static const uint32_t PRIME32_2 = 2246822519U; ++static const uint32_t PRIME32_3 = 3266489917U; ++static const uint32_t PRIME32_4 = 668265263U; ++static const uint32_t PRIME32_5 = 374761393U; ++ ++/*-************************** ++ * Utils ++ ***************************/ ++void xxh32_copy_state(struct xxh32_state *dst, const struct xxh32_state *src) ++{ ++ memcpy(dst, src, sizeof(*dst)); ++} ++ ++/*-*************************** ++ * Simple Hash Functions ++ ****************************/ ++static uint32_t xxh32_round(uint32_t seed, const uint32_t input) ++{ ++ seed += input * PRIME32_2; ++ seed = xxh_rotl32(seed, 13); ++ seed *= PRIME32_1; ++ return seed; ++} ++ ++uint32_t xxh32(const void *input, const size_t len, const uint32_t seed) ++{ ++ const uint8_t *p = (const uint8_t *)input; ++ const uint8_t *b_end = p + len; ++ uint32_t h32; ++ ++ if (len >= 16) { ++ const uint8_t *const limit = b_end - 16; ++ uint32_t v1 = seed + PRIME32_1 + PRIME32_2; ++ uint32_t v2 = seed + PRIME32_2; ++ uint32_t v3 = seed + 0; ++ uint32_t v4 = seed - PRIME32_1; ++ ++ do { ++ v1 = xxh32_round(v1, get_unaligned_le32(p)); ++ p += 4; ++ v2 = xxh32_round(v2, get_unaligned_le32(p)); ++ p += 4; ++ v3 = xxh32_round(v3, get_unaligned_le32(p)); ++ p += 4; ++ v4 = xxh32_round(v4, get_unaligned_le32(p)); ++ p += 4; ++ } while (p <= limit); ++ ++ h32 = xxh_rotl32(v1, 1) + xxh_rotl32(v2, 7) + ++ xxh_rotl32(v3, 12) + xxh_rotl32(v4, 18); ++ } else { ++ h32 = seed + PRIME32_5; ++ } ++ ++ h32 += (uint32_t)len; ++ ++ while (p + 4 <= b_end) { ++ h32 += get_unaligned_le32(p) * PRIME32_3; ++ h32 = xxh_rotl32(h32, 17) * PRIME32_4; ++ p += 4; ++ } ++ ++ while (p < b_end) { ++ h32 += (*p) * PRIME32_5; ++ h32 = xxh_rotl32(h32, 11) * PRIME32_1; ++ p++; ++ } ++ ++ h32 ^= h32 >> 15; ++ h32 *= PRIME32_2; ++ h32 ^= h32 >> 13; ++ h32 *= PRIME32_3; ++ h32 ^= h32 >> 16; ++ ++ return h32; ++} ++ ++/*-************************************************** ++ * Advanced Hash Functions ++ ***************************************************/ ++void xxh32_reset(struct xxh32_state *statePtr, const uint32_t seed) ++{ ++ /* use a local state for memcpy() to avoid strict-aliasing warnings */ ++ struct xxh32_state state; ++ ++ memset(&state, 0, sizeof(state)); ++ state.v1 = seed + PRIME32_1 + PRIME32_2; ++ state.v2 = seed + PRIME32_2; ++ state.v3 = seed + 0; ++ state.v4 = seed - PRIME32_1; ++ memcpy(statePtr, &state, sizeof(state)); ++} ++ ++int xxh32_update(struct xxh32_state *state, const void *input, const size_t len) ++{ ++ const uint8_t *p = (const uint8_t *)input; ++ const uint8_t *const b_end = p + len; ++ ++ if (input == NULL) ++ return -EINVAL; ++ ++ state->total_len_32 += (uint32_t)len; ++ state->large_len |= (len >= 16) | (state->total_len_32 >= 16); ++ ++ if (state->memsize + len < 16) { /* fill in tmp buffer */ ++ memcpy((uint8_t *)(state->mem32) + state->memsize, input, len); ++ state->memsize += (uint32_t)len; ++ return 0; ++ } ++ ++ if (state->memsize) { /* some data left from previous update */ ++ const uint32_t *p32 = state->mem32; ++ ++ memcpy((uint8_t *)(state->mem32) + state->memsize, input, ++ 16 - state->memsize); ++ ++ state->v1 = xxh32_round(state->v1, get_unaligned_le32(p32)); ++ p32++; ++ state->v2 = xxh32_round(state->v2, get_unaligned_le32(p32)); ++ p32++; ++ state->v3 = xxh32_round(state->v3, get_unaligned_le32(p32)); ++ p32++; ++ state->v4 = xxh32_round(state->v4, get_unaligned_le32(p32)); ++ p32++; ++ ++ p += 16-state->memsize; ++ state->memsize = 0; ++ } ++ ++ if (p <= b_end - 16) { ++ const uint8_t *const limit = b_end - 16; ++ uint32_t v1 = state->v1; ++ uint32_t v2 = state->v2; ++ uint32_t v3 = state->v3; ++ uint32_t v4 = state->v4; ++ ++ do { ++ v1 = xxh32_round(v1, get_unaligned_le32(p)); ++ p += 4; ++ v2 = xxh32_round(v2, get_unaligned_le32(p)); ++ p += 4; ++ v3 = xxh32_round(v3, get_unaligned_le32(p)); ++ p += 4; ++ v4 = xxh32_round(v4, get_unaligned_le32(p)); ++ p += 4; ++ } while (p <= limit); ++ ++ state->v1 = v1; ++ state->v2 = v2; ++ state->v3 = v3; ++ state->v4 = v4; ++ } ++ ++ if (p < b_end) { ++ memcpy(state->mem32, p, (size_t)(b_end-p)); ++ state->memsize = (uint32_t)(b_end-p); ++ } ++ ++ return 0; ++} ++ ++uint32_t xxh32_digest(const struct xxh32_state *state) ++{ ++ const uint8_t *p = (const uint8_t *)state->mem32; ++ const uint8_t *const b_end = (const uint8_t *)(state->mem32) + ++ state->memsize; ++ uint32_t h32; ++ ++ if (state->large_len) { ++ h32 = xxh_rotl32(state->v1, 1) + xxh_rotl32(state->v2, 7) + ++ xxh_rotl32(state->v3, 12) + xxh_rotl32(state->v4, 18); ++ } else { ++ h32 = state->v3 /* == seed */ + PRIME32_5; ++ } ++ ++ h32 += state->total_len_32; ++ ++ while (p + 4 <= b_end) { ++ h32 += get_unaligned_le32(p) * PRIME32_3; ++ h32 = xxh_rotl32(h32, 17) * PRIME32_4; ++ p += 4; ++ } ++ ++ while (p < b_end) { ++ h32 += (*p) * PRIME32_5; ++ h32 = xxh_rotl32(h32, 11) * PRIME32_1; ++ p++; ++ } ++ ++ h32 ^= h32 >> 15; ++ h32 *= PRIME32_2; ++ h32 ^= h32 >> 13; ++ h32 *= PRIME32_3; ++ h32 ^= h32 >> 16; ++ ++ return h32; ++} ++ +diff --git a/xen/lib/xxhash64.c b/xen/lib/xxhash64.c +new file mode 100644 +index 000000000000..ba6bcf152d6f +--- /dev/null ++++ b/xen/lib/xxhash64.c +@@ -0,0 +1,294 @@ ++/* ++ * xxHash - Extremely Fast Hash algorithm ++ * Copyright (C) 2012-2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at: ++ * - xxHash homepage: https://cyan4973.github.io/xxHash/ ++ * - xxHash source repository: https://github.com/Cyan4973/xxHash ++ */ ++ ++#include ++#include ++#include ++#include ++#include ++ ++/*-************************************* ++ * Macros ++ **************************************/ ++#define xxh_rotl64(x, r) ((x << r) | (x >> (64 - r))) ++ ++#ifdef __LITTLE_ENDIAN ++# define XXH_CPU_LITTLE_ENDIAN 1 ++#else ++# define XXH_CPU_LITTLE_ENDIAN 0 ++#endif ++ ++/*-************************************* ++ * Constants ++ **************************************/ ++static const uint64_t PRIME64_1 = 11400714785074694791ULL; ++static const uint64_t PRIME64_2 = 14029467366897019727ULL; ++static const uint64_t PRIME64_3 = 1609587929392839161ULL; ++static const uint64_t PRIME64_4 = 9650029242287828579ULL; ++static const uint64_t PRIME64_5 = 2870177450012600261ULL; ++ ++/*-************************** ++ * Utils ++ ***************************/ ++void xxh64_copy_state(struct xxh64_state *dst, const struct xxh64_state *src) ++{ ++ memcpy(dst, src, sizeof(*dst)); ++} ++ ++/*-*************************** ++ * Simple Hash Functions ++ ****************************/ ++static uint64_t xxh64_round(uint64_t acc, const uint64_t input) ++{ ++ acc += input * PRIME64_2; ++ acc = xxh_rotl64(acc, 31); ++ acc *= PRIME64_1; ++ return acc; ++} ++ ++static uint64_t xxh64_merge_round(uint64_t acc, uint64_t val) ++{ ++ val = xxh64_round(0, val); ++ acc ^= val; ++ acc = acc * PRIME64_1 + PRIME64_4; ++ return acc; ++} ++ ++uint64_t xxh64(const void *input, const size_t len, const uint64_t seed) ++{ ++ const uint8_t *p = (const uint8_t *)input; ++ const uint8_t *const b_end = p + len; ++ uint64_t h64; ++ ++ if (len >= 32) { ++ const uint8_t *const limit = b_end - 32; ++ uint64_t v1 = seed + PRIME64_1 + PRIME64_2; ++ uint64_t v2 = seed + PRIME64_2; ++ uint64_t v3 = seed + 0; ++ uint64_t v4 = seed - PRIME64_1; ++ ++ do { ++ v1 = xxh64_round(v1, get_unaligned_le64(p)); ++ p += 8; ++ v2 = xxh64_round(v2, get_unaligned_le64(p)); ++ p += 8; ++ v3 = xxh64_round(v3, get_unaligned_le64(p)); ++ p += 8; ++ v4 = xxh64_round(v4, get_unaligned_le64(p)); ++ p += 8; ++ } while (p <= limit); ++ ++ h64 = xxh_rotl64(v1, 1) + xxh_rotl64(v2, 7) + ++ xxh_rotl64(v3, 12) + xxh_rotl64(v4, 18); ++ h64 = xxh64_merge_round(h64, v1); ++ h64 = xxh64_merge_round(h64, v2); ++ h64 = xxh64_merge_round(h64, v3); ++ h64 = xxh64_merge_round(h64, v4); ++ ++ } else { ++ h64 = seed + PRIME64_5; ++ } ++ ++ h64 += (uint64_t)len; ++ ++ while (p + 8 <= b_end) { ++ const uint64_t k1 = xxh64_round(0, get_unaligned_le64(p)); ++ ++ h64 ^= k1; ++ h64 = xxh_rotl64(h64, 27) * PRIME64_1 + PRIME64_4; ++ p += 8; ++ } ++ ++ if (p + 4 <= b_end) { ++ h64 ^= (uint64_t)(get_unaligned_le32(p)) * PRIME64_1; ++ h64 = xxh_rotl64(h64, 23) * PRIME64_2 + PRIME64_3; ++ p += 4; ++ } ++ ++ while (p < b_end) { ++ h64 ^= (*p) * PRIME64_5; ++ h64 = xxh_rotl64(h64, 11) * PRIME64_1; ++ p++; ++ } ++ ++ h64 ^= h64 >> 33; ++ h64 *= PRIME64_2; ++ h64 ^= h64 >> 29; ++ h64 *= PRIME64_3; ++ h64 ^= h64 >> 32; ++ ++ return h64; ++} ++ ++/*-************************************************** ++ * Advanced Hash Functions ++ ***************************************************/ ++void xxh64_reset(struct xxh64_state *statePtr, const uint64_t seed) ++{ ++ /* use a local state for memcpy() to avoid strict-aliasing warnings */ ++ struct xxh64_state state; ++ ++ memset(&state, 0, sizeof(state)); ++ state.v1 = seed + PRIME64_1 + PRIME64_2; ++ state.v2 = seed + PRIME64_2; ++ state.v3 = seed + 0; ++ state.v4 = seed - PRIME64_1; ++ memcpy(statePtr, &state, sizeof(state)); ++} ++ ++int xxh64_update(struct xxh64_state *state, const void *input, const size_t len) ++{ ++ const uint8_t *p = (const uint8_t *)input; ++ const uint8_t *const b_end = p + len; ++ ++ if (input == NULL) ++ return -EINVAL; ++ ++ state->total_len += len; ++ ++ if (state->memsize + len < 32) { /* fill in tmp buffer */ ++ memcpy(((uint8_t *)state->mem64) + state->memsize, input, len); ++ state->memsize += (uint32_t)len; ++ return 0; ++ } ++ ++ if (state->memsize) { /* tmp buffer is full */ ++ uint64_t *p64 = state->mem64; ++ ++ memcpy(((uint8_t *)p64) + state->memsize, input, ++ 32 - state->memsize); ++ ++ state->v1 = xxh64_round(state->v1, get_unaligned_le64(p64)); ++ p64++; ++ state->v2 = xxh64_round(state->v2, get_unaligned_le64(p64)); ++ p64++; ++ state->v3 = xxh64_round(state->v3, get_unaligned_le64(p64)); ++ p64++; ++ state->v4 = xxh64_round(state->v4, get_unaligned_le64(p64)); ++ ++ p += 32 - state->memsize; ++ state->memsize = 0; ++ } ++ ++ if (p + 32 <= b_end) { ++ const uint8_t *const limit = b_end - 32; ++ uint64_t v1 = state->v1; ++ uint64_t v2 = state->v2; ++ uint64_t v3 = state->v3; ++ uint64_t v4 = state->v4; ++ ++ do { ++ v1 = xxh64_round(v1, get_unaligned_le64(p)); ++ p += 8; ++ v2 = xxh64_round(v2, get_unaligned_le64(p)); ++ p += 8; ++ v3 = xxh64_round(v3, get_unaligned_le64(p)); ++ p += 8; ++ v4 = xxh64_round(v4, get_unaligned_le64(p)); ++ p += 8; ++ } while (p <= limit); ++ ++ state->v1 = v1; ++ state->v2 = v2; ++ state->v3 = v3; ++ state->v4 = v4; ++ } ++ ++ if (p < b_end) { ++ memcpy(state->mem64, p, (size_t)(b_end-p)); ++ state->memsize = (uint32_t)(b_end - p); ++ } ++ ++ return 0; ++} ++ ++uint64_t xxh64_digest(const struct xxh64_state *state) ++{ ++ const uint8_t *p = (const uint8_t *)state->mem64; ++ const uint8_t *const b_end = (const uint8_t *)state->mem64 + ++ state->memsize; ++ uint64_t h64; ++ ++ if (state->total_len >= 32) { ++ const uint64_t v1 = state->v1; ++ const uint64_t v2 = state->v2; ++ const uint64_t v3 = state->v3; ++ const uint64_t v4 = state->v4; ++ ++ h64 = xxh_rotl64(v1, 1) + xxh_rotl64(v2, 7) + ++ xxh_rotl64(v3, 12) + xxh_rotl64(v4, 18); ++ h64 = xxh64_merge_round(h64, v1); ++ h64 = xxh64_merge_round(h64, v2); ++ h64 = xxh64_merge_round(h64, v3); ++ h64 = xxh64_merge_round(h64, v4); ++ } else { ++ h64 = state->v3 + PRIME64_5; ++ } ++ ++ h64 += (uint64_t)state->total_len; ++ ++ while (p + 8 <= b_end) { ++ const uint64_t k1 = xxh64_round(0, get_unaligned_le64(p)); ++ ++ h64 ^= k1; ++ h64 = xxh_rotl64(h64, 27) * PRIME64_1 + PRIME64_4; ++ p += 8; ++ } ++ ++ if (p + 4 <= b_end) { ++ h64 ^= (uint64_t)(get_unaligned_le32(p)) * PRIME64_1; ++ h64 = xxh_rotl64(h64, 23) * PRIME64_2 + PRIME64_3; ++ p += 4; ++ } ++ ++ while (p < b_end) { ++ h64 ^= (*p) * PRIME64_5; ++ h64 = xxh_rotl64(h64, 11) * PRIME64_1; ++ p++; ++ } ++ ++ h64 ^= h64 >> 33; ++ h64 *= PRIME64_2; ++ h64 ^= h64 >> 29; ++ h64 *= PRIME64_3; ++ h64 ^= h64 >> 32; ++ ++ return h64; ++} +-- +2.34.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0003-x86-Dom0-support-zstd-compressed-kernels.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0003-x86-Dom0-support-zstd-compressed-kernels.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0003-x86-Dom0-support-zstd-compressed-kernels.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0003-x86-Dom0-support-zstd-compressed-kernels.patch 2022-07-13 14:06:12.000000000 +0100 @@ -0,0 +1,6404 @@ +From 95becb20279ede2cc0b87e0311f43911997a53e7 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Mon, 18 Jan 2021 12:12:23 +0100 +Subject: [PATCH 3/5] x86/Dom0: support zstd compressed kernels + +Taken from Linux at commit 1c4dd334df3a ("lib: decompress_unzstd: Limit +output size") for unzstd.c (renamed from decompress_unzstd.c) and +36f9ff9e03de ("lib: Fix fall-through warnings for Clang") for zstd/, +with bits from linux/zstd.h merged into suitable other headers. + +To limit the editing necessary, introduce ptrdiff_t. + +Signed-off-by: Jan Beulich +Acked-by: Andrew Cooper + +Bug-Ubuntu: https://bugs.launchpad.net/bugs/1956166 +Origin: backport, http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=d6627cf1b63ce57a6a7e2c1800dbc50eed742c32 +[backport: xen/common/Makefile: remove 'lzo' from list, + and refresh 1 context line.] +--- + xen/common/Makefile | 2 +- + xen/common/decompress.c | 3 + + xen/common/unzstd.c | 308 ++++ + xen/common/zstd/bitstream.h | 380 +++++ + xen/common/zstd/decompress.c | 2496 ++++++++++++++++++++++++++++++ + xen/common/zstd/entropy_common.c | 243 +++ + xen/common/zstd/error_private.h | 110 ++ + xen/common/zstd/fse.h | 575 +++++++ + xen/common/zstd/fse_decompress.c | 324 ++++ + xen/common/zstd/huf.h | 212 +++ + xen/common/zstd/huf_decompress.c | 960 ++++++++++++ + xen/common/zstd/mem.h | 151 ++ + xen/common/zstd/zstd_common.c | 74 + + xen/common/zstd/zstd_internal.h | 372 +++++ + xen/include/asm-arm/types.h | 6 + + xen/include/asm-x86/types.h | 6 + + xen/include/xen/decompress.h | 2 +- + 17 files changed, 6222 insertions(+), 2 deletions(-) + create mode 100644 xen/common/unzstd.c + create mode 100644 xen/common/zstd/bitstream.h + create mode 100644 xen/common/zstd/decompress.c + create mode 100644 xen/common/zstd/entropy_common.c + create mode 100644 xen/common/zstd/error_private.h + create mode 100644 xen/common/zstd/fse.h + create mode 100644 xen/common/zstd/fse_decompress.c + create mode 100644 xen/common/zstd/huf.h + create mode 100644 xen/common/zstd/huf_decompress.c + create mode 100644 xen/common/zstd/mem.h + create mode 100644 xen/common/zstd/zstd_common.c + create mode 100644 xen/common/zstd/zstd_internal.h + +diff --git a/xen/common/Makefile b/xen/common/Makefile +index 24d4752ccc55..c4dceff97842 100644 +--- a/xen/common/Makefile ++++ b/xen/common/Makefile +@@ -66,7 +66,7 @@ obj-bin-y += warning.init.o + obj-$(CONFIG_XENOPROF) += xenoprof.o + obj-y += xmalloc_tlsf.o + +-obj-bin-$(CONFIG_X86) += $(foreach n,decompress bunzip2 unxz unlzma unlzo unlz4 earlycpio,$(n).init.o) ++obj-bin-$(CONFIG_X86) += $(foreach n,decompress bunzip2 unxz unlzma unlzo unlz4 unzstd earlycpio,$(n).init.o) + + + obj-$(CONFIG_COMPAT) += $(addprefix compat/,domain.o kernel.o memory.o multicall.o xlat.o) +diff --git a/xen/common/decompress.c b/xen/common/decompress.c +index 9d6e0c4ab075..79e60f4802d5 100644 +--- a/xen/common/decompress.c ++++ b/xen/common/decompress.c +@@ -31,5 +31,8 @@ int __init decompress(void *inbuf, unsigned int len, void *outbuf) + if ( len >= 2 && !memcmp(inbuf, "\x02\x21", 2) ) + return unlz4(inbuf, len, NULL, NULL, outbuf, NULL, error); + ++ if ( len >= 4 && !memcmp(inbuf, "\x28\xb5\x2f\xfd", 4) ) ++ return unzstd(inbuf, len, NULL, NULL, outbuf, NULL, error); ++ + return 1; + } +diff --git a/xen/common/unzstd.c b/xen/common/unzstd.c +new file mode 100644 +index 000000000000..a10761642764 +--- /dev/null ++++ b/xen/common/unzstd.c +@@ -0,0 +1,308 @@ ++// SPDX-License-Identifier: GPL-2.0 ++ ++/* ++ * Important notes about in-place decompression ++ * ++ * At least on x86, the kernel is decompressed in place: the compressed data ++ * is placed to the end of the output buffer, and the decompressor overwrites ++ * most of the compressed data. There must be enough safety margin to ++ * guarantee that the write position is always behind the read position. ++ * ++ * The safety margin for ZSTD with a 128 KB block size is calculated below. ++ * Note that the margin with ZSTD is bigger than with GZIP or XZ! ++ * ++ * The worst case for in-place decompression is that the beginning of ++ * the file is compressed extremely well, and the rest of the file is ++ * uncompressible. Thus, we must look for worst-case expansion when the ++ * compressor is encoding uncompressible data. ++ * ++ * The structure of the .zst file in case of a compresed kernel is as follows. ++ * Maximum sizes (as bytes) of the fields are in parenthesis. ++ * ++ * Frame Header: (18) ++ * Blocks: (N) ++ * Checksum: (4) ++ * ++ * The frame header and checksum overhead is at most 22 bytes. ++ * ++ * ZSTD stores the data in blocks. Each block has a header whose size is ++ * a 3 bytes. After the block header, there is up to 128 KB of payload. ++ * The maximum uncompressed size of the payload is 128 KB. The minimum ++ * uncompressed size of the payload is never less than the payload size ++ * (excluding the block header). ++ * ++ * The assumption, that the uncompressed size of the payload is never ++ * smaller than the payload itself, is valid only when talking about ++ * the payload as a whole. It is possible that the payload has parts where ++ * the decompressor consumes more input than it produces output. Calculating ++ * the worst case for this would be tricky. Instead of trying to do that, ++ * let's simply make sure that the decompressor never overwrites any bytes ++ * of the payload which it is currently reading. ++ * ++ * Now we have enough information to calculate the safety margin. We need ++ * - 22 bytes for the .zst file format headers; ++ * - 3 bytes per every 128 KiB of uncompressed size (one block header per ++ * block); and ++ * - 128 KiB (biggest possible zstd block size) to make sure that the ++ * decompressor never overwrites anything from the block it is currently ++ * reading. ++ * ++ * We get the following formula: ++ * ++ * safety_margin = 22 + uncompressed_size * 3 / 131072 + 131072 ++ * <= 22 + (uncompressed_size >> 15) + 131072 ++ */ ++ ++#include "decompress.h" ++ ++#include "zstd/entropy_common.c" ++#include "zstd/fse_decompress.c" ++#include "zstd/huf_decompress.c" ++#include "zstd/zstd_common.c" ++#include "zstd/decompress.c" ++ ++/* 128MB is the maximum window size supported by zstd. */ ++#define ZSTD_WINDOWSIZE_MAX (1 << ZSTD_WINDOWLOG_MAX) ++/* ++ * Size of the input and output buffers in multi-call mode. ++ * Pick a larger size because it isn't used during kernel decompression, ++ * since that is single pass, and we have to allocate a large buffer for ++ * zstd's window anyway. The larger size speeds up initramfs decompression. ++ */ ++#define ZSTD_IOBUF_SIZE (1 << 17) ++ ++static int INIT handle_zstd_error(size_t ret, void (*error)(const char *x)) ++{ ++ const int err = ZSTD_getErrorCode(ret); ++ ++ if (!ZSTD_isError(ret)) ++ return 0; ++ ++ switch (err) { ++ case ZSTD_error_memory_allocation: ++ error("ZSTD decompressor ran out of memory"); ++ break; ++ case ZSTD_error_prefix_unknown: ++ error("Input is not in the ZSTD format (wrong magic bytes)"); ++ break; ++ case ZSTD_error_dstSize_tooSmall: ++ case ZSTD_error_corruption_detected: ++ case ZSTD_error_checksum_wrong: ++ error("ZSTD-compressed data is corrupt"); ++ break; ++ default: ++ error("ZSTD-compressed data is probably corrupt"); ++ break; ++ } ++ return -1; ++} ++ ++/* ++ * Handle the case where we have the entire input and output in one segment. ++ * We can allocate less memory (no circular buffer for the sliding window), ++ * and avoid some memcpy() calls. ++ */ ++static int INIT decompress_single(const u8 *in_buf, long in_len, u8 *out_buf, ++ long out_len, unsigned int *in_pos, ++ void (*error)(const char *x)) ++{ ++ const size_t wksp_size = ZSTD_DCtxWorkspaceBound(); ++ void *wksp = large_malloc(wksp_size); ++ ZSTD_DCtx *dctx = ZSTD_initDCtx(wksp, wksp_size); ++ int err; ++ size_t ret; ++ ++ if (dctx == NULL) { ++ error("Out of memory while allocating ZSTD_DCtx"); ++ err = -1; ++ goto out; ++ } ++ /* ++ * Find out how large the frame actually is, there may be junk at ++ * the end of the frame that ZSTD_decompressDCtx() can't handle. ++ */ ++ ret = ZSTD_findFrameCompressedSize(in_buf, in_len); ++ err = handle_zstd_error(ret, error); ++ if (err) ++ goto out; ++ in_len = (long)ret; ++ ++ ret = ZSTD_decompressDCtx(dctx, out_buf, out_len, in_buf, in_len); ++ err = handle_zstd_error(ret, error); ++ if (err) ++ goto out; ++ ++ if (in_pos != NULL) ++ *in_pos = in_len; ++ ++ err = 0; ++out: ++ if (wksp != NULL) ++ large_free(wksp); ++ return err; ++} ++ ++STATIC int INIT unzstd(unsigned char *in_buf, unsigned int in_len, ++ int (*fill)(void*, unsigned int), ++ int (*flush)(void*, unsigned int), ++ unsigned char *out_buf, ++ unsigned int *in_pos, ++ void (*error)(const char *x)) ++{ ++ ZSTD_inBuffer in; ++ ZSTD_outBuffer out; ++ ZSTD_frameParams params; ++ void *in_allocated = NULL; ++ void *out_allocated = NULL; ++ void *wksp = NULL; ++ size_t wksp_size; ++ ZSTD_DStream *dstream; ++ int err; ++ size_t ret; ++ /* ++ * ZSTD decompression code won't be happy if the buffer size is so big ++ * that its end address overflows. When the size is not provided, make ++ * it as big as possible without having the end address overflow. ++ */ ++ unsigned long out_len = ULONG_MAX - (unsigned long)out_buf; ++ ++ if (fill == NULL && flush == NULL) ++ /* ++ * We can decompress faster and with less memory when we have a ++ * single chunk. ++ */ ++ return decompress_single(in_buf, in_len, out_buf, out_len, ++ in_pos, error); ++ ++ /* ++ * If in_buf is not provided, we must be using fill(), so allocate ++ * a large enough buffer. If it is provided, it must be at least ++ * ZSTD_IOBUF_SIZE large. ++ */ ++ if (in_buf == NULL) { ++ in_allocated = large_malloc(ZSTD_IOBUF_SIZE); ++ if (in_allocated == NULL) { ++ error("Out of memory while allocating input buffer"); ++ err = -1; ++ goto out; ++ } ++ in_buf = in_allocated; ++ in_len = 0; ++ } ++ /* Read the first chunk, since we need to decode the frame header. */ ++ if (fill != NULL) ++ in_len = fill(in_buf, ZSTD_IOBUF_SIZE); ++ if ((int)in_len < 0) { ++ error("ZSTD-compressed data is truncated"); ++ err = -1; ++ goto out; ++ } ++ /* Set the first non-empty input buffer. */ ++ in.src = in_buf; ++ in.pos = 0; ++ in.size = in_len; ++ /* Allocate the output buffer if we are using flush(). */ ++ if (flush != NULL) { ++ out_allocated = large_malloc(ZSTD_IOBUF_SIZE); ++ if (out_allocated == NULL) { ++ error("Out of memory while allocating output buffer"); ++ err = -1; ++ goto out; ++ } ++ out_buf = out_allocated; ++ out_len = ZSTD_IOBUF_SIZE; ++ } ++ /* Set the output buffer. */ ++ out.dst = out_buf; ++ out.pos = 0; ++ out.size = out_len; ++ ++ /* ++ * We need to know the window size to allocate the ZSTD_DStream. ++ * Since we are streaming, we need to allocate a buffer for the sliding ++ * window. The window size varies from 1 KB to ZSTD_WINDOWSIZE_MAX ++ * (8 MB), so it is important to use the actual value so as not to ++ * waste memory when it is smaller. ++ */ ++ ret = ZSTD_getFrameParams(¶ms, in.src, in.size); ++ err = handle_zstd_error(ret, error); ++ if (err) ++ goto out; ++ if (ret != 0) { ++ error("ZSTD-compressed data has an incomplete frame header"); ++ err = -1; ++ goto out; ++ } ++ if (params.windowSize > ZSTD_WINDOWSIZE_MAX) { ++ error("ZSTD-compressed data has too large a window size"); ++ err = -1; ++ goto out; ++ } ++ ++ /* ++ * Allocate the ZSTD_DStream now that we know how much memory is ++ * required. ++ */ ++ wksp_size = ZSTD_DStreamWorkspaceBound(params.windowSize); ++ wksp = large_malloc(wksp_size); ++ dstream = ZSTD_initDStream(params.windowSize, wksp, wksp_size); ++ if (dstream == NULL) { ++ error("Out of memory while allocating ZSTD_DStream"); ++ err = -1; ++ goto out; ++ } ++ ++ /* ++ * Decompression loop: ++ * Read more data if necessary (error if no more data can be read). ++ * Call the decompression function, which returns 0 when finished. ++ * Flush any data produced if using flush(). ++ */ ++ if (in_pos != NULL) ++ *in_pos = 0; ++ do { ++ /* ++ * If we need to reload data, either we have fill() and can ++ * try to get more data, or we don't and the input is truncated. ++ */ ++ if (in.pos == in.size) { ++ if (in_pos != NULL) ++ *in_pos += in.pos; ++ in_len = fill ? fill(in_buf, ZSTD_IOBUF_SIZE) : -1; ++ if ((int)in_len < 0) { ++ error("ZSTD-compressed data is truncated"); ++ err = -1; ++ goto out; ++ } ++ in.pos = 0; ++ in.size = in_len; ++ } ++ /* Returns zero when the frame is complete. */ ++ ret = ZSTD_decompressStream(dstream, &out, &in); ++ err = handle_zstd_error(ret, error); ++ if (err) ++ goto out; ++ /* Flush all of the data produced if using flush(). */ ++ if (flush != NULL && out.pos > 0) { ++ if (out.pos != flush(out.dst, out.pos)) { ++ error("Failed to flush()"); ++ err = -1; ++ goto out; ++ } ++ out.pos = 0; ++ } ++ } while (ret != 0); ++ ++ if (in_pos != NULL) ++ *in_pos += in.pos; ++ ++ err = 0; ++out: ++ if (in_allocated != NULL) ++ large_free(in_allocated); ++ if (out_allocated != NULL) ++ large_free(out_allocated); ++ if (wksp != NULL) ++ large_free(wksp); ++ return err; ++} +diff --git a/xen/common/zstd/bitstream.h b/xen/common/zstd/bitstream.h +new file mode 100644 +index 000000000000..2b06d4551f03 +--- /dev/null ++++ b/xen/common/zstd/bitstream.h +@@ -0,0 +1,380 @@ ++/* ++ * bitstream ++ * Part of FSE library ++ * header file (to include) ++ * Copyright (C) 2013-2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at : ++ * - Source repository : https://github.com/Cyan4973/FiniteStateEntropy ++ */ ++#ifndef BITSTREAM_H_MODULE ++#define BITSTREAM_H_MODULE ++ ++/* ++* This API consists of small unitary functions, which must be inlined for best performance. ++* Since link-time-optimization is not available for all compilers, ++* these functions are defined into a .h to be included. ++*/ ++ ++/*-**************************************** ++* Dependencies ++******************************************/ ++#include "error_private.h" /* error codes and messages */ ++#include "mem.h" /* unaligned access routines */ ++ ++/*========================================= ++* Target specific ++=========================================*/ ++#define STREAM_ACCUMULATOR_MIN_32 25 ++#define STREAM_ACCUMULATOR_MIN_64 57 ++#define STREAM_ACCUMULATOR_MIN ((U32)(ZSTD_32bits() ? STREAM_ACCUMULATOR_MIN_32 : STREAM_ACCUMULATOR_MIN_64)) ++ ++/*-****************************************** ++* bitStream encoding API (write forward) ++********************************************/ ++/* bitStream can mix input from multiple sources. ++* A critical property of these streams is that they encode and decode in **reverse** direction. ++* So the first bit sequence you add will be the last to be read, like a LIFO stack. ++*/ ++typedef struct { ++ size_t bitContainer; ++ int bitPos; ++ char *startPtr; ++ char *ptr; ++ char *endPtr; ++} BIT_CStream_t; ++ ++ZSTD_STATIC size_t BIT_initCStream(BIT_CStream_t *bitC, void *dstBuffer, size_t dstCapacity); ++ZSTD_STATIC void BIT_addBits(BIT_CStream_t *bitC, size_t value, unsigned nbBits); ++ZSTD_STATIC void BIT_flushBits(BIT_CStream_t *bitC); ++ZSTD_STATIC size_t BIT_closeCStream(BIT_CStream_t *bitC); ++ ++/* Start with initCStream, providing the size of buffer to write into. ++* bitStream will never write outside of this buffer. ++* `dstCapacity` must be >= sizeof(bitD->bitContainer), otherwise @return will be an error code. ++* ++* bits are first added to a local register. ++* Local register is size_t, hence 64-bits on 64-bits systems, or 32-bits on 32-bits systems. ++* Writing data into memory is an explicit operation, performed by the flushBits function. ++* Hence keep track how many bits are potentially stored into local register to avoid register overflow. ++* After a flushBits, a maximum of 7 bits might still be stored into local register. ++* ++* Avoid storing elements of more than 24 bits if you want compatibility with 32-bits bitstream readers. ++* ++* Last operation is to close the bitStream. ++* The function returns the final size of CStream in bytes. ++* If data couldn't fit into `dstBuffer`, it will return a 0 ( == not storable) ++*/ ++ ++/*-******************************************** ++* bitStream decoding API (read backward) ++**********************************************/ ++typedef struct { ++ size_t bitContainer; ++ unsigned bitsConsumed; ++ const char *ptr; ++ const char *start; ++} BIT_DStream_t; ++ ++typedef enum { ++ BIT_DStream_unfinished = 0, ++ BIT_DStream_endOfBuffer = 1, ++ BIT_DStream_completed = 2, ++ BIT_DStream_overflow = 3 ++} BIT_DStream_status; /* result of BIT_reloadDStream() */ ++/* 1,2,4,8 would be better for bitmap combinations, but slows down performance a bit ... :( */ ++ ++ZSTD_STATIC size_t BIT_initDStream(BIT_DStream_t *bitD, const void *srcBuffer, size_t srcSize); ++ZSTD_STATIC size_t BIT_readBits(BIT_DStream_t *bitD, unsigned nbBits); ++ZSTD_STATIC BIT_DStream_status BIT_reloadDStream(BIT_DStream_t *bitD); ++ZSTD_STATIC unsigned BIT_endOfDStream(const BIT_DStream_t *bitD); ++ ++/* Start by invoking BIT_initDStream(). ++* A chunk of the bitStream is then stored into a local register. ++* Local register size is 64-bits on 64-bits systems, 32-bits on 32-bits systems (size_t). ++* You can then retrieve bitFields stored into the local register, **in reverse order**. ++* Local register is explicitly reloaded from memory by the BIT_reloadDStream() method. ++* A reload guarantee a minimum of ((8*sizeof(bitD->bitContainer))-7) bits when its result is BIT_DStream_unfinished. ++* Otherwise, it can be less than that, so proceed accordingly. ++* Checking if DStream has reached its end can be performed with BIT_endOfDStream(). ++*/ ++ ++/*-**************************************** ++* unsafe API ++******************************************/ ++ZSTD_STATIC void BIT_addBitsFast(BIT_CStream_t *bitC, size_t value, unsigned nbBits); ++/* faster, but works only if value is "clean", meaning all high bits above nbBits are 0 */ ++ ++ZSTD_STATIC void BIT_flushBitsFast(BIT_CStream_t *bitC); ++/* unsafe version; does not check buffer overflow */ ++ ++ZSTD_STATIC size_t BIT_readBitsFast(BIT_DStream_t *bitD, unsigned nbBits); ++/* faster, but works only if nbBits >= 1 */ ++ ++/*-************************************************************** ++* Internal functions ++****************************************************************/ ++ZSTD_STATIC unsigned BIT_highbit32(register U32 val) { return 31 - __builtin_clz(val); } ++ ++/*===== Local Constants =====*/ ++static const unsigned BIT_mask[] = {0, 1, 3, 7, 0xF, 0x1F, 0x3F, 0x7F, 0xFF, ++ 0x1FF, 0x3FF, 0x7FF, 0xFFF, 0x1FFF, 0x3FFF, 0x7FFF, 0xFFFF, 0x1FFFF, ++ 0x3FFFF, 0x7FFFF, 0xFFFFF, 0x1FFFFF, 0x3FFFFF, 0x7FFFFF, 0xFFFFFF, 0x1FFFFFF, 0x3FFFFFF}; /* up to 26 bits */ ++ ++/*-************************************************************** ++* bitStream encoding ++****************************************************************/ ++/*! BIT_initCStream() : ++ * `dstCapacity` must be > sizeof(void*) ++ * @return : 0 if success, ++ otherwise an error code (can be tested using ERR_isError() ) */ ++ZSTD_STATIC size_t BIT_initCStream(BIT_CStream_t *bitC, void *startPtr, size_t dstCapacity) ++{ ++ bitC->bitContainer = 0; ++ bitC->bitPos = 0; ++ bitC->startPtr = (char *)startPtr; ++ bitC->ptr = bitC->startPtr; ++ bitC->endPtr = bitC->startPtr + dstCapacity - sizeof(bitC->ptr); ++ if (dstCapacity <= sizeof(bitC->ptr)) ++ return ERROR(dstSize_tooSmall); ++ return 0; ++} ++ ++/*! BIT_addBits() : ++ can add up to 26 bits into `bitC`. ++ Does not check for register overflow ! */ ++ZSTD_STATIC void BIT_addBits(BIT_CStream_t *bitC, size_t value, unsigned nbBits) ++{ ++ bitC->bitContainer |= (value & BIT_mask[nbBits]) << bitC->bitPos; ++ bitC->bitPos += nbBits; ++} ++ ++/*! BIT_addBitsFast() : ++ * works only if `value` is _clean_, meaning all high bits above nbBits are 0 */ ++ZSTD_STATIC void BIT_addBitsFast(BIT_CStream_t *bitC, size_t value, unsigned nbBits) ++{ ++ bitC->bitContainer |= value << bitC->bitPos; ++ bitC->bitPos += nbBits; ++} ++ ++/*! BIT_flushBitsFast() : ++ * unsafe version; does not check buffer overflow */ ++ZSTD_STATIC void BIT_flushBitsFast(BIT_CStream_t *bitC) ++{ ++ size_t const nbBytes = bitC->bitPos >> 3; ++ ZSTD_writeLEST(bitC->ptr, bitC->bitContainer); ++ bitC->ptr += nbBytes; ++ bitC->bitPos &= 7; ++ bitC->bitContainer >>= nbBytes * 8; /* if bitPos >= sizeof(bitContainer)*8 --> undefined behavior */ ++} ++ ++/*! BIT_flushBits() : ++ * safe version; check for buffer overflow, and prevents it. ++ * note : does not signal buffer overflow. This will be revealed later on using BIT_closeCStream() */ ++ZSTD_STATIC void BIT_flushBits(BIT_CStream_t *bitC) ++{ ++ size_t const nbBytes = bitC->bitPos >> 3; ++ ZSTD_writeLEST(bitC->ptr, bitC->bitContainer); ++ bitC->ptr += nbBytes; ++ if (bitC->ptr > bitC->endPtr) ++ bitC->ptr = bitC->endPtr; ++ bitC->bitPos &= 7; ++ bitC->bitContainer >>= nbBytes * 8; /* if bitPos >= sizeof(bitContainer)*8 --> undefined behavior */ ++} ++ ++/*! BIT_closeCStream() : ++ * @return : size of CStream, in bytes, ++ or 0 if it could not fit into dstBuffer */ ++ZSTD_STATIC size_t BIT_closeCStream(BIT_CStream_t *bitC) ++{ ++ BIT_addBitsFast(bitC, 1, 1); /* endMark */ ++ BIT_flushBits(bitC); ++ ++ if (bitC->ptr >= bitC->endPtr) ++ return 0; /* doesn't fit within authorized budget : cancel */ ++ ++ return (bitC->ptr - bitC->startPtr) + (bitC->bitPos > 0); ++} ++ ++/*-******************************************************** ++* bitStream decoding ++**********************************************************/ ++/*! BIT_initDStream() : ++* Initialize a BIT_DStream_t. ++* `bitD` : a pointer to an already allocated BIT_DStream_t structure. ++* `srcSize` must be the *exact* size of the bitStream, in bytes. ++* @return : size of stream (== srcSize) or an errorCode if a problem is detected ++*/ ++ZSTD_STATIC size_t BIT_initDStream(BIT_DStream_t *bitD, const void *srcBuffer, size_t srcSize) ++{ ++ if (srcSize < 1) { ++ memset(bitD, 0, sizeof(*bitD)); ++ return ERROR(srcSize_wrong); ++ } ++ ++ if (srcSize >= sizeof(bitD->bitContainer)) { /* normal case */ ++ bitD->start = (const char *)srcBuffer; ++ bitD->ptr = (const char *)srcBuffer + srcSize - sizeof(bitD->bitContainer); ++ bitD->bitContainer = ZSTD_readLEST(bitD->ptr); ++ { ++ BYTE const lastByte = ((const BYTE *)srcBuffer)[srcSize - 1]; ++ bitD->bitsConsumed = lastByte ? 8 - BIT_highbit32(lastByte) : 0; /* ensures bitsConsumed is always set */ ++ if (lastByte == 0) ++ return ERROR(GENERIC); /* endMark not present */ ++ } ++ } else { ++ bitD->start = (const char *)srcBuffer; ++ bitD->ptr = bitD->start; ++ bitD->bitContainer = *(const BYTE *)(bitD->start); ++ switch (srcSize) { ++ case 7: bitD->bitContainer += (size_t)(((const BYTE *)(srcBuffer))[6]) << (sizeof(bitD->bitContainer) * 8 - 16); ++ /* fallthrough */ ++ case 6: bitD->bitContainer += (size_t)(((const BYTE *)(srcBuffer))[5]) << (sizeof(bitD->bitContainer) * 8 - 24); ++ /* fallthrough */ ++ case 5: bitD->bitContainer += (size_t)(((const BYTE *)(srcBuffer))[4]) << (sizeof(bitD->bitContainer) * 8 - 32); ++ /* fallthrough */ ++ case 4: bitD->bitContainer += (size_t)(((const BYTE *)(srcBuffer))[3]) << 24; ++ /* fallthrough */ ++ case 3: bitD->bitContainer += (size_t)(((const BYTE *)(srcBuffer))[2]) << 16; ++ /* fallthrough */ ++ case 2: bitD->bitContainer += (size_t)(((const BYTE *)(srcBuffer))[1]) << 8; ++ /* fallthrough */ ++ default:; ++ } ++ { ++ BYTE const lastByte = ((const BYTE *)srcBuffer)[srcSize - 1]; ++ bitD->bitsConsumed = lastByte ? 8 - BIT_highbit32(lastByte) : 0; ++ if (lastByte == 0) ++ return ERROR(GENERIC); /* endMark not present */ ++ } ++ bitD->bitsConsumed += (U32)(sizeof(bitD->bitContainer) - srcSize) * 8; ++ } ++ ++ return srcSize; ++} ++ ++ZSTD_STATIC size_t BIT_getUpperBits(size_t bitContainer, U32 const start) { return bitContainer >> start; } ++ ++ZSTD_STATIC size_t BIT_getMiddleBits(size_t bitContainer, U32 const start, U32 const nbBits) { return (bitContainer >> start) & BIT_mask[nbBits]; } ++ ++ZSTD_STATIC size_t BIT_getLowerBits(size_t bitContainer, U32 const nbBits) { return bitContainer & BIT_mask[nbBits]; } ++ ++/*! BIT_lookBits() : ++ * Provides next n bits from local register. ++ * local register is not modified. ++ * On 32-bits, maxNbBits==24. ++ * On 64-bits, maxNbBits==56. ++ * @return : value extracted ++ */ ++ZSTD_STATIC size_t BIT_lookBits(const BIT_DStream_t *bitD, U32 nbBits) ++{ ++ U32 const bitMask = sizeof(bitD->bitContainer) * 8 - 1; ++ return ((bitD->bitContainer << (bitD->bitsConsumed & bitMask)) >> 1) >> ((bitMask - nbBits) & bitMask); ++} ++ ++/*! BIT_lookBitsFast() : ++* unsafe version; only works only if nbBits >= 1 */ ++ZSTD_STATIC size_t BIT_lookBitsFast(const BIT_DStream_t *bitD, U32 nbBits) ++{ ++ U32 const bitMask = sizeof(bitD->bitContainer) * 8 - 1; ++ return (bitD->bitContainer << (bitD->bitsConsumed & bitMask)) >> (((bitMask + 1) - nbBits) & bitMask); ++} ++ ++ZSTD_STATIC void BIT_skipBits(BIT_DStream_t *bitD, U32 nbBits) { bitD->bitsConsumed += nbBits; } ++ ++/*! BIT_readBits() : ++ * Read (consume) next n bits from local register and update. ++ * Pay attention to not read more than nbBits contained into local register. ++ * @return : extracted value. ++ */ ++ZSTD_STATIC size_t BIT_readBits(BIT_DStream_t *bitD, U32 nbBits) ++{ ++ size_t const value = BIT_lookBits(bitD, nbBits); ++ BIT_skipBits(bitD, nbBits); ++ return value; ++} ++ ++/*! BIT_readBitsFast() : ++* unsafe version; only works only if nbBits >= 1 */ ++ZSTD_STATIC size_t BIT_readBitsFast(BIT_DStream_t *bitD, U32 nbBits) ++{ ++ size_t const value = BIT_lookBitsFast(bitD, nbBits); ++ BIT_skipBits(bitD, nbBits); ++ return value; ++} ++ ++/*! BIT_reloadDStream() : ++* Refill `bitD` from buffer previously set in BIT_initDStream() . ++* This function is safe, it guarantees it will not read beyond src buffer. ++* @return : status of `BIT_DStream_t` internal register. ++ if status == BIT_DStream_unfinished, internal register is filled with >= (sizeof(bitD->bitContainer)*8 - 7) bits */ ++ZSTD_STATIC BIT_DStream_status BIT_reloadDStream(BIT_DStream_t *bitD) ++{ ++ if (bitD->bitsConsumed > (sizeof(bitD->bitContainer) * 8)) /* should not happen => corruption detected */ ++ return BIT_DStream_overflow; ++ ++ if (bitD->ptr >= bitD->start + sizeof(bitD->bitContainer)) { ++ bitD->ptr -= bitD->bitsConsumed >> 3; ++ bitD->bitsConsumed &= 7; ++ bitD->bitContainer = ZSTD_readLEST(bitD->ptr); ++ return BIT_DStream_unfinished; ++ } ++ if (bitD->ptr == bitD->start) { ++ if (bitD->bitsConsumed < sizeof(bitD->bitContainer) * 8) ++ return BIT_DStream_endOfBuffer; ++ return BIT_DStream_completed; ++ } ++ { ++ U32 nbBytes = bitD->bitsConsumed >> 3; ++ BIT_DStream_status result = BIT_DStream_unfinished; ++ if (bitD->ptr - nbBytes < bitD->start) { ++ nbBytes = (U32)(bitD->ptr - bitD->start); /* ptr > start */ ++ result = BIT_DStream_endOfBuffer; ++ } ++ bitD->ptr -= nbBytes; ++ bitD->bitsConsumed -= nbBytes * 8; ++ bitD->bitContainer = ZSTD_readLEST(bitD->ptr); /* reminder : srcSize > sizeof(bitD) */ ++ return result; ++ } ++} ++ ++/*! BIT_endOfDStream() : ++* @return Tells if DStream has exactly reached its end (all bits consumed). ++*/ ++ZSTD_STATIC unsigned BIT_endOfDStream(const BIT_DStream_t *DStream) ++{ ++ return ((DStream->ptr == DStream->start) && (DStream->bitsConsumed == sizeof(DStream->bitContainer) * 8)); ++} ++ ++#endif /* BITSTREAM_H_MODULE */ +diff --git a/xen/common/zstd/decompress.c b/xen/common/zstd/decompress.c +new file mode 100644 +index 000000000000..3d3ef136e5c2 +--- /dev/null ++++ b/xen/common/zstd/decompress.c +@@ -0,0 +1,2496 @@ ++/** ++ * Copyright (c) 2016-present, Yann Collet, Facebook, Inc. ++ * All rights reserved. ++ * ++ * This source code is licensed under the BSD-style license found in the ++ * LICENSE file in the root directory of https://github.com/facebook/zstd. ++ * An additional grant of patent rights can be found in the PATENTS file in the ++ * same directory. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ */ ++ ++/* *************************************************************** ++* Tuning parameters ++*****************************************************************/ ++/*! ++* MAXWINDOWSIZE_DEFAULT : ++* maximum window size accepted by DStream, by default. ++* Frames requiring more memory will be rejected. ++*/ ++#ifndef ZSTD_MAXWINDOWSIZE_DEFAULT ++#define ZSTD_MAXWINDOWSIZE_DEFAULT ((1 << ZSTD_WINDOWLOG_MAX) + 1) /* defined within zstd.h */ ++#endif ++ ++/*-******************************************************* ++* Dependencies ++*********************************************************/ ++#include "fse.h" ++#include "huf.h" ++#include "mem.h" /* low level memory routines */ ++#include "zstd_internal.h" ++#include /* memcpy, memmove, memset */ ++ ++#define ZSTD_PREFETCH(ptr) __builtin_prefetch(ptr, 0, 0) ++ ++/*-************************************* ++* Macros ++***************************************/ ++#define ZSTD_isError ERR_isError /* for inlining */ ++#define FSE_isError ERR_isError ++#define HUF_isError ERR_isError ++ ++/*_******************************************************* ++* Memory operations ++**********************************************************/ ++static void INIT ZSTD_copy4(void *dst, const void *src) { memcpy(dst, src, 4); } ++ ++/*-************************************************************* ++* Context management ++***************************************************************/ ++typedef enum { ++ ZSTDds_getFrameHeaderSize, ++ ZSTDds_decodeFrameHeader, ++ ZSTDds_decodeBlockHeader, ++ ZSTDds_decompressBlock, ++ ZSTDds_decompressLastBlock, ++ ZSTDds_checkChecksum, ++ ZSTDds_decodeSkippableHeader, ++ ZSTDds_skipFrame ++} ZSTD_dStage; ++ ++typedef struct { ++ FSE_DTable LLTable[FSE_DTABLE_SIZE_U32(LLFSELog)]; ++ FSE_DTable OFTable[FSE_DTABLE_SIZE_U32(OffFSELog)]; ++ FSE_DTable MLTable[FSE_DTABLE_SIZE_U32(MLFSELog)]; ++ HUF_DTable hufTable[HUF_DTABLE_SIZE(HufLog)]; /* can accommodate HUF_decompress4X */ ++ U64 workspace[HUF_DECOMPRESS_WORKSPACE_SIZE_U32 / 2]; ++ U32 rep[ZSTD_REP_NUM]; ++} ZSTD_entropyTables_t; ++ ++struct ZSTD_DCtx_s { ++ const FSE_DTable *LLTptr; ++ const FSE_DTable *MLTptr; ++ const FSE_DTable *OFTptr; ++ const HUF_DTable *HUFptr; ++ ZSTD_entropyTables_t entropy; ++ const void *previousDstEnd; /* detect continuity */ ++ const void *base; /* start of curr segment */ ++ const void *vBase; /* virtual start of previous segment if it was just before curr one */ ++ const void *dictEnd; /* end of previous segment */ ++ size_t expected; ++ ZSTD_frameParams fParams; ++ blockType_e bType; /* used in ZSTD_decompressContinue(), to transfer blockType between header decoding and block decoding stages */ ++ ZSTD_dStage stage; ++ U32 litEntropy; ++ U32 fseEntropy; ++ struct xxh64_state xxhState; ++ size_t headerSize; ++ U32 dictID; ++ const BYTE *litPtr; ++ ZSTD_customMem customMem; ++ size_t litSize; ++ size_t rleSize; ++ BYTE litBuffer[ZSTD_BLOCKSIZE_ABSOLUTEMAX + WILDCOPY_OVERLENGTH]; ++ BYTE headerBuffer[ZSTD_FRAMEHEADERSIZE_MAX]; ++}; /* typedef'd to ZSTD_DCtx within "zstd.h" */ ++ ++size_t INIT ZSTD_DCtxWorkspaceBound(void) { return ZSTD_ALIGN(sizeof(ZSTD_stack)) + ZSTD_ALIGN(sizeof(ZSTD_DCtx)); } ++ ++size_t INIT ZSTD_decompressBegin(ZSTD_DCtx *dctx) ++{ ++ dctx->expected = ZSTD_frameHeaderSize_prefix; ++ dctx->stage = ZSTDds_getFrameHeaderSize; ++ dctx->previousDstEnd = NULL; ++ dctx->base = NULL; ++ dctx->vBase = NULL; ++ dctx->dictEnd = NULL; ++ dctx->entropy.hufTable[0] = (HUF_DTable)((HufLog)*0x1000001); /* cover both little and big endian */ ++ dctx->litEntropy = dctx->fseEntropy = 0; ++ dctx->dictID = 0; ++ ZSTD_STATIC_ASSERT(sizeof(dctx->entropy.rep) == sizeof(repStartValue)); ++ memcpy(dctx->entropy.rep, repStartValue, sizeof(repStartValue)); /* initial repcodes */ ++ dctx->LLTptr = dctx->entropy.LLTable; ++ dctx->MLTptr = dctx->entropy.MLTable; ++ dctx->OFTptr = dctx->entropy.OFTable; ++ dctx->HUFptr = dctx->entropy.hufTable; ++ return 0; ++} ++ ++ZSTD_DCtx *INIT ZSTD_createDCtx_advanced(ZSTD_customMem customMem) ++{ ++ ZSTD_DCtx *dctx; ++ ++ if (!customMem.customAlloc || !customMem.customFree) ++ return NULL; ++ ++ dctx = (ZSTD_DCtx *)ZSTD_malloc(sizeof(ZSTD_DCtx), customMem); ++ if (!dctx) ++ return NULL; ++ memcpy(&dctx->customMem, &customMem, sizeof(customMem)); ++ ZSTD_decompressBegin(dctx); ++ return dctx; ++} ++ ++ZSTD_DCtx *INIT ZSTD_initDCtx(void *workspace, size_t workspaceSize) ++{ ++ ZSTD_customMem const stackMem = ZSTD_initStack(workspace, workspaceSize); ++ return ZSTD_createDCtx_advanced(stackMem); ++} ++ ++size_t INIT ZSTD_freeDCtx(ZSTD_DCtx *dctx) ++{ ++ if (dctx == NULL) ++ return 0; /* support free on NULL */ ++ ZSTD_free(dctx, dctx->customMem); ++ return 0; /* reserved as a potential error code in the future */ ++} ++ ++void INIT ZSTD_copyDCtx(ZSTD_DCtx *dstDCtx, const ZSTD_DCtx *srcDCtx) ++{ ++ size_t const workSpaceSize = (ZSTD_BLOCKSIZE_ABSOLUTEMAX + WILDCOPY_OVERLENGTH) + ZSTD_frameHeaderSize_max; ++ memcpy(dstDCtx, srcDCtx, sizeof(ZSTD_DCtx) - workSpaceSize); /* no need to copy workspace */ ++} ++ ++STATIC size_t ZSTD_findFrameCompressedSize(const void *src, size_t srcSize); ++STATIC size_t ZSTD_decompressBegin_usingDict(ZSTD_DCtx *dctx, const void *dict, ++ size_t dictSize); ++ ++static void ZSTD_refDDict(ZSTD_DCtx *dstDCtx, const ZSTD_DDict *ddict); ++ ++/*-************************************************************* ++* Decompression section ++***************************************************************/ ++ ++/*! ZSTD_isFrame() : ++ * Tells if the content of `buffer` starts with a valid Frame Identifier. ++ * Note : Frame Identifier is 4 bytes. If `size < 4`, @return will always be 0. ++ * Note 2 : Legacy Frame Identifiers are considered valid only if Legacy Support is enabled. ++ * Note 3 : Skippable Frame Identifiers are considered valid. */ ++unsigned INIT ZSTD_isFrame(const void *buffer, size_t size) ++{ ++ if (size < 4) ++ return 0; ++ { ++ U32 const magic = ZSTD_readLE32(buffer); ++ if (magic == ZSTD_MAGICNUMBER) ++ return 1; ++ if ((magic & 0xFFFFFFF0U) == ZSTD_MAGIC_SKIPPABLE_START) ++ return 1; ++ } ++ return 0; ++} ++ ++/** ZSTD_frameHeaderSize() : ++* srcSize must be >= ZSTD_frameHeaderSize_prefix. ++* @return : size of the Frame Header */ ++static size_t INIT ZSTD_frameHeaderSize(const void *src, size_t srcSize) ++{ ++ if (srcSize < ZSTD_frameHeaderSize_prefix) ++ return ERROR(srcSize_wrong); ++ { ++ BYTE const fhd = ((const BYTE *)src)[4]; ++ U32 const dictID = fhd & 3; ++ U32 const singleSegment = (fhd >> 5) & 1; ++ U32 const fcsId = fhd >> 6; ++ return ZSTD_frameHeaderSize_prefix + !singleSegment + ZSTD_did_fieldSize[dictID] + ZSTD_fcs_fieldSize[fcsId] + (singleSegment && !fcsId); ++ } ++} ++ ++/** ZSTD_getFrameParams() : ++* decode Frame Header, or require larger `srcSize`. ++* @return : 0, `fparamsPtr` is correctly filled, ++* >0, `srcSize` is too small, result is expected `srcSize`, ++* or an error code, which can be tested using ZSTD_isError() */ ++size_t INIT ZSTD_getFrameParams(ZSTD_frameParams *fparamsPtr, const void *src, size_t srcSize) ++{ ++ const BYTE *ip = (const BYTE *)src; ++ ++ if (srcSize < ZSTD_frameHeaderSize_prefix) ++ return ZSTD_frameHeaderSize_prefix; ++ if (ZSTD_readLE32(src) != ZSTD_MAGICNUMBER) { ++ if ((ZSTD_readLE32(src) & 0xFFFFFFF0U) == ZSTD_MAGIC_SKIPPABLE_START) { ++ if (srcSize < ZSTD_skippableHeaderSize) ++ return ZSTD_skippableHeaderSize; /* magic number + skippable frame length */ ++ memset(fparamsPtr, 0, sizeof(*fparamsPtr)); ++ fparamsPtr->frameContentSize = ZSTD_readLE32((const char *)src + 4); ++ fparamsPtr->windowSize = 0; /* windowSize==0 means a frame is skippable */ ++ return 0; ++ } ++ return ERROR(prefix_unknown); ++ } ++ ++ /* ensure there is enough `srcSize` to fully read/decode frame header */ ++ { ++ size_t const fhsize = ZSTD_frameHeaderSize(src, srcSize); ++ if (srcSize < fhsize) ++ return fhsize; ++ } ++ ++ { ++ BYTE const fhdByte = ip[4]; ++ size_t pos = 5; ++ U32 const dictIDSizeCode = fhdByte & 3; ++ U32 const checksumFlag = (fhdByte >> 2) & 1; ++ U32 const singleSegment = (fhdByte >> 5) & 1; ++ U32 const fcsID = fhdByte >> 6; ++ U32 const windowSizeMax = 1U << ZSTD_WINDOWLOG_MAX; ++ U32 windowSize = 0; ++ U32 dictID = 0; ++ U64 frameContentSize = 0; ++ if ((fhdByte & 0x08) != 0) ++ return ERROR(frameParameter_unsupported); /* reserved bits, which must be zero */ ++ if (!singleSegment) { ++ BYTE const wlByte = ip[pos++]; ++ U32 const windowLog = (wlByte >> 3) + ZSTD_WINDOWLOG_ABSOLUTEMIN; ++ if (windowLog > ZSTD_WINDOWLOG_MAX) ++ return ERROR(frameParameter_windowTooLarge); /* avoids issue with 1 << windowLog */ ++ windowSize = (1U << windowLog); ++ windowSize += (windowSize >> 3) * (wlByte & 7); ++ } ++ ++ switch (dictIDSizeCode) { ++ default: /* impossible */ ++ case 0: break; ++ case 1: ++ dictID = ip[pos]; ++ pos++; ++ break; ++ case 2: ++ dictID = ZSTD_readLE16(ip + pos); ++ pos += 2; ++ break; ++ case 3: ++ dictID = ZSTD_readLE32(ip + pos); ++ pos += 4; ++ break; ++ } ++ switch (fcsID) { ++ default: /* impossible */ ++ case 0: ++ if (singleSegment) ++ frameContentSize = ip[pos]; ++ break; ++ case 1: frameContentSize = ZSTD_readLE16(ip + pos) + 256; break; ++ case 2: frameContentSize = ZSTD_readLE32(ip + pos); break; ++ case 3: frameContentSize = ZSTD_readLE64(ip + pos); break; ++ } ++ if (!windowSize) ++ windowSize = (U32)frameContentSize; ++ if (windowSize > windowSizeMax) ++ return ERROR(frameParameter_windowTooLarge); ++ fparamsPtr->frameContentSize = frameContentSize; ++ fparamsPtr->windowSize = windowSize; ++ fparamsPtr->dictID = dictID; ++ fparamsPtr->checksumFlag = checksumFlag; ++ } ++ return 0; ++} ++ ++/** ZSTD_getFrameContentSize() : ++* compatible with legacy mode ++* @return : decompressed size of the single frame pointed to be `src` if known, otherwise ++* - ZSTD_CONTENTSIZE_UNKNOWN if the size cannot be determined ++* - ZSTD_CONTENTSIZE_ERROR if an error occurred (e.g. invalid magic number, srcSize too small) */ ++unsigned long long INIT ZSTD_getFrameContentSize(const void *src, size_t srcSize) ++{ ++ { ++ ZSTD_frameParams fParams; ++ if (ZSTD_getFrameParams(&fParams, src, srcSize) != 0) ++ return ZSTD_CONTENTSIZE_ERROR; ++ if (fParams.windowSize == 0) { ++ /* Either skippable or empty frame, size == 0 either way */ ++ return 0; ++ } else if (fParams.frameContentSize != 0) { ++ return fParams.frameContentSize; ++ } else { ++ return ZSTD_CONTENTSIZE_UNKNOWN; ++ } ++ } ++} ++ ++/** ZSTD_findDecompressedSize() : ++ * compatible with legacy mode ++ * `srcSize` must be the exact length of some number of ZSTD compressed and/or ++ * skippable frames ++ * @return : decompressed size of the frames contained */ ++unsigned long long INIT ZSTD_findDecompressedSize(const void *src, size_t srcSize) ++{ ++ { ++ unsigned long long totalDstSize = 0; ++ while (srcSize >= ZSTD_frameHeaderSize_prefix) { ++ const U32 magicNumber = ZSTD_readLE32(src); ++ ++ if ((magicNumber & 0xFFFFFFF0U) == ZSTD_MAGIC_SKIPPABLE_START) { ++ size_t skippableSize; ++ if (srcSize < ZSTD_skippableHeaderSize) ++ return ERROR(srcSize_wrong); ++ skippableSize = ZSTD_readLE32((const BYTE *)src + 4) + ZSTD_skippableHeaderSize; ++ if (srcSize < skippableSize) { ++ return ZSTD_CONTENTSIZE_ERROR; ++ } ++ ++ src = (const BYTE *)src + skippableSize; ++ srcSize -= skippableSize; ++ continue; ++ } ++ ++ { ++ unsigned long long const ret = ZSTD_getFrameContentSize(src, srcSize); ++ if (ret >= ZSTD_CONTENTSIZE_ERROR) ++ return ret; ++ ++ /* check for overflow */ ++ if (totalDstSize + ret < totalDstSize) ++ return ZSTD_CONTENTSIZE_ERROR; ++ totalDstSize += ret; ++ } ++ { ++ size_t const frameSrcSize = ZSTD_findFrameCompressedSize(src, srcSize); ++ if (ZSTD_isError(frameSrcSize)) { ++ return ZSTD_CONTENTSIZE_ERROR; ++ } ++ ++ src = (const BYTE *)src + frameSrcSize; ++ srcSize -= frameSrcSize; ++ } ++ } ++ ++ if (srcSize) { ++ return ZSTD_CONTENTSIZE_ERROR; ++ } ++ ++ return totalDstSize; ++ } ++} ++ ++/** ZSTD_decodeFrameHeader() : ++* `headerSize` must be the size provided by ZSTD_frameHeaderSize(). ++* @return : 0 if success, or an error code, which can be tested using ZSTD_isError() */ ++static size_t INIT ZSTD_decodeFrameHeader(ZSTD_DCtx *dctx, const void *src, size_t headerSize) ++{ ++ size_t const result = ZSTD_getFrameParams(&(dctx->fParams), src, headerSize); ++ if (ZSTD_isError(result)) ++ return result; /* invalid header */ ++ if (result > 0) ++ return ERROR(srcSize_wrong); /* headerSize too small */ ++ if (dctx->fParams.dictID && (dctx->dictID != dctx->fParams.dictID)) ++ return ERROR(dictionary_wrong); ++ if (dctx->fParams.checksumFlag) ++ xxh64_reset(&dctx->xxhState, 0); ++ return 0; ++} ++ ++typedef struct { ++ blockType_e blockType; ++ U32 lastBlock; ++ U32 origSize; ++} blockProperties_t; ++ ++/*! ZSTD_getcBlockSize() : ++* Provides the size of compressed block from block header `src` */ ++size_t INIT ZSTD_getcBlockSize(const void *src, size_t srcSize, blockProperties_t *bpPtr) ++{ ++ if (srcSize < ZSTD_blockHeaderSize) ++ return ERROR(srcSize_wrong); ++ { ++ U32 const cBlockHeader = ZSTD_readLE24(src); ++ U32 const cSize = cBlockHeader >> 3; ++ bpPtr->lastBlock = cBlockHeader & 1; ++ bpPtr->blockType = (blockType_e)((cBlockHeader >> 1) & 3); ++ bpPtr->origSize = cSize; /* only useful for RLE */ ++ if (bpPtr->blockType == bt_rle) ++ return 1; ++ if (bpPtr->blockType == bt_reserved) ++ return ERROR(corruption_detected); ++ return cSize; ++ } ++} ++ ++static size_t INIT ZSTD_copyRawBlock(void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++{ ++ if (srcSize > dstCapacity) ++ return ERROR(dstSize_tooSmall); ++ memcpy(dst, src, srcSize); ++ return srcSize; ++} ++ ++static size_t INIT ZSTD_setRleBlock(void *dst, size_t dstCapacity, const void *src, size_t srcSize, size_t regenSize) ++{ ++ if (srcSize != 1) ++ return ERROR(srcSize_wrong); ++ if (regenSize > dstCapacity) ++ return ERROR(dstSize_tooSmall); ++ memset(dst, *(const BYTE *)src, regenSize); ++ return regenSize; ++} ++ ++/*! ZSTD_decodeLiteralsBlock() : ++ @return : nb of bytes read from src (< srcSize ) */ ++size_t INIT ZSTD_decodeLiteralsBlock(ZSTD_DCtx *dctx, const void *src, size_t srcSize) /* note : srcSize < BLOCKSIZE */ ++{ ++ if (srcSize < MIN_CBLOCK_SIZE) ++ return ERROR(corruption_detected); ++ ++ { ++ const BYTE *const istart = (const BYTE *)src; ++ symbolEncodingType_e const litEncType = (symbolEncodingType_e)(istart[0] & 3); ++ ++ switch (litEncType) { ++ case set_repeat: ++ if (dctx->litEntropy == 0) ++ return ERROR(dictionary_corrupted); ++ /* fallthrough */ ++ case set_compressed: ++ if (srcSize < 5) ++ return ERROR(corruption_detected); /* srcSize >= MIN_CBLOCK_SIZE == 3; here we need up to 5 for case 3 */ ++ { ++ size_t lhSize, litSize, litCSize; ++ U32 singleStream = 0; ++ U32 const lhlCode = (istart[0] >> 2) & 3; ++ U32 const lhc = ZSTD_readLE32(istart); ++ switch (lhlCode) { ++ case 0: ++ case 1: ++ default: /* note : default is impossible, since lhlCode into [0..3] */ ++ /* 2 - 2 - 10 - 10 */ ++ singleStream = !lhlCode; ++ lhSize = 3; ++ litSize = (lhc >> 4) & 0x3FF; ++ litCSize = (lhc >> 14) & 0x3FF; ++ break; ++ case 2: ++ /* 2 - 2 - 14 - 14 */ ++ lhSize = 4; ++ litSize = (lhc >> 4) & 0x3FFF; ++ litCSize = lhc >> 18; ++ break; ++ case 3: ++ /* 2 - 2 - 18 - 18 */ ++ lhSize = 5; ++ litSize = (lhc >> 4) & 0x3FFFF; ++ litCSize = (lhc >> 22) + (istart[4] << 10); ++ break; ++ } ++ if (litSize > ZSTD_BLOCKSIZE_ABSOLUTEMAX) ++ return ERROR(corruption_detected); ++ if (litCSize + lhSize > srcSize) ++ return ERROR(corruption_detected); ++ ++ if (HUF_isError( ++ (litEncType == set_repeat) ++ ? (singleStream ? HUF_decompress1X_usingDTable(dctx->litBuffer, litSize, istart + lhSize, litCSize, dctx->HUFptr) ++ : HUF_decompress4X_usingDTable(dctx->litBuffer, litSize, istart + lhSize, litCSize, dctx->HUFptr)) ++ : (singleStream ++ ? HUF_decompress1X2_DCtx_wksp(dctx->entropy.hufTable, dctx->litBuffer, litSize, istart + lhSize, litCSize, ++ dctx->entropy.workspace, sizeof(dctx->entropy.workspace)) ++ : HUF_decompress4X_hufOnly_wksp(dctx->entropy.hufTable, dctx->litBuffer, litSize, istart + lhSize, litCSize, ++ dctx->entropy.workspace, sizeof(dctx->entropy.workspace))))) ++ return ERROR(corruption_detected); ++ ++ dctx->litPtr = dctx->litBuffer; ++ dctx->litSize = litSize; ++ dctx->litEntropy = 1; ++ if (litEncType == set_compressed) ++ dctx->HUFptr = dctx->entropy.hufTable; ++ memset(dctx->litBuffer + dctx->litSize, 0, WILDCOPY_OVERLENGTH); ++ return litCSize + lhSize; ++ } ++ ++ case set_basic: { ++ size_t litSize, lhSize; ++ U32 const lhlCode = ((istart[0]) >> 2) & 3; ++ switch (lhlCode) { ++ case 0: ++ case 2: ++ default: /* note : default is impossible, since lhlCode into [0..3] */ ++ lhSize = 1; ++ litSize = istart[0] >> 3; ++ break; ++ case 1: ++ lhSize = 2; ++ litSize = ZSTD_readLE16(istart) >> 4; ++ break; ++ case 3: ++ lhSize = 3; ++ litSize = ZSTD_readLE24(istart) >> 4; ++ break; ++ } ++ ++ if (lhSize + litSize + WILDCOPY_OVERLENGTH > srcSize) { /* risk reading beyond src buffer with wildcopy */ ++ if (litSize + lhSize > srcSize) ++ return ERROR(corruption_detected); ++ memcpy(dctx->litBuffer, istart + lhSize, litSize); ++ dctx->litPtr = dctx->litBuffer; ++ dctx->litSize = litSize; ++ memset(dctx->litBuffer + dctx->litSize, 0, WILDCOPY_OVERLENGTH); ++ return lhSize + litSize; ++ } ++ /* direct reference into compressed stream */ ++ dctx->litPtr = istart + lhSize; ++ dctx->litSize = litSize; ++ return lhSize + litSize; ++ } ++ ++ case set_rle: { ++ U32 const lhlCode = ((istart[0]) >> 2) & 3; ++ size_t litSize, lhSize; ++ switch (lhlCode) { ++ case 0: ++ case 2: ++ default: /* note : default is impossible, since lhlCode into [0..3] */ ++ lhSize = 1; ++ litSize = istart[0] >> 3; ++ break; ++ case 1: ++ lhSize = 2; ++ litSize = ZSTD_readLE16(istart) >> 4; ++ break; ++ case 3: ++ lhSize = 3; ++ litSize = ZSTD_readLE24(istart) >> 4; ++ if (srcSize < 4) ++ return ERROR(corruption_detected); /* srcSize >= MIN_CBLOCK_SIZE == 3; here we need lhSize+1 = 4 */ ++ break; ++ } ++ if (litSize > ZSTD_BLOCKSIZE_ABSOLUTEMAX) ++ return ERROR(corruption_detected); ++ memset(dctx->litBuffer, istart[lhSize], litSize + WILDCOPY_OVERLENGTH); ++ dctx->litPtr = dctx->litBuffer; ++ dctx->litSize = litSize; ++ return lhSize + 1; ++ } ++ default: ++ return ERROR(corruption_detected); /* impossible */ ++ } ++ } ++} ++ ++typedef union { ++ FSE_decode_t realData; ++ U32 alignedBy4; ++} FSE_decode_t4; ++ ++static const FSE_decode_t4 LL_defaultDTable[(1 << LL_DEFAULTNORMLOG) + 1] = { ++ {{LL_DEFAULTNORMLOG, 1, 1}}, /* header : tableLog, fastMode, fastMode */ ++ {{0, 0, 4}}, /* 0 : base, symbol, bits */ ++ {{16, 0, 4}}, ++ {{32, 1, 5}}, ++ {{0, 3, 5}}, ++ {{0, 4, 5}}, ++ {{0, 6, 5}}, ++ {{0, 7, 5}}, ++ {{0, 9, 5}}, ++ {{0, 10, 5}}, ++ {{0, 12, 5}}, ++ {{0, 14, 6}}, ++ {{0, 16, 5}}, ++ {{0, 18, 5}}, ++ {{0, 19, 5}}, ++ {{0, 21, 5}}, ++ {{0, 22, 5}}, ++ {{0, 24, 5}}, ++ {{32, 25, 5}}, ++ {{0, 26, 5}}, ++ {{0, 27, 6}}, ++ {{0, 29, 6}}, ++ {{0, 31, 6}}, ++ {{32, 0, 4}}, ++ {{0, 1, 4}}, ++ {{0, 2, 5}}, ++ {{32, 4, 5}}, ++ {{0, 5, 5}}, ++ {{32, 7, 5}}, ++ {{0, 8, 5}}, ++ {{32, 10, 5}}, ++ {{0, 11, 5}}, ++ {{0, 13, 6}}, ++ {{32, 16, 5}}, ++ {{0, 17, 5}}, ++ {{32, 19, 5}}, ++ {{0, 20, 5}}, ++ {{32, 22, 5}}, ++ {{0, 23, 5}}, ++ {{0, 25, 4}}, ++ {{16, 25, 4}}, ++ {{32, 26, 5}}, ++ {{0, 28, 6}}, ++ {{0, 30, 6}}, ++ {{48, 0, 4}}, ++ {{16, 1, 4}}, ++ {{32, 2, 5}}, ++ {{32, 3, 5}}, ++ {{32, 5, 5}}, ++ {{32, 6, 5}}, ++ {{32, 8, 5}}, ++ {{32, 9, 5}}, ++ {{32, 11, 5}}, ++ {{32, 12, 5}}, ++ {{0, 15, 6}}, ++ {{32, 17, 5}}, ++ {{32, 18, 5}}, ++ {{32, 20, 5}}, ++ {{32, 21, 5}}, ++ {{32, 23, 5}}, ++ {{32, 24, 5}}, ++ {{0, 35, 6}}, ++ {{0, 34, 6}}, ++ {{0, 33, 6}}, ++ {{0, 32, 6}}, ++}; /* LL_defaultDTable */ ++ ++static const FSE_decode_t4 ML_defaultDTable[(1 << ML_DEFAULTNORMLOG) + 1] = { ++ {{ML_DEFAULTNORMLOG, 1, 1}}, /* header : tableLog, fastMode, fastMode */ ++ {{0, 0, 6}}, /* 0 : base, symbol, bits */ ++ {{0, 1, 4}}, ++ {{32, 2, 5}}, ++ {{0, 3, 5}}, ++ {{0, 5, 5}}, ++ {{0, 6, 5}}, ++ {{0, 8, 5}}, ++ {{0, 10, 6}}, ++ {{0, 13, 6}}, ++ {{0, 16, 6}}, ++ {{0, 19, 6}}, ++ {{0, 22, 6}}, ++ {{0, 25, 6}}, ++ {{0, 28, 6}}, ++ {{0, 31, 6}}, ++ {{0, 33, 6}}, ++ {{0, 35, 6}}, ++ {{0, 37, 6}}, ++ {{0, 39, 6}}, ++ {{0, 41, 6}}, ++ {{0, 43, 6}}, ++ {{0, 45, 6}}, ++ {{16, 1, 4}}, ++ {{0, 2, 4}}, ++ {{32, 3, 5}}, ++ {{0, 4, 5}}, ++ {{32, 6, 5}}, ++ {{0, 7, 5}}, ++ {{0, 9, 6}}, ++ {{0, 12, 6}}, ++ {{0, 15, 6}}, ++ {{0, 18, 6}}, ++ {{0, 21, 6}}, ++ {{0, 24, 6}}, ++ {{0, 27, 6}}, ++ {{0, 30, 6}}, ++ {{0, 32, 6}}, ++ {{0, 34, 6}}, ++ {{0, 36, 6}}, ++ {{0, 38, 6}}, ++ {{0, 40, 6}}, ++ {{0, 42, 6}}, ++ {{0, 44, 6}}, ++ {{32, 1, 4}}, ++ {{48, 1, 4}}, ++ {{16, 2, 4}}, ++ {{32, 4, 5}}, ++ {{32, 5, 5}}, ++ {{32, 7, 5}}, ++ {{32, 8, 5}}, ++ {{0, 11, 6}}, ++ {{0, 14, 6}}, ++ {{0, 17, 6}}, ++ {{0, 20, 6}}, ++ {{0, 23, 6}}, ++ {{0, 26, 6}}, ++ {{0, 29, 6}}, ++ {{0, 52, 6}}, ++ {{0, 51, 6}}, ++ {{0, 50, 6}}, ++ {{0, 49, 6}}, ++ {{0, 48, 6}}, ++ {{0, 47, 6}}, ++ {{0, 46, 6}}, ++}; /* ML_defaultDTable */ ++ ++static const FSE_decode_t4 OF_defaultDTable[(1 << OF_DEFAULTNORMLOG) + 1] = { ++ {{OF_DEFAULTNORMLOG, 1, 1}}, /* header : tableLog, fastMode, fastMode */ ++ {{0, 0, 5}}, /* 0 : base, symbol, bits */ ++ {{0, 6, 4}}, ++ {{0, 9, 5}}, ++ {{0, 15, 5}}, ++ {{0, 21, 5}}, ++ {{0, 3, 5}}, ++ {{0, 7, 4}}, ++ {{0, 12, 5}}, ++ {{0, 18, 5}}, ++ {{0, 23, 5}}, ++ {{0, 5, 5}}, ++ {{0, 8, 4}}, ++ {{0, 14, 5}}, ++ {{0, 20, 5}}, ++ {{0, 2, 5}}, ++ {{16, 7, 4}}, ++ {{0, 11, 5}}, ++ {{0, 17, 5}}, ++ {{0, 22, 5}}, ++ {{0, 4, 5}}, ++ {{16, 8, 4}}, ++ {{0, 13, 5}}, ++ {{0, 19, 5}}, ++ {{0, 1, 5}}, ++ {{16, 6, 4}}, ++ {{0, 10, 5}}, ++ {{0, 16, 5}}, ++ {{0, 28, 5}}, ++ {{0, 27, 5}}, ++ {{0, 26, 5}}, ++ {{0, 25, 5}}, ++ {{0, 24, 5}}, ++}; /* OF_defaultDTable */ ++ ++/*! ZSTD_buildSeqTable() : ++ @return : nb bytes read from src, ++ or an error code if it fails, testable with ZSTD_isError() ++*/ ++static size_t INIT ZSTD_buildSeqTable(FSE_DTable *DTableSpace, const FSE_DTable **DTablePtr, ++ symbolEncodingType_e type, U32 max, U32 maxLog, const void *src, ++ size_t srcSize, const FSE_decode_t4 *defaultTable, ++ U32 flagRepeatTable, void *workspace, size_t workspaceSize) ++{ ++ const void *const tmpPtr = defaultTable; /* bypass strict aliasing */ ++ switch (type) { ++ case set_rle: ++ if (!srcSize) ++ return ERROR(srcSize_wrong); ++ if ((*(const BYTE *)src) > max) ++ return ERROR(corruption_detected); ++ FSE_buildDTable_rle(DTableSpace, *(const BYTE *)src); ++ *DTablePtr = DTableSpace; ++ return 1; ++ case set_basic: *DTablePtr = (const FSE_DTable *)tmpPtr; return 0; ++ case set_repeat: ++ if (!flagRepeatTable) ++ return ERROR(corruption_detected); ++ return 0; ++ default: /* impossible */ ++ case set_compressed: { ++ U32 tableLog; ++ S16 *norm = (S16 *)workspace; ++ size_t const spaceUsed32 = ALIGN(sizeof(S16) * (MaxSeq + 1), sizeof(U32)) >> 2; ++ ++ if ((spaceUsed32 << 2) > workspaceSize) ++ return ERROR(GENERIC); ++ workspace = (U32 *)workspace + spaceUsed32; ++ workspaceSize -= (spaceUsed32 << 2); ++ { ++ size_t const headerSize = FSE_readNCount(norm, &max, &tableLog, src, srcSize); ++ if (FSE_isError(headerSize)) ++ return ERROR(corruption_detected); ++ if (tableLog > maxLog) ++ return ERROR(corruption_detected); ++ FSE_buildDTable_wksp(DTableSpace, norm, max, tableLog, workspace, workspaceSize); ++ *DTablePtr = DTableSpace; ++ return headerSize; ++ } ++ } ++ } ++} ++ ++size_t INIT ZSTD_decodeSeqHeaders(ZSTD_DCtx *dctx, int *nbSeqPtr, const void *src, size_t srcSize) ++{ ++ const BYTE *const istart = (const BYTE *const)src; ++ const BYTE *const iend = istart + srcSize; ++ const BYTE *ip = istart; ++ ++ /* check */ ++ if (srcSize < MIN_SEQUENCES_SIZE) ++ return ERROR(srcSize_wrong); ++ ++ /* SeqHead */ ++ { ++ int nbSeq = *ip++; ++ if (!nbSeq) { ++ *nbSeqPtr = 0; ++ return 1; ++ } ++ if (nbSeq > 0x7F) { ++ if (nbSeq == 0xFF) { ++ if (ip + 2 > iend) ++ return ERROR(srcSize_wrong); ++ nbSeq = ZSTD_readLE16(ip) + LONGNBSEQ, ip += 2; ++ } else { ++ if (ip >= iend) ++ return ERROR(srcSize_wrong); ++ nbSeq = ((nbSeq - 0x80) << 8) + *ip++; ++ } ++ } ++ *nbSeqPtr = nbSeq; ++ } ++ ++ /* FSE table descriptors */ ++ if (ip + 4 > iend) ++ return ERROR(srcSize_wrong); /* minimum possible size */ ++ { ++ symbolEncodingType_e const LLtype = (symbolEncodingType_e)(*ip >> 6); ++ symbolEncodingType_e const OFtype = (symbolEncodingType_e)((*ip >> 4) & 3); ++ symbolEncodingType_e const MLtype = (symbolEncodingType_e)((*ip >> 2) & 3); ++ ip++; ++ ++ /* Build DTables */ ++ { ++ size_t const llhSize = ZSTD_buildSeqTable(dctx->entropy.LLTable, &dctx->LLTptr, LLtype, MaxLL, LLFSELog, ip, iend - ip, ++ LL_defaultDTable, dctx->fseEntropy, dctx->entropy.workspace, sizeof(dctx->entropy.workspace)); ++ if (ZSTD_isError(llhSize)) ++ return ERROR(corruption_detected); ++ ip += llhSize; ++ } ++ { ++ size_t const ofhSize = ZSTD_buildSeqTable(dctx->entropy.OFTable, &dctx->OFTptr, OFtype, MaxOff, OffFSELog, ip, iend - ip, ++ OF_defaultDTable, dctx->fseEntropy, dctx->entropy.workspace, sizeof(dctx->entropy.workspace)); ++ if (ZSTD_isError(ofhSize)) ++ return ERROR(corruption_detected); ++ ip += ofhSize; ++ } ++ { ++ size_t const mlhSize = ZSTD_buildSeqTable(dctx->entropy.MLTable, &dctx->MLTptr, MLtype, MaxML, MLFSELog, ip, iend - ip, ++ ML_defaultDTable, dctx->fseEntropy, dctx->entropy.workspace, sizeof(dctx->entropy.workspace)); ++ if (ZSTD_isError(mlhSize)) ++ return ERROR(corruption_detected); ++ ip += mlhSize; ++ } ++ } ++ ++ return ip - istart; ++} ++ ++typedef struct { ++ size_t litLength; ++ size_t matchLength; ++ size_t offset; ++ const BYTE *match; ++} seq_t; ++ ++typedef struct { ++ BIT_DStream_t DStream; ++ FSE_DState_t stateLL; ++ FSE_DState_t stateOffb; ++ FSE_DState_t stateML; ++ size_t prevOffset[ZSTD_REP_NUM]; ++ const BYTE *base; ++ size_t pos; ++ uPtrDiff gotoDict; ++} seqState_t; ++ ++FORCE_NOINLINE ++size_t ZSTD_execSequenceLast7(BYTE *op, BYTE *const oend, seq_t sequence, const BYTE **litPtr, const BYTE *const litLimit, const BYTE *const base, ++ const BYTE *const vBase, const BYTE *const dictEnd) ++{ ++ BYTE *const oLitEnd = op + sequence.litLength; ++ size_t const sequenceLength = sequence.litLength + sequence.matchLength; ++ BYTE *const oMatchEnd = op + sequenceLength; /* risk : address space overflow (32-bits) */ ++ BYTE *const oend_w = oend - WILDCOPY_OVERLENGTH; ++ const BYTE *const iLitEnd = *litPtr + sequence.litLength; ++ const BYTE *match = oLitEnd - sequence.offset; ++ ++ /* check */ ++ if (oMatchEnd > oend) ++ return ERROR(dstSize_tooSmall); /* last match must start at a minimum distance of WILDCOPY_OVERLENGTH from oend */ ++ if (iLitEnd > litLimit) ++ return ERROR(corruption_detected); /* over-read beyond lit buffer */ ++ if (oLitEnd <= oend_w) ++ return ERROR(GENERIC); /* Precondition */ ++ ++ /* copy literals */ ++ if (op < oend_w) { ++ ZSTD_wildcopy(op, *litPtr, oend_w - op); ++ *litPtr += oend_w - op; ++ op = oend_w; ++ } ++ while (op < oLitEnd) ++ *op++ = *(*litPtr)++; ++ ++ /* copy Match */ ++ if (sequence.offset > (size_t)(oLitEnd - base)) { ++ /* offset beyond prefix */ ++ if (sequence.offset > (size_t)(oLitEnd - vBase)) ++ return ERROR(corruption_detected); ++ match = dictEnd - (base - match); ++ if (match + sequence.matchLength <= dictEnd) { ++ memmove(oLitEnd, match, sequence.matchLength); ++ return sequenceLength; ++ } ++ /* span extDict & currPrefixSegment */ ++ { ++ size_t const length1 = dictEnd - match; ++ memmove(oLitEnd, match, length1); ++ op = oLitEnd + length1; ++ sequence.matchLength -= length1; ++ match = base; ++ } ++ } ++ while (op < oMatchEnd) ++ *op++ = *match++; ++ return sequenceLength; ++} ++ ++static seq_t INIT ZSTD_decodeSequence(seqState_t *seqState) ++{ ++ seq_t seq; ++ ++ U32 const llCode = FSE_peekSymbol(&seqState->stateLL); ++ U32 const mlCode = FSE_peekSymbol(&seqState->stateML); ++ U32 const ofCode = FSE_peekSymbol(&seqState->stateOffb); /* <= maxOff, by table construction */ ++ ++ U32 const llBits = LL_bits[llCode]; ++ U32 const mlBits = ML_bits[mlCode]; ++ U32 const ofBits = ofCode; ++ U32 const totalBits = llBits + mlBits + ofBits; ++ ++ static const U32 LL_base[MaxLL + 1] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, ++ 20, 22, 24, 28, 32, 40, 48, 64, 0x80, 0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000}; ++ ++ static const U32 ML_base[MaxML + 1] = {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ++ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 39, 41, ++ 43, 47, 51, 59, 67, 83, 99, 0x83, 0x103, 0x203, 0x403, 0x803, 0x1003, 0x2003, 0x4003, 0x8003, 0x10003}; ++ ++ static const U32 OF_base[MaxOff + 1] = {0, 1, 1, 5, 0xD, 0x1D, 0x3D, 0x7D, 0xFD, 0x1FD, ++ 0x3FD, 0x7FD, 0xFFD, 0x1FFD, 0x3FFD, 0x7FFD, 0xFFFD, 0x1FFFD, 0x3FFFD, 0x7FFFD, ++ 0xFFFFD, 0x1FFFFD, 0x3FFFFD, 0x7FFFFD, 0xFFFFFD, 0x1FFFFFD, 0x3FFFFFD, 0x7FFFFFD, 0xFFFFFFD}; ++ ++ /* sequence */ ++ { ++ size_t offset; ++ if (!ofCode) ++ offset = 0; ++ else { ++ offset = OF_base[ofCode] + BIT_readBitsFast(&seqState->DStream, ofBits); /* <= (ZSTD_WINDOWLOG_MAX-1) bits */ ++ if (ZSTD_32bits()) ++ BIT_reloadDStream(&seqState->DStream); ++ } ++ ++ if (ofCode <= 1) { ++ offset += (llCode == 0); ++ if (offset) { ++ size_t temp = (offset == 3) ? seqState->prevOffset[0] - 1 : seqState->prevOffset[offset]; ++ temp += !temp; /* 0 is not valid; input is corrupted; force offset to 1 */ ++ if (offset != 1) ++ seqState->prevOffset[2] = seqState->prevOffset[1]; ++ seqState->prevOffset[1] = seqState->prevOffset[0]; ++ seqState->prevOffset[0] = offset = temp; ++ } else { ++ offset = seqState->prevOffset[0]; ++ } ++ } else { ++ seqState->prevOffset[2] = seqState->prevOffset[1]; ++ seqState->prevOffset[1] = seqState->prevOffset[0]; ++ seqState->prevOffset[0] = offset; ++ } ++ seq.offset = offset; ++ } ++ ++ seq.matchLength = ML_base[mlCode] + ((mlCode > 31) ? BIT_readBitsFast(&seqState->DStream, mlBits) : 0); /* <= 16 bits */ ++ if (ZSTD_32bits() && (mlBits + llBits > 24)) ++ BIT_reloadDStream(&seqState->DStream); ++ ++ seq.litLength = LL_base[llCode] + ((llCode > 15) ? BIT_readBitsFast(&seqState->DStream, llBits) : 0); /* <= 16 bits */ ++ if (ZSTD_32bits() || (totalBits > 64 - 7 - (LLFSELog + MLFSELog + OffFSELog))) ++ BIT_reloadDStream(&seqState->DStream); ++ ++ /* ANS state update */ ++ FSE_updateState(&seqState->stateLL, &seqState->DStream); /* <= 9 bits */ ++ FSE_updateState(&seqState->stateML, &seqState->DStream); /* <= 9 bits */ ++ if (ZSTD_32bits()) ++ BIT_reloadDStream(&seqState->DStream); /* <= 18 bits */ ++ FSE_updateState(&seqState->stateOffb, &seqState->DStream); /* <= 8 bits */ ++ ++ seq.match = NULL; ++ ++ return seq; ++} ++ ++FORCE_INLINE ++size_t ZSTD_execSequence(BYTE *op, BYTE *const oend, seq_t sequence, const BYTE **litPtr, const BYTE *const litLimit, const BYTE *const base, ++ const BYTE *const vBase, const BYTE *const dictEnd) ++{ ++ BYTE *const oLitEnd = op + sequence.litLength; ++ size_t const sequenceLength = sequence.litLength + sequence.matchLength; ++ BYTE *const oMatchEnd = op + sequenceLength; /* risk : address space overflow (32-bits) */ ++ BYTE *const oend_w = oend - WILDCOPY_OVERLENGTH; ++ const BYTE *const iLitEnd = *litPtr + sequence.litLength; ++ const BYTE *match = oLitEnd - sequence.offset; ++ ++ /* check */ ++ if (oMatchEnd > oend) ++ return ERROR(dstSize_tooSmall); /* last match must start at a minimum distance of WILDCOPY_OVERLENGTH from oend */ ++ if (iLitEnd > litLimit) ++ return ERROR(corruption_detected); /* over-read beyond lit buffer */ ++ if (oLitEnd > oend_w) ++ return ZSTD_execSequenceLast7(op, oend, sequence, litPtr, litLimit, base, vBase, dictEnd); ++ ++ /* copy Literals */ ++ ZSTD_copy8(op, *litPtr); ++ if (sequence.litLength > 8) ++ ZSTD_wildcopy(op + 8, (*litPtr) + 8, ++ sequence.litLength - 8); /* note : since oLitEnd <= oend-WILDCOPY_OVERLENGTH, no risk of overwrite beyond oend */ ++ op = oLitEnd; ++ *litPtr = iLitEnd; /* update for next sequence */ ++ ++ /* copy Match */ ++ if (sequence.offset > (size_t)(oLitEnd - base)) { ++ /* offset beyond prefix */ ++ if (sequence.offset > (size_t)(oLitEnd - vBase)) ++ return ERROR(corruption_detected); ++ match = dictEnd + (match - base); ++ if (match + sequence.matchLength <= dictEnd) { ++ memmove(oLitEnd, match, sequence.matchLength); ++ return sequenceLength; ++ } ++ /* span extDict & currPrefixSegment */ ++ { ++ size_t const length1 = dictEnd - match; ++ memmove(oLitEnd, match, length1); ++ op = oLitEnd + length1; ++ sequence.matchLength -= length1; ++ match = base; ++ if (op > oend_w || sequence.matchLength < MINMATCH) { ++ U32 i; ++ for (i = 0; i < sequence.matchLength; ++i) ++ op[i] = match[i]; ++ return sequenceLength; ++ } ++ } ++ } ++ /* Requirement: op <= oend_w && sequence.matchLength >= MINMATCH */ ++ ++ /* match within prefix */ ++ if (sequence.offset < 8) { ++ /* close range match, overlap */ ++ static const U32 dec32table[] = {0, 1, 2, 1, 4, 4, 4, 4}; /* added */ ++ static const int dec64table[] = {8, 8, 8, 7, 8, 9, 10, 11}; /* subtracted */ ++ int const sub2 = dec64table[sequence.offset]; ++ op[0] = match[0]; ++ op[1] = match[1]; ++ op[2] = match[2]; ++ op[3] = match[3]; ++ match += dec32table[sequence.offset]; ++ ZSTD_copy4(op + 4, match); ++ match -= sub2; ++ } else { ++ ZSTD_copy8(op, match); ++ } ++ op += 8; ++ match += 8; ++ ++ if (oMatchEnd > oend - (16 - MINMATCH)) { ++ if (op < oend_w) { ++ ZSTD_wildcopy(op, match, oend_w - op); ++ match += oend_w - op; ++ op = oend_w; ++ } ++ while (op < oMatchEnd) ++ *op++ = *match++; ++ } else { ++ ZSTD_wildcopy(op, match, (ptrdiff_t)sequence.matchLength - 8); /* works even if matchLength < 8 */ ++ } ++ return sequenceLength; ++} ++ ++static size_t INIT ZSTD_decompressSequences(ZSTD_DCtx *dctx, void *dst, size_t maxDstSize, const void *seqStart, size_t seqSize) ++{ ++ const BYTE *ip = (const BYTE *)seqStart; ++ const BYTE *const iend = ip + seqSize; ++ BYTE *const ostart = (BYTE * const)dst; ++ BYTE *const oend = ostart + maxDstSize; ++ BYTE *op = ostart; ++ const BYTE *litPtr = dctx->litPtr; ++ const BYTE *const litEnd = litPtr + dctx->litSize; ++ const BYTE *const base = (const BYTE *)(dctx->base); ++ const BYTE *const vBase = (const BYTE *)(dctx->vBase); ++ const BYTE *const dictEnd = (const BYTE *)(dctx->dictEnd); ++ int nbSeq; ++ ++ /* Build Decoding Tables */ ++ { ++ size_t const seqHSize = ZSTD_decodeSeqHeaders(dctx, &nbSeq, ip, seqSize); ++ if (ZSTD_isError(seqHSize)) ++ return seqHSize; ++ ip += seqHSize; ++ } ++ ++ /* Regen sequences */ ++ if (nbSeq) { ++ seqState_t seqState; ++ dctx->fseEntropy = 1; ++ { ++ U32 i; ++ for (i = 0; i < ZSTD_REP_NUM; i++) ++ seqState.prevOffset[i] = dctx->entropy.rep[i]; ++ } ++ CHECK_E(BIT_initDStream(&seqState.DStream, ip, iend - ip), corruption_detected); ++ FSE_initDState(&seqState.stateLL, &seqState.DStream, dctx->LLTptr); ++ FSE_initDState(&seqState.stateOffb, &seqState.DStream, dctx->OFTptr); ++ FSE_initDState(&seqState.stateML, &seqState.DStream, dctx->MLTptr); ++ ++ for (; (BIT_reloadDStream(&(seqState.DStream)) <= BIT_DStream_completed) && nbSeq;) { ++ nbSeq--; ++ { ++ seq_t const sequence = ZSTD_decodeSequence(&seqState); ++ size_t const oneSeqSize = ZSTD_execSequence(op, oend, sequence, &litPtr, litEnd, base, vBase, dictEnd); ++ if (ZSTD_isError(oneSeqSize)) ++ return oneSeqSize; ++ op += oneSeqSize; ++ } ++ } ++ ++ /* check if reached exact end */ ++ if (nbSeq) ++ return ERROR(corruption_detected); ++ /* save reps for next block */ ++ { ++ U32 i; ++ for (i = 0; i < ZSTD_REP_NUM; i++) ++ dctx->entropy.rep[i] = (U32)(seqState.prevOffset[i]); ++ } ++ } ++ ++ /* last literal segment */ ++ { ++ size_t const lastLLSize = litEnd - litPtr; ++ if (lastLLSize > (size_t)(oend - op)) ++ return ERROR(dstSize_tooSmall); ++ memcpy(op, litPtr, lastLLSize); ++ op += lastLLSize; ++ } ++ ++ return op - ostart; ++} ++ ++FORCE_INLINE seq_t ZSTD_decodeSequenceLong_generic(seqState_t *seqState, int const longOffsets) ++{ ++ seq_t seq; ++ ++ U32 const llCode = FSE_peekSymbol(&seqState->stateLL); ++ U32 const mlCode = FSE_peekSymbol(&seqState->stateML); ++ U32 const ofCode = FSE_peekSymbol(&seqState->stateOffb); /* <= maxOff, by table construction */ ++ ++ U32 const llBits = LL_bits[llCode]; ++ U32 const mlBits = ML_bits[mlCode]; ++ U32 const ofBits = ofCode; ++ U32 const totalBits = llBits + mlBits + ofBits; ++ ++ static const U32 LL_base[MaxLL + 1] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, ++ 20, 22, 24, 28, 32, 40, 48, 64, 0x80, 0x100, 0x200, 0x400, 0x800, 0x1000, 0x2000, 0x4000, 0x8000, 0x10000}; ++ ++ static const U32 ML_base[MaxML + 1] = {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ++ 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 37, 39, 41, ++ 43, 47, 51, 59, 67, 83, 99, 0x83, 0x103, 0x203, 0x403, 0x803, 0x1003, 0x2003, 0x4003, 0x8003, 0x10003}; ++ ++ static const U32 OF_base[MaxOff + 1] = {0, 1, 1, 5, 0xD, 0x1D, 0x3D, 0x7D, 0xFD, 0x1FD, ++ 0x3FD, 0x7FD, 0xFFD, 0x1FFD, 0x3FFD, 0x7FFD, 0xFFFD, 0x1FFFD, 0x3FFFD, 0x7FFFD, ++ 0xFFFFD, 0x1FFFFD, 0x3FFFFD, 0x7FFFFD, 0xFFFFFD, 0x1FFFFFD, 0x3FFFFFD, 0x7FFFFFD, 0xFFFFFFD}; ++ ++ /* sequence */ ++ { ++ size_t offset; ++ if (!ofCode) ++ offset = 0; ++ else { ++ if (longOffsets) { ++ int const extraBits = ofBits - MIN(ofBits, STREAM_ACCUMULATOR_MIN); ++ offset = OF_base[ofCode] + (BIT_readBitsFast(&seqState->DStream, ofBits - extraBits) << extraBits); ++ if (ZSTD_32bits() || extraBits) ++ BIT_reloadDStream(&seqState->DStream); ++ if (extraBits) ++ offset += BIT_readBitsFast(&seqState->DStream, extraBits); ++ } else { ++ offset = OF_base[ofCode] + BIT_readBitsFast(&seqState->DStream, ofBits); /* <= (ZSTD_WINDOWLOG_MAX-1) bits */ ++ if (ZSTD_32bits()) ++ BIT_reloadDStream(&seqState->DStream); ++ } ++ } ++ ++ if (ofCode <= 1) { ++ offset += (llCode == 0); ++ if (offset) { ++ size_t temp = (offset == 3) ? seqState->prevOffset[0] - 1 : seqState->prevOffset[offset]; ++ temp += !temp; /* 0 is not valid; input is corrupted; force offset to 1 */ ++ if (offset != 1) ++ seqState->prevOffset[2] = seqState->prevOffset[1]; ++ seqState->prevOffset[1] = seqState->prevOffset[0]; ++ seqState->prevOffset[0] = offset = temp; ++ } else { ++ offset = seqState->prevOffset[0]; ++ } ++ } else { ++ seqState->prevOffset[2] = seqState->prevOffset[1]; ++ seqState->prevOffset[1] = seqState->prevOffset[0]; ++ seqState->prevOffset[0] = offset; ++ } ++ seq.offset = offset; ++ } ++ ++ seq.matchLength = ML_base[mlCode] + ((mlCode > 31) ? BIT_readBitsFast(&seqState->DStream, mlBits) : 0); /* <= 16 bits */ ++ if (ZSTD_32bits() && (mlBits + llBits > 24)) ++ BIT_reloadDStream(&seqState->DStream); ++ ++ seq.litLength = LL_base[llCode] + ((llCode > 15) ? BIT_readBitsFast(&seqState->DStream, llBits) : 0); /* <= 16 bits */ ++ if (ZSTD_32bits() || (totalBits > 64 - 7 - (LLFSELog + MLFSELog + OffFSELog))) ++ BIT_reloadDStream(&seqState->DStream); ++ ++ { ++ size_t const pos = seqState->pos + seq.litLength; ++ seq.match = seqState->base + pos - seq.offset; /* single memory segment */ ++ if (seq.offset > pos) ++ seq.match += seqState->gotoDict; /* separate memory segment */ ++ seqState->pos = pos + seq.matchLength; ++ } ++ ++ /* ANS state update */ ++ FSE_updateState(&seqState->stateLL, &seqState->DStream); /* <= 9 bits */ ++ FSE_updateState(&seqState->stateML, &seqState->DStream); /* <= 9 bits */ ++ if (ZSTD_32bits()) ++ BIT_reloadDStream(&seqState->DStream); /* <= 18 bits */ ++ FSE_updateState(&seqState->stateOffb, &seqState->DStream); /* <= 8 bits */ ++ ++ return seq; ++} ++ ++static seq_t INIT ZSTD_decodeSequenceLong(seqState_t *seqState, unsigned const windowSize) ++{ ++ if (ZSTD_highbit32(windowSize) > STREAM_ACCUMULATOR_MIN) { ++ return ZSTD_decodeSequenceLong_generic(seqState, 1); ++ } else { ++ return ZSTD_decodeSequenceLong_generic(seqState, 0); ++ } ++} ++ ++FORCE_INLINE ++size_t INIT ZSTD_execSequenceLong(BYTE *op, BYTE *const oend, seq_t sequence, const BYTE **litPtr, ++ const BYTE *const litLimit, const BYTE *const base, ++ const BYTE *const vBase, const BYTE *const dictEnd) ++{ ++ BYTE *const oLitEnd = op + sequence.litLength; ++ size_t const sequenceLength = sequence.litLength + sequence.matchLength; ++ BYTE *const oMatchEnd = op + sequenceLength; /* risk : address space overflow (32-bits) */ ++ BYTE *const oend_w = oend - WILDCOPY_OVERLENGTH; ++ const BYTE *const iLitEnd = *litPtr + sequence.litLength; ++ const BYTE *match = sequence.match; ++ ++ /* check */ ++ if (oMatchEnd > oend) ++ return ERROR(dstSize_tooSmall); /* last match must start at a minimum distance of WILDCOPY_OVERLENGTH from oend */ ++ if (iLitEnd > litLimit) ++ return ERROR(corruption_detected); /* over-read beyond lit buffer */ ++ if (oLitEnd > oend_w) ++ return ZSTD_execSequenceLast7(op, oend, sequence, litPtr, litLimit, base, vBase, dictEnd); ++ ++ /* copy Literals */ ++ ZSTD_copy8(op, *litPtr); ++ if (sequence.litLength > 8) ++ ZSTD_wildcopy(op + 8, (*litPtr) + 8, ++ sequence.litLength - 8); /* note : since oLitEnd <= oend-WILDCOPY_OVERLENGTH, no risk of overwrite beyond oend */ ++ op = oLitEnd; ++ *litPtr = iLitEnd; /* update for next sequence */ ++ ++ /* copy Match */ ++ if (sequence.offset > (size_t)(oLitEnd - base)) { ++ /* offset beyond prefix */ ++ if (sequence.offset > (size_t)(oLitEnd - vBase)) ++ return ERROR(corruption_detected); ++ if (match + sequence.matchLength <= dictEnd) { ++ memmove(oLitEnd, match, sequence.matchLength); ++ return sequenceLength; ++ } ++ /* span extDict & currPrefixSegment */ ++ { ++ size_t const length1 = dictEnd - match; ++ memmove(oLitEnd, match, length1); ++ op = oLitEnd + length1; ++ sequence.matchLength -= length1; ++ match = base; ++ if (op > oend_w || sequence.matchLength < MINMATCH) { ++ U32 i; ++ for (i = 0; i < sequence.matchLength; ++i) ++ op[i] = match[i]; ++ return sequenceLength; ++ } ++ } ++ } ++ /* Requirement: op <= oend_w && sequence.matchLength >= MINMATCH */ ++ ++ /* match within prefix */ ++ if (sequence.offset < 8) { ++ /* close range match, overlap */ ++ static const U32 dec32table[] = {0, 1, 2, 1, 4, 4, 4, 4}; /* added */ ++ static const int dec64table[] = {8, 8, 8, 7, 8, 9, 10, 11}; /* subtracted */ ++ int const sub2 = dec64table[sequence.offset]; ++ op[0] = match[0]; ++ op[1] = match[1]; ++ op[2] = match[2]; ++ op[3] = match[3]; ++ match += dec32table[sequence.offset]; ++ ZSTD_copy4(op + 4, match); ++ match -= sub2; ++ } else { ++ ZSTD_copy8(op, match); ++ } ++ op += 8; ++ match += 8; ++ ++ if (oMatchEnd > oend - (16 - MINMATCH)) { ++ if (op < oend_w) { ++ ZSTD_wildcopy(op, match, oend_w - op); ++ match += oend_w - op; ++ op = oend_w; ++ } ++ while (op < oMatchEnd) ++ *op++ = *match++; ++ } else { ++ ZSTD_wildcopy(op, match, (ptrdiff_t)sequence.matchLength - 8); /* works even if matchLength < 8 */ ++ } ++ return sequenceLength; ++} ++ ++static size_t INIT ZSTD_decompressSequencesLong(ZSTD_DCtx *dctx, void *dst, size_t maxDstSize, const void *seqStart, size_t seqSize) ++{ ++ const BYTE *ip = (const BYTE *)seqStart; ++ const BYTE *const iend = ip + seqSize; ++ BYTE *const ostart = (BYTE * const)dst; ++ BYTE *const oend = ostart + maxDstSize; ++ BYTE *op = ostart; ++ const BYTE *litPtr = dctx->litPtr; ++ const BYTE *const litEnd = litPtr + dctx->litSize; ++ const BYTE *const base = (const BYTE *)(dctx->base); ++ const BYTE *const vBase = (const BYTE *)(dctx->vBase); ++ const BYTE *const dictEnd = (const BYTE *)(dctx->dictEnd); ++ unsigned const windowSize = dctx->fParams.windowSize; ++ int nbSeq; ++ ++ /* Build Decoding Tables */ ++ { ++ size_t const seqHSize = ZSTD_decodeSeqHeaders(dctx, &nbSeq, ip, seqSize); ++ if (ZSTD_isError(seqHSize)) ++ return seqHSize; ++ ip += seqHSize; ++ } ++ ++ /* Regen sequences */ ++ if (nbSeq) { ++#define STORED_SEQS 4 ++#define STOSEQ_MASK (STORED_SEQS - 1) ++#define ADVANCED_SEQS 4 ++ seq_t *sequences = (seq_t *)dctx->entropy.workspace; ++ int const seqAdvance = MIN(nbSeq, ADVANCED_SEQS); ++ seqState_t seqState; ++ int seqNb; ++ ZSTD_STATIC_ASSERT(sizeof(dctx->entropy.workspace) >= sizeof(seq_t) * STORED_SEQS); ++ dctx->fseEntropy = 1; ++ { ++ U32 i; ++ for (i = 0; i < ZSTD_REP_NUM; i++) ++ seqState.prevOffset[i] = dctx->entropy.rep[i]; ++ } ++ seqState.base = base; ++ seqState.pos = (size_t)(op - base); ++ seqState.gotoDict = (uPtrDiff)dictEnd - (uPtrDiff)base; /* cast to avoid undefined behaviour */ ++ CHECK_E(BIT_initDStream(&seqState.DStream, ip, iend - ip), corruption_detected); ++ FSE_initDState(&seqState.stateLL, &seqState.DStream, dctx->LLTptr); ++ FSE_initDState(&seqState.stateOffb, &seqState.DStream, dctx->OFTptr); ++ FSE_initDState(&seqState.stateML, &seqState.DStream, dctx->MLTptr); ++ ++ /* prepare in advance */ ++ for (seqNb = 0; (BIT_reloadDStream(&seqState.DStream) <= BIT_DStream_completed) && seqNb < seqAdvance; seqNb++) { ++ sequences[seqNb] = ZSTD_decodeSequenceLong(&seqState, windowSize); ++ } ++ if (seqNb < seqAdvance) ++ return ERROR(corruption_detected); ++ ++ /* decode and decompress */ ++ for (; (BIT_reloadDStream(&(seqState.DStream)) <= BIT_DStream_completed) && seqNb < nbSeq; seqNb++) { ++ seq_t const sequence = ZSTD_decodeSequenceLong(&seqState, windowSize); ++ size_t const oneSeqSize = ++ ZSTD_execSequenceLong(op, oend, sequences[(seqNb - ADVANCED_SEQS) & STOSEQ_MASK], &litPtr, litEnd, base, vBase, dictEnd); ++ if (ZSTD_isError(oneSeqSize)) ++ return oneSeqSize; ++ ZSTD_PREFETCH(sequence.match); ++ sequences[seqNb & STOSEQ_MASK] = sequence; ++ op += oneSeqSize; ++ } ++ if (seqNb < nbSeq) ++ return ERROR(corruption_detected); ++ ++ /* finish queue */ ++ seqNb -= seqAdvance; ++ for (; seqNb < nbSeq; seqNb++) { ++ size_t const oneSeqSize = ZSTD_execSequenceLong(op, oend, sequences[seqNb & STOSEQ_MASK], &litPtr, litEnd, base, vBase, dictEnd); ++ if (ZSTD_isError(oneSeqSize)) ++ return oneSeqSize; ++ op += oneSeqSize; ++ } ++ ++ /* save reps for next block */ ++ { ++ U32 i; ++ for (i = 0; i < ZSTD_REP_NUM; i++) ++ dctx->entropy.rep[i] = (U32)(seqState.prevOffset[i]); ++ } ++ } ++ ++ /* last literal segment */ ++ { ++ size_t const lastLLSize = litEnd - litPtr; ++ if (lastLLSize > (size_t)(oend - op)) ++ return ERROR(dstSize_tooSmall); ++ memcpy(op, litPtr, lastLLSize); ++ op += lastLLSize; ++ } ++ ++ return op - ostart; ++} ++ ++static size_t INIT ZSTD_decompressBlock_internal(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++{ /* blockType == blockCompressed */ ++ const BYTE *ip = (const BYTE *)src; ++ ++ if (srcSize >= ZSTD_BLOCKSIZE_ABSOLUTEMAX) ++ return ERROR(srcSize_wrong); ++ ++ /* Decode literals section */ ++ { ++ size_t const litCSize = ZSTD_decodeLiteralsBlock(dctx, src, srcSize); ++ if (ZSTD_isError(litCSize)) ++ return litCSize; ++ ip += litCSize; ++ srcSize -= litCSize; ++ } ++ if (sizeof(size_t) > 4) /* do not enable prefetching on 32-bits x86, as it's performance detrimental */ ++ /* likely because of register pressure */ ++ /* if that's the correct cause, then 32-bits ARM should be affected differently */ ++ /* it would be good to test this on ARM real hardware, to see if prefetch version improves speed */ ++ if (dctx->fParams.windowSize > (1 << 23)) ++ return ZSTD_decompressSequencesLong(dctx, dst, dstCapacity, ip, srcSize); ++ return ZSTD_decompressSequences(dctx, dst, dstCapacity, ip, srcSize); ++} ++ ++static void INIT ZSTD_checkContinuity(ZSTD_DCtx *dctx, const void *dst) ++{ ++ if (dst != dctx->previousDstEnd) { /* not contiguous */ ++ dctx->dictEnd = dctx->previousDstEnd; ++ dctx->vBase = (const char *)dst - ((const char *)(dctx->previousDstEnd) - (const char *)(dctx->base)); ++ dctx->base = dst; ++ dctx->previousDstEnd = dst; ++ } ++} ++ ++size_t INIT ZSTD_decompressBlock(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++{ ++ size_t dSize; ++ ZSTD_checkContinuity(dctx, dst); ++ dSize = ZSTD_decompressBlock_internal(dctx, dst, dstCapacity, src, srcSize); ++ dctx->previousDstEnd = (char *)dst + dSize; ++ return dSize; ++} ++ ++/** ZSTD_insertBlock() : ++ insert `src` block into `dctx` history. Useful to track uncompressed blocks. */ ++size_t INIT ZSTD_insertBlock(ZSTD_DCtx *dctx, const void *blockStart, size_t blockSize) ++{ ++ ZSTD_checkContinuity(dctx, blockStart); ++ dctx->previousDstEnd = (const char *)blockStart + blockSize; ++ return blockSize; ++} ++ ++size_t INIT ZSTD_generateNxBytes(void *dst, size_t dstCapacity, BYTE byte, size_t length) ++{ ++ if (length > dstCapacity) ++ return ERROR(dstSize_tooSmall); ++ memset(dst, byte, length); ++ return length; ++} ++ ++/** ZSTD_findFrameCompressedSize() : ++ * compatible with legacy mode ++ * `src` must point to the start of a ZSTD frame, ZSTD legacy frame, or skippable frame ++ * `srcSize` must be at least as large as the frame contained ++ * @return : the compressed size of the frame starting at `src` */ ++size_t INIT ZSTD_findFrameCompressedSize(const void *src, size_t srcSize) ++{ ++ if (srcSize >= ZSTD_skippableHeaderSize && (ZSTD_readLE32(src) & 0xFFFFFFF0U) == ZSTD_MAGIC_SKIPPABLE_START) { ++ return ZSTD_skippableHeaderSize + ZSTD_readLE32((const BYTE *)src + 4); ++ } else { ++ const BYTE *ip = (const BYTE *)src; ++ const BYTE *const ipstart = ip; ++ size_t remainingSize = srcSize; ++ ZSTD_frameParams fParams; ++ ++ size_t const headerSize = ZSTD_frameHeaderSize(ip, remainingSize); ++ if (ZSTD_isError(headerSize)) ++ return headerSize; ++ ++ /* Frame Header */ ++ { ++ size_t const ret = ZSTD_getFrameParams(&fParams, ip, remainingSize); ++ if (ZSTD_isError(ret)) ++ return ret; ++ if (ret > 0) ++ return ERROR(srcSize_wrong); ++ } ++ ++ ip += headerSize; ++ remainingSize -= headerSize; ++ ++ /* Loop on each block */ ++ while (1) { ++ blockProperties_t blockProperties; ++ size_t const cBlockSize = ZSTD_getcBlockSize(ip, remainingSize, &blockProperties); ++ if (ZSTD_isError(cBlockSize)) ++ return cBlockSize; ++ ++ if (ZSTD_blockHeaderSize + cBlockSize > remainingSize) ++ return ERROR(srcSize_wrong); ++ ++ ip += ZSTD_blockHeaderSize + cBlockSize; ++ remainingSize -= ZSTD_blockHeaderSize + cBlockSize; ++ ++ if (blockProperties.lastBlock) ++ break; ++ } ++ ++ if (fParams.checksumFlag) { /* Frame content checksum */ ++ if (remainingSize < 4) ++ return ERROR(srcSize_wrong); ++ ip += 4; ++ remainingSize -= 4; ++ } ++ ++ return ip - ipstart; ++ } ++} ++ ++/*! ZSTD_decompressFrame() : ++* @dctx must be properly initialized */ ++static size_t INIT ZSTD_decompressFrame(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void **srcPtr, size_t *srcSizePtr) ++{ ++ const BYTE *ip = (const BYTE *)(*srcPtr); ++ BYTE *const ostart = (BYTE * const)dst; ++ BYTE *const oend = ostart + dstCapacity; ++ BYTE *op = ostart; ++ size_t remainingSize = *srcSizePtr; ++ ++ /* check */ ++ if (remainingSize < ZSTD_frameHeaderSize_min + ZSTD_blockHeaderSize) ++ return ERROR(srcSize_wrong); ++ ++ /* Frame Header */ ++ { ++ size_t const frameHeaderSize = ZSTD_frameHeaderSize(ip, ZSTD_frameHeaderSize_prefix); ++ if (ZSTD_isError(frameHeaderSize)) ++ return frameHeaderSize; ++ if (remainingSize < frameHeaderSize + ZSTD_blockHeaderSize) ++ return ERROR(srcSize_wrong); ++ CHECK_F(ZSTD_decodeFrameHeader(dctx, ip, frameHeaderSize)); ++ ip += frameHeaderSize; ++ remainingSize -= frameHeaderSize; ++ } ++ ++ /* Loop on each block */ ++ while (1) { ++ size_t decodedSize; ++ blockProperties_t blockProperties; ++ size_t const cBlockSize = ZSTD_getcBlockSize(ip, remainingSize, &blockProperties); ++ if (ZSTD_isError(cBlockSize)) ++ return cBlockSize; ++ ++ ip += ZSTD_blockHeaderSize; ++ remainingSize -= ZSTD_blockHeaderSize; ++ if (cBlockSize > remainingSize) ++ return ERROR(srcSize_wrong); ++ ++ switch (blockProperties.blockType) { ++ case bt_compressed: decodedSize = ZSTD_decompressBlock_internal(dctx, op, oend - op, ip, cBlockSize); break; ++ case bt_raw: decodedSize = ZSTD_copyRawBlock(op, oend - op, ip, cBlockSize); break; ++ case bt_rle: decodedSize = ZSTD_generateNxBytes(op, oend - op, *ip, blockProperties.origSize); break; ++ case bt_reserved: ++ default: return ERROR(corruption_detected); ++ } ++ ++ if (ZSTD_isError(decodedSize)) ++ return decodedSize; ++ if (dctx->fParams.checksumFlag) ++ xxh64_update(&dctx->xxhState, op, decodedSize); ++ op += decodedSize; ++ ip += cBlockSize; ++ remainingSize -= cBlockSize; ++ if (blockProperties.lastBlock) ++ break; ++ } ++ ++ if (dctx->fParams.checksumFlag) { /* Frame content checksum verification */ ++ U32 const checkCalc = (U32)xxh64_digest(&dctx->xxhState); ++ U32 checkRead; ++ if (remainingSize < 4) ++ return ERROR(checksum_wrong); ++ checkRead = ZSTD_readLE32(ip); ++ if (checkRead != checkCalc) ++ return ERROR(checksum_wrong); ++ ip += 4; ++ remainingSize -= 4; ++ } ++ ++ /* Allow caller to get size read */ ++ *srcPtr = ip; ++ *srcSizePtr = remainingSize; ++ return op - ostart; ++} ++ ++static const void *ZSTD_DDictDictContent(const ZSTD_DDict *ddict); ++static size_t ZSTD_DDictDictSize(const ZSTD_DDict *ddict); ++ ++static size_t INIT ZSTD_decompressMultiFrame(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize, const void *dict, size_t dictSize, ++ const ZSTD_DDict *ddict) ++{ ++ void *const dststart = dst; ++ ++ if (ddict) { ++ if (dict) { ++ /* programmer error, these two cases should be mutually exclusive */ ++ return ERROR(GENERIC); ++ } ++ ++ dict = ZSTD_DDictDictContent(ddict); ++ dictSize = ZSTD_DDictDictSize(ddict); ++ } ++ ++ while (srcSize >= ZSTD_frameHeaderSize_prefix) { ++ U32 magicNumber; ++ ++ magicNumber = ZSTD_readLE32(src); ++ if (magicNumber != ZSTD_MAGICNUMBER) { ++ if ((magicNumber & 0xFFFFFFF0U) == ZSTD_MAGIC_SKIPPABLE_START) { ++ size_t skippableSize; ++ if (srcSize < ZSTD_skippableHeaderSize) ++ return ERROR(srcSize_wrong); ++ skippableSize = ZSTD_readLE32((const BYTE *)src + 4) + ZSTD_skippableHeaderSize; ++ if (srcSize < skippableSize) { ++ return ERROR(srcSize_wrong); ++ } ++ ++ src = (const BYTE *)src + skippableSize; ++ srcSize -= skippableSize; ++ continue; ++ } else { ++ return ERROR(prefix_unknown); ++ } ++ } ++ ++ if (ddict) { ++ /* we were called from ZSTD_decompress_usingDDict */ ++ ZSTD_refDDict(dctx, ddict); ++ } else { ++ /* this will initialize correctly with no dict if dict == NULL, so ++ * use this in all cases but ddict */ ++ CHECK_F(ZSTD_decompressBegin_usingDict(dctx, dict, dictSize)); ++ } ++ ZSTD_checkContinuity(dctx, dst); ++ ++ { ++ const size_t res = ZSTD_decompressFrame(dctx, dst, dstCapacity, &src, &srcSize); ++ if (ZSTD_isError(res)) ++ return res; ++ /* don't need to bounds check this, ZSTD_decompressFrame will have ++ * already */ ++ dst = (BYTE *)dst + res; ++ dstCapacity -= res; ++ } ++ } ++ ++ if (srcSize) ++ return ERROR(srcSize_wrong); /* input not entirely consumed */ ++ ++ return (BYTE *)dst - (BYTE *)dststart; ++} ++ ++size_t INIT ZSTD_decompress_usingDict(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize, const void *dict, size_t dictSize) ++{ ++ return ZSTD_decompressMultiFrame(dctx, dst, dstCapacity, src, srcSize, dict, dictSize, NULL); ++} ++ ++size_t INIT ZSTD_decompressDCtx(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++{ ++ return ZSTD_decompress_usingDict(dctx, dst, dstCapacity, src, srcSize, NULL, 0); ++} ++ ++/*-************************************** ++* Advanced Streaming Decompression API ++* Bufferless and synchronous ++****************************************/ ++size_t INIT ZSTD_nextSrcSizeToDecompress(ZSTD_DCtx *dctx) { return dctx->expected; } ++ ++ZSTD_nextInputType_e INIT ZSTD_nextInputType(ZSTD_DCtx *dctx) ++{ ++ switch (dctx->stage) { ++ default: /* should not happen */ ++ case ZSTDds_getFrameHeaderSize: ++ case ZSTDds_decodeFrameHeader: return ZSTDnit_frameHeader; ++ case ZSTDds_decodeBlockHeader: return ZSTDnit_blockHeader; ++ case ZSTDds_decompressBlock: return ZSTDnit_block; ++ case ZSTDds_decompressLastBlock: return ZSTDnit_lastBlock; ++ case ZSTDds_checkChecksum: return ZSTDnit_checksum; ++ case ZSTDds_decodeSkippableHeader: ++ case ZSTDds_skipFrame: return ZSTDnit_skippableFrame; ++ } ++} ++ ++int INIT ZSTD_isSkipFrame(ZSTD_DCtx *dctx) { return dctx->stage == ZSTDds_skipFrame; } /* for zbuff */ ++ ++/** ZSTD_decompressContinue() : ++* @return : nb of bytes generated into `dst` (necessarily <= `dstCapacity) ++* or an error code, which can be tested using ZSTD_isError() */ ++size_t INIT ZSTD_decompressContinue(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++{ ++ /* Sanity check */ ++ if (srcSize != dctx->expected) ++ return ERROR(srcSize_wrong); ++ if (dstCapacity) ++ ZSTD_checkContinuity(dctx, dst); ++ ++ switch (dctx->stage) { ++ case ZSTDds_getFrameHeaderSize: ++ if (srcSize != ZSTD_frameHeaderSize_prefix) ++ return ERROR(srcSize_wrong); /* impossible */ ++ if ((ZSTD_readLE32(src) & 0xFFFFFFF0U) == ZSTD_MAGIC_SKIPPABLE_START) { /* skippable frame */ ++ memcpy(dctx->headerBuffer, src, ZSTD_frameHeaderSize_prefix); ++ dctx->expected = ZSTD_skippableHeaderSize - ZSTD_frameHeaderSize_prefix; /* magic number + skippable frame length */ ++ dctx->stage = ZSTDds_decodeSkippableHeader; ++ return 0; ++ } ++ dctx->headerSize = ZSTD_frameHeaderSize(src, ZSTD_frameHeaderSize_prefix); ++ if (ZSTD_isError(dctx->headerSize)) ++ return dctx->headerSize; ++ memcpy(dctx->headerBuffer, src, ZSTD_frameHeaderSize_prefix); ++ if (dctx->headerSize > ZSTD_frameHeaderSize_prefix) { ++ dctx->expected = dctx->headerSize - ZSTD_frameHeaderSize_prefix; ++ dctx->stage = ZSTDds_decodeFrameHeader; ++ return 0; ++ } ++ dctx->expected = 0; /* not necessary to copy more */ ++ /* fallthrough */ ++ ++ case ZSTDds_decodeFrameHeader: ++ memcpy(dctx->headerBuffer + ZSTD_frameHeaderSize_prefix, src, dctx->expected); ++ CHECK_F(ZSTD_decodeFrameHeader(dctx, dctx->headerBuffer, dctx->headerSize)); ++ dctx->expected = ZSTD_blockHeaderSize; ++ dctx->stage = ZSTDds_decodeBlockHeader; ++ return 0; ++ ++ case ZSTDds_decodeBlockHeader: { ++ blockProperties_t bp; ++ size_t const cBlockSize = ZSTD_getcBlockSize(src, ZSTD_blockHeaderSize, &bp); ++ if (ZSTD_isError(cBlockSize)) ++ return cBlockSize; ++ dctx->expected = cBlockSize; ++ dctx->bType = bp.blockType; ++ dctx->rleSize = bp.origSize; ++ if (cBlockSize) { ++ dctx->stage = bp.lastBlock ? ZSTDds_decompressLastBlock : ZSTDds_decompressBlock; ++ return 0; ++ } ++ /* empty block */ ++ if (bp.lastBlock) { ++ if (dctx->fParams.checksumFlag) { ++ dctx->expected = 4; ++ dctx->stage = ZSTDds_checkChecksum; ++ } else { ++ dctx->expected = 0; /* end of frame */ ++ dctx->stage = ZSTDds_getFrameHeaderSize; ++ } ++ } else { ++ dctx->expected = 3; /* go directly to next header */ ++ dctx->stage = ZSTDds_decodeBlockHeader; ++ } ++ return 0; ++ } ++ case ZSTDds_decompressLastBlock: ++ case ZSTDds_decompressBlock: { ++ size_t rSize; ++ switch (dctx->bType) { ++ case bt_compressed: rSize = ZSTD_decompressBlock_internal(dctx, dst, dstCapacity, src, srcSize); break; ++ case bt_raw: rSize = ZSTD_copyRawBlock(dst, dstCapacity, src, srcSize); break; ++ case bt_rle: rSize = ZSTD_setRleBlock(dst, dstCapacity, src, srcSize, dctx->rleSize); break; ++ case bt_reserved: /* should never happen */ ++ default: return ERROR(corruption_detected); ++ } ++ if (ZSTD_isError(rSize)) ++ return rSize; ++ if (dctx->fParams.checksumFlag) ++ xxh64_update(&dctx->xxhState, dst, rSize); ++ ++ if (dctx->stage == ZSTDds_decompressLastBlock) { /* end of frame */ ++ if (dctx->fParams.checksumFlag) { /* another round for frame checksum */ ++ dctx->expected = 4; ++ dctx->stage = ZSTDds_checkChecksum; ++ } else { ++ dctx->expected = 0; /* ends here */ ++ dctx->stage = ZSTDds_getFrameHeaderSize; ++ } ++ } else { ++ dctx->stage = ZSTDds_decodeBlockHeader; ++ dctx->expected = ZSTD_blockHeaderSize; ++ dctx->previousDstEnd = (char *)dst + rSize; ++ } ++ return rSize; ++ } ++ case ZSTDds_checkChecksum: { ++ U32 const h32 = (U32)xxh64_digest(&dctx->xxhState); ++ U32 const check32 = ZSTD_readLE32(src); /* srcSize == 4, guaranteed by dctx->expected */ ++ if (check32 != h32) ++ return ERROR(checksum_wrong); ++ dctx->expected = 0; ++ dctx->stage = ZSTDds_getFrameHeaderSize; ++ return 0; ++ } ++ case ZSTDds_decodeSkippableHeader: { ++ memcpy(dctx->headerBuffer + ZSTD_frameHeaderSize_prefix, src, dctx->expected); ++ dctx->expected = ZSTD_readLE32(dctx->headerBuffer + 4); ++ dctx->stage = ZSTDds_skipFrame; ++ return 0; ++ } ++ case ZSTDds_skipFrame: { ++ dctx->expected = 0; ++ dctx->stage = ZSTDds_getFrameHeaderSize; ++ return 0; ++ } ++ default: ++ return ERROR(GENERIC); /* impossible */ ++ } ++} ++ ++static size_t INIT ZSTD_refDictContent(ZSTD_DCtx *dctx, const void *dict, size_t dictSize) ++{ ++ dctx->dictEnd = dctx->previousDstEnd; ++ dctx->vBase = (const char *)dict - ((const char *)(dctx->previousDstEnd) - (const char *)(dctx->base)); ++ dctx->base = dict; ++ dctx->previousDstEnd = (const char *)dict + dictSize; ++ return 0; ++} ++ ++/* ZSTD_loadEntropy() : ++ * dict : must point at beginning of a valid zstd dictionary ++ * @return : size of entropy tables read */ ++static size_t INIT ZSTD_loadEntropy(ZSTD_entropyTables_t *entropy, const void *const dict, size_t const dictSize) ++{ ++ const BYTE *dictPtr = (const BYTE *)dict; ++ const BYTE *const dictEnd = dictPtr + dictSize; ++ ++ if (dictSize <= 8) ++ return ERROR(dictionary_corrupted); ++ dictPtr += 8; /* skip header = magic + dictID */ ++ ++ { ++ size_t const hSize = HUF_readDTableX4_wksp(entropy->hufTable, dictPtr, dictEnd - dictPtr, entropy->workspace, sizeof(entropy->workspace)); ++ if (HUF_isError(hSize)) ++ return ERROR(dictionary_corrupted); ++ dictPtr += hSize; ++ } ++ ++ { ++ short offcodeNCount[MaxOff + 1]; ++ U32 offcodeMaxValue = MaxOff, offcodeLog; ++ size_t const offcodeHeaderSize = FSE_readNCount(offcodeNCount, &offcodeMaxValue, &offcodeLog, dictPtr, dictEnd - dictPtr); ++ if (FSE_isError(offcodeHeaderSize)) ++ return ERROR(dictionary_corrupted); ++ if (offcodeLog > OffFSELog) ++ return ERROR(dictionary_corrupted); ++ CHECK_E(FSE_buildDTable_wksp(entropy->OFTable, offcodeNCount, offcodeMaxValue, offcodeLog, entropy->workspace, sizeof(entropy->workspace)), dictionary_corrupted); ++ dictPtr += offcodeHeaderSize; ++ } ++ ++ { ++ short matchlengthNCount[MaxML + 1]; ++ unsigned matchlengthMaxValue = MaxML, matchlengthLog; ++ size_t const matchlengthHeaderSize = FSE_readNCount(matchlengthNCount, &matchlengthMaxValue, &matchlengthLog, dictPtr, dictEnd - dictPtr); ++ if (FSE_isError(matchlengthHeaderSize)) ++ return ERROR(dictionary_corrupted); ++ if (matchlengthLog > MLFSELog) ++ return ERROR(dictionary_corrupted); ++ CHECK_E(FSE_buildDTable_wksp(entropy->MLTable, matchlengthNCount, matchlengthMaxValue, matchlengthLog, entropy->workspace, sizeof(entropy->workspace)), dictionary_corrupted); ++ dictPtr += matchlengthHeaderSize; ++ } ++ ++ { ++ short litlengthNCount[MaxLL + 1]; ++ unsigned litlengthMaxValue = MaxLL, litlengthLog; ++ size_t const litlengthHeaderSize = FSE_readNCount(litlengthNCount, &litlengthMaxValue, &litlengthLog, dictPtr, dictEnd - dictPtr); ++ if (FSE_isError(litlengthHeaderSize)) ++ return ERROR(dictionary_corrupted); ++ if (litlengthLog > LLFSELog) ++ return ERROR(dictionary_corrupted); ++ CHECK_E(FSE_buildDTable_wksp(entropy->LLTable, litlengthNCount, litlengthMaxValue, litlengthLog, entropy->workspace, sizeof(entropy->workspace)), dictionary_corrupted); ++ dictPtr += litlengthHeaderSize; ++ } ++ ++ if (dictPtr + 12 > dictEnd) ++ return ERROR(dictionary_corrupted); ++ { ++ int i; ++ size_t const dictContentSize = (size_t)(dictEnd - (dictPtr + 12)); ++ for (i = 0; i < 3; i++) { ++ U32 const rep = ZSTD_readLE32(dictPtr); ++ dictPtr += 4; ++ if (rep == 0 || rep >= dictContentSize) ++ return ERROR(dictionary_corrupted); ++ entropy->rep[i] = rep; ++ } ++ } ++ ++ return dictPtr - (const BYTE *)dict; ++} ++ ++static size_t INIT ZSTD_decompress_insertDictionary(ZSTD_DCtx *dctx, const void *dict, size_t dictSize) ++{ ++ if (dictSize < 8) ++ return ZSTD_refDictContent(dctx, dict, dictSize); ++ { ++ U32 const magic = ZSTD_readLE32(dict); ++ if (magic != ZSTD_DICT_MAGIC) { ++ return ZSTD_refDictContent(dctx, dict, dictSize); /* pure content mode */ ++ } ++ } ++ dctx->dictID = ZSTD_readLE32((const char *)dict + 4); ++ ++ /* load entropy tables */ ++ { ++ size_t const eSize = ZSTD_loadEntropy(&dctx->entropy, dict, dictSize); ++ if (ZSTD_isError(eSize)) ++ return ERROR(dictionary_corrupted); ++ dict = (const char *)dict + eSize; ++ dictSize -= eSize; ++ } ++ dctx->litEntropy = dctx->fseEntropy = 1; ++ ++ /* reference dictionary content */ ++ return ZSTD_refDictContent(dctx, dict, dictSize); ++} ++ ++size_t INIT ZSTD_decompressBegin_usingDict(ZSTD_DCtx *dctx, const void *dict, size_t dictSize) ++{ ++ CHECK_F(ZSTD_decompressBegin(dctx)); ++ if (dict && dictSize) ++ CHECK_E(ZSTD_decompress_insertDictionary(dctx, dict, dictSize), dictionary_corrupted); ++ return 0; ++} ++ ++/* ====== ZSTD_DDict ====== */ ++ ++struct ZSTD_DDict_s { ++ void *dictBuffer; ++ const void *dictContent; ++ size_t dictSize; ++ ZSTD_entropyTables_t entropy; ++ U32 dictID; ++ U32 entropyPresent; ++ ZSTD_customMem cMem; ++}; /* typedef'd to ZSTD_DDict within "zstd.h" */ ++ ++size_t INIT ZSTD_DDictWorkspaceBound(void) { return ZSTD_ALIGN(sizeof(ZSTD_stack)) + ZSTD_ALIGN(sizeof(ZSTD_DDict)); } ++ ++static const void *INIT ZSTD_DDictDictContent(const ZSTD_DDict *ddict) { return ddict->dictContent; } ++ ++static size_t INIT ZSTD_DDictDictSize(const ZSTD_DDict *ddict) { return ddict->dictSize; } ++ ++static void INIT ZSTD_refDDict(ZSTD_DCtx *dstDCtx, const ZSTD_DDict *ddict) ++{ ++ ZSTD_decompressBegin(dstDCtx); /* init */ ++ if (ddict) { /* support refDDict on NULL */ ++ dstDCtx->dictID = ddict->dictID; ++ dstDCtx->base = ddict->dictContent; ++ dstDCtx->vBase = ddict->dictContent; ++ dstDCtx->dictEnd = (const BYTE *)ddict->dictContent + ddict->dictSize; ++ dstDCtx->previousDstEnd = dstDCtx->dictEnd; ++ if (ddict->entropyPresent) { ++ dstDCtx->litEntropy = 1; ++ dstDCtx->fseEntropy = 1; ++ dstDCtx->LLTptr = ddict->entropy.LLTable; ++ dstDCtx->MLTptr = ddict->entropy.MLTable; ++ dstDCtx->OFTptr = ddict->entropy.OFTable; ++ dstDCtx->HUFptr = ddict->entropy.hufTable; ++ dstDCtx->entropy.rep[0] = ddict->entropy.rep[0]; ++ dstDCtx->entropy.rep[1] = ddict->entropy.rep[1]; ++ dstDCtx->entropy.rep[2] = ddict->entropy.rep[2]; ++ } else { ++ dstDCtx->litEntropy = 0; ++ dstDCtx->fseEntropy = 0; ++ } ++ } ++} ++ ++static size_t INIT ZSTD_loadEntropy_inDDict(ZSTD_DDict *ddict) ++{ ++ ddict->dictID = 0; ++ ddict->entropyPresent = 0; ++ if (ddict->dictSize < 8) ++ return 0; ++ { ++ U32 const magic = ZSTD_readLE32(ddict->dictContent); ++ if (magic != ZSTD_DICT_MAGIC) ++ return 0; /* pure content mode */ ++ } ++ ddict->dictID = ZSTD_readLE32((const char *)ddict->dictContent + 4); ++ ++ /* load entropy tables */ ++ CHECK_E(ZSTD_loadEntropy(&ddict->entropy, ddict->dictContent, ddict->dictSize), dictionary_corrupted); ++ ddict->entropyPresent = 1; ++ return 0; ++} ++ ++static ZSTD_DDict *INIT ZSTD_createDDict_advanced(const void *dict, size_t dictSize, unsigned byReference, ZSTD_customMem customMem) ++{ ++ if (!customMem.customAlloc || !customMem.customFree) ++ return NULL; ++ ++ { ++ ZSTD_DDict *const ddict = (ZSTD_DDict *)ZSTD_malloc(sizeof(ZSTD_DDict), customMem); ++ if (!ddict) ++ return NULL; ++ ddict->cMem = customMem; ++ ++ if ((byReference) || (!dict) || (!dictSize)) { ++ ddict->dictBuffer = NULL; ++ ddict->dictContent = dict; ++ } else { ++ void *const internalBuffer = ZSTD_malloc(dictSize, customMem); ++ if (!internalBuffer) { ++ ZSTD_freeDDict(ddict); ++ return NULL; ++ } ++ memcpy(internalBuffer, dict, dictSize); ++ ddict->dictBuffer = internalBuffer; ++ ddict->dictContent = internalBuffer; ++ } ++ ddict->dictSize = dictSize; ++ ddict->entropy.hufTable[0] = (HUF_DTable)((HufLog)*0x1000001); /* cover both little and big endian */ ++ /* parse dictionary content */ ++ { ++ size_t const errorCode = ZSTD_loadEntropy_inDDict(ddict); ++ if (ZSTD_isError(errorCode)) { ++ ZSTD_freeDDict(ddict); ++ return NULL; ++ } ++ } ++ ++ return ddict; ++ } ++} ++ ++/*! ZSTD_initDDict() : ++* Create a digested dictionary, to start decompression without startup delay. ++* `dict` content is copied inside DDict. ++* Consequently, `dict` can be released after `ZSTD_DDict` creation */ ++ZSTD_DDict *INIT ZSTD_initDDict(const void *dict, size_t dictSize, void *workspace, size_t workspaceSize) ++{ ++ ZSTD_customMem const stackMem = ZSTD_initStack(workspace, workspaceSize); ++ return ZSTD_createDDict_advanced(dict, dictSize, 1, stackMem); ++} ++ ++size_t INIT ZSTD_freeDDict(ZSTD_DDict *ddict) ++{ ++ if (ddict == NULL) ++ return 0; /* support free on NULL */ ++ { ++ ZSTD_customMem const cMem = ddict->cMem; ++ ZSTD_free(ddict->dictBuffer, cMem); ++ ZSTD_free(ddict, cMem); ++ return 0; ++ } ++} ++ ++/*! ZSTD_getDictID_fromDict() : ++ * Provides the dictID stored within dictionary. ++ * if @return == 0, the dictionary is not conformant with Zstandard specification. ++ * It can still be loaded, but as a content-only dictionary. */ ++unsigned INIT ZSTD_getDictID_fromDict(const void *dict, size_t dictSize) ++{ ++ if (dictSize < 8) ++ return 0; ++ if (ZSTD_readLE32(dict) != ZSTD_DICT_MAGIC) ++ return 0; ++ return ZSTD_readLE32((const char *)dict + 4); ++} ++ ++/*! ZSTD_getDictID_fromDDict() : ++ * Provides the dictID of the dictionary loaded into `ddict`. ++ * If @return == 0, the dictionary is not conformant to Zstandard specification, or empty. ++ * Non-conformant dictionaries can still be loaded, but as content-only dictionaries. */ ++unsigned INIT ZSTD_getDictID_fromDDict(const ZSTD_DDict *ddict) ++{ ++ if (ddict == NULL) ++ return 0; ++ return ZSTD_getDictID_fromDict(ddict->dictContent, ddict->dictSize); ++} ++ ++/*! ZSTD_getDictID_fromFrame() : ++ * Provides the dictID required to decompressed the frame stored within `src`. ++ * If @return == 0, the dictID could not be decoded. ++ * This could for one of the following reasons : ++ * - The frame does not require a dictionary to be decoded (most common case). ++ * - The frame was built with dictID intentionally removed. Whatever dictionary is necessary is a hidden information. ++ * Note : this use case also happens when using a non-conformant dictionary. ++ * - `srcSize` is too small, and as a result, the frame header could not be decoded (only possible if `srcSize < ZSTD_FRAMEHEADERSIZE_MAX`). ++ * - This is not a Zstandard frame. ++ * When identifying the exact failure cause, it's possible to used ZSTD_getFrameParams(), which will provide a more precise error code. */ ++unsigned INIT ZSTD_getDictID_fromFrame(const void *src, size_t srcSize) ++{ ++ ZSTD_frameParams zfp = {0, 0, 0, 0}; ++ size_t const hError = ZSTD_getFrameParams(&zfp, src, srcSize); ++ if (ZSTD_isError(hError)) ++ return 0; ++ return zfp.dictID; ++} ++ ++/*! ZSTD_decompress_usingDDict() : ++* Decompression using a pre-digested Dictionary ++* Use dictionary without significant overhead. */ ++size_t INIT ZSTD_decompress_usingDDict(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize, const ZSTD_DDict *ddict) ++{ ++ /* pass content and size in case legacy frames are encountered */ ++ return ZSTD_decompressMultiFrame(dctx, dst, dstCapacity, src, srcSize, NULL, 0, ddict); ++} ++ ++/*===================================== ++* Streaming decompression ++*====================================*/ ++ ++typedef enum { zdss_init, zdss_loadHeader, zdss_read, zdss_load, zdss_flush } ZSTD_dStreamStage; ++ ++/* *** Resource management *** */ ++struct ZSTD_DStream_s { ++ ZSTD_DCtx *dctx; ++ ZSTD_DDict *ddictLocal; ++ const ZSTD_DDict *ddict; ++ ZSTD_frameParams fParams; ++ ZSTD_dStreamStage stage; ++ char *inBuff; ++ size_t inBuffSize; ++ size_t inPos; ++ size_t maxWindowSize; ++ char *outBuff; ++ size_t outBuffSize; ++ size_t outStart; ++ size_t outEnd; ++ size_t blockSize; ++ BYTE headerBuffer[ZSTD_FRAMEHEADERSIZE_MAX]; /* tmp buffer to store frame header */ ++ size_t lhSize; ++ ZSTD_customMem customMem; ++ void *legacyContext; ++ U32 previousLegacyVersion; ++ U32 legacyVersion; ++ U32 hostageByte; ++}; /* typedef'd to ZSTD_DStream within "zstd.h" */ ++ ++size_t INIT ZSTD_DStreamWorkspaceBound(size_t maxWindowSize) ++{ ++ size_t const blockSize = MIN(maxWindowSize, ZSTD_BLOCKSIZE_ABSOLUTEMAX); ++ size_t const inBuffSize = blockSize; ++ size_t const outBuffSize = maxWindowSize + blockSize + WILDCOPY_OVERLENGTH * 2; ++ return ZSTD_DCtxWorkspaceBound() + ZSTD_ALIGN(sizeof(ZSTD_DStream)) + ZSTD_ALIGN(inBuffSize) + ZSTD_ALIGN(outBuffSize); ++} ++ ++static ZSTD_DStream *INIT ZSTD_createDStream_advanced(ZSTD_customMem customMem) ++{ ++ ZSTD_DStream *zds; ++ ++ if (!customMem.customAlloc || !customMem.customFree) ++ return NULL; ++ ++ zds = (ZSTD_DStream *)ZSTD_malloc(sizeof(ZSTD_DStream), customMem); ++ if (zds == NULL) ++ return NULL; ++ memset(zds, 0, sizeof(ZSTD_DStream)); ++ memcpy(&zds->customMem, &customMem, sizeof(ZSTD_customMem)); ++ zds->dctx = ZSTD_createDCtx_advanced(customMem); ++ if (zds->dctx == NULL) { ++ ZSTD_freeDStream(zds); ++ return NULL; ++ } ++ zds->stage = zdss_init; ++ zds->maxWindowSize = ZSTD_MAXWINDOWSIZE_DEFAULT; ++ return zds; ++} ++ ++ZSTD_DStream *INIT ZSTD_initDStream(size_t maxWindowSize, void *workspace, size_t workspaceSize) ++{ ++ ZSTD_customMem const stackMem = ZSTD_initStack(workspace, workspaceSize); ++ ZSTD_DStream *zds = ZSTD_createDStream_advanced(stackMem); ++ if (!zds) { ++ return NULL; ++ } ++ ++ zds->maxWindowSize = maxWindowSize; ++ zds->stage = zdss_loadHeader; ++ zds->lhSize = zds->inPos = zds->outStart = zds->outEnd = 0; ++ ZSTD_freeDDict(zds->ddictLocal); ++ zds->ddictLocal = NULL; ++ zds->ddict = zds->ddictLocal; ++ zds->legacyVersion = 0; ++ zds->hostageByte = 0; ++ ++ { ++ size_t const blockSize = MIN(zds->maxWindowSize, ZSTD_BLOCKSIZE_ABSOLUTEMAX); ++ size_t const neededOutSize = zds->maxWindowSize + blockSize + WILDCOPY_OVERLENGTH * 2; ++ ++ zds->inBuff = (char *)ZSTD_malloc(blockSize, zds->customMem); ++ zds->inBuffSize = blockSize; ++ zds->outBuff = (char *)ZSTD_malloc(neededOutSize, zds->customMem); ++ zds->outBuffSize = neededOutSize; ++ if (zds->inBuff == NULL || zds->outBuff == NULL) { ++ ZSTD_freeDStream(zds); ++ return NULL; ++ } ++ } ++ return zds; ++} ++ ++ZSTD_DStream *INIT ZSTD_initDStream_usingDDict(size_t maxWindowSize, const ZSTD_DDict *ddict, void *workspace, size_t workspaceSize) ++{ ++ ZSTD_DStream *zds = ZSTD_initDStream(maxWindowSize, workspace, workspaceSize); ++ if (zds) { ++ zds->ddict = ddict; ++ } ++ return zds; ++} ++ ++size_t INIT ZSTD_freeDStream(ZSTD_DStream *zds) ++{ ++ if (zds == NULL) ++ return 0; /* support free on null */ ++ { ++ ZSTD_customMem const cMem = zds->customMem; ++ ZSTD_freeDCtx(zds->dctx); ++ zds->dctx = NULL; ++ ZSTD_freeDDict(zds->ddictLocal); ++ zds->ddictLocal = NULL; ++ ZSTD_free(zds->inBuff, cMem); ++ zds->inBuff = NULL; ++ ZSTD_free(zds->outBuff, cMem); ++ zds->outBuff = NULL; ++ ZSTD_free(zds, cMem); ++ return 0; ++ } ++} ++ ++/* *** Initialization *** */ ++ ++size_t INIT ZSTD_DStreamInSize(void) { return ZSTD_BLOCKSIZE_ABSOLUTEMAX + ZSTD_blockHeaderSize; } ++size_t INIT ZSTD_DStreamOutSize(void) { return ZSTD_BLOCKSIZE_ABSOLUTEMAX; } ++ ++size_t INIT ZSTD_resetDStream(ZSTD_DStream *zds) ++{ ++ zds->stage = zdss_loadHeader; ++ zds->lhSize = zds->inPos = zds->outStart = zds->outEnd = 0; ++ zds->legacyVersion = 0; ++ zds->hostageByte = 0; ++ return ZSTD_frameHeaderSize_prefix; ++} ++ ++/* ***** Decompression ***** */ ++ ++ZSTD_STATIC size_t INIT ZSTD_limitCopy(void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++{ ++ size_t const length = MIN(dstCapacity, srcSize); ++ memcpy(dst, src, length); ++ return length; ++} ++ ++size_t INIT ZSTD_decompressStream(ZSTD_DStream *zds, ZSTD_outBuffer *output, ZSTD_inBuffer *input) ++{ ++ const char *const istart = (const char *)(input->src) + input->pos; ++ const char *const iend = (const char *)(input->src) + input->size; ++ const char *ip = istart; ++ char *const ostart = (char *)(output->dst) + output->pos; ++ char *const oend = (char *)(output->dst) + output->size; ++ char *op = ostart; ++ U32 someMoreWork = 1; ++ ++ while (someMoreWork) { ++ switch (zds->stage) { ++ case zdss_init: ++ ZSTD_resetDStream(zds); /* transparent reset on starting decoding a new frame */ ++ /* fallthrough */ ++ ++ case zdss_loadHeader: { ++ size_t const hSize = ZSTD_getFrameParams(&zds->fParams, zds->headerBuffer, zds->lhSize); ++ if (ZSTD_isError(hSize)) ++ return hSize; ++ if (hSize != 0) { /* need more input */ ++ size_t const toLoad = hSize - zds->lhSize; /* if hSize!=0, hSize > zds->lhSize */ ++ if (toLoad > (size_t)(iend - ip)) { /* not enough input to load full header */ ++ memcpy(zds->headerBuffer + zds->lhSize, ip, iend - ip); ++ zds->lhSize += iend - ip; ++ input->pos = input->size; ++ return (MAX(ZSTD_frameHeaderSize_min, hSize) - zds->lhSize) + ++ ZSTD_blockHeaderSize; /* remaining header bytes + next block header */ ++ } ++ memcpy(zds->headerBuffer + zds->lhSize, ip, toLoad); ++ zds->lhSize = hSize; ++ ip += toLoad; ++ break; ++ } ++ ++ /* check for single-pass mode opportunity */ ++ if (zds->fParams.frameContentSize && zds->fParams.windowSize /* skippable frame if == 0 */ ++ && (U64)(size_t)(oend - op) >= zds->fParams.frameContentSize) { ++ size_t const cSize = ZSTD_findFrameCompressedSize(istart, iend - istart); ++ if (cSize <= (size_t)(iend - istart)) { ++ size_t const decompressedSize = ZSTD_decompress_usingDDict(zds->dctx, op, oend - op, istart, cSize, zds->ddict); ++ if (ZSTD_isError(decompressedSize)) ++ return decompressedSize; ++ ip = istart + cSize; ++ op += decompressedSize; ++ zds->dctx->expected = 0; ++ zds->stage = zdss_init; ++ someMoreWork = 0; ++ break; ++ } ++ } ++ ++ /* Consume header */ ++ ZSTD_refDDict(zds->dctx, zds->ddict); ++ { ++ size_t const h1Size = ZSTD_nextSrcSizeToDecompress(zds->dctx); /* == ZSTD_frameHeaderSize_prefix */ ++ CHECK_F(ZSTD_decompressContinue(zds->dctx, NULL, 0, zds->headerBuffer, h1Size)); ++ { ++ size_t const h2Size = ZSTD_nextSrcSizeToDecompress(zds->dctx); ++ CHECK_F(ZSTD_decompressContinue(zds->dctx, NULL, 0, zds->headerBuffer + h1Size, h2Size)); ++ } ++ } ++ ++ zds->fParams.windowSize = MAX(zds->fParams.windowSize, 1U << ZSTD_WINDOWLOG_ABSOLUTEMIN); ++ if (zds->fParams.windowSize > zds->maxWindowSize) ++ return ERROR(frameParameter_windowTooLarge); ++ ++ /* Buffers are preallocated, but double check */ ++ { ++ size_t const blockSize = MIN(zds->maxWindowSize, ZSTD_BLOCKSIZE_ABSOLUTEMAX); ++ size_t const neededOutSize = zds->maxWindowSize + blockSize + WILDCOPY_OVERLENGTH * 2; ++ if (zds->inBuffSize < blockSize) { ++ return ERROR(GENERIC); ++ } ++ if (zds->outBuffSize < neededOutSize) { ++ return ERROR(GENERIC); ++ } ++ zds->blockSize = blockSize; ++ } ++ zds->stage = zdss_read; ++ } ++ /* fallthrough */ ++ ++ case zdss_read: { ++ size_t const neededInSize = ZSTD_nextSrcSizeToDecompress(zds->dctx); ++ if (neededInSize == 0) { /* end of frame */ ++ zds->stage = zdss_init; ++ someMoreWork = 0; ++ break; ++ } ++ if ((size_t)(iend - ip) >= neededInSize) { /* decode directly from src */ ++ const int isSkipFrame = ZSTD_isSkipFrame(zds->dctx); ++ size_t const decodedSize = ZSTD_decompressContinue(zds->dctx, zds->outBuff + zds->outStart, ++ (isSkipFrame ? 0 : zds->outBuffSize - zds->outStart), ip, neededInSize); ++ if (ZSTD_isError(decodedSize)) ++ return decodedSize; ++ ip += neededInSize; ++ if (!decodedSize && !isSkipFrame) ++ break; /* this was just a header */ ++ zds->outEnd = zds->outStart + decodedSize; ++ zds->stage = zdss_flush; ++ break; ++ } ++ if (ip == iend) { ++ someMoreWork = 0; ++ break; ++ } /* no more input */ ++ zds->stage = zdss_load; ++ /* pass-through */ ++ } ++ /* fallthrough */ ++ ++ case zdss_load: { ++ size_t const neededInSize = ZSTD_nextSrcSizeToDecompress(zds->dctx); ++ size_t const toLoad = neededInSize - zds->inPos; /* should always be <= remaining space within inBuff */ ++ size_t loadedSize; ++ if (toLoad > zds->inBuffSize - zds->inPos) ++ return ERROR(corruption_detected); /* should never happen */ ++ loadedSize = ZSTD_limitCopy(zds->inBuff + zds->inPos, toLoad, ip, iend - ip); ++ ip += loadedSize; ++ zds->inPos += loadedSize; ++ if (loadedSize < toLoad) { ++ someMoreWork = 0; ++ break; ++ } /* not enough input, wait for more */ ++ ++ /* decode loaded input */ ++ { ++ const int isSkipFrame = ZSTD_isSkipFrame(zds->dctx); ++ size_t const decodedSize = ZSTD_decompressContinue(zds->dctx, zds->outBuff + zds->outStart, zds->outBuffSize - zds->outStart, ++ zds->inBuff, neededInSize); ++ if (ZSTD_isError(decodedSize)) ++ return decodedSize; ++ zds->inPos = 0; /* input is consumed */ ++ if (!decodedSize && !isSkipFrame) { ++ zds->stage = zdss_read; ++ break; ++ } /* this was just a header */ ++ zds->outEnd = zds->outStart + decodedSize; ++ zds->stage = zdss_flush; ++ /* pass-through */ ++ } ++ } ++ /* fallthrough */ ++ ++ case zdss_flush: { ++ size_t const toFlushSize = zds->outEnd - zds->outStart; ++ size_t const flushedSize = ZSTD_limitCopy(op, oend - op, zds->outBuff + zds->outStart, toFlushSize); ++ op += flushedSize; ++ zds->outStart += flushedSize; ++ if (flushedSize == toFlushSize) { /* flush completed */ ++ zds->stage = zdss_read; ++ if (zds->outStart + zds->blockSize > zds->outBuffSize) ++ zds->outStart = zds->outEnd = 0; ++ break; ++ } ++ /* cannot complete flush */ ++ someMoreWork = 0; ++ break; ++ } ++ default: ++ return ERROR(GENERIC); /* impossible */ ++ } ++ } ++ ++ /* result */ ++ input->pos += (size_t)(ip - istart); ++ output->pos += (size_t)(op - ostart); ++ { ++ size_t nextSrcSizeHint = ZSTD_nextSrcSizeToDecompress(zds->dctx); ++ if (!nextSrcSizeHint) { /* frame fully decoded */ ++ if (zds->outEnd == zds->outStart) { /* output fully flushed */ ++ if (zds->hostageByte) { ++ if (input->pos >= input->size) { ++ zds->stage = zdss_read; ++ return 1; ++ } /* can't release hostage (not present) */ ++ input->pos++; /* release hostage */ ++ } ++ return 0; ++ } ++ if (!zds->hostageByte) { /* output not fully flushed; keep last byte as hostage; will be released when all output is flushed */ ++ input->pos--; /* note : pos > 0, otherwise, impossible to finish reading last block */ ++ zds->hostageByte = 1; ++ } ++ return 1; ++ } ++ nextSrcSizeHint += ZSTD_blockHeaderSize * (ZSTD_nextInputType(zds->dctx) == ZSTDnit_block); /* preload header of next block */ ++ if (zds->inPos > nextSrcSizeHint) ++ return ERROR(GENERIC); /* should never happen */ ++ nextSrcSizeHint -= zds->inPos; /* already loaded*/ ++ return nextSrcSizeHint; ++ } ++} +diff --git a/xen/common/zstd/entropy_common.c b/xen/common/zstd/entropy_common.c +new file mode 100644 +index 000000000000..bcdb57982ba5 +--- /dev/null ++++ b/xen/common/zstd/entropy_common.c +@@ -0,0 +1,243 @@ ++/* ++ * Common functions of New Generation Entropy library ++ * Copyright (C) 2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at : ++ * - Source repository : https://github.com/Cyan4973/FiniteStateEntropy ++ */ ++ ++/* ************************************* ++* Dependencies ++***************************************/ ++#include "error_private.h" /* ERR_*, ERROR */ ++#include "fse.h" ++#include "huf.h" ++#include "mem.h" ++ ++/*=== Version ===*/ ++unsigned INIT FSE_versionNumber(void) { return FSE_VERSION_NUMBER; } ++ ++/*=== Error Management ===*/ ++unsigned INIT FSE_isError(size_t code) { return ERR_isError(code); } ++ ++unsigned INIT HUF_isError(size_t code) { return ERR_isError(code); } ++ ++/*-************************************************************** ++* FSE NCount encoding-decoding ++****************************************************************/ ++size_t INIT FSE_readNCount(short *normalizedCounter, unsigned *maxSVPtr, unsigned *tableLogPtr, const void *headerBuffer, size_t hbSize) ++{ ++ const BYTE *const istart = (const BYTE *)headerBuffer; ++ const BYTE *const iend = istart + hbSize; ++ const BYTE *ip = istart; ++ int nbBits; ++ int remaining; ++ int threshold; ++ U32 bitStream; ++ int bitCount; ++ unsigned charnum = 0; ++ int previous0 = 0; ++ ++ if (hbSize < 4) ++ return ERROR(srcSize_wrong); ++ bitStream = ZSTD_readLE32(ip); ++ nbBits = (bitStream & 0xF) + FSE_MIN_TABLELOG; /* extract tableLog */ ++ if (nbBits > FSE_TABLELOG_ABSOLUTE_MAX) ++ return ERROR(tableLog_tooLarge); ++ bitStream >>= 4; ++ bitCount = 4; ++ *tableLogPtr = nbBits; ++ remaining = (1 << nbBits) + 1; ++ threshold = 1 << nbBits; ++ nbBits++; ++ ++ while ((remaining > 1) & (charnum <= *maxSVPtr)) { ++ if (previous0) { ++ unsigned n0 = charnum; ++ while ((bitStream & 0xFFFF) == 0xFFFF) { ++ n0 += 24; ++ if (ip < iend - 5) { ++ ip += 2; ++ bitStream = ZSTD_readLE32(ip) >> bitCount; ++ } else { ++ bitStream >>= 16; ++ bitCount += 16; ++ } ++ } ++ while ((bitStream & 3) == 3) { ++ n0 += 3; ++ bitStream >>= 2; ++ bitCount += 2; ++ } ++ n0 += bitStream & 3; ++ bitCount += 2; ++ if (n0 > *maxSVPtr) ++ return ERROR(maxSymbolValue_tooSmall); ++ while (charnum < n0) ++ normalizedCounter[charnum++] = 0; ++ if ((ip <= iend - 7) || (ip + (bitCount >> 3) <= iend - 4)) { ++ ip += bitCount >> 3; ++ bitCount &= 7; ++ bitStream = ZSTD_readLE32(ip) >> bitCount; ++ } else { ++ bitStream >>= 2; ++ } ++ } ++ { ++ int const max = (2 * threshold - 1) - remaining; ++ int count; ++ ++ if ((bitStream & (threshold - 1)) < (U32)max) { ++ count = bitStream & (threshold - 1); ++ bitCount += nbBits - 1; ++ } else { ++ count = bitStream & (2 * threshold - 1); ++ if (count >= threshold) ++ count -= max; ++ bitCount += nbBits; ++ } ++ ++ count--; /* extra accuracy */ ++ remaining -= count < 0 ? -count : count; /* -1 means +1 */ ++ normalizedCounter[charnum++] = (short)count; ++ previous0 = !count; ++ while (remaining < threshold) { ++ nbBits--; ++ threshold >>= 1; ++ } ++ ++ if ((ip <= iend - 7) || (ip + (bitCount >> 3) <= iend - 4)) { ++ ip += bitCount >> 3; ++ bitCount &= 7; ++ } else { ++ bitCount -= (int)(8 * (iend - 4 - ip)); ++ ip = iend - 4; ++ } ++ bitStream = ZSTD_readLE32(ip) >> (bitCount & 31); ++ } ++ } /* while ((remaining>1) & (charnum<=*maxSVPtr)) */ ++ if (remaining != 1) ++ return ERROR(corruption_detected); ++ if (bitCount > 32) ++ return ERROR(corruption_detected); ++ *maxSVPtr = charnum - 1; ++ ++ ip += (bitCount + 7) >> 3; ++ return ip - istart; ++} ++ ++/*! HUF_readStats() : ++ Read compact Huffman tree, saved by HUF_writeCTable(). ++ `huffWeight` is destination buffer. ++ `rankStats` is assumed to be a table of at least HUF_TABLELOG_MAX U32. ++ @return : size read from `src` , or an error Code . ++ Note : Needed by HUF_readCTable() and HUF_readDTableX?() . ++*/ ++size_t INIT HUF_readStats_wksp(BYTE *huffWeight, size_t hwSize, U32 *rankStats, U32 *nbSymbolsPtr, U32 *tableLogPtr, const void *src, size_t srcSize, void *workspace, size_t workspaceSize) ++{ ++ U32 weightTotal; ++ const BYTE *ip = (const BYTE *)src; ++ size_t iSize; ++ size_t oSize; ++ ++ if (!srcSize) ++ return ERROR(srcSize_wrong); ++ iSize = ip[0]; ++ /* memset(huffWeight, 0, hwSize); */ /* is not necessary, even though some analyzer complain ... */ ++ ++ if (iSize >= 128) { /* special header */ ++ oSize = iSize - 127; ++ iSize = ((oSize + 1) / 2); ++ if (iSize + 1 > srcSize) ++ return ERROR(srcSize_wrong); ++ if (oSize >= hwSize) ++ return ERROR(corruption_detected); ++ ip += 1; ++ { ++ U32 n; ++ for (n = 0; n < oSize; n += 2) { ++ huffWeight[n] = ip[n / 2] >> 4; ++ huffWeight[n + 1] = ip[n / 2] & 15; ++ } ++ } ++ } else { /* header compressed with FSE (normal case) */ ++ if (iSize + 1 > srcSize) ++ return ERROR(srcSize_wrong); ++ oSize = FSE_decompress_wksp(huffWeight, hwSize - 1, ip + 1, iSize, 6, workspace, workspaceSize); /* max (hwSize-1) values decoded, as last one is implied */ ++ if (FSE_isError(oSize)) ++ return oSize; ++ } ++ ++ /* collect weight stats */ ++ memset(rankStats, 0, (HUF_TABLELOG_MAX + 1) * sizeof(U32)); ++ weightTotal = 0; ++ { ++ U32 n; ++ for (n = 0; n < oSize; n++) { ++ if (huffWeight[n] >= HUF_TABLELOG_MAX) ++ return ERROR(corruption_detected); ++ rankStats[huffWeight[n]]++; ++ weightTotal += (1 << huffWeight[n]) >> 1; ++ } ++ } ++ if (weightTotal == 0) ++ return ERROR(corruption_detected); ++ ++ /* get last non-null symbol weight (implied, total must be 2^n) */ ++ { ++ U32 const tableLog = BIT_highbit32(weightTotal) + 1; ++ if (tableLog > HUF_TABLELOG_MAX) ++ return ERROR(corruption_detected); ++ *tableLogPtr = tableLog; ++ /* determine last weight */ ++ { ++ U32 const total = 1 << tableLog; ++ U32 const rest = total - weightTotal; ++ U32 const verif = 1 << BIT_highbit32(rest); ++ U32 const lastWeight = BIT_highbit32(rest) + 1; ++ if (verif != rest) ++ return ERROR(corruption_detected); /* last value must be a clean power of 2 */ ++ huffWeight[oSize] = (BYTE)lastWeight; ++ rankStats[lastWeight]++; ++ } ++ } ++ ++ /* check tree construction validity */ ++ if ((rankStats[1] < 2) || (rankStats[1] & 1)) ++ return ERROR(corruption_detected); /* by construction : at least 2 elts of rank 1, must be even */ ++ ++ /* results */ ++ *nbSymbolsPtr = (U32)(oSize + 1); ++ return iSize + 1; ++} +diff --git a/xen/common/zstd/error_private.h b/xen/common/zstd/error_private.h +new file mode 100644 +index 000000000000..d07bf3cb9b55 +--- /dev/null ++++ b/xen/common/zstd/error_private.h +@@ -0,0 +1,110 @@ ++/** ++ * Copyright (c) 2016-present, Yann Collet, Facebook, Inc. ++ * All rights reserved. ++ * ++ * This source code is licensed under the BSD-style license found in the ++ * LICENSE file in the root directory of https://github.com/facebook/zstd. ++ * An additional grant of patent rights can be found in the PATENTS file in the ++ * same directory. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ */ ++ ++/* Note : this module is expected to remain private, do not expose it */ ++ ++#ifndef ERROR_H_MODULE ++#define ERROR_H_MODULE ++ ++/* **************************************** ++* Dependencies ++******************************************/ ++#include /* size_t */ ++ ++/** ++ * enum ZSTD_ErrorCode - zstd error codes ++ * ++ * Functions that return size_t can be checked for errors using ZSTD_isError() ++ * and the ZSTD_ErrorCode can be extracted using ZSTD_getErrorCode(). ++ */ ++typedef enum { ++ ZSTD_error_no_error, ++ ZSTD_error_GENERIC, ++ ZSTD_error_prefix_unknown, ++ ZSTD_error_version_unsupported, ++ ZSTD_error_parameter_unknown, ++ ZSTD_error_frameParameter_unsupported, ++ ZSTD_error_frameParameter_unsupportedBy32bits, ++ ZSTD_error_frameParameter_windowTooLarge, ++ ZSTD_error_compressionParameter_unsupported, ++ ZSTD_error_init_missing, ++ ZSTD_error_memory_allocation, ++ ZSTD_error_stage_wrong, ++ ZSTD_error_dstSize_tooSmall, ++ ZSTD_error_srcSize_wrong, ++ ZSTD_error_corruption_detected, ++ ZSTD_error_checksum_wrong, ++ ZSTD_error_tableLog_tooLarge, ++ ZSTD_error_maxSymbolValue_tooLarge, ++ ZSTD_error_maxSymbolValue_tooSmall, ++ ZSTD_error_dictionary_corrupted, ++ ZSTD_error_dictionary_wrong, ++ ZSTD_error_dictionaryCreation_failed, ++ ZSTD_error_maxCode ++} ZSTD_ErrorCode; ++ ++/* **************************************** ++* Compiler-specific ++******************************************/ ++#define ERR_STATIC static __attribute__((unused)) ++ ++/*-**************************************** ++* Customization (error_public.h) ++******************************************/ ++typedef ZSTD_ErrorCode ERR_enum; ++#define PREFIX(name) ZSTD_error_##name ++ ++/*-**************************************** ++* Error codes handling ++******************************************/ ++#define ERROR(name) ((size_t)-PREFIX(name)) ++ ++ERR_STATIC unsigned INIT ERR_isError(size_t code) { return (code > ERROR(maxCode)); } ++ ++ERR_STATIC ERR_enum INIT ERR_getErrorCode(size_t code) ++{ ++ if (!ERR_isError(code)) ++ return (ERR_enum)0; ++ return (ERR_enum)(0 - code); ++} ++ ++/** ++ * ZSTD_isError() - tells if a size_t function result is an error code ++ * @code: The function result to check for error. ++ * ++ * Return: Non-zero iff the code is an error. ++ */ ++static __attribute__((unused)) unsigned int INIT ZSTD_isError(size_t code) ++{ ++ return code > (size_t)-ZSTD_error_maxCode; ++} ++ ++/** ++ * ZSTD_getErrorCode() - translates an error function result to a ZSTD_ErrorCode ++ * @functionResult: The result of a function for which ZSTD_isError() is true. ++ * ++ * Return: The ZSTD_ErrorCode corresponding to the functionResult or 0 ++ * if the functionResult isn't an error. ++ */ ++static __attribute__((unused)) ZSTD_ErrorCode INIT ZSTD_getErrorCode( ++ size_t functionResult) ++{ ++ if (!ZSTD_isError(functionResult)) ++ return (ZSTD_ErrorCode)0; ++ return (ZSTD_ErrorCode)(0 - functionResult); ++} ++ ++#endif /* ERROR_H_MODULE */ +diff --git a/xen/common/zstd/fse.h b/xen/common/zstd/fse.h +new file mode 100644 +index 000000000000..b86717c34d0f +--- /dev/null ++++ b/xen/common/zstd/fse.h +@@ -0,0 +1,575 @@ ++/* ++ * FSE : Finite State Entropy codec ++ * Public Prototypes declaration ++ * Copyright (C) 2013-2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at : ++ * - Source repository : https://github.com/Cyan4973/FiniteStateEntropy ++ */ ++#ifndef FSE_H ++#define FSE_H ++ ++/*-***************************************** ++* Dependencies ++******************************************/ ++#include /* size_t, ptrdiff_t */ ++ ++/*-***************************************** ++* FSE_PUBLIC_API : control library symbols visibility ++******************************************/ ++#define FSE_PUBLIC_API ++ ++/*------ Version ------*/ ++#define FSE_VERSION_MAJOR 0 ++#define FSE_VERSION_MINOR 9 ++#define FSE_VERSION_RELEASE 0 ++ ++#define FSE_LIB_VERSION FSE_VERSION_MAJOR.FSE_VERSION_MINOR.FSE_VERSION_RELEASE ++#define FSE_QUOTE(str) #str ++#define FSE_EXPAND_AND_QUOTE(str) FSE_QUOTE(str) ++#define FSE_VERSION_STRING FSE_EXPAND_AND_QUOTE(FSE_LIB_VERSION) ++ ++#define FSE_VERSION_NUMBER (FSE_VERSION_MAJOR * 100 * 100 + FSE_VERSION_MINOR * 100 + FSE_VERSION_RELEASE) ++FSE_PUBLIC_API unsigned FSE_versionNumber(void); /**< library version number; to be used when checking dll version */ ++ ++/*-***************************************** ++* Tool functions ++******************************************/ ++FSE_PUBLIC_API size_t FSE_compressBound(size_t size); /* maximum compressed size */ ++ ++/* Error Management */ ++FSE_PUBLIC_API unsigned FSE_isError(size_t code); /* tells if a return value is an error code */ ++ ++/*-***************************************** ++* FSE detailed API ++******************************************/ ++/*! ++FSE_compress() does the following: ++1. count symbol occurrence from source[] into table count[] ++2. normalize counters so that sum(count[]) == Power_of_2 (2^tableLog) ++3. save normalized counters to memory buffer using writeNCount() ++4. build encoding table 'CTable' from normalized counters ++5. encode the data stream using encoding table 'CTable' ++ ++FSE_decompress() does the following: ++1. read normalized counters with readNCount() ++2. build decoding table 'DTable' from normalized counters ++3. decode the data stream using decoding table 'DTable' ++ ++The following API allows targeting specific sub-functions for advanced tasks. ++For example, it's possible to compress several blocks using the same 'CTable', ++or to save and provide normalized distribution using external method. ++*/ ++ ++/* *** COMPRESSION *** */ ++/*! FSE_optimalTableLog(): ++ dynamically downsize 'tableLog' when conditions are met. ++ It saves CPU time, by using smaller tables, while preserving or even improving compression ratio. ++ @return : recommended tableLog (necessarily <= 'maxTableLog') */ ++FSE_PUBLIC_API unsigned FSE_optimalTableLog(unsigned maxTableLog, size_t srcSize, unsigned maxSymbolValue); ++ ++/*! FSE_normalizeCount(): ++ normalize counts so that sum(count[]) == Power_of_2 (2^tableLog) ++ 'normalizedCounter' is a table of short, of minimum size (maxSymbolValue+1). ++ @return : tableLog, ++ or an errorCode, which can be tested using FSE_isError() */ ++FSE_PUBLIC_API size_t FSE_normalizeCount(short *normalizedCounter, unsigned tableLog, const unsigned *count, size_t srcSize, unsigned maxSymbolValue); ++ ++/*! FSE_NCountWriteBound(): ++ Provides the maximum possible size of an FSE normalized table, given 'maxSymbolValue' and 'tableLog'. ++ Typically useful for allocation purpose. */ ++FSE_PUBLIC_API size_t FSE_NCountWriteBound(unsigned maxSymbolValue, unsigned tableLog); ++ ++/*! FSE_writeNCount(): ++ Compactly save 'normalizedCounter' into 'buffer'. ++ @return : size of the compressed table, ++ or an errorCode, which can be tested using FSE_isError(). */ ++FSE_PUBLIC_API size_t FSE_writeNCount(void *buffer, size_t bufferSize, const short *normalizedCounter, unsigned maxSymbolValue, unsigned tableLog); ++ ++/*! Constructor and Destructor of FSE_CTable. ++ Note that FSE_CTable size depends on 'tableLog' and 'maxSymbolValue' */ ++typedef unsigned FSE_CTable; /* don't allocate that. It's only meant to be more restrictive than void* */ ++ ++/*! FSE_compress_usingCTable(): ++ Compress `src` using `ct` into `dst` which must be already allocated. ++ @return : size of compressed data (<= `dstCapacity`), ++ or 0 if compressed data could not fit into `dst`, ++ or an errorCode, which can be tested using FSE_isError() */ ++FSE_PUBLIC_API size_t FSE_compress_usingCTable(void *dst, size_t dstCapacity, const void *src, size_t srcSize, const FSE_CTable *ct); ++ ++/*! ++Tutorial : ++---------- ++The first step is to count all symbols. FSE_count() does this job very fast. ++Result will be saved into 'count', a table of unsigned int, which must be already allocated, and have 'maxSymbolValuePtr[0]+1' cells. ++'src' is a table of bytes of size 'srcSize'. All values within 'src' MUST be <= maxSymbolValuePtr[0] ++maxSymbolValuePtr[0] will be updated, with its real value (necessarily <= original value) ++FSE_count() will return the number of occurrence of the most frequent symbol. ++This can be used to know if there is a single symbol within 'src', and to quickly evaluate its compressibility. ++If there is an error, the function will return an ErrorCode (which can be tested using FSE_isError()). ++ ++The next step is to normalize the frequencies. ++FSE_normalizeCount() will ensure that sum of frequencies is == 2 ^'tableLog'. ++It also guarantees a minimum of 1 to any Symbol with frequency >= 1. ++You can use 'tableLog'==0 to mean "use default tableLog value". ++If you are unsure of which tableLog value to use, you can ask FSE_optimalTableLog(), ++which will provide the optimal valid tableLog given sourceSize, maxSymbolValue, and a user-defined maximum (0 means "default"). ++ ++The result of FSE_normalizeCount() will be saved into a table, ++called 'normalizedCounter', which is a table of signed short. ++'normalizedCounter' must be already allocated, and have at least 'maxSymbolValue+1' cells. ++The return value is tableLog if everything proceeded as expected. ++It is 0 if there is a single symbol within distribution. ++If there is an error (ex: invalid tableLog value), the function will return an ErrorCode (which can be tested using FSE_isError()). ++ ++'normalizedCounter' can be saved in a compact manner to a memory area using FSE_writeNCount(). ++'buffer' must be already allocated. ++For guaranteed success, buffer size must be at least FSE_headerBound(). ++The result of the function is the number of bytes written into 'buffer'. ++If there is an error, the function will return an ErrorCode (which can be tested using FSE_isError(); ex : buffer size too small). ++ ++'normalizedCounter' can then be used to create the compression table 'CTable'. ++The space required by 'CTable' must be already allocated, using FSE_createCTable(). ++You can then use FSE_buildCTable() to fill 'CTable'. ++If there is an error, both functions will return an ErrorCode (which can be tested using FSE_isError()). ++ ++'CTable' can then be used to compress 'src', with FSE_compress_usingCTable(). ++Similar to FSE_count(), the convention is that 'src' is assumed to be a table of char of size 'srcSize' ++The function returns the size of compressed data (without header), necessarily <= `dstCapacity`. ++If it returns '0', compressed data could not fit into 'dst'. ++If there is an error, the function will return an ErrorCode (which can be tested using FSE_isError()). ++*/ ++ ++/* *** DECOMPRESSION *** */ ++ ++/*! FSE_readNCount(): ++ Read compactly saved 'normalizedCounter' from 'rBuffer'. ++ @return : size read from 'rBuffer', ++ or an errorCode, which can be tested using FSE_isError(). ++ maxSymbolValuePtr[0] and tableLogPtr[0] will also be updated with their respective values */ ++FSE_PUBLIC_API size_t FSE_readNCount(short *normalizedCounter, unsigned *maxSymbolValuePtr, unsigned *tableLogPtr, const void *rBuffer, size_t rBuffSize); ++ ++/*! Constructor and Destructor of FSE_DTable. ++ Note that its size depends on 'tableLog' */ ++typedef unsigned FSE_DTable; /* don't allocate that. It's just a way to be more restrictive than void* */ ++ ++/*! FSE_buildDTable(): ++ Builds 'dt', which must be already allocated, using FSE_createDTable(). ++ return : 0, or an errorCode, which can be tested using FSE_isError() */ ++FSE_PUBLIC_API size_t FSE_buildDTable_wksp(FSE_DTable *dt, const short *normalizedCounter, unsigned maxSymbolValue, unsigned tableLog, void *workspace, size_t workspaceSize); ++ ++/*! FSE_decompress_usingDTable(): ++ Decompress compressed source `cSrc` of size `cSrcSize` using `dt` ++ into `dst` which must be already allocated. ++ @return : size of regenerated data (necessarily <= `dstCapacity`), ++ or an errorCode, which can be tested using FSE_isError() */ ++FSE_PUBLIC_API size_t FSE_decompress_usingDTable(void *dst, size_t dstCapacity, const void *cSrc, size_t cSrcSize, const FSE_DTable *dt); ++ ++/*! ++Tutorial : ++---------- ++(Note : these functions only decompress FSE-compressed blocks. ++ If block is uncompressed, use memcpy() instead ++ If block is a single repeated byte, use memset() instead ) ++ ++The first step is to obtain the normalized frequencies of symbols. ++This can be performed by FSE_readNCount() if it was saved using FSE_writeNCount(). ++'normalizedCounter' must be already allocated, and have at least 'maxSymbolValuePtr[0]+1' cells of signed short. ++In practice, that means it's necessary to know 'maxSymbolValue' beforehand, ++or size the table to handle worst case situations (typically 256). ++FSE_readNCount() will provide 'tableLog' and 'maxSymbolValue'. ++The result of FSE_readNCount() is the number of bytes read from 'rBuffer'. ++Note that 'rBufferSize' must be at least 4 bytes, even if useful information is less than that. ++If there is an error, the function will return an error code, which can be tested using FSE_isError(). ++ ++The next step is to build the decompression tables 'FSE_DTable' from 'normalizedCounter'. ++This is performed by the function FSE_buildDTable(). ++The space required by 'FSE_DTable' must be already allocated using FSE_createDTable(). ++If there is an error, the function will return an error code, which can be tested using FSE_isError(). ++ ++`FSE_DTable` can then be used to decompress `cSrc`, with FSE_decompress_usingDTable(). ++`cSrcSize` must be strictly correct, otherwise decompression will fail. ++FSE_decompress_usingDTable() result will tell how many bytes were regenerated (<=`dstCapacity`). ++If there is an error, the function will return an error code, which can be tested using FSE_isError(). (ex: dst buffer too small) ++*/ ++ ++/* *** Dependency *** */ ++#include "bitstream.h" ++ ++/* ***************************************** ++* Static allocation ++*******************************************/ ++/* FSE buffer bounds */ ++#define FSE_NCOUNTBOUND 512 ++#define FSE_BLOCKBOUND(size) (size + (size >> 7)) ++#define FSE_COMPRESSBOUND(size) (FSE_NCOUNTBOUND + FSE_BLOCKBOUND(size)) /* Macro version, useful for static allocation */ ++ ++/* It is possible to statically allocate FSE CTable/DTable as a table of FSE_CTable/FSE_DTable using below macros */ ++#define FSE_CTABLE_SIZE_U32(maxTableLog, maxSymbolValue) (1 + (1 << (maxTableLog - 1)) + ((maxSymbolValue + 1) * 2)) ++#define FSE_DTABLE_SIZE_U32(maxTableLog) (1 + (1 << maxTableLog)) ++ ++/* ***************************************** ++* FSE advanced API ++*******************************************/ ++/* FSE_count_wksp() : ++ * Same as FSE_count(), but using an externally provided scratch buffer. ++ * `workSpace` size must be table of >= `1024` unsigned ++ */ ++size_t FSE_count_wksp(unsigned *count, unsigned *maxSymbolValuePtr, const void *source, size_t sourceSize, unsigned *workSpace); ++ ++/* FSE_countFast_wksp() : ++ * Same as FSE_countFast(), but using an externally provided scratch buffer. ++ * `workSpace` must be a table of minimum `1024` unsigned ++ */ ++size_t FSE_countFast_wksp(unsigned *count, unsigned *maxSymbolValuePtr, const void *src, size_t srcSize, unsigned *workSpace); ++ ++/*! FSE_count_simple ++ * Same as FSE_countFast(), but does not use any additional memory (not even on stack). ++ * This function is unsafe, and will segfault if any value within `src` is `> *maxSymbolValuePtr` (presuming it's also the size of `count`). ++*/ ++size_t FSE_count_simple(unsigned *count, unsigned *maxSymbolValuePtr, const void *src, size_t srcSize); ++ ++unsigned FSE_optimalTableLog_internal(unsigned maxTableLog, size_t srcSize, unsigned maxSymbolValue, unsigned minus); ++/**< same as FSE_optimalTableLog(), which used `minus==2` */ ++ ++size_t FSE_buildCTable_raw(FSE_CTable *ct, unsigned nbBits); ++/**< build a fake FSE_CTable, designed for a flat distribution, where each symbol uses nbBits */ ++ ++size_t FSE_buildCTable_rle(FSE_CTable *ct, unsigned char symbolValue); ++/**< build a fake FSE_CTable, designed to compress always the same symbolValue */ ++ ++/* FSE_buildCTable_wksp() : ++ * Same as FSE_buildCTable(), but using an externally allocated scratch buffer (`workSpace`). ++ * `wkspSize` must be >= `(1<= BIT_DStream_completed ++ ++When it's done, verify decompression is fully completed, by checking both DStream and the relevant states. ++Checking if DStream has reached its end is performed by : ++ BIT_endOfDStream(&DStream); ++Check also the states. There might be some symbols left there, if some high probability ones (>50%) are possible. ++ FSE_endOfDState(&DState); ++*/ ++ ++/* ***************************************** ++* FSE unsafe API ++*******************************************/ ++static unsigned char FSE_decodeSymbolFast(FSE_DState_t *DStatePtr, BIT_DStream_t *bitD); ++/* faster, but works only if nbBits is always >= 1 (otherwise, result will be corrupted) */ ++ ++/* ***************************************** ++* Implementation of inlined functions ++*******************************************/ ++typedef struct { ++ int deltaFindState; ++ U32 deltaNbBits; ++} FSE_symbolCompressionTransform; /* total 8 bytes */ ++ ++ZSTD_STATIC void FSE_initCState(FSE_CState_t *statePtr, const FSE_CTable *ct) ++{ ++ const void *ptr = ct; ++ const U16 *u16ptr = (const U16 *)ptr; ++ const U32 tableLog = ZSTD_read16(ptr); ++ statePtr->value = (ptrdiff_t)1 << tableLog; ++ statePtr->stateTable = u16ptr + 2; ++ statePtr->symbolTT = ((const U32 *)ct + 1 + (tableLog ? (1 << (tableLog - 1)) : 1)); ++ statePtr->stateLog = tableLog; ++} ++ ++/*! FSE_initCState2() : ++* Same as FSE_initCState(), but the first symbol to include (which will be the last to be read) ++* uses the smallest state value possible, saving the cost of this symbol */ ++ZSTD_STATIC void FSE_initCState2(FSE_CState_t *statePtr, const FSE_CTable *ct, U32 symbol) ++{ ++ FSE_initCState(statePtr, ct); ++ { ++ const FSE_symbolCompressionTransform symbolTT = ((const FSE_symbolCompressionTransform *)(statePtr->symbolTT))[symbol]; ++ const U16 *stateTable = (const U16 *)(statePtr->stateTable); ++ U32 nbBitsOut = (U32)((symbolTT.deltaNbBits + (1 << 15)) >> 16); ++ statePtr->value = (nbBitsOut << 16) - symbolTT.deltaNbBits; ++ statePtr->value = stateTable[(statePtr->value >> nbBitsOut) + symbolTT.deltaFindState]; ++ } ++} ++ ++ZSTD_STATIC void FSE_encodeSymbol(BIT_CStream_t *bitC, FSE_CState_t *statePtr, U32 symbol) ++{ ++ const FSE_symbolCompressionTransform symbolTT = ((const FSE_symbolCompressionTransform *)(statePtr->symbolTT))[symbol]; ++ const U16 *const stateTable = (const U16 *)(statePtr->stateTable); ++ U32 nbBitsOut = (U32)((statePtr->value + symbolTT.deltaNbBits) >> 16); ++ BIT_addBits(bitC, statePtr->value, nbBitsOut); ++ statePtr->value = stateTable[(statePtr->value >> nbBitsOut) + symbolTT.deltaFindState]; ++} ++ ++ZSTD_STATIC void FSE_flushCState(BIT_CStream_t *bitC, const FSE_CState_t *statePtr) ++{ ++ BIT_addBits(bitC, statePtr->value, statePtr->stateLog); ++ BIT_flushBits(bitC); ++} ++ ++/* ====== Decompression ====== */ ++ ++typedef struct { ++ U16 tableLog; ++ U16 fastMode; ++} FSE_DTableHeader; /* sizeof U32 */ ++ ++typedef struct { ++ unsigned short newState; ++ unsigned char symbol; ++ unsigned char nbBits; ++} FSE_decode_t; /* size == U32 */ ++ ++ZSTD_STATIC void FSE_initDState(FSE_DState_t *DStatePtr, BIT_DStream_t *bitD, const FSE_DTable *dt) ++{ ++ const void *ptr = dt; ++ const FSE_DTableHeader *const DTableH = (const FSE_DTableHeader *)ptr; ++ DStatePtr->state = BIT_readBits(bitD, DTableH->tableLog); ++ BIT_reloadDStream(bitD); ++ DStatePtr->table = dt + 1; ++} ++ ++ZSTD_STATIC BYTE FSE_peekSymbol(const FSE_DState_t *DStatePtr) ++{ ++ FSE_decode_t const DInfo = ((const FSE_decode_t *)(DStatePtr->table))[DStatePtr->state]; ++ return DInfo.symbol; ++} ++ ++ZSTD_STATIC void FSE_updateState(FSE_DState_t *DStatePtr, BIT_DStream_t *bitD) ++{ ++ FSE_decode_t const DInfo = ((const FSE_decode_t *)(DStatePtr->table))[DStatePtr->state]; ++ U32 const nbBits = DInfo.nbBits; ++ size_t const lowBits = BIT_readBits(bitD, nbBits); ++ DStatePtr->state = DInfo.newState + lowBits; ++} ++ ++ZSTD_STATIC BYTE FSE_decodeSymbol(FSE_DState_t *DStatePtr, BIT_DStream_t *bitD) ++{ ++ FSE_decode_t const DInfo = ((const FSE_decode_t *)(DStatePtr->table))[DStatePtr->state]; ++ U32 const nbBits = DInfo.nbBits; ++ BYTE const symbol = DInfo.symbol; ++ size_t const lowBits = BIT_readBits(bitD, nbBits); ++ ++ DStatePtr->state = DInfo.newState + lowBits; ++ return symbol; ++} ++ ++/*! FSE_decodeSymbolFast() : ++ unsafe, only works if no symbol has a probability > 50% */ ++ZSTD_STATIC BYTE FSE_decodeSymbolFast(FSE_DState_t *DStatePtr, BIT_DStream_t *bitD) ++{ ++ FSE_decode_t const DInfo = ((const FSE_decode_t *)(DStatePtr->table))[DStatePtr->state]; ++ U32 const nbBits = DInfo.nbBits; ++ BYTE const symbol = DInfo.symbol; ++ size_t const lowBits = BIT_readBitsFast(bitD, nbBits); ++ ++ DStatePtr->state = DInfo.newState + lowBits; ++ return symbol; ++} ++ ++ZSTD_STATIC unsigned FSE_endOfDState(const FSE_DState_t *DStatePtr) { return DStatePtr->state == 0; } ++ ++/* ************************************************************** ++* Tuning parameters ++****************************************************************/ ++/*!MEMORY_USAGE : ++* Memory usage formula : N->2^N Bytes (examples : 10 -> 1KB; 12 -> 4KB ; 16 -> 64KB; 20 -> 1MB; etc.) ++* Increasing memory usage improves compression ratio ++* Reduced memory usage can improve speed, due to cache effect ++* Recommended max value is 14, for 16KB, which nicely fits into Intel x86 L1 cache */ ++#ifndef FSE_MAX_MEMORY_USAGE ++#define FSE_MAX_MEMORY_USAGE 14 ++#endif ++#ifndef FSE_DEFAULT_MEMORY_USAGE ++#define FSE_DEFAULT_MEMORY_USAGE 13 ++#endif ++ ++/*!FSE_MAX_SYMBOL_VALUE : ++* Maximum symbol value authorized. ++* Required for proper stack allocation */ ++#ifndef FSE_MAX_SYMBOL_VALUE ++#define FSE_MAX_SYMBOL_VALUE 255 ++#endif ++ ++/* ************************************************************** ++* template functions type & suffix ++****************************************************************/ ++#define FSE_FUNCTION_TYPE BYTE ++#define FSE_FUNCTION_EXTENSION ++#define FSE_DECODE_TYPE FSE_decode_t ++ ++/* *************************************************************** ++* Constants ++*****************************************************************/ ++#define FSE_MAX_TABLELOG (FSE_MAX_MEMORY_USAGE - 2) ++#define FSE_MAX_TABLESIZE (1U << FSE_MAX_TABLELOG) ++#define FSE_MAXTABLESIZE_MASK (FSE_MAX_TABLESIZE - 1) ++#define FSE_DEFAULT_TABLELOG (FSE_DEFAULT_MEMORY_USAGE - 2) ++#define FSE_MIN_TABLELOG 5 ++ ++#define FSE_TABLELOG_ABSOLUTE_MAX 15 ++#if FSE_MAX_TABLELOG > FSE_TABLELOG_ABSOLUTE_MAX ++#error "FSE_MAX_TABLELOG > FSE_TABLELOG_ABSOLUTE_MAX is not supported" ++#endif ++ ++#define FSE_TABLESTEP(tableSize) ((tableSize >> 1) + (tableSize >> 3) + 3) ++ ++#endif /* FSE_H */ +diff --git a/xen/common/zstd/fse_decompress.c b/xen/common/zstd/fse_decompress.c +new file mode 100644 +index 000000000000..cc51206df614 +--- /dev/null ++++ b/xen/common/zstd/fse_decompress.c +@@ -0,0 +1,324 @@ ++/* ++ * FSE : Finite State Entropy decoder ++ * Copyright (C) 2013-2015, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at : ++ * - Source repository : https://github.com/Cyan4973/FiniteStateEntropy ++ */ ++ ++/* ************************************************************** ++* Compiler specifics ++****************************************************************/ ++#define FORCE_INLINE static always_inline ++ ++/* ************************************************************** ++* Includes ++****************************************************************/ ++#include "bitstream.h" ++#include "fse.h" ++#include "zstd_internal.h" ++#include ++#include /* memcpy, memset */ ++ ++/* ************************************************************** ++* Error Management ++****************************************************************/ ++#define FSE_isError ERR_isError ++#define FSE_STATIC_ASSERT(c) \ ++ { \ ++ enum { FSE_static_assert = 1 / (int)(!!(c)) }; \ ++ } /* use only *after* variable declarations */ ++ ++/* ************************************************************** ++* Templates ++****************************************************************/ ++/* ++ designed to be included ++ for type-specific functions (template emulation in C) ++ Objective is to write these functions only once, for improved maintenance ++*/ ++ ++/* safety checks */ ++#ifndef FSE_FUNCTION_EXTENSION ++#error "FSE_FUNCTION_EXTENSION must be defined" ++#endif ++#ifndef FSE_FUNCTION_TYPE ++#error "FSE_FUNCTION_TYPE must be defined" ++#endif ++ ++/* Function names */ ++#define FSE_CAT(X, Y) X##Y ++#define FSE_FUNCTION_NAME(X, Y) FSE_CAT(X, Y) ++#define FSE_TYPE_NAME(X, Y) FSE_CAT(X, Y) ++ ++/* Function templates */ ++ ++size_t INIT FSE_buildDTable_wksp(FSE_DTable *dt, const short *normalizedCounter, unsigned maxSymbolValue, unsigned tableLog, void *workspace, size_t workspaceSize) ++{ ++ void *const tdPtr = dt + 1; /* because *dt is unsigned, 32-bits aligned on 32-bits */ ++ FSE_DECODE_TYPE *const tableDecode = (FSE_DECODE_TYPE *)(tdPtr); ++ U16 *symbolNext = (U16 *)workspace; ++ ++ U32 const maxSV1 = maxSymbolValue + 1; ++ U32 const tableSize = 1 << tableLog; ++ U32 highThreshold = tableSize - 1; ++ ++ /* Sanity Checks */ ++ if (workspaceSize < sizeof(U16) * (FSE_MAX_SYMBOL_VALUE + 1)) ++ return ERROR(tableLog_tooLarge); ++ if (maxSymbolValue > FSE_MAX_SYMBOL_VALUE) ++ return ERROR(maxSymbolValue_tooLarge); ++ if (tableLog > FSE_MAX_TABLELOG) ++ return ERROR(tableLog_tooLarge); ++ ++ /* Init, lay down lowprob symbols */ ++ { ++ FSE_DTableHeader DTableH; ++ DTableH.tableLog = (U16)tableLog; ++ DTableH.fastMode = 1; ++ { ++ S16 const largeLimit = (S16)(1 << (tableLog - 1)); ++ U32 s; ++ for (s = 0; s < maxSV1; s++) { ++ if (normalizedCounter[s] == -1) { ++ tableDecode[highThreshold--].symbol = (FSE_FUNCTION_TYPE)s; ++ symbolNext[s] = 1; ++ } else { ++ if (normalizedCounter[s] >= largeLimit) ++ DTableH.fastMode = 0; ++ symbolNext[s] = normalizedCounter[s]; ++ } ++ } ++ } ++ memcpy(dt, &DTableH, sizeof(DTableH)); ++ } ++ ++ /* Spread symbols */ ++ { ++ U32 const tableMask = tableSize - 1; ++ U32 const step = FSE_TABLESTEP(tableSize); ++ U32 s, position = 0; ++ for (s = 0; s < maxSV1; s++) { ++ int i; ++ for (i = 0; i < normalizedCounter[s]; i++) { ++ tableDecode[position].symbol = (FSE_FUNCTION_TYPE)s; ++ position = (position + step) & tableMask; ++ while (position > highThreshold) ++ position = (position + step) & tableMask; /* lowprob area */ ++ } ++ } ++ if (position != 0) ++ return ERROR(GENERIC); /* position must reach all cells once, otherwise normalizedCounter is incorrect */ ++ } ++ ++ /* Build Decoding table */ ++ { ++ U32 u; ++ for (u = 0; u < tableSize; u++) { ++ FSE_FUNCTION_TYPE const symbol = (FSE_FUNCTION_TYPE)(tableDecode[u].symbol); ++ U16 nextState = symbolNext[symbol]++; ++ tableDecode[u].nbBits = (BYTE)(tableLog - BIT_highbit32((U32)nextState)); ++ tableDecode[u].newState = (U16)((nextState << tableDecode[u].nbBits) - tableSize); ++ } ++ } ++ ++ return 0; ++} ++ ++/*-******************************************************* ++* Decompression (Byte symbols) ++*********************************************************/ ++size_t INIT FSE_buildDTable_rle(FSE_DTable *dt, BYTE symbolValue) ++{ ++ void *ptr = dt; ++ FSE_DTableHeader *const DTableH = (FSE_DTableHeader *)ptr; ++ void *dPtr = dt + 1; ++ FSE_decode_t *const cell = (FSE_decode_t *)dPtr; ++ ++ DTableH->tableLog = 0; ++ DTableH->fastMode = 0; ++ ++ cell->newState = 0; ++ cell->symbol = symbolValue; ++ cell->nbBits = 0; ++ ++ return 0; ++} ++ ++size_t INIT FSE_buildDTable_raw(FSE_DTable *dt, unsigned nbBits) ++{ ++ void *ptr = dt; ++ FSE_DTableHeader *const DTableH = (FSE_DTableHeader *)ptr; ++ void *dPtr = dt + 1; ++ FSE_decode_t *const dinfo = (FSE_decode_t *)dPtr; ++ const unsigned tableSize = 1 << nbBits; ++ const unsigned tableMask = tableSize - 1; ++ const unsigned maxSV1 = tableMask + 1; ++ unsigned s; ++ ++ /* Sanity checks */ ++ if (nbBits < 1) ++ return ERROR(GENERIC); /* min size */ ++ ++ /* Build Decoding Table */ ++ DTableH->tableLog = (U16)nbBits; ++ DTableH->fastMode = 1; ++ for (s = 0; s < maxSV1; s++) { ++ dinfo[s].newState = 0; ++ dinfo[s].symbol = (BYTE)s; ++ dinfo[s].nbBits = (BYTE)nbBits; ++ } ++ ++ return 0; ++} ++ ++FORCE_INLINE size_t FSE_decompress_usingDTable_generic(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const FSE_DTable *dt, ++ const unsigned fast) ++{ ++ BYTE *const ostart = (BYTE *)dst; ++ BYTE *op = ostart; ++ BYTE *const omax = op + maxDstSize; ++ BYTE *const olimit = omax - 3; ++ ++ BIT_DStream_t bitD; ++ FSE_DState_t state1; ++ FSE_DState_t state2; ++ ++ /* Init */ ++ CHECK_F(BIT_initDStream(&bitD, cSrc, cSrcSize)); ++ ++ FSE_initDState(&state1, &bitD, dt); ++ FSE_initDState(&state2, &bitD, dt); ++ ++#define FSE_GETSYMBOL(statePtr) fast ? FSE_decodeSymbolFast(statePtr, &bitD) : FSE_decodeSymbol(statePtr, &bitD) ++ ++ /* 4 symbols per loop */ ++ for (; (BIT_reloadDStream(&bitD) == BIT_DStream_unfinished) & (op < olimit); op += 4) { ++ op[0] = FSE_GETSYMBOL(&state1); ++ ++ if (FSE_MAX_TABLELOG * 2 + 7 > sizeof(bitD.bitContainer) * 8) /* This test must be static */ ++ BIT_reloadDStream(&bitD); ++ ++ op[1] = FSE_GETSYMBOL(&state2); ++ ++ if (FSE_MAX_TABLELOG * 4 + 7 > sizeof(bitD.bitContainer) * 8) /* This test must be static */ ++ { ++ if (BIT_reloadDStream(&bitD) > BIT_DStream_unfinished) { ++ op += 2; ++ break; ++ } ++ } ++ ++ op[2] = FSE_GETSYMBOL(&state1); ++ ++ if (FSE_MAX_TABLELOG * 2 + 7 > sizeof(bitD.bitContainer) * 8) /* This test must be static */ ++ BIT_reloadDStream(&bitD); ++ ++ op[3] = FSE_GETSYMBOL(&state2); ++ } ++ ++ /* tail */ ++ /* note : BIT_reloadDStream(&bitD) >= FSE_DStream_partiallyFilled; Ends at exactly BIT_DStream_completed */ ++ while (1) { ++ if (op > (omax - 2)) ++ return ERROR(dstSize_tooSmall); ++ *op++ = FSE_GETSYMBOL(&state1); ++ if (BIT_reloadDStream(&bitD) == BIT_DStream_overflow) { ++ *op++ = FSE_GETSYMBOL(&state2); ++ break; ++ } ++ ++ if (op > (omax - 2)) ++ return ERROR(dstSize_tooSmall); ++ *op++ = FSE_GETSYMBOL(&state2); ++ if (BIT_reloadDStream(&bitD) == BIT_DStream_overflow) { ++ *op++ = FSE_GETSYMBOL(&state1); ++ break; ++ } ++ } ++ ++ return op - ostart; ++} ++ ++size_t INIT FSE_decompress_usingDTable(void *dst, size_t originalSize, const void *cSrc, size_t cSrcSize, const FSE_DTable *dt) ++{ ++ const void *ptr = dt; ++ const FSE_DTableHeader *DTableH = (const FSE_DTableHeader *)ptr; ++ const U32 fastMode = DTableH->fastMode; ++ ++ /* select fast mode (static) */ ++ if (fastMode) ++ return FSE_decompress_usingDTable_generic(dst, originalSize, cSrc, cSrcSize, dt, 1); ++ return FSE_decompress_usingDTable_generic(dst, originalSize, cSrc, cSrcSize, dt, 0); ++} ++ ++size_t INIT FSE_decompress_wksp(void *dst, size_t dstCapacity, const void *cSrc, size_t cSrcSize, unsigned maxLog, void *workspace, size_t workspaceSize) ++{ ++ const BYTE *const istart = (const BYTE *)cSrc; ++ const BYTE *ip = istart; ++ unsigned tableLog; ++ unsigned maxSymbolValue = FSE_MAX_SYMBOL_VALUE; ++ size_t NCountLength; ++ ++ FSE_DTable *dt; ++ short *counting; ++ size_t spaceUsed32 = 0; ++ ++ FSE_STATIC_ASSERT(sizeof(FSE_DTable) == sizeof(U32)); ++ ++ dt = (FSE_DTable *)((U32 *)workspace + spaceUsed32); ++ spaceUsed32 += FSE_DTABLE_SIZE_U32(maxLog); ++ counting = (short *)((U32 *)workspace + spaceUsed32); ++ spaceUsed32 += ALIGN(sizeof(short) * (FSE_MAX_SYMBOL_VALUE + 1), sizeof(U32)) >> 2; ++ ++ if ((spaceUsed32 << 2) > workspaceSize) ++ return ERROR(tableLog_tooLarge); ++ workspace = (U32 *)workspace + spaceUsed32; ++ workspaceSize -= (spaceUsed32 << 2); ++ ++ /* normal FSE decoding mode */ ++ NCountLength = FSE_readNCount(counting, &maxSymbolValue, &tableLog, istart, cSrcSize); ++ if (FSE_isError(NCountLength)) ++ return NCountLength; ++ // if (NCountLength >= cSrcSize) return ERROR(srcSize_wrong); /* too small input size; supposed to be already checked in NCountLength, only remaining ++ // case : NCountLength==cSrcSize */ ++ if (tableLog > maxLog) ++ return ERROR(tableLog_tooLarge); ++ ip += NCountLength; ++ cSrcSize -= NCountLength; ++ ++ CHECK_F(FSE_buildDTable_wksp(dt, counting, maxSymbolValue, tableLog, workspace, workspaceSize)); ++ ++ return FSE_decompress_usingDTable(dst, dstCapacity, ip, cSrcSize, dt); /* always return, even if it is an error code */ ++} +diff --git a/xen/common/zstd/huf.h b/xen/common/zstd/huf.h +new file mode 100644 +index 000000000000..a9d522c7bb7b +--- /dev/null ++++ b/xen/common/zstd/huf.h +@@ -0,0 +1,212 @@ ++/* ++ * Huffman coder, part of New Generation Entropy library ++ * header file ++ * Copyright (C) 2013-2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at : ++ * - Source repository : https://github.com/Cyan4973/FiniteStateEntropy ++ */ ++#ifndef HUF_H_298734234 ++#define HUF_H_298734234 ++ ++/* *** Dependencies *** */ ++#include /* size_t */ ++ ++/* *** Tool functions *** */ ++#define HUF_BLOCKSIZE_MAX (128 * 1024) /**< maximum input size for a single block compressed with HUF_compress */ ++size_t HUF_compressBound(size_t size); /**< maximum compressed size (worst case) */ ++ ++/* Error Management */ ++unsigned HUF_isError(size_t code); /**< tells if a return value is an error code */ ++ ++/* *** Advanced function *** */ ++ ++/** HUF_compress4X_wksp() : ++* Same as HUF_compress2(), but uses externally allocated `workSpace`, which must be a table of >= 1024 unsigned */ ++size_t HUF_compress4X_wksp(void *dst, size_t dstSize, const void *src, size_t srcSize, unsigned maxSymbolValue, unsigned tableLog, void *workSpace, ++ size_t wkspSize); /**< `workSpace` must be a table of at least HUF_COMPRESS_WORKSPACE_SIZE_U32 unsigned */ ++ ++/* *** Dependencies *** */ ++#include "mem.h" /* U32 */ ++ ++/* *** Constants *** */ ++#define HUF_TABLELOG_MAX 12 /* max configured tableLog (for static allocation); can be modified up to HUF_ABSOLUTEMAX_TABLELOG */ ++#define HUF_TABLELOG_DEFAULT 11 /* tableLog by default, when not specified */ ++#define HUF_SYMBOLVALUE_MAX 255 ++ ++#define HUF_TABLELOG_ABSOLUTEMAX 15 /* absolute limit of HUF_MAX_TABLELOG. Beyond that value, code does not work */ ++#if (HUF_TABLELOG_MAX > HUF_TABLELOG_ABSOLUTEMAX) ++#error "HUF_TABLELOG_MAX is too large !" ++#endif ++ ++/* **************************************** ++* Static allocation ++******************************************/ ++/* HUF buffer bounds */ ++#define HUF_CTABLEBOUND 129 ++#define HUF_BLOCKBOUND(size) (size + (size >> 8) + 8) /* only true if incompressible pre-filtered with fast heuristic */ ++#define HUF_COMPRESSBOUND(size) (HUF_CTABLEBOUND + HUF_BLOCKBOUND(size)) /* Macro version, useful for static allocation */ ++ ++/* static allocation of HUF's Compression Table */ ++#define HUF_CREATE_STATIC_CTABLE(name, maxSymbolValue) \ ++ U32 name##hb[maxSymbolValue + 1]; \ ++ void *name##hv = &(name##hb); \ ++ HUF_CElt *name = (HUF_CElt *)(name##hv) /* no final ; */ ++ ++/* static allocation of HUF's DTable */ ++typedef U32 HUF_DTable; ++#define HUF_DTABLE_SIZE(maxTableLog) (1 + (1 << (maxTableLog))) ++#define HUF_CREATE_STATIC_DTABLEX2(DTable, maxTableLog) HUF_DTable DTable[HUF_DTABLE_SIZE((maxTableLog)-1)] = {((U32)((maxTableLog)-1) * 0x01000001)} ++#define HUF_CREATE_STATIC_DTABLEX4(DTable, maxTableLog) HUF_DTable DTable[HUF_DTABLE_SIZE(maxTableLog)] = {((U32)(maxTableLog)*0x01000001)} ++ ++/* The workspace must have alignment at least 4 and be at least this large */ ++#define HUF_COMPRESS_WORKSPACE_SIZE (6 << 10) ++#define HUF_COMPRESS_WORKSPACE_SIZE_U32 (HUF_COMPRESS_WORKSPACE_SIZE / sizeof(U32)) ++ ++/* The workspace must have alignment at least 4 and be at least this large */ ++#define HUF_DECOMPRESS_WORKSPACE_SIZE (3 << 10) ++#define HUF_DECOMPRESS_WORKSPACE_SIZE_U32 (HUF_DECOMPRESS_WORKSPACE_SIZE / sizeof(U32)) ++ ++/* **************************************** ++* Advanced decompression functions ++******************************************/ ++size_t HUF_decompress4X_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize); /**< decodes RLE and uncompressed */ ++size_t HUF_decompress4X_hufOnly_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, ++ size_t workspaceSize); /**< considers RLE and uncompressed as errors */ ++size_t HUF_decompress4X2_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, ++ size_t workspaceSize); /**< single-symbol decoder */ ++size_t HUF_decompress4X4_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, ++ size_t workspaceSize); /**< double-symbols decoder */ ++ ++/* **************************************** ++* HUF detailed API ++******************************************/ ++/*! ++HUF_compress() does the following: ++1. count symbol occurrence from source[] into table count[] using FSE_count() ++2. (optional) refine tableLog using HUF_optimalTableLog() ++3. build Huffman table from count using HUF_buildCTable() ++4. save Huffman table to memory buffer using HUF_writeCTable_wksp() ++5. encode the data stream using HUF_compress4X_usingCTable() ++ ++The following API allows targeting specific sub-functions for advanced tasks. ++For example, it's possible to compress several blocks using the same 'CTable', ++or to save and regenerate 'CTable' using external methods. ++*/ ++/* FSE_count() : find it within "fse.h" */ ++unsigned HUF_optimalTableLog(unsigned maxTableLog, size_t srcSize, unsigned maxSymbolValue); ++typedef struct HUF_CElt_s HUF_CElt; /* incomplete type */ ++size_t HUF_writeCTable_wksp(void *dst, size_t maxDstSize, const HUF_CElt *CTable, unsigned maxSymbolValue, unsigned huffLog, void *workspace, size_t workspaceSize); ++size_t HUF_compress4X_usingCTable(void *dst, size_t dstSize, const void *src, size_t srcSize, const HUF_CElt *CTable); ++ ++typedef enum { ++ HUF_repeat_none, /**< Cannot use the previous table */ ++ HUF_repeat_check, /**< Can use the previous table but it must be checked. Note : The previous table must have been constructed by HUF_compress{1, ++ 4}X_repeat */ ++ HUF_repeat_valid /**< Can use the previous table and it is asumed to be valid */ ++} HUF_repeat; ++/** HUF_compress4X_repeat() : ++* Same as HUF_compress4X_wksp(), but considers using hufTable if *repeat != HUF_repeat_none. ++* If it uses hufTable it does not modify hufTable or repeat. ++* If it doesn't, it sets *repeat = HUF_repeat_none, and it sets hufTable to the table used. ++* If preferRepeat then the old table will always be used if valid. */ ++size_t HUF_compress4X_repeat(void *dst, size_t dstSize, const void *src, size_t srcSize, unsigned maxSymbolValue, unsigned tableLog, void *workSpace, ++ size_t wkspSize, HUF_CElt *hufTable, HUF_repeat *repeat, ++ int preferRepeat); /**< `workSpace` must be a table of at least HUF_COMPRESS_WORKSPACE_SIZE_U32 unsigned */ ++ ++/** HUF_buildCTable_wksp() : ++ * Same as HUF_buildCTable(), but using externally allocated scratch buffer. ++ * `workSpace` must be aligned on 4-bytes boundaries, and be at least as large as a table of 1024 unsigned. ++ */ ++size_t HUF_buildCTable_wksp(HUF_CElt *tree, const U32 *count, U32 maxSymbolValue, U32 maxNbBits, void *workSpace, size_t wkspSize); ++ ++/*! HUF_readStats() : ++ Read compact Huffman tree, saved by HUF_writeCTable(). ++ `huffWeight` is destination buffer. ++ @return : size read from `src` , or an error Code . ++ Note : Needed by HUF_readCTable() and HUF_readDTableXn() . */ ++size_t HUF_readStats_wksp(BYTE *huffWeight, size_t hwSize, U32 *rankStats, U32 *nbSymbolsPtr, U32 *tableLogPtr, const void *src, size_t srcSize, ++ void *workspace, size_t workspaceSize); ++ ++/** HUF_readCTable() : ++* Loading a CTable saved with HUF_writeCTable() */ ++size_t HUF_readCTable_wksp(HUF_CElt *CTable, unsigned maxSymbolValue, const void *src, size_t srcSize, void *workspace, size_t workspaceSize); ++ ++/* ++HUF_decompress() does the following: ++1. select the decompression algorithm (X2, X4) based on pre-computed heuristics ++2. build Huffman table from save, using HUF_readDTableXn() ++3. decode 1 or 4 segments in parallel using HUF_decompressSXn_usingDTable ++*/ ++ ++/** HUF_selectDecoder() : ++* Tells which decoder is likely to decode faster, ++* based on a set of pre-determined metrics. ++* @return : 0==HUF_decompress4X2, 1==HUF_decompress4X4 . ++* Assumption : 0 < cSrcSize < dstSize <= 128 KB */ ++U32 HUF_selectDecoder(size_t dstSize, size_t cSrcSize); ++ ++size_t HUF_readDTableX2_wksp(HUF_DTable *DTable, const void *src, size_t srcSize, void *workspace, size_t workspaceSize); ++size_t HUF_readDTableX4_wksp(HUF_DTable *DTable, const void *src, size_t srcSize, void *workspace, size_t workspaceSize); ++ ++size_t HUF_decompress4X_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable); ++size_t HUF_decompress4X2_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable); ++size_t HUF_decompress4X4_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable); ++ ++/* single stream variants */ ++ ++size_t HUF_compress1X_wksp(void *dst, size_t dstSize, const void *src, size_t srcSize, unsigned maxSymbolValue, unsigned tableLog, void *workSpace, ++ size_t wkspSize); /**< `workSpace` must be a table of at least HUF_COMPRESS_WORKSPACE_SIZE_U32 unsigned */ ++size_t HUF_compress1X_usingCTable(void *dst, size_t dstSize, const void *src, size_t srcSize, const HUF_CElt *CTable); ++/** HUF_compress1X_repeat() : ++* Same as HUF_compress1X_wksp(), but considers using hufTable if *repeat != HUF_repeat_none. ++* If it uses hufTable it does not modify hufTable or repeat. ++* If it doesn't, it sets *repeat = HUF_repeat_none, and it sets hufTable to the table used. ++* If preferRepeat then the old table will always be used if valid. */ ++size_t HUF_compress1X_repeat(void *dst, size_t dstSize, const void *src, size_t srcSize, unsigned maxSymbolValue, unsigned tableLog, void *workSpace, ++ size_t wkspSize, HUF_CElt *hufTable, HUF_repeat *repeat, ++ int preferRepeat); /**< `workSpace` must be a table of at least HUF_COMPRESS_WORKSPACE_SIZE_U32 unsigned */ ++ ++size_t HUF_decompress1X_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize); ++size_t HUF_decompress1X2_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, ++ size_t workspaceSize); /**< single-symbol decoder */ ++size_t HUF_decompress1X4_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, ++ size_t workspaceSize); /**< double-symbols decoder */ ++ ++size_t HUF_decompress1X_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, ++ const HUF_DTable *DTable); /**< automatic selection of sing or double symbol decoder, based on DTable */ ++size_t HUF_decompress1X2_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable); ++size_t HUF_decompress1X4_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable); ++ ++#endif /* HUF_H_298734234 */ +diff --git a/xen/common/zstd/huf_decompress.c b/xen/common/zstd/huf_decompress.c +new file mode 100644 +index 000000000000..341619e64246 +--- /dev/null ++++ b/xen/common/zstd/huf_decompress.c +@@ -0,0 +1,960 @@ ++/* ++ * Huffman decoder, part of New Generation Entropy library ++ * Copyright (C) 2013-2016, Yann Collet. ++ * ++ * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) ++ * ++ * Redistribution and use in source and binary forms, with or without ++ * modification, are permitted provided that the following conditions are ++ * met: ++ * ++ * * Redistributions of source code must retain the above copyright ++ * notice, this list of conditions and the following disclaimer. ++ * * Redistributions in binary form must reproduce the above ++ * copyright notice, this list of conditions and the following disclaimer ++ * in the documentation and/or other materials provided with the ++ * distribution. ++ * ++ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ++ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ++ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ++ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ++ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ++ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT ++ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, ++ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY ++ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT ++ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE ++ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ * ++ * You can contact the author at : ++ * - Source repository : https://github.com/Cyan4973/FiniteStateEntropy ++ */ ++ ++/* ************************************************************** ++* Compiler specifics ++****************************************************************/ ++#define FORCE_INLINE static always_inline ++ ++/* ************************************************************** ++* Dependencies ++****************************************************************/ ++#include "bitstream.h" /* BIT_* */ ++#include "fse.h" /* header compression */ ++#include "huf.h" ++#include ++#include /* memcpy, memset */ ++ ++/* ************************************************************** ++* Error Management ++****************************************************************/ ++#define HUF_STATIC_ASSERT(c) \ ++ { \ ++ enum { HUF_static_assert = 1 / (int)(!!(c)) }; \ ++ } /* use only *after* variable declarations */ ++ ++/*-***************************/ ++/* generic DTableDesc */ ++/*-***************************/ ++ ++typedef struct { ++ BYTE maxTableLog; ++ BYTE tableType; ++ BYTE tableLog; ++ BYTE reserved; ++} DTableDesc; ++ ++static DTableDesc INIT HUF_getDTableDesc(const HUF_DTable *table) ++{ ++ DTableDesc dtd; ++ memcpy(&dtd, table, sizeof(dtd)); ++ return dtd; ++} ++ ++/*-***************************/ ++/* single-symbol decoding */ ++/*-***************************/ ++ ++typedef struct { ++ BYTE byte; ++ BYTE nbBits; ++} HUF_DEltX2; /* single-symbol decoding */ ++ ++size_t INIT HUF_readDTableX2_wksp(HUF_DTable *DTable, const void *src, size_t srcSize, void *workspace, size_t workspaceSize) ++{ ++ U32 tableLog = 0; ++ U32 nbSymbols = 0; ++ size_t iSize; ++ void *const dtPtr = DTable + 1; ++ HUF_DEltX2 *const dt = (HUF_DEltX2 *)dtPtr; ++ ++ U32 *rankVal; ++ BYTE *huffWeight; ++ size_t spaceUsed32 = 0; ++ ++ rankVal = (U32 *)workspace + spaceUsed32; ++ spaceUsed32 += HUF_TABLELOG_ABSOLUTEMAX + 1; ++ huffWeight = (BYTE *)((U32 *)workspace + spaceUsed32); ++ spaceUsed32 += ALIGN(HUF_SYMBOLVALUE_MAX + 1, sizeof(U32)) >> 2; ++ ++ if ((spaceUsed32 << 2) > workspaceSize) ++ return ERROR(tableLog_tooLarge); ++ workspace = (U32 *)workspace + spaceUsed32; ++ workspaceSize -= (spaceUsed32 << 2); ++ ++ HUF_STATIC_ASSERT(sizeof(DTableDesc) == sizeof(HUF_DTable)); ++ /* memset(huffWeight, 0, sizeof(huffWeight)); */ /* is not necessary, even though some analyzer complain ... */ ++ ++ iSize = HUF_readStats_wksp(huffWeight, HUF_SYMBOLVALUE_MAX + 1, rankVal, &nbSymbols, &tableLog, src, srcSize, workspace, workspaceSize); ++ if (HUF_isError(iSize)) ++ return iSize; ++ ++ /* Table header */ ++ { ++ DTableDesc dtd = HUF_getDTableDesc(DTable); ++ if (tableLog > (U32)(dtd.maxTableLog + 1)) ++ return ERROR(tableLog_tooLarge); /* DTable too small, Huffman tree cannot fit in */ ++ dtd.tableType = 0; ++ dtd.tableLog = (BYTE)tableLog; ++ memcpy(DTable, &dtd, sizeof(dtd)); ++ } ++ ++ /* Calculate starting value for each rank */ ++ { ++ U32 n, nextRankStart = 0; ++ for (n = 1; n < tableLog + 1; n++) { ++ U32 const curr = nextRankStart; ++ nextRankStart += (rankVal[n] << (n - 1)); ++ rankVal[n] = curr; ++ } ++ } ++ ++ /* fill DTable */ ++ { ++ U32 n; ++ for (n = 0; n < nbSymbols; n++) { ++ U32 const w = huffWeight[n]; ++ U32 const length = (1 << w) >> 1; ++ U32 u; ++ HUF_DEltX2 D; ++ D.byte = (BYTE)n; ++ D.nbBits = (BYTE)(tableLog + 1 - w); ++ for (u = rankVal[w]; u < rankVal[w] + length; u++) ++ dt[u] = D; ++ rankVal[w] += length; ++ } ++ } ++ ++ return iSize; ++} ++ ++static BYTE INIT HUF_decodeSymbolX2(BIT_DStream_t *Dstream, const HUF_DEltX2 *dt, const U32 dtLog) ++{ ++ size_t const val = BIT_lookBitsFast(Dstream, dtLog); /* note : dtLog >= 1 */ ++ BYTE const c = dt[val].byte; ++ BIT_skipBits(Dstream, dt[val].nbBits); ++ return c; ++} ++ ++#define HUF_DECODE_SYMBOLX2_0(ptr, DStreamPtr) *ptr++ = HUF_decodeSymbolX2(DStreamPtr, dt, dtLog) ++ ++#define HUF_DECODE_SYMBOLX2_1(ptr, DStreamPtr) \ ++ if (ZSTD_64bits() || (HUF_TABLELOG_MAX <= 12)) \ ++ HUF_DECODE_SYMBOLX2_0(ptr, DStreamPtr) ++ ++#define HUF_DECODE_SYMBOLX2_2(ptr, DStreamPtr) \ ++ if (ZSTD_64bits()) \ ++ HUF_DECODE_SYMBOLX2_0(ptr, DStreamPtr) ++ ++FORCE_INLINE size_t HUF_decodeStreamX2(BYTE *p, BIT_DStream_t *const bitDPtr, BYTE *const pEnd, const HUF_DEltX2 *const dt, const U32 dtLog) ++{ ++ BYTE *const pStart = p; ++ ++ /* up to 4 symbols at a time */ ++ while ((BIT_reloadDStream(bitDPtr) == BIT_DStream_unfinished) && (p <= pEnd - 4)) { ++ HUF_DECODE_SYMBOLX2_2(p, bitDPtr); ++ HUF_DECODE_SYMBOLX2_1(p, bitDPtr); ++ HUF_DECODE_SYMBOLX2_2(p, bitDPtr); ++ HUF_DECODE_SYMBOLX2_0(p, bitDPtr); ++ } ++ ++ /* closer to the end */ ++ while ((BIT_reloadDStream(bitDPtr) == BIT_DStream_unfinished) && (p < pEnd)) ++ HUF_DECODE_SYMBOLX2_0(p, bitDPtr); ++ ++ /* no more data to retrieve from bitstream, hence no need to reload */ ++ while (p < pEnd) ++ HUF_DECODE_SYMBOLX2_0(p, bitDPtr); ++ ++ return pEnd - pStart; ++} ++ ++static size_t INIT HUF_decompress1X2_usingDTable_internal(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ BYTE *op = (BYTE *)dst; ++ BYTE *const oend = op + dstSize; ++ const void *dtPtr = DTable + 1; ++ const HUF_DEltX2 *const dt = (const HUF_DEltX2 *)dtPtr; ++ BIT_DStream_t bitD; ++ DTableDesc const dtd = HUF_getDTableDesc(DTable); ++ U32 const dtLog = dtd.tableLog; ++ ++ { ++ size_t const errorCode = BIT_initDStream(&bitD, cSrc, cSrcSize); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ ++ HUF_decodeStreamX2(op, &bitD, oend, dt, dtLog); ++ ++ /* check */ ++ if (!BIT_endOfDStream(&bitD)) ++ return ERROR(corruption_detected); ++ ++ return dstSize; ++} ++ ++size_t INIT HUF_decompress1X2_usingDTable(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ DTableDesc dtd = HUF_getDTableDesc(DTable); ++ if (dtd.tableType != 0) ++ return ERROR(GENERIC); ++ return HUF_decompress1X2_usingDTable_internal(dst, dstSize, cSrc, cSrcSize, DTable); ++} ++ ++size_t INIT HUF_decompress1X2_DCtx_wksp(HUF_DTable *DCtx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize) ++{ ++ const BYTE *ip = (const BYTE *)cSrc; ++ ++ size_t const hSize = HUF_readDTableX2_wksp(DCtx, cSrc, cSrcSize, workspace, workspaceSize); ++ if (HUF_isError(hSize)) ++ return hSize; ++ if (hSize >= cSrcSize) ++ return ERROR(srcSize_wrong); ++ ip += hSize; ++ cSrcSize -= hSize; ++ ++ return HUF_decompress1X2_usingDTable_internal(dst, dstSize, ip, cSrcSize, DCtx); ++} ++ ++static size_t INIT HUF_decompress4X2_usingDTable_internal(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ /* Check */ ++ if (cSrcSize < 10) ++ return ERROR(corruption_detected); /* strict minimum : jump table + 1 byte per stream */ ++ ++ { ++ const BYTE *const istart = (const BYTE *)cSrc; ++ BYTE *const ostart = (BYTE *)dst; ++ BYTE *const oend = ostart + dstSize; ++ const void *const dtPtr = DTable + 1; ++ const HUF_DEltX2 *const dt = (const HUF_DEltX2 *)dtPtr; ++ ++ /* Init */ ++ BIT_DStream_t bitD1; ++ BIT_DStream_t bitD2; ++ BIT_DStream_t bitD3; ++ BIT_DStream_t bitD4; ++ size_t const length1 = ZSTD_readLE16(istart); ++ size_t const length2 = ZSTD_readLE16(istart + 2); ++ size_t const length3 = ZSTD_readLE16(istart + 4); ++ size_t const length4 = cSrcSize - (length1 + length2 + length3 + 6); ++ const BYTE *const istart1 = istart + 6; /* jumpTable */ ++ const BYTE *const istart2 = istart1 + length1; ++ const BYTE *const istart3 = istart2 + length2; ++ const BYTE *const istart4 = istart3 + length3; ++ const size_t segmentSize = (dstSize + 3) / 4; ++ BYTE *const opStart2 = ostart + segmentSize; ++ BYTE *const opStart3 = opStart2 + segmentSize; ++ BYTE *const opStart4 = opStart3 + segmentSize; ++ BYTE *op1 = ostart; ++ BYTE *op2 = opStart2; ++ BYTE *op3 = opStart3; ++ BYTE *op4 = opStart4; ++ U32 endSignal; ++ DTableDesc const dtd = HUF_getDTableDesc(DTable); ++ U32 const dtLog = dtd.tableLog; ++ ++ if (length4 > cSrcSize) ++ return ERROR(corruption_detected); /* overflow */ ++ { ++ size_t const errorCode = BIT_initDStream(&bitD1, istart1, length1); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ { ++ size_t const errorCode = BIT_initDStream(&bitD2, istart2, length2); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ { ++ size_t const errorCode = BIT_initDStream(&bitD3, istart3, length3); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ { ++ size_t const errorCode = BIT_initDStream(&bitD4, istart4, length4); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ ++ /* 16-32 symbols per loop (4-8 symbols per stream) */ ++ endSignal = BIT_reloadDStream(&bitD1) | BIT_reloadDStream(&bitD2) | BIT_reloadDStream(&bitD3) | BIT_reloadDStream(&bitD4); ++ for (; (endSignal == BIT_DStream_unfinished) && (op4 < (oend - 7));) { ++ HUF_DECODE_SYMBOLX2_2(op1, &bitD1); ++ HUF_DECODE_SYMBOLX2_2(op2, &bitD2); ++ HUF_DECODE_SYMBOLX2_2(op3, &bitD3); ++ HUF_DECODE_SYMBOLX2_2(op4, &bitD4); ++ HUF_DECODE_SYMBOLX2_1(op1, &bitD1); ++ HUF_DECODE_SYMBOLX2_1(op2, &bitD2); ++ HUF_DECODE_SYMBOLX2_1(op3, &bitD3); ++ HUF_DECODE_SYMBOLX2_1(op4, &bitD4); ++ HUF_DECODE_SYMBOLX2_2(op1, &bitD1); ++ HUF_DECODE_SYMBOLX2_2(op2, &bitD2); ++ HUF_DECODE_SYMBOLX2_2(op3, &bitD3); ++ HUF_DECODE_SYMBOLX2_2(op4, &bitD4); ++ HUF_DECODE_SYMBOLX2_0(op1, &bitD1); ++ HUF_DECODE_SYMBOLX2_0(op2, &bitD2); ++ HUF_DECODE_SYMBOLX2_0(op3, &bitD3); ++ HUF_DECODE_SYMBOLX2_0(op4, &bitD4); ++ endSignal = BIT_reloadDStream(&bitD1) | BIT_reloadDStream(&bitD2) | BIT_reloadDStream(&bitD3) | BIT_reloadDStream(&bitD4); ++ } ++ ++ /* check corruption */ ++ if (op1 > opStart2) ++ return ERROR(corruption_detected); ++ if (op2 > opStart3) ++ return ERROR(corruption_detected); ++ if (op3 > opStart4) ++ return ERROR(corruption_detected); ++ /* note : op4 supposed already verified within main loop */ ++ ++ /* finish bitStreams one by one */ ++ HUF_decodeStreamX2(op1, &bitD1, opStart2, dt, dtLog); ++ HUF_decodeStreamX2(op2, &bitD2, opStart3, dt, dtLog); ++ HUF_decodeStreamX2(op3, &bitD3, opStart4, dt, dtLog); ++ HUF_decodeStreamX2(op4, &bitD4, oend, dt, dtLog); ++ ++ /* check */ ++ endSignal = BIT_endOfDStream(&bitD1) & BIT_endOfDStream(&bitD2) & BIT_endOfDStream(&bitD3) & BIT_endOfDStream(&bitD4); ++ if (!endSignal) ++ return ERROR(corruption_detected); ++ ++ /* decoded size */ ++ return dstSize; ++ } ++} ++ ++size_t INIT HUF_decompress4X2_usingDTable(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ DTableDesc dtd = HUF_getDTableDesc(DTable); ++ if (dtd.tableType != 0) ++ return ERROR(GENERIC); ++ return HUF_decompress4X2_usingDTable_internal(dst, dstSize, cSrc, cSrcSize, DTable); ++} ++ ++size_t INIT HUF_decompress4X2_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize) ++{ ++ const BYTE *ip = (const BYTE *)cSrc; ++ ++ size_t const hSize = HUF_readDTableX2_wksp(dctx, cSrc, cSrcSize, workspace, workspaceSize); ++ if (HUF_isError(hSize)) ++ return hSize; ++ if (hSize >= cSrcSize) ++ return ERROR(srcSize_wrong); ++ ip += hSize; ++ cSrcSize -= hSize; ++ ++ return HUF_decompress4X2_usingDTable_internal(dst, dstSize, ip, cSrcSize, dctx); ++} ++ ++/* *************************/ ++/* double-symbols decoding */ ++/* *************************/ ++typedef struct { ++ U16 sequence; ++ BYTE nbBits; ++ BYTE length; ++} HUF_DEltX4; /* double-symbols decoding */ ++ ++typedef struct { ++ BYTE symbol; ++ BYTE weight; ++} sortedSymbol_t; ++ ++/* HUF_fillDTableX4Level2() : ++ * `rankValOrigin` must be a table of at least (HUF_TABLELOG_MAX + 1) U32 */ ++static void INIT HUF_fillDTableX4Level2(HUF_DEltX4 *DTable, U32 sizeLog, const U32 consumed, const U32 *rankValOrigin, const int minWeight, ++ const sortedSymbol_t *sortedSymbols, const U32 sortedListSize, U32 nbBitsBaseline, U16 baseSeq) ++{ ++ HUF_DEltX4 DElt; ++ U32 rankVal[HUF_TABLELOG_MAX + 1]; ++ ++ /* get pre-calculated rankVal */ ++ memcpy(rankVal, rankValOrigin, sizeof(rankVal)); ++ ++ /* fill skipped values */ ++ if (minWeight > 1) { ++ U32 i, skipSize = rankVal[minWeight]; ++ ZSTD_writeLE16(&(DElt.sequence), baseSeq); ++ DElt.nbBits = (BYTE)(consumed); ++ DElt.length = 1; ++ for (i = 0; i < skipSize; i++) ++ DTable[i] = DElt; ++ } ++ ++ /* fill DTable */ ++ { ++ U32 s; ++ for (s = 0; s < sortedListSize; s++) { /* note : sortedSymbols already skipped */ ++ const U32 symbol = sortedSymbols[s].symbol; ++ const U32 weight = sortedSymbols[s].weight; ++ const U32 nbBits = nbBitsBaseline - weight; ++ const U32 length = 1 << (sizeLog - nbBits); ++ const U32 start = rankVal[weight]; ++ U32 i = start; ++ const U32 end = start + length; ++ ++ ZSTD_writeLE16(&(DElt.sequence), (U16)(baseSeq + (symbol << 8))); ++ DElt.nbBits = (BYTE)(nbBits + consumed); ++ DElt.length = 2; ++ do { ++ DTable[i++] = DElt; ++ } while (i < end); /* since length >= 1 */ ++ ++ rankVal[weight] += length; ++ } ++ } ++} ++ ++typedef U32 rankVal_t[HUF_TABLELOG_MAX][HUF_TABLELOG_MAX + 1]; ++typedef U32 rankValCol_t[HUF_TABLELOG_MAX + 1]; ++ ++static void INIT HUF_fillDTableX4(HUF_DEltX4 *DTable, const U32 targetLog, const sortedSymbol_t *sortedList, ++ const U32 sortedListSize, const U32 *rankStart, ++ rankVal_t rankValOrigin, const U32 maxWeight, const U32 nbBitsBaseline) ++{ ++ U32 rankVal[HUF_TABLELOG_MAX + 1]; ++ const int scaleLog = nbBitsBaseline - targetLog; /* note : targetLog >= srcLog, hence scaleLog <= 1 */ ++ const U32 minBits = nbBitsBaseline - maxWeight; ++ U32 s; ++ ++ memcpy(rankVal, rankValOrigin, sizeof(rankVal)); ++ ++ /* fill DTable */ ++ for (s = 0; s < sortedListSize; s++) { ++ const U16 symbol = sortedList[s].symbol; ++ const U32 weight = sortedList[s].weight; ++ const U32 nbBits = nbBitsBaseline - weight; ++ const U32 start = rankVal[weight]; ++ const U32 length = 1 << (targetLog - nbBits); ++ ++ if (targetLog - nbBits >= minBits) { /* enough room for a second symbol */ ++ U32 sortedRank; ++ int minWeight = nbBits + scaleLog; ++ if (minWeight < 1) ++ minWeight = 1; ++ sortedRank = rankStart[minWeight]; ++ HUF_fillDTableX4Level2(DTable + start, targetLog - nbBits, nbBits, rankValOrigin[nbBits], minWeight, sortedList + sortedRank, ++ sortedListSize - sortedRank, nbBitsBaseline, symbol); ++ } else { ++ HUF_DEltX4 DElt; ++ ZSTD_writeLE16(&(DElt.sequence), symbol); ++ DElt.nbBits = (BYTE)(nbBits); ++ DElt.length = 1; ++ { ++ U32 const end = start + length; ++ U32 u; ++ for (u = start; u < end; u++) ++ DTable[u] = DElt; ++ } ++ } ++ rankVal[weight] += length; ++ } ++} ++ ++size_t INIT HUF_readDTableX4_wksp(HUF_DTable *DTable, const void *src, size_t srcSize, void *workspace, size_t workspaceSize) ++{ ++ U32 tableLog, maxW, sizeOfSort, nbSymbols; ++ DTableDesc dtd = HUF_getDTableDesc(DTable); ++ U32 const maxTableLog = dtd.maxTableLog; ++ size_t iSize; ++ void *dtPtr = DTable + 1; /* force compiler to avoid strict-aliasing */ ++ HUF_DEltX4 *const dt = (HUF_DEltX4 *)dtPtr; ++ U32 *rankStart; ++ ++ rankValCol_t *rankVal; ++ U32 *rankStats; ++ U32 *rankStart0; ++ sortedSymbol_t *sortedSymbol; ++ BYTE *weightList; ++ size_t spaceUsed32 = 0; ++ ++ HUF_STATIC_ASSERT((sizeof(rankValCol_t) & 3) == 0); ++ ++ rankVal = (rankValCol_t *)((U32 *)workspace + spaceUsed32); ++ spaceUsed32 += (sizeof(rankValCol_t) * HUF_TABLELOG_MAX) >> 2; ++ rankStats = (U32 *)workspace + spaceUsed32; ++ spaceUsed32 += HUF_TABLELOG_MAX + 1; ++ rankStart0 = (U32 *)workspace + spaceUsed32; ++ spaceUsed32 += HUF_TABLELOG_MAX + 2; ++ sortedSymbol = (sortedSymbol_t *)((U32 *)workspace + spaceUsed32); ++ spaceUsed32 += ALIGN(sizeof(sortedSymbol_t) * (HUF_SYMBOLVALUE_MAX + 1), sizeof(U32)) >> 2; ++ weightList = (BYTE *)((U32 *)workspace + spaceUsed32); ++ spaceUsed32 += ALIGN(HUF_SYMBOLVALUE_MAX + 1, sizeof(U32)) >> 2; ++ ++ if ((spaceUsed32 << 2) > workspaceSize) ++ return ERROR(tableLog_tooLarge); ++ workspace = (U32 *)workspace + spaceUsed32; ++ workspaceSize -= (spaceUsed32 << 2); ++ ++ rankStart = rankStart0 + 1; ++ memset(rankStats, 0, sizeof(U32) * (2 * HUF_TABLELOG_MAX + 2 + 1)); ++ ++ HUF_STATIC_ASSERT(sizeof(HUF_DEltX4) == sizeof(HUF_DTable)); /* if compiler fails here, assertion is wrong */ ++ if (maxTableLog > HUF_TABLELOG_MAX) ++ return ERROR(tableLog_tooLarge); ++ /* memset(weightList, 0, sizeof(weightList)); */ /* is not necessary, even though some analyzer complain ... */ ++ ++ iSize = HUF_readStats_wksp(weightList, HUF_SYMBOLVALUE_MAX + 1, rankStats, &nbSymbols, &tableLog, src, srcSize, workspace, workspaceSize); ++ if (HUF_isError(iSize)) ++ return iSize; ++ ++ /* check result */ ++ if (tableLog > maxTableLog) ++ return ERROR(tableLog_tooLarge); /* DTable can't fit code depth */ ++ ++ /* find maxWeight */ ++ for (maxW = tableLog; rankStats[maxW] == 0; maxW--) { ++ } /* necessarily finds a solution before 0 */ ++ ++ /* Get start index of each weight */ ++ { ++ U32 w, nextRankStart = 0; ++ for (w = 1; w < maxW + 1; w++) { ++ U32 curr = nextRankStart; ++ nextRankStart += rankStats[w]; ++ rankStart[w] = curr; ++ } ++ rankStart[0] = nextRankStart; /* put all 0w symbols at the end of sorted list*/ ++ sizeOfSort = nextRankStart; ++ } ++ ++ /* sort symbols by weight */ ++ { ++ U32 s; ++ for (s = 0; s < nbSymbols; s++) { ++ U32 const w = weightList[s]; ++ U32 const r = rankStart[w]++; ++ sortedSymbol[r].symbol = (BYTE)s; ++ sortedSymbol[r].weight = (BYTE)w; ++ } ++ rankStart[0] = 0; /* forget 0w symbols; this is beginning of weight(1) */ ++ } ++ ++ /* Build rankVal */ ++ { ++ U32 *const rankVal0 = rankVal[0]; ++ { ++ int const rescale = (maxTableLog - tableLog) - 1; /* tableLog <= maxTableLog */ ++ U32 nextRankVal = 0; ++ U32 w; ++ for (w = 1; w < maxW + 1; w++) { ++ U32 curr = nextRankVal; ++ nextRankVal += rankStats[w] << (w + rescale); ++ rankVal0[w] = curr; ++ } ++ } ++ { ++ U32 const minBits = tableLog + 1 - maxW; ++ U32 consumed; ++ for (consumed = minBits; consumed < maxTableLog - minBits + 1; consumed++) { ++ U32 *const rankValPtr = rankVal[consumed]; ++ U32 w; ++ for (w = 1; w < maxW + 1; w++) { ++ rankValPtr[w] = rankVal0[w] >> consumed; ++ } ++ } ++ } ++ } ++ ++ HUF_fillDTableX4(dt, maxTableLog, sortedSymbol, sizeOfSort, rankStart0, rankVal, maxW, tableLog + 1); ++ ++ dtd.tableLog = (BYTE)maxTableLog; ++ dtd.tableType = 1; ++ memcpy(DTable, &dtd, sizeof(dtd)); ++ return iSize; ++} ++ ++static U32 INIT HUF_decodeSymbolX4(void *op, BIT_DStream_t *DStream, const HUF_DEltX4 *dt, const U32 dtLog) ++{ ++ size_t const val = BIT_lookBitsFast(DStream, dtLog); /* note : dtLog >= 1 */ ++ memcpy(op, dt + val, 2); ++ BIT_skipBits(DStream, dt[val].nbBits); ++ return dt[val].length; ++} ++ ++static U32 INIT HUF_decodeLastSymbolX4(void *op, BIT_DStream_t *DStream, const HUF_DEltX4 *dt, const U32 dtLog) ++{ ++ size_t const val = BIT_lookBitsFast(DStream, dtLog); /* note : dtLog >= 1 */ ++ memcpy(op, dt + val, 1); ++ if (dt[val].length == 1) ++ BIT_skipBits(DStream, dt[val].nbBits); ++ else { ++ if (DStream->bitsConsumed < (sizeof(DStream->bitContainer) * 8)) { ++ BIT_skipBits(DStream, dt[val].nbBits); ++ if (DStream->bitsConsumed > (sizeof(DStream->bitContainer) * 8)) ++ /* ugly hack; works only because it's the last symbol. Note : can't easily extract nbBits from just this symbol */ ++ DStream->bitsConsumed = (sizeof(DStream->bitContainer) * 8); ++ } ++ } ++ return 1; ++} ++ ++#define HUF_DECODE_SYMBOLX4_0(ptr, DStreamPtr) ptr += HUF_decodeSymbolX4(ptr, DStreamPtr, dt, dtLog) ++ ++#define HUF_DECODE_SYMBOLX4_1(ptr, DStreamPtr) \ ++ if (ZSTD_64bits() || (HUF_TABLELOG_MAX <= 12)) \ ++ ptr += HUF_decodeSymbolX4(ptr, DStreamPtr, dt, dtLog) ++ ++#define HUF_DECODE_SYMBOLX4_2(ptr, DStreamPtr) \ ++ if (ZSTD_64bits()) \ ++ ptr += HUF_decodeSymbolX4(ptr, DStreamPtr, dt, dtLog) ++ ++FORCE_INLINE size_t HUF_decodeStreamX4(BYTE *p, BIT_DStream_t *bitDPtr, BYTE *const pEnd, const HUF_DEltX4 *const dt, const U32 dtLog) ++{ ++ BYTE *const pStart = p; ++ ++ /* up to 8 symbols at a time */ ++ while ((BIT_reloadDStream(bitDPtr) == BIT_DStream_unfinished) & (p < pEnd - (sizeof(bitDPtr->bitContainer) - 1))) { ++ HUF_DECODE_SYMBOLX4_2(p, bitDPtr); ++ HUF_DECODE_SYMBOLX4_1(p, bitDPtr); ++ HUF_DECODE_SYMBOLX4_2(p, bitDPtr); ++ HUF_DECODE_SYMBOLX4_0(p, bitDPtr); ++ } ++ ++ /* closer to end : up to 2 symbols at a time */ ++ while ((BIT_reloadDStream(bitDPtr) == BIT_DStream_unfinished) & (p <= pEnd - 2)) ++ HUF_DECODE_SYMBOLX4_0(p, bitDPtr); ++ ++ while (p <= pEnd - 2) ++ HUF_DECODE_SYMBOLX4_0(p, bitDPtr); /* no need to reload : reached the end of DStream */ ++ ++ if (p < pEnd) ++ p += HUF_decodeLastSymbolX4(p, bitDPtr, dt, dtLog); ++ ++ return p - pStart; ++} ++ ++static size_t INIT HUF_decompress1X4_usingDTable_internal(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ BIT_DStream_t bitD; ++ ++ /* Init */ ++ { ++ size_t const errorCode = BIT_initDStream(&bitD, cSrc, cSrcSize); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ ++ /* decode */ ++ { ++ BYTE *const ostart = (BYTE *)dst; ++ BYTE *const oend = ostart + dstSize; ++ const void *const dtPtr = DTable + 1; /* force compiler to not use strict-aliasing */ ++ const HUF_DEltX4 *const dt = (const HUF_DEltX4 *)dtPtr; ++ DTableDesc const dtd = HUF_getDTableDesc(DTable); ++ HUF_decodeStreamX4(ostart, &bitD, oend, dt, dtd.tableLog); ++ } ++ ++ /* check */ ++ if (!BIT_endOfDStream(&bitD)) ++ return ERROR(corruption_detected); ++ ++ /* decoded size */ ++ return dstSize; ++} ++ ++size_t INIT HUF_decompress1X4_usingDTable(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ DTableDesc dtd = HUF_getDTableDesc(DTable); ++ if (dtd.tableType != 1) ++ return ERROR(GENERIC); ++ return HUF_decompress1X4_usingDTable_internal(dst, dstSize, cSrc, cSrcSize, DTable); ++} ++ ++size_t INIT HUF_decompress1X4_DCtx_wksp(HUF_DTable *DCtx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize) ++{ ++ const BYTE *ip = (const BYTE *)cSrc; ++ ++ size_t const hSize = HUF_readDTableX4_wksp(DCtx, cSrc, cSrcSize, workspace, workspaceSize); ++ if (HUF_isError(hSize)) ++ return hSize; ++ if (hSize >= cSrcSize) ++ return ERROR(srcSize_wrong); ++ ip += hSize; ++ cSrcSize -= hSize; ++ ++ return HUF_decompress1X4_usingDTable_internal(dst, dstSize, ip, cSrcSize, DCtx); ++} ++ ++static size_t INIT HUF_decompress4X4_usingDTable_internal(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ if (cSrcSize < 10) ++ return ERROR(corruption_detected); /* strict minimum : jump table + 1 byte per stream */ ++ ++ { ++ const BYTE *const istart = (const BYTE *)cSrc; ++ BYTE *const ostart = (BYTE *)dst; ++ BYTE *const oend = ostart + dstSize; ++ const void *const dtPtr = DTable + 1; ++ const HUF_DEltX4 *const dt = (const HUF_DEltX4 *)dtPtr; ++ ++ /* Init */ ++ BIT_DStream_t bitD1; ++ BIT_DStream_t bitD2; ++ BIT_DStream_t bitD3; ++ BIT_DStream_t bitD4; ++ size_t const length1 = ZSTD_readLE16(istart); ++ size_t const length2 = ZSTD_readLE16(istart + 2); ++ size_t const length3 = ZSTD_readLE16(istart + 4); ++ size_t const length4 = cSrcSize - (length1 + length2 + length3 + 6); ++ const BYTE *const istart1 = istart + 6; /* jumpTable */ ++ const BYTE *const istart2 = istart1 + length1; ++ const BYTE *const istart3 = istart2 + length2; ++ const BYTE *const istart4 = istart3 + length3; ++ size_t const segmentSize = (dstSize + 3) / 4; ++ BYTE *const opStart2 = ostart + segmentSize; ++ BYTE *const opStart3 = opStart2 + segmentSize; ++ BYTE *const opStart4 = opStart3 + segmentSize; ++ BYTE *op1 = ostart; ++ BYTE *op2 = opStart2; ++ BYTE *op3 = opStart3; ++ BYTE *op4 = opStart4; ++ U32 endSignal; ++ DTableDesc const dtd = HUF_getDTableDesc(DTable); ++ U32 const dtLog = dtd.tableLog; ++ ++ if (length4 > cSrcSize) ++ return ERROR(corruption_detected); /* overflow */ ++ { ++ size_t const errorCode = BIT_initDStream(&bitD1, istart1, length1); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ { ++ size_t const errorCode = BIT_initDStream(&bitD2, istart2, length2); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ { ++ size_t const errorCode = BIT_initDStream(&bitD3, istart3, length3); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ { ++ size_t const errorCode = BIT_initDStream(&bitD4, istart4, length4); ++ if (HUF_isError(errorCode)) ++ return errorCode; ++ } ++ ++ /* 16-32 symbols per loop (4-8 symbols per stream) */ ++ endSignal = BIT_reloadDStream(&bitD1) | BIT_reloadDStream(&bitD2) | BIT_reloadDStream(&bitD3) | BIT_reloadDStream(&bitD4); ++ for (; (endSignal == BIT_DStream_unfinished) & (op4 < (oend - (sizeof(bitD4.bitContainer) - 1)));) { ++ HUF_DECODE_SYMBOLX4_2(op1, &bitD1); ++ HUF_DECODE_SYMBOLX4_2(op2, &bitD2); ++ HUF_DECODE_SYMBOLX4_2(op3, &bitD3); ++ HUF_DECODE_SYMBOLX4_2(op4, &bitD4); ++ HUF_DECODE_SYMBOLX4_1(op1, &bitD1); ++ HUF_DECODE_SYMBOLX4_1(op2, &bitD2); ++ HUF_DECODE_SYMBOLX4_1(op3, &bitD3); ++ HUF_DECODE_SYMBOLX4_1(op4, &bitD4); ++ HUF_DECODE_SYMBOLX4_2(op1, &bitD1); ++ HUF_DECODE_SYMBOLX4_2(op2, &bitD2); ++ HUF_DECODE_SYMBOLX4_2(op3, &bitD3); ++ HUF_DECODE_SYMBOLX4_2(op4, &bitD4); ++ HUF_DECODE_SYMBOLX4_0(op1, &bitD1); ++ HUF_DECODE_SYMBOLX4_0(op2, &bitD2); ++ HUF_DECODE_SYMBOLX4_0(op3, &bitD3); ++ HUF_DECODE_SYMBOLX4_0(op4, &bitD4); ++ ++ endSignal = BIT_reloadDStream(&bitD1) | BIT_reloadDStream(&bitD2) | BIT_reloadDStream(&bitD3) | BIT_reloadDStream(&bitD4); ++ } ++ ++ /* check corruption */ ++ if (op1 > opStart2) ++ return ERROR(corruption_detected); ++ if (op2 > opStart3) ++ return ERROR(corruption_detected); ++ if (op3 > opStart4) ++ return ERROR(corruption_detected); ++ /* note : op4 already verified within main loop */ ++ ++ /* finish bitStreams one by one */ ++ HUF_decodeStreamX4(op1, &bitD1, opStart2, dt, dtLog); ++ HUF_decodeStreamX4(op2, &bitD2, opStart3, dt, dtLog); ++ HUF_decodeStreamX4(op3, &bitD3, opStart4, dt, dtLog); ++ HUF_decodeStreamX4(op4, &bitD4, oend, dt, dtLog); ++ ++ /* check */ ++ { ++ U32 const endCheck = BIT_endOfDStream(&bitD1) & BIT_endOfDStream(&bitD2) & BIT_endOfDStream(&bitD3) & BIT_endOfDStream(&bitD4); ++ if (!endCheck) ++ return ERROR(corruption_detected); ++ } ++ ++ /* decoded size */ ++ return dstSize; ++ } ++} ++ ++size_t INIT HUF_decompress4X4_usingDTable(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ DTableDesc dtd = HUF_getDTableDesc(DTable); ++ if (dtd.tableType != 1) ++ return ERROR(GENERIC); ++ return HUF_decompress4X4_usingDTable_internal(dst, dstSize, cSrc, cSrcSize, DTable); ++} ++ ++size_t INIT HUF_decompress4X4_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize) ++{ ++ const BYTE *ip = (const BYTE *)cSrc; ++ ++ size_t hSize = HUF_readDTableX4_wksp(dctx, cSrc, cSrcSize, workspace, workspaceSize); ++ if (HUF_isError(hSize)) ++ return hSize; ++ if (hSize >= cSrcSize) ++ return ERROR(srcSize_wrong); ++ ip += hSize; ++ cSrcSize -= hSize; ++ ++ return HUF_decompress4X4_usingDTable_internal(dst, dstSize, ip, cSrcSize, dctx); ++} ++ ++/* ********************************/ ++/* Generic decompression selector */ ++/* ********************************/ ++ ++size_t INIT HUF_decompress1X_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ DTableDesc const dtd = HUF_getDTableDesc(DTable); ++ return dtd.tableType ? HUF_decompress1X4_usingDTable_internal(dst, maxDstSize, cSrc, cSrcSize, DTable) ++ : HUF_decompress1X2_usingDTable_internal(dst, maxDstSize, cSrc, cSrcSize, DTable); ++} ++ ++size_t INIT HUF_decompress4X_usingDTable(void *dst, size_t maxDstSize, const void *cSrc, size_t cSrcSize, const HUF_DTable *DTable) ++{ ++ DTableDesc const dtd = HUF_getDTableDesc(DTable); ++ return dtd.tableType ? HUF_decompress4X4_usingDTable_internal(dst, maxDstSize, cSrc, cSrcSize, DTable) ++ : HUF_decompress4X2_usingDTable_internal(dst, maxDstSize, cSrc, cSrcSize, DTable); ++} ++ ++typedef struct { ++ U32 tableTime; ++ U32 decode256Time; ++} algo_time_t; ++static const algo_time_t algoTime[16 /* Quantization */][3 /* single, double, quad */] = { ++ /* single, double, quad */ ++ {{0, 0}, {1, 1}, {2, 2}}, /* Q==0 : impossible */ ++ {{0, 0}, {1, 1}, {2, 2}}, /* Q==1 : impossible */ ++ {{38, 130}, {1313, 74}, {2151, 38}}, /* Q == 2 : 12-18% */ ++ {{448, 128}, {1353, 74}, {2238, 41}}, /* Q == 3 : 18-25% */ ++ {{556, 128}, {1353, 74}, {2238, 47}}, /* Q == 4 : 25-32% */ ++ {{714, 128}, {1418, 74}, {2436, 53}}, /* Q == 5 : 32-38% */ ++ {{883, 128}, {1437, 74}, {2464, 61}}, /* Q == 6 : 38-44% */ ++ {{897, 128}, {1515, 75}, {2622, 68}}, /* Q == 7 : 44-50% */ ++ {{926, 128}, {1613, 75}, {2730, 75}}, /* Q == 8 : 50-56% */ ++ {{947, 128}, {1729, 77}, {3359, 77}}, /* Q == 9 : 56-62% */ ++ {{1107, 128}, {2083, 81}, {4006, 84}}, /* Q ==10 : 62-69% */ ++ {{1177, 128}, {2379, 87}, {4785, 88}}, /* Q ==11 : 69-75% */ ++ {{1242, 128}, {2415, 93}, {5155, 84}}, /* Q ==12 : 75-81% */ ++ {{1349, 128}, {2644, 106}, {5260, 106}}, /* Q ==13 : 81-87% */ ++ {{1455, 128}, {2422, 124}, {4174, 124}}, /* Q ==14 : 87-93% */ ++ {{722, 128}, {1891, 145}, {1936, 146}}, /* Q ==15 : 93-99% */ ++}; ++ ++/** HUF_selectDecoder() : ++* Tells which decoder is likely to decode faster, ++* based on a set of pre-determined metrics. ++* @return : 0==HUF_decompress4X2, 1==HUF_decompress4X4 . ++* Assumption : 0 < cSrcSize < dstSize <= 128 KB */ ++U32 INIT HUF_selectDecoder(size_t dstSize, size_t cSrcSize) ++{ ++ /* decoder timing evaluation */ ++ U32 const Q = (U32)(cSrcSize * 16 / dstSize); /* Q < 16 since dstSize > cSrcSize */ ++ U32 const D256 = (U32)(dstSize >> 8); ++ U32 const DTime0 = algoTime[Q][0].tableTime + (algoTime[Q][0].decode256Time * D256); ++ U32 DTime1 = algoTime[Q][1].tableTime + (algoTime[Q][1].decode256Time * D256); ++ DTime1 += DTime1 >> 3; /* advantage to algorithm using less memory, for cache eviction */ ++ ++ return DTime1 < DTime0; ++} ++ ++typedef size_t (*decompressionAlgo)(void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize); ++ ++size_t INIT HUF_decompress4X_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize) ++{ ++ /* validation checks */ ++ if (dstSize == 0) ++ return ERROR(dstSize_tooSmall); ++ if (cSrcSize > dstSize) ++ return ERROR(corruption_detected); /* invalid */ ++ if (cSrcSize == dstSize) { ++ memcpy(dst, cSrc, dstSize); ++ return dstSize; ++ } /* not compressed */ ++ if (cSrcSize == 1) { ++ memset(dst, *(const BYTE *)cSrc, dstSize); ++ return dstSize; ++ } /* RLE */ ++ ++ { ++ U32 const algoNb = HUF_selectDecoder(dstSize, cSrcSize); ++ return algoNb ? HUF_decompress4X4_DCtx_wksp(dctx, dst, dstSize, cSrc, cSrcSize, workspace, workspaceSize) ++ : HUF_decompress4X2_DCtx_wksp(dctx, dst, dstSize, cSrc, cSrcSize, workspace, workspaceSize); ++ } ++} ++ ++size_t INIT HUF_decompress4X_hufOnly_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize) ++{ ++ /* validation checks */ ++ if (dstSize == 0) ++ return ERROR(dstSize_tooSmall); ++ if ((cSrcSize >= dstSize) || (cSrcSize <= 1)) ++ return ERROR(corruption_detected); /* invalid */ ++ ++ { ++ U32 const algoNb = HUF_selectDecoder(dstSize, cSrcSize); ++ return algoNb ? HUF_decompress4X4_DCtx_wksp(dctx, dst, dstSize, cSrc, cSrcSize, workspace, workspaceSize) ++ : HUF_decompress4X2_DCtx_wksp(dctx, dst, dstSize, cSrc, cSrcSize, workspace, workspaceSize); ++ } ++} ++ ++size_t INIT HUF_decompress1X_DCtx_wksp(HUF_DTable *dctx, void *dst, size_t dstSize, const void *cSrc, size_t cSrcSize, void *workspace, size_t workspaceSize) ++{ ++ /* validation checks */ ++ if (dstSize == 0) ++ return ERROR(dstSize_tooSmall); ++ if (cSrcSize > dstSize) ++ return ERROR(corruption_detected); /* invalid */ ++ if (cSrcSize == dstSize) { ++ memcpy(dst, cSrc, dstSize); ++ return dstSize; ++ } /* not compressed */ ++ if (cSrcSize == 1) { ++ memset(dst, *(const BYTE *)cSrc, dstSize); ++ return dstSize; ++ } /* RLE */ ++ ++ { ++ U32 const algoNb = HUF_selectDecoder(dstSize, cSrcSize); ++ return algoNb ? HUF_decompress1X4_DCtx_wksp(dctx, dst, dstSize, cSrc, cSrcSize, workspace, workspaceSize) ++ : HUF_decompress1X2_DCtx_wksp(dctx, dst, dstSize, cSrc, cSrcSize, workspace, workspaceSize); ++ } ++} +diff --git a/xen/common/zstd/mem.h b/xen/common/zstd/mem.h +new file mode 100644 +index 000000000000..288320069654 +--- /dev/null ++++ b/xen/common/zstd/mem.h +@@ -0,0 +1,151 @@ ++/** ++ * Copyright (c) 2016-present, Yann Collet, Facebook, Inc. ++ * All rights reserved. ++ * ++ * This source code is licensed under the BSD-style license found in the ++ * LICENSE file in the root directory of https://github.com/facebook/zstd. ++ * An additional grant of patent rights can be found in the PATENTS file in the ++ * same directory. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ */ ++ ++#ifndef MEM_H_MODULE ++#define MEM_H_MODULE ++ ++/*-**************************************** ++* Dependencies ++******************************************/ ++#include /* memcpy */ ++#include /* size_t, ptrdiff_t */ ++#include ++ ++/*-**************************************** ++* Compiler specifics ++******************************************/ ++#define ZSTD_STATIC static inline ++ ++/*-************************************************************** ++* Basic Types ++*****************************************************************/ ++typedef uint8_t BYTE; ++typedef uint16_t U16; ++typedef int16_t S16; ++typedef uint32_t U32; ++typedef int32_t S32; ++typedef uint64_t U64; ++typedef int64_t S64; ++typedef ptrdiff_t iPtrDiff; ++typedef uintptr_t uPtrDiff; ++ ++/*-************************************************************** ++* Memory I/O ++*****************************************************************/ ++ZSTD_STATIC unsigned ZSTD_32bits(void) { return sizeof(size_t) == 4; } ++ZSTD_STATIC unsigned ZSTD_64bits(void) { return sizeof(size_t) == 8; } ++ ++#if defined(__LITTLE_ENDIAN) ++#define ZSTD_LITTLE_ENDIAN 1 ++#else ++#define ZSTD_LITTLE_ENDIAN 0 ++#endif ++ ++ZSTD_STATIC unsigned ZSTD_isLittleEndian(void) { return ZSTD_LITTLE_ENDIAN; } ++ ++ZSTD_STATIC U16 ZSTD_read16(const void *memPtr) { return get_unaligned((const U16 *)memPtr); } ++ ++ZSTD_STATIC U32 ZSTD_read32(const void *memPtr) { return get_unaligned((const U32 *)memPtr); } ++ ++ZSTD_STATIC U64 ZSTD_read64(const void *memPtr) { return get_unaligned((const U64 *)memPtr); } ++ ++ZSTD_STATIC size_t ZSTD_readST(const void *memPtr) { return get_unaligned((const size_t *)memPtr); } ++ ++ZSTD_STATIC void ZSTD_write16(void *memPtr, U16 value) { put_unaligned(value, (U16 *)memPtr); } ++ ++ZSTD_STATIC void ZSTD_write32(void *memPtr, U32 value) { put_unaligned(value, (U32 *)memPtr); } ++ ++ZSTD_STATIC void ZSTD_write64(void *memPtr, U64 value) { put_unaligned(value, (U64 *)memPtr); } ++ ++/*=== Little endian r/w ===*/ ++ ++ZSTD_STATIC U16 ZSTD_readLE16(const void *memPtr) { return get_unaligned_le16(memPtr); } ++ ++ZSTD_STATIC void ZSTD_writeLE16(void *memPtr, U16 val) { put_unaligned_le16(val, memPtr); } ++ ++ZSTD_STATIC U32 ZSTD_readLE24(const void *memPtr) { return ZSTD_readLE16(memPtr) + (((const BYTE *)memPtr)[2] << 16); } ++ ++ZSTD_STATIC void ZSTD_writeLE24(void *memPtr, U32 val) ++{ ++ ZSTD_writeLE16(memPtr, (U16)val); ++ ((BYTE *)memPtr)[2] = (BYTE)(val >> 16); ++} ++ ++ZSTD_STATIC U32 ZSTD_readLE32(const void *memPtr) { return get_unaligned_le32(memPtr); } ++ ++ZSTD_STATIC void ZSTD_writeLE32(void *memPtr, U32 val32) { put_unaligned_le32(val32, memPtr); } ++ ++ZSTD_STATIC U64 ZSTD_readLE64(const void *memPtr) { return get_unaligned_le64(memPtr); } ++ ++ZSTD_STATIC void ZSTD_writeLE64(void *memPtr, U64 val64) { put_unaligned_le64(val64, memPtr); } ++ ++ZSTD_STATIC size_t ZSTD_readLEST(const void *memPtr) ++{ ++ if (ZSTD_32bits()) ++ return (size_t)ZSTD_readLE32(memPtr); ++ else ++ return (size_t)ZSTD_readLE64(memPtr); ++} ++ ++ZSTD_STATIC void ZSTD_writeLEST(void *memPtr, size_t val) ++{ ++ if (ZSTD_32bits()) ++ ZSTD_writeLE32(memPtr, (U32)val); ++ else ++ ZSTD_writeLE64(memPtr, (U64)val); ++} ++ ++/*=== Big endian r/w ===*/ ++ ++ZSTD_STATIC U32 ZSTD_readBE32(const void *memPtr) { return get_unaligned_be32(memPtr); } ++ ++ZSTD_STATIC void ZSTD_writeBE32(void *memPtr, U32 val32) { put_unaligned_be32(val32, memPtr); } ++ ++ZSTD_STATIC U64 ZSTD_readBE64(const void *memPtr) { return get_unaligned_be64(memPtr); } ++ ++ZSTD_STATIC void ZSTD_writeBE64(void *memPtr, U64 val64) { put_unaligned_be64(val64, memPtr); } ++ ++ZSTD_STATIC size_t ZSTD_readBEST(const void *memPtr) ++{ ++ if (ZSTD_32bits()) ++ return (size_t)ZSTD_readBE32(memPtr); ++ else ++ return (size_t)ZSTD_readBE64(memPtr); ++} ++ ++ZSTD_STATIC void ZSTD_writeBEST(void *memPtr, size_t val) ++{ ++ if (ZSTD_32bits()) ++ ZSTD_writeBE32(memPtr, (U32)val); ++ else ++ ZSTD_writeBE64(memPtr, (U64)val); ++} ++ ++/* function safe only for comparisons */ ++ZSTD_STATIC U32 ZSTD_readMINMATCH(const void *memPtr, U32 length) ++{ ++ switch (length) { ++ default: ++ case 4: return ZSTD_read32(memPtr); ++ case 3: ++ if (ZSTD_isLittleEndian()) ++ return ZSTD_read32(memPtr) << 8; ++ else ++ return ZSTD_read32(memPtr) >> 8; ++ } ++} ++ ++#endif /* MEM_H_MODULE */ +diff --git a/xen/common/zstd/zstd_common.c b/xen/common/zstd/zstd_common.c +new file mode 100644 +index 000000000000..a35c4a5f14a3 +--- /dev/null ++++ b/xen/common/zstd/zstd_common.c +@@ -0,0 +1,74 @@ ++/** ++ * Copyright (c) 2016-present, Yann Collet, Facebook, Inc. ++ * All rights reserved. ++ * ++ * This source code is licensed under the BSD-style license found in the ++ * LICENSE file in the root directory of https://github.com/facebook/zstd. ++ * An additional grant of patent rights can be found in the PATENTS file in the ++ * same directory. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ */ ++ ++/*-************************************* ++* Dependencies ++***************************************/ ++#include "error_private.h" ++#include "zstd_internal.h" /* declaration of ZSTD_isError, ZSTD_getErrorName, ZSTD_getErrorCode, ZSTD_getErrorString, ZSTD_versionNumber */ ++ ++/*=************************************************************** ++* Custom allocator ++****************************************************************/ ++ ++#define stack_push(stack, size) \ ++ ({ \ ++ void *const ptr = ZSTD_PTR_ALIGN((stack)->ptr); \ ++ (stack)->ptr = (char *)ptr + (size); \ ++ (stack)->ptr <= (stack)->end ? ptr : NULL; \ ++ }) ++ ++ZSTD_customMem INIT ZSTD_initStack(void *workspace, size_t workspaceSize) ++{ ++ ZSTD_customMem stackMem = {ZSTD_stackAlloc, ZSTD_stackFree, workspace}; ++ ZSTD_stack *stack = (ZSTD_stack *)workspace; ++ /* Verify preconditions */ ++ if (!workspace || workspaceSize < sizeof(ZSTD_stack) || workspace != ZSTD_PTR_ALIGN(workspace)) { ++ ZSTD_customMem error = {NULL, NULL, NULL}; ++ return error; ++ } ++ /* Initialize the stack */ ++ stack->ptr = workspace; ++ stack->end = (char *)workspace + workspaceSize; ++ stack_push(stack, sizeof(ZSTD_stack)); ++ return stackMem; ++} ++ ++void *INIT ZSTD_stackAllocAll(void *opaque, size_t *size) ++{ ++ ZSTD_stack *stack = (ZSTD_stack *)opaque; ++ *size = (BYTE const *)stack->end - (BYTE *)ZSTD_PTR_ALIGN(stack->ptr); ++ return stack_push(stack, *size); ++} ++ ++void *INIT ZSTD_stackAlloc(void *opaque, size_t size) ++{ ++ ZSTD_stack *stack = (ZSTD_stack *)opaque; ++ return stack_push(stack, size); ++} ++void INIT ZSTD_stackFree(void *opaque, void *address) ++{ ++ (void)opaque; ++ (void)address; ++} ++ ++void *INIT ZSTD_malloc(size_t size, ZSTD_customMem customMem) { return customMem.customAlloc(customMem.opaque, size); } ++ ++void INIT ZSTD_free(void *ptr, ZSTD_customMem customMem) ++{ ++ if (ptr != NULL) ++ customMem.customFree(customMem.opaque, ptr); ++} +diff --git a/xen/common/zstd/zstd_internal.h b/xen/common/zstd/zstd_internal.h +new file mode 100644 +index 000000000000..7f8e5529ebfa +--- /dev/null ++++ b/xen/common/zstd/zstd_internal.h +@@ -0,0 +1,372 @@ ++/** ++ * Copyright (c) 2016-present, Yann Collet, Facebook, Inc. ++ * All rights reserved. ++ * ++ * This source code is licensed under the BSD-style license found in the ++ * LICENSE file in the root directory of https://github.com/facebook/zstd. ++ * An additional grant of patent rights can be found in the PATENTS file in the ++ * same directory. ++ * ++ * This program is free software; you can redistribute it and/or modify it under ++ * the terms of the GNU General Public License version 2 as published by the ++ * Free Software Foundation. This program is dual-licensed; you may select ++ * either version 2 of the GNU General Public License ("GPL") or BSD license ++ * ("BSD"). ++ */ ++ ++#ifndef ZSTD_CCOMMON_H_MODULE ++#define ZSTD_CCOMMON_H_MODULE ++ ++/*-******************************************************* ++* Compiler specifics ++*********************************************************/ ++#define FORCE_INLINE static always_inline ++#define FORCE_NOINLINE static noinline INIT ++ ++/*-************************************* ++* Dependencies ++***************************************/ ++#include "error_private.h" ++#include "mem.h" ++#include ++#include ++ ++#define ALIGN(x, a) ((x + (a) - 1) & ~((a) - 1)) ++#define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a))) ++ ++typedef enum { ++ ZSTDnit_frameHeader, ++ ZSTDnit_blockHeader, ++ ZSTDnit_block, ++ ZSTDnit_lastBlock, ++ ZSTDnit_checksum, ++ ZSTDnit_skippableFrame ++} ZSTD_nextInputType_e; ++ ++/** ++ * struct ZSTD_frameParams - zstd frame parameters stored in the frame header ++ * @frameContentSize: The frame content size, or 0 if not present. ++ * @windowSize: The window size, or 0 if the frame is a skippable frame. ++ * @dictID: The dictionary id, or 0 if not present. ++ * @checksumFlag: Whether a checksum was used. ++ */ ++typedef struct { ++ unsigned long long frameContentSize; ++ unsigned int windowSize; ++ unsigned int dictID; ++ unsigned int checksumFlag; ++} ZSTD_frameParams; ++ ++/** ++ * struct ZSTD_inBuffer - input buffer for streaming ++ * @src: Start of the input buffer. ++ * @size: Size of the input buffer. ++ * @pos: Position where reading stopped. Will be updated. ++ * Necessarily 0 <= pos <= size. ++ */ ++typedef struct ZSTD_inBuffer_s { ++ const void *src; ++ size_t size; ++ size_t pos; ++} ZSTD_inBuffer; ++ ++/** ++ * struct ZSTD_outBuffer - output buffer for streaming ++ * @dst: Start of the output buffer. ++ * @size: Size of the output buffer. ++ * @pos: Position where writing stopped. Will be updated. ++ * Necessarily 0 <= pos <= size. ++ */ ++typedef struct ZSTD_outBuffer_s { ++ void *dst; ++ size_t size; ++ size_t pos; ++} ZSTD_outBuffer; ++ ++typedef struct ZSTD_CCtx_s ZSTD_CCtx; ++typedef struct ZSTD_DCtx_s ZSTD_DCtx; ++ ++typedef struct ZSTD_CDict_s ZSTD_CDict; ++typedef struct ZSTD_DDict_s ZSTD_DDict; ++ ++typedef struct ZSTD_CStream_s ZSTD_CStream; ++typedef struct ZSTD_DStream_s ZSTD_DStream; ++ ++/*-************************************* ++* shared macros ++***************************************/ ++#define MIN(a, b) ((a) < (b) ? (a) : (b)) ++#define MAX(a, b) ((a) > (b) ? (a) : (b)) ++#define CHECK_F(f) \ ++ { \ ++ size_t const errcod = f; \ ++ if (ERR_isError(errcod)) \ ++ return errcod; \ ++ } /* check and Forward error code */ ++#define CHECK_E(f, e) \ ++ { \ ++ size_t const errcod = f; \ ++ if (ERR_isError(errcod)) \ ++ return ERROR(e); \ ++ } /* check and send Error code */ ++#define ZSTD_STATIC_ASSERT(c) \ ++ { \ ++ enum { ZSTD_static_assert = 1 / (int)(!!(c)) }; \ ++ } ++ ++/*-************************************* ++* Common constants ++***************************************/ ++#define ZSTD_MAGICNUMBER 0xFD2FB528 /* >= v0.8.0 */ ++#define ZSTD_MAGIC_SKIPPABLE_START 0x184D2A50U ++ ++#define ZSTD_OPT_NUM (1 << 12) ++#define ZSTD_DICT_MAGIC 0xEC30A437 /* v0.7+ */ ++ ++#define ZSTD_CONTENTSIZE_UNKNOWN (0ULL - 1) ++#define ZSTD_CONTENTSIZE_ERROR (0ULL - 2) ++ ++#define ZSTD_WINDOWLOG_MAX_32 27 ++#define ZSTD_WINDOWLOG_MAX_64 27 ++#define ZSTD_WINDOWLOG_MAX \ ++ ((unsigned int)(sizeof(size_t) == 4 \ ++ ? ZSTD_WINDOWLOG_MAX_32 \ ++ : ZSTD_WINDOWLOG_MAX_64)) ++#define ZSTD_WINDOWLOG_MIN 10 ++#define ZSTD_HASHLOG_MAX ZSTD_WINDOWLOG_MAX ++#define ZSTD_HASHLOG_MIN 6 ++#define ZSTD_CHAINLOG_MAX (ZSTD_WINDOWLOG_MAX+1) ++#define ZSTD_CHAINLOG_MIN ZSTD_HASHLOG_MIN ++#define ZSTD_HASHLOG3_MAX 17 ++#define ZSTD_SEARCHLOG_MAX (ZSTD_WINDOWLOG_MAX-1) ++#define ZSTD_SEARCHLOG_MIN 1 ++/* only for ZSTD_fast, other strategies are limited to 6 */ ++#define ZSTD_SEARCHLENGTH_MAX 7 ++/* only for ZSTD_btopt, other strategies are limited to 4 */ ++#define ZSTD_SEARCHLENGTH_MIN 3 ++#define ZSTD_TARGETLENGTH_MIN 4 ++#define ZSTD_TARGETLENGTH_MAX 999 ++ ++#define ZSTD_REP_NUM 3 /* number of repcodes */ ++#define ZSTD_REP_CHECK (ZSTD_REP_NUM) /* number of repcodes to check by the optimal parser */ ++#define ZSTD_REP_MOVE (ZSTD_REP_NUM - 1) ++#define ZSTD_REP_MOVE_OPT (ZSTD_REP_NUM) ++static const U32 repStartValue[ZSTD_REP_NUM] = {1, 4, 8}; ++ ++/* for static allocation */ ++#define ZSTD_FRAMEHEADERSIZE_MAX 18 ++#define ZSTD_FRAMEHEADERSIZE_MIN 6 ++static const size_t ZSTD_frameHeaderSize_prefix = 5; ++static const size_t ZSTD_frameHeaderSize_min = ZSTD_FRAMEHEADERSIZE_MIN; ++static const size_t ZSTD_frameHeaderSize_max = ZSTD_FRAMEHEADERSIZE_MAX; ++/* magic number + skippable frame length */ ++static const size_t ZSTD_skippableHeaderSize = 8; ++ ++#define ZSTD_BLOCKSIZE_ABSOLUTEMAX (128 * 1024) ++ ++#if 0 /* These don't seem to be usable - not sure what their purpose is. */ ++#define KB *(1 << 10) ++#define MB *(1 << 20) ++#define GB *(1U << 30) ++#endif ++ ++#define BIT7 128 ++#define BIT6 64 ++#define BIT5 32 ++#define BIT4 16 ++#define BIT1 2 ++#define BIT0 1 ++ ++#define ZSTD_WINDOWLOG_ABSOLUTEMIN 10 ++static const size_t ZSTD_fcs_fieldSize[4] = {0, 2, 4, 8}; ++static const size_t ZSTD_did_fieldSize[4] = {0, 1, 2, 4}; ++ ++#define ZSTD_BLOCKHEADERSIZE 3 /* C standard doesn't allow `static const` variable to be init using another `static const` variable */ ++static const size_t ZSTD_blockHeaderSize = ZSTD_BLOCKHEADERSIZE; ++typedef enum { bt_raw, bt_rle, bt_compressed, bt_reserved } blockType_e; ++ ++#define MIN_SEQUENCES_SIZE 1 /* nbSeq==0 */ ++#define MIN_CBLOCK_SIZE (1 /*litCSize*/ + 1 /* RLE or RAW */ + MIN_SEQUENCES_SIZE /* nbSeq==0 */) /* for a non-null block */ ++ ++#define HufLog 12 ++typedef enum { set_basic, set_rle, set_compressed, set_repeat } symbolEncodingType_e; ++ ++#define LONGNBSEQ 0x7F00 ++ ++#define MINMATCH 3 ++#define EQUAL_READ32 4 ++ ++#define Litbits 8 ++#define MaxLit ((1 << Litbits) - 1) ++#define MaxML 52 ++#define MaxLL 35 ++#define MaxOff 28 ++#define MaxSeq MAX(MaxLL, MaxML) /* Assumption : MaxOff < MaxLL,MaxML */ ++#define MLFSELog 9 ++#define LLFSELog 9 ++#define OffFSELog 8 ++ ++static const U32 LL_bits[MaxLL + 1] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; ++static const S16 LL_defaultNorm[MaxLL + 1] = {4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, -1, -1, -1, -1}; ++#define LL_DEFAULTNORMLOG 6 /* for static allocation */ ++static const U32 LL_defaultNormLog = LL_DEFAULTNORMLOG; ++ ++static const U32 ML_bits[MaxML + 1] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ++ 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; ++static const S16 ML_defaultNorm[MaxML + 1] = {1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ++ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1, -1, -1}; ++#define ML_DEFAULTNORMLOG 6 /* for static allocation */ ++static const U32 ML_defaultNormLog = ML_DEFAULTNORMLOG; ++ ++static const S16 OF_defaultNorm[MaxOff + 1] = {1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1, -1}; ++#define OF_DEFAULTNORMLOG 5 /* for static allocation */ ++static const U32 OF_defaultNormLog = OF_DEFAULTNORMLOG; ++ ++/*-******************************************* ++* Shared functions to include for inlining ++*********************************************/ ++ZSTD_STATIC void ZSTD_copy8(void *dst, const void *src) { ++ /* ++ * zstd relies heavily on gcc being able to analyze and inline this ++ * memcpy() call, since it is called in a tight loop. Preboot mode ++ * is compiled in freestanding mode, which stops gcc from analyzing ++ * memcpy(). Use __builtin_memcpy() to tell gcc to analyze this as a ++ * regular memcpy(). ++ */ ++ __builtin_memcpy(dst, src, 8); ++} ++/*! ZSTD_wildcopy() : ++* custom version of memcpy(), can copy up to 7 bytes too many (8 bytes if length==0) */ ++#define WILDCOPY_OVERLENGTH 8 ++ZSTD_STATIC void ZSTD_wildcopy(void *dst, const void *src, ptrdiff_t length) ++{ ++ const BYTE* ip = (const BYTE*)src; ++ BYTE* op = (BYTE*)dst; ++ BYTE* const oend = op + length; ++#if defined(GCC_VERSION) && GCC_VERSION >= 70000 && GCC_VERSION < 70200 ++ /* ++ * Work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81388. ++ * Avoid the bad case where the loop only runs once by handling the ++ * special case separately. This doesn't trigger the bug because it ++ * doesn't involve pointer/integer overflow. ++ */ ++ if (length <= 8) ++ return ZSTD_copy8(dst, src); ++#endif ++ do { ++ ZSTD_copy8(op, ip); ++ op += 8; ++ ip += 8; ++ } while (op < oend); ++} ++ ++/*-******************************************* ++* Private interfaces ++*********************************************/ ++typedef struct ZSTD_stats_s ZSTD_stats_t; ++ ++typedef struct { ++ U32 off; ++ U32 len; ++} ZSTD_match_t; ++ ++typedef struct { ++ U32 price; ++ U32 off; ++ U32 mlen; ++ U32 litlen; ++ U32 rep[ZSTD_REP_NUM]; ++} ZSTD_optimal_t; ++ ++typedef struct seqDef_s { ++ U32 offset; ++ U16 litLength; ++ U16 matchLength; ++} seqDef; ++ ++typedef struct { ++ seqDef *sequencesStart; ++ seqDef *sequences; ++ BYTE *litStart; ++ BYTE *lit; ++ BYTE *llCode; ++ BYTE *mlCode; ++ BYTE *ofCode; ++ U32 longLengthID; /* 0 == no longLength; 1 == Lit.longLength; 2 == Match.longLength; */ ++ U32 longLengthPos; ++ /* opt */ ++ ZSTD_optimal_t *priceTable; ++ ZSTD_match_t *matchTable; ++ U32 *matchLengthFreq; ++ U32 *litLengthFreq; ++ U32 *litFreq; ++ U32 *offCodeFreq; ++ U32 matchLengthSum; ++ U32 matchSum; ++ U32 litLengthSum; ++ U32 litSum; ++ U32 offCodeSum; ++ U32 log2matchLengthSum; ++ U32 log2matchSum; ++ U32 log2litLengthSum; ++ U32 log2litSum; ++ U32 log2offCodeSum; ++ U32 factor; ++ U32 staticPrices; ++ U32 cachedPrice; ++ U32 cachedLitLength; ++ const BYTE *cachedLiterals; ++} seqStore_t; ++ ++const seqStore_t *ZSTD_getSeqStore(const ZSTD_CCtx *ctx); ++void ZSTD_seqToCodes(const seqStore_t *seqStorePtr); ++int ZSTD_isSkipFrame(ZSTD_DCtx *dctx); ++ ++/*= Custom memory allocation functions */ ++typedef void *(*ZSTD_allocFunction)(void *opaque, size_t size); ++typedef void (*ZSTD_freeFunction)(void *opaque, void *address); ++typedef struct { ++ ZSTD_allocFunction customAlloc; ++ ZSTD_freeFunction customFree; ++ void *opaque; ++} ZSTD_customMem; ++ ++void *ZSTD_malloc(size_t size, ZSTD_customMem customMem); ++void ZSTD_free(void *ptr, ZSTD_customMem customMem); ++ ++/*====== stack allocation ======*/ ++ ++typedef struct { ++ void *ptr; ++ const void *end; ++} ZSTD_stack; ++ ++#define ZSTD_ALIGN(x) ALIGN(x, sizeof(size_t)) ++#define ZSTD_PTR_ALIGN(p) PTR_ALIGN(p, sizeof(size_t)) ++ ++ZSTD_customMem ZSTD_initStack(void *workspace, size_t workspaceSize); ++ ++void *ZSTD_stackAllocAll(void *opaque, size_t *size); ++void *ZSTD_stackAlloc(void *opaque, size_t size); ++void ZSTD_stackFree(void *opaque, void *address); ++ ++/*====== common function ======*/ ++ ++ZSTD_STATIC U32 ZSTD_highbit32(U32 val) { return 31 - __builtin_clz(val); } ++ ++/* hidden functions */ ++ ++/* ZSTD_invalidateRepCodes() : ++ * ensures next compression will not use repcodes from previous block. ++ * Note : only works with regular variant; ++ * do not use with extDict variant ! */ ++void ZSTD_invalidateRepCodes(ZSTD_CCtx *cctx); ++ ++size_t ZSTD_freeCCtx(ZSTD_CCtx *cctx); ++size_t ZSTD_freeDCtx(ZSTD_DCtx *dctx); ++size_t ZSTD_freeCDict(ZSTD_CDict *cdict); ++size_t ZSTD_freeDDict(ZSTD_DDict *cdict); ++size_t ZSTD_freeCStream(ZSTD_CStream *zcs); ++size_t ZSTD_freeDStream(ZSTD_DStream *zds); ++ ++#endif /* ZSTD_CCOMMON_H_MODULE */ +diff --git a/xen/include/asm-arm/types.h b/xen/include/asm-arm/types.h +index 30f95078cb0a..47696916d740 100644 +--- a/xen/include/asm-arm/types.h ++++ b/xen/include/asm-arm/types.h +@@ -61,6 +61,12 @@ typedef unsigned long size_t; + #endif + typedef signed long ssize_t; + ++#if defined(__PTRDIFF_TYPE__) ++typedef __PTRDIFF_TYPE__ ptrdiff_t; ++#else ++typedef signed long ptrdiff_t; ++#endif ++ + #endif /* __ASSEMBLY__ */ + + #endif /* __ARM_TYPES_H__ */ +diff --git a/xen/include/asm-x86/types.h b/xen/include/asm-x86/types.h +index fdf4f7dcc0bb..781713204876 100644 +--- a/xen/include/asm-x86/types.h ++++ b/xen/include/asm-x86/types.h +@@ -39,6 +39,12 @@ typedef unsigned long size_t; + #endif + typedef signed long ssize_t; + ++#if defined(__PTRDIFF_TYPE__) ++typedef __PTRDIFF_TYPE__ ptrdiff_t; ++#else ++typedef signed long ptrdiff_t; ++#endif ++ + #endif /* __ASSEMBLY__ */ + + #endif /* __X86_TYPES_H__ */ +diff --git a/xen/include/xen/decompress.h b/xen/include/xen/decompress.h +index b2955faa4bfb..f5bc17f2b63e 100644 +--- a/xen/include/xen/decompress.h ++++ b/xen/include/xen/decompress.h +@@ -31,7 +31,7 @@ typedef int decompress_fn(unsigned char *inbuf, unsigned int len, + * dependent). + */ + +-decompress_fn bunzip2, unxz, unlzma, unlzo, unlz4; ++decompress_fn bunzip2, unxz, unlzma, unlzo, unlz4, unzstd; + + int decompress(void *inbuf, unsigned int len, void *outbuf); + +-- +2.34.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0004-libxenguest-add-get_unaligned_le32.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0004-libxenguest-add-get_unaligned_le32.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0004-libxenguest-add-get_unaligned_le32.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0004-libxenguest-add-get_unaligned_le32.patch 2022-07-13 14:06:12.000000000 +0100 @@ -0,0 +1,112 @@ +From 7a763977f5deab7444fd1375d459757ebea64a16 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Tue, 26 Jan 2021 14:14:39 +0100 +Subject: [PATCH 4/5] libxenguest: add get_unaligned_le32() + +Abstract xc_dom_check_gzip()'s reading of the uncompressed size into a +helper re-usable, in particular, by other decompressor code. + +Sadly in the mini-os case this conflicts with other functions of the +same name (and purpose), which can't be easily replaced individually. +Yet it was requested that no full set of helpers be introduced at this +point in the release cycle. Hence the awkward XG_NEED_UNALIGNED. + +Requested-by: Ian Jackson +Signed-off-by: Jan Beulich +Reviewed-by: Ian Jackson +Release-Acked-by: Ian Jackson + +Bug-Ubuntu: https://bugs.launchpad.net/bugs/1956166 +Origin: backport, http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=d8099d94dfaa3573bd86ebfc457cbc8f70a3ecda +[backport: + - rename 'tools/libs/guest/xg_*' to 'tools/libxc/xc_*'; + - xc_dom_core.c: refresh 2 context lines/includes; + - xc_dom_decompress_lz4.c: refresh 1 context line (xg/xc)] +--- + tools/libxc/xc_dom_core.c | 5 ++--- + tools/libxc/xc_dom_decompress_lz4.c | 1 + + tools/libxc/xg_private.h | 9 +++++++++ + xen/common/lz4/defs.h | 5 ----- + 4 files changed, 12 insertions(+), 8 deletions(-) + +diff --git a/tools/libxc/xc_dom_core.c b/tools/libxc/xc_dom_core.c +index 9bd04cb2d554..7250d95bc0af 100644 +--- a/tools/libxc/xc_dom_core.c ++++ b/tools/libxc/xc_dom_core.c +@@ -31,6 +31,7 @@ + #include + #include + ++#define XG_NEED_UNALIGNED + #include "xg_private.h" + #include "xc_dom.h" + #include "_paths.h" +@@ -325,7 +326,6 @@ int xc_dom_kernel_check_size(struct xc_dom_image *dom, size_t sz) + + size_t xc_dom_check_gzip(xc_interface *xch, void *blob, size_t ziplen) + { +- unsigned char *gzlen; + size_t unziplen; + + if ( ziplen < 6 ) +@@ -337,8 +337,7 @@ size_t xc_dom_check_gzip(xc_interface *xch, void *blob, size_t ziplen) + /* not gzipped */ + return 0; + +- gzlen = blob + ziplen - 4; +- unziplen = (size_t)gzlen[3] << 24 | gzlen[2] << 16 | gzlen[1] << 8 | gzlen[0]; ++ unziplen = get_unaligned_le32(blob + ziplen - 4); + if ( unziplen > XC_DOM_DECOMPRESS_MAX ) + { + xc_dom_printf +diff --git a/tools/libxc/xc_dom_decompress_lz4.c b/tools/libxc/xc_dom_decompress_lz4.c +index b6a33f27a87d..31689c7375ae 100644 +--- a/tools/libxc/xc_dom_decompress_lz4.c ++++ b/tools/libxc/xc_dom_decompress_lz4.c +@@ -3,6 +3,7 @@ + #include + #include + ++#define XG_NEED_UNALIGNED + #include "xg_private.h" + #include "xc_dom_decompress.h" + +diff --git a/tools/libxc/xg_private.h b/tools/libxc/xg_private.h +index f0a4b2c61699..aa35f82cb36b 100644 +--- a/tools/libxc/xg_private.h ++++ b/tools/libxc/xg_private.h +@@ -48,6 +48,15 @@ char *xc_inflate_buffer(xc_interface *xch, + unsigned long in_size, + unsigned long *out_size); + ++#if !defined(__MINIOS__) || defined(XG_NEED_UNALIGNED) ++ ++static inline unsigned int get_unaligned_le32(const uint8_t *buf) ++{ ++ return ((unsigned int)buf[3] << 24) | (buf[2] << 16) | (buf[1] << 8) | buf[0]; ++} ++ ++#endif /* !__MINIOS__ || XG_NEED_UNALIGNED */ ++ + unsigned long csum_page (void * page); + + #define _PAGE_PRESENT 0x001 +diff --git a/xen/common/lz4/defs.h b/xen/common/lz4/defs.h +index 4fbea2ac3dd4..10609f5a5317 100644 +--- a/xen/common/lz4/defs.h ++++ b/xen/common/lz4/defs.h +@@ -18,11 +18,6 @@ static inline u16 get_unaligned_le16(const void *p) + return le16_to_cpup(p); + } + +-static inline u32 get_unaligned_le32(const void *p) +-{ +- return le32_to_cpup(p); +-} +- + #endif + + /* +-- +2.34.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0005-libxenguest-support-zstd-compressed-kernels.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0005-libxenguest-support-zstd-compressed-kernels.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0005-libxenguest-support-zstd-compressed-kernels.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0005-libxenguest-support-zstd-compressed-kernels.patch 2022-07-13 14:06:12.000000000 +0100 @@ -0,0 +1,717 @@ +From 8560749745e41507d9a0a50181fdd92a65ec50c8 Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Tue, 26 Jan 2021 14:16:34 +0100 +Subject: [PATCH 5/5] libxenguest: support zstd compressed kernels + +This follows the logic used for other decompression methods utilizing an +external library, albeit here we can't ignore the 32-bit size field +appended to the compressed image - its presence causes decompression to +fail. Leverage the field instead to allocate the output buffer in one +go, i.e. without incrementally realloc()ing. + +As far as configure.ac goes, I'm pretty sure there is a better (more +"standard") way of using PKG_CHECK_MODULES(). The construct also gets +put next to the other decompression library checks, albeit I think they +all ought to be x86-specific (e.g. placed in the existing case block a +few lines down). + +Note that, where possible, instead of #ifdef-ing xen/*.h inclusions, +they get removed. + +Signed-off-by: Jan Beulich +Acked-by: Wei Liu +Reviewed-by: Ian Jackson +Release-Acked-by: Ian Jackson + +Bug-Ubuntu: https://bugs.launchpad.net/bugs/1956166 +Origin: backport, http://xenbits.xen.org/gitweb/?p=xen.git;a=commit;h=8169f82049efb5b2044b33aa482ba3a136b7804d +[backport: + - rename 'tools/libs/guest/xg_*' to 'tools/libxc/xc_*'; + - tools/configure: drop file, removed in our package; + - tools/configure.ac, hunk 1: refresh 2 context lines; + - tools/libxc/Makefile: s/SRCS-y/GUEST_SRCS-y/; s/xg/xc/;] +--- + README | 2 + + tools/configure.ac | 2 + + .../guest/xg_dom_decompress_unsafe_zstd.c | 45 ++++++++++ + tools/libxc/Makefile | 1 + + tools/libxc/xc_dom_bzimageloader.c | 90 +++++++++++++++++++ + tools/libxc/xc_dom_decompress_unsafe.h | 2 + + xen/common/zstd/decompress.c | 67 +++++++++----- + xen/common/zstd/error_private.h | 5 -- + xen/common/zstd/fse.h | 5 -- + xen/common/zstd/fse_decompress.c | 2 - + xen/common/zstd/huf.h | 3 - + xen/common/zstd/huf_decompress.c | 2 - + xen/common/zstd/mem.h | 2 + + xen/common/zstd/zstd_internal.h | 4 + + xen/include/xen/unaligned.h | 2 + + xen/lib/xxhash64.c | 2 + + 16 files changed, 197 insertions(+), 39 deletions(-) + create mode 100644 tools/libs/guest/xg_dom_decompress_unsafe_zstd.c + +diff --git a/README b/README +index faf51dc7b657..a0953e5a73c5 100644 +--- a/README ++++ b/README +@@ -83,6 +83,8 @@ disabled at compile time: + * 16-bit x86 assembler, loader and compiler for qemu-traditional / rombios + (dev86 rpm or bin86 & bcc debs) + * Development install of liblzma for rombios ++ * Development install of libbz2, liblzma, liblzo2, and libzstd for DomU ++ kernel decompression. + + Second, you need to acquire a suitable kernel for use in domain 0. If + possible you should use a kernel provided by your OS distributor. If +diff --git a/tools/configure.ac b/tools/configure.ac +index 0826af8cbc40..ed46fa12c9d9 100644 +--- a/tools/configure.ac ++++ b/tools/configure.ac +@@ -366,6 +366,8 @@ AC_CHECK_LIB([lzma], [lzma_stream_decoder], [zlib="$zlib -DHAVE_LZMA -llzma"]) + AC_CHECK_HEADER([lzo/lzo1x.h], [ + AC_CHECK_LIB([lzo2], [lzo1x_decompress], [zlib="$zlib -DHAVE_LZO1X -llzo2"]) + ]) ++PKG_CHECK_MODULES([libzstd], [libzstd], ++ [zlib="$zlib -DHAVE_ZSTD $libzstd_CFLAGS $libzstd_LIBS"], [true]) + AC_SUBST(zlib) + AS_IF([test "x$enable_blktap2" = "xyes"], [ + AC_CHECK_LIB([aio], [io_setup], [], [AC_MSG_ERROR([Could not find libaio])]) +diff --git a/tools/libs/guest/xg_dom_decompress_unsafe_zstd.c b/tools/libs/guest/xg_dom_decompress_unsafe_zstd.c +new file mode 100644 +index 000000000000..52558d2ffc5b +--- /dev/null ++++ b/tools/libs/guest/xg_dom_decompress_unsafe_zstd.c +@@ -0,0 +1,45 @@ ++#include ++#include ++#include ++#include ++#include ++#include ++ ++#include "xg_private.h" ++#include "xg_dom_decompress_unsafe.h" ++ ++typedef uint8_t u8; ++ ++typedef uint16_t __u16; ++typedef uint32_t __u32; ++typedef uint64_t __u64; ++ ++typedef uint16_t __le16; ++typedef uint32_t __le32; ++typedef uint64_t __le64; ++ ++typedef uint16_t __be16; ++typedef uint32_t __be32; ++typedef uint64_t __be64; ++ ++#define __attribute_const__ ++#define __force ++#define always_inline ++#define noinline ++ ++#undef ERROR ++ ++#define __BYTEORDER_HAS_U64__ ++#define __TYPES_H__ /* xen/types.h guard */ ++#include "../../xen/include/xen/byteorder/little_endian.h" ++#define __ASM_UNALIGNED_H__ /* asm/unaligned.h guard */ ++#include "../../xen/include/xen/unaligned.h" ++#include "../../xen/include/xen/xxhash.h" ++#include "../../xen/lib/xxhash64.c" ++#include "../../xen/common/unzstd.c" ++ ++int xc_try_zstd_decode( ++ struct xc_dom_image *dom, void **blob, size_t *size) ++{ ++ return xc_dom_decompress_unsafe(unzstd, dom, blob, size); ++} +diff --git a/tools/libxc/Makefile b/tools/libxc/Makefile +index d26bf8dfac9c..46e1033c3892 100644 +--- a/tools/libxc/Makefile ++++ b/tools/libxc/Makefile +@@ -100,6 +100,7 @@ GUEST_SRCS-y += xc_dom_decompress_unsafe_bzip2.c + GUEST_SRCS-y += xc_dom_decompress_unsafe_lzma.c + GUEST_SRCS-y += xc_dom_decompress_unsafe_lzo1x.c + GUEST_SRCS-y += xc_dom_decompress_unsafe_xz.c ++GUEST_SRCS-y += xc_dom_decompress_unsafe_zstd.c + endif + + -include $(XEN_TARGET_ARCH)/Makefile +diff --git a/tools/libxc/xc_dom_bzimageloader.c b/tools/libxc/xc_dom_bzimageloader.c +index a7d70cc7c6df..ceb8a6411702 100644 +--- a/tools/libxc/xc_dom_bzimageloader.c ++++ b/tools/libxc/xc_dom_bzimageloader.c +@@ -589,6 +589,85 @@ static int xc_try_lzo1x_decode( + + #endif + ++#if defined(HAVE_ZSTD) ++ ++#include ++ ++static int xc_try_zstd_decode( ++ struct xc_dom_image *dom, void **blob, size_t *size) ++{ ++ size_t outsize, insize, actual; ++ unsigned char *outbuf; ++ ++ /* Magic, descriptor byte, and trailing size field. */ ++ if ( *size <= 9 ) ++ { ++ DOMPRINTF("ZSTD: insufficient input data"); ++ return -1; ++ } ++ ++ insize = *size - 4; ++ outsize = get_unaligned_le32(*blob + insize); ++ ++ if ( xc_dom_kernel_check_size(dom, outsize) ) ++ { ++ DOMPRINTF("ZSTD: output too large"); ++ return -1; ++ } ++ ++ outbuf = malloc(outsize); ++ if ( !outbuf ) ++ { ++ DOMPRINTF("ZSTD: failed to alloc memory"); ++ return -1; ++ } ++ ++ actual = ZSTD_decompress(outbuf, outsize, *blob, insize); ++ ++ if ( ZSTD_isError(actual) ) ++ { ++ DOMPRINTF("ZSTD: error: %s", ZSTD_getErrorName(actual)); ++ free(outbuf); ++ return -1; ++ } ++ ++ if ( actual != outsize ) ++ { ++ DOMPRINTF("ZSTD: got 0x%zx bytes instead of 0x%zx", ++ actual, outsize); ++ free(outbuf); ++ return -1; ++ } ++ ++ if ( xc_dom_register_external(dom, outbuf, outsize) ) ++ { ++ DOMPRINTF("ZSTD: error registering stream output"); ++ free(outbuf); ++ return -1; ++ } ++ ++ DOMPRINTF("%s: ZSTD decompress OK, 0x%zx -> 0x%zx", ++ __FUNCTION__, insize, outsize); ++ ++ *blob = outbuf; ++ *size = outsize; ++ ++ return 0; ++} ++ ++#else /* !defined(HAVE_ZSTD) */ ++ ++static int xc_try_zstd_decode( ++ struct xc_dom_image *dom, void **blob, size_t *size) ++{ ++ xc_dom_panic(dom->xch, XC_INTERNAL_ERROR, ++ "%s: ZSTD decompress support unavailable\n", ++ __FUNCTION__); ++ return -1; ++} ++ ++#endif ++ + #else /* __MINIOS__ */ + + int xc_try_bzip2_decode(struct xc_dom_image *dom, void **blob, size_t *size); +@@ -736,6 +815,17 @@ static int xc_dom_probe_bzimage_kernel(struct xc_dom_image *dom) + return -EINVAL; + } + } ++ else if ( check_magic(dom, "\x28\xb5\x2f\xfd", 4) ) ++ { ++ ret = xc_try_zstd_decode(dom, &dom->kernel_blob, &dom->kernel_size); ++ if ( ret < 0 ) ++ { ++ xc_dom_panic(dom->xch, XC_INVALID_KERNEL, ++ "%s unable to ZSTD decompress kernel", ++ __FUNCTION__); ++ return -EINVAL; ++ } ++ } + else if ( check_magic(dom, "\135\000", 2) ) + { + ret = xc_try_lzma_decode(dom, &dom->kernel_blob, &dom->kernel_size); +diff --git a/tools/libxc/xc_dom_decompress_unsafe.h b/tools/libxc/xc_dom_decompress_unsafe.h +index 64f68864b165..22ab68da6e5b 100644 +--- a/tools/libxc/xc_dom_decompress_unsafe.h ++++ b/tools/libxc/xc_dom_decompress_unsafe.h +@@ -18,3 +18,5 @@ int xc_try_lzo1x_decode(struct xc_dom_image *dom, void **blob, size_t *size) + __attribute__((visibility("internal"))); + int xc_try_xz_decode(struct xc_dom_image *dom, void **blob, size_t *size) + __attribute__((visibility("internal"))); ++int xc_try_zstd_decode(struct xc_dom_image *dom, void **blob, size_t *size) ++ __attribute__((visibility("internal"))); +diff --git a/xen/common/zstd/decompress.c b/xen/common/zstd/decompress.c +index 3d3ef136e5c2..b0249108145c 100644 +--- a/xen/common/zstd/decompress.c ++++ b/xen/common/zstd/decompress.c +@@ -33,7 +33,6 @@ + #include "huf.h" + #include "mem.h" /* low level memory routines */ + #include "zstd_internal.h" +-#include /* memcpy, memmove, memset */ + + #define ZSTD_PREFETCH(ptr) __builtin_prefetch(ptr, 0, 0) + +@@ -99,9 +98,12 @@ struct ZSTD_DCtx_s { + BYTE headerBuffer[ZSTD_FRAMEHEADERSIZE_MAX]; + }; /* typedef'd to ZSTD_DCtx within "zstd.h" */ + +-size_t INIT ZSTD_DCtxWorkspaceBound(void) { return ZSTD_ALIGN(sizeof(ZSTD_stack)) + ZSTD_ALIGN(sizeof(ZSTD_DCtx)); } ++STATIC size_t INIT ZSTD_DCtxWorkspaceBound(void) ++{ ++ return ZSTD_ALIGN(sizeof(ZSTD_stack)) + ZSTD_ALIGN(sizeof(ZSTD_DCtx)); ++} + +-size_t INIT ZSTD_decompressBegin(ZSTD_DCtx *dctx) ++STATIC size_t INIT ZSTD_decompressBegin(ZSTD_DCtx *dctx) + { + dctx->expected = ZSTD_frameHeaderSize_prefix; + dctx->stage = ZSTDds_getFrameHeaderSize; +@@ -121,7 +123,7 @@ size_t INIT ZSTD_decompressBegin(ZSTD_DCtx *dctx) + return 0; + } + +-ZSTD_DCtx *INIT ZSTD_createDCtx_advanced(ZSTD_customMem customMem) ++STATIC ZSTD_DCtx *INIT ZSTD_createDCtx_advanced(ZSTD_customMem customMem) + { + ZSTD_DCtx *dctx; + +@@ -136,7 +138,7 @@ ZSTD_DCtx *INIT ZSTD_createDCtx_advanced(ZSTD_customMem customMem) + return dctx; + } + +-ZSTD_DCtx *INIT ZSTD_initDCtx(void *workspace, size_t workspaceSize) ++STATIC ZSTD_DCtx *INIT ZSTD_initDCtx(void *workspace, size_t workspaceSize) + { + ZSTD_customMem const stackMem = ZSTD_initStack(workspace, workspaceSize); + return ZSTD_createDCtx_advanced(stackMem); +@@ -150,11 +152,13 @@ size_t INIT ZSTD_freeDCtx(ZSTD_DCtx *dctx) + return 0; /* reserved as a potential error code in the future */ + } + ++#ifdef BUILD_DEAD_CODE + void INIT ZSTD_copyDCtx(ZSTD_DCtx *dstDCtx, const ZSTD_DCtx *srcDCtx) + { + size_t const workSpaceSize = (ZSTD_BLOCKSIZE_ABSOLUTEMAX + WILDCOPY_OVERLENGTH) + ZSTD_frameHeaderSize_max; + memcpy(dstDCtx, srcDCtx, sizeof(ZSTD_DCtx) - workSpaceSize); /* no need to copy workspace */ + } ++#endif + + STATIC size_t ZSTD_findFrameCompressedSize(const void *src, size_t srcSize); + STATIC size_t ZSTD_decompressBegin_usingDict(ZSTD_DCtx *dctx, const void *dict, +@@ -166,6 +170,7 @@ static void ZSTD_refDDict(ZSTD_DCtx *dstDCtx, const ZSTD_DDict *ddict); + * Decompression section + ***************************************************************/ + ++#ifdef BUILD_DEAD_CODE + /*! ZSTD_isFrame() : + * Tells if the content of `buffer` starts with a valid Frame Identifier. + * Note : Frame Identifier is 4 bytes. If `size < 4`, @return will always be 0. +@@ -184,6 +189,7 @@ unsigned INIT ZSTD_isFrame(const void *buffer, size_t size) + } + return 0; + } ++#endif + + /** ZSTD_frameHeaderSize() : + * srcSize must be >= ZSTD_frameHeaderSize_prefix. +@@ -206,7 +212,7 @@ static size_t INIT ZSTD_frameHeaderSize(const void *src, size_t srcSize) + * @return : 0, `fparamsPtr` is correctly filled, + * >0, `srcSize` is too small, result is expected `srcSize`, + * or an error code, which can be tested using ZSTD_isError() */ +-size_t INIT ZSTD_getFrameParams(ZSTD_frameParams *fparamsPtr, const void *src, size_t srcSize) ++STATIC size_t INIT ZSTD_getFrameParams(ZSTD_frameParams *fparamsPtr, const void *src, size_t srcSize) + { + const BYTE *ip = (const BYTE *)src; + +@@ -291,6 +297,7 @@ size_t INIT ZSTD_getFrameParams(ZSTD_frameParams *fparamsPtr, const void *src, s + return 0; + } + ++#ifdef BUILD_DEAD_CODE + /** ZSTD_getFrameContentSize() : + * compatible with legacy mode + * @return : decompressed size of the single frame pointed to be `src` if known, otherwise +@@ -367,6 +374,7 @@ unsigned long long INIT ZSTD_findDecompressedSize(const void *src, size_t srcSiz + return totalDstSize; + } + } ++#endif /* BUILD_DEAD_CODE */ + + /** ZSTD_decodeFrameHeader() : + * `headerSize` must be the size provided by ZSTD_frameHeaderSize(). +@@ -393,7 +401,7 @@ typedef struct { + + /*! ZSTD_getcBlockSize() : + * Provides the size of compressed block from block header `src` */ +-size_t INIT ZSTD_getcBlockSize(const void *src, size_t srcSize, blockProperties_t *bpPtr) ++STATIC size_t INIT ZSTD_getcBlockSize(const void *src, size_t srcSize, blockProperties_t *bpPtr) + { + if (srcSize < ZSTD_blockHeaderSize) + return ERROR(srcSize_wrong); +@@ -431,7 +439,7 @@ static size_t INIT ZSTD_setRleBlock(void *dst, size_t dstCapacity, const void *s + + /*! ZSTD_decodeLiteralsBlock() : + @return : nb of bytes read from src (< srcSize ) */ +-size_t INIT ZSTD_decodeLiteralsBlock(ZSTD_DCtx *dctx, const void *src, size_t srcSize) /* note : srcSize < BLOCKSIZE */ ++STATIC size_t INIT ZSTD_decodeLiteralsBlock(ZSTD_DCtx *dctx, const void *src, size_t srcSize) /* note : srcSize < BLOCKSIZE */ + { + if (srcSize < MIN_CBLOCK_SIZE) + return ERROR(corruption_detected); +@@ -795,7 +803,7 @@ static size_t INIT ZSTD_buildSeqTable(FSE_DTable *DTableSpace, const FSE_DTable + } + } + +-size_t INIT ZSTD_decodeSeqHeaders(ZSTD_DCtx *dctx, int *nbSeqPtr, const void *src, size_t srcSize) ++STATIC size_t INIT ZSTD_decodeSeqHeaders(ZSTD_DCtx *dctx, int *nbSeqPtr, const void *src, size_t srcSize) + { + const BYTE *const istart = (const BYTE *const)src; + const BYTE *const iend = istart + srcSize; +@@ -1481,6 +1489,7 @@ static void INIT ZSTD_checkContinuity(ZSTD_DCtx *dctx, const void *dst) + } + } + ++#ifdef BUILD_DEAD_CODE + size_t INIT ZSTD_decompressBlock(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) + { + size_t dSize; +@@ -1498,8 +1507,9 @@ size_t INIT ZSTD_insertBlock(ZSTD_DCtx *dctx, const void *blockStart, size_t blo + dctx->previousDstEnd = (const char *)blockStart + blockSize; + return blockSize; + } ++#endif /* BUILD_DEAD_CODE */ + +-size_t INIT ZSTD_generateNxBytes(void *dst, size_t dstCapacity, BYTE byte, size_t length) ++STATIC size_t INIT ZSTD_generateNxBytes(void *dst, size_t dstCapacity, BYTE byte, size_t length) + { + if (length > dstCapacity) + return ERROR(dstSize_tooSmall); +@@ -1512,7 +1522,7 @@ size_t INIT ZSTD_generateNxBytes(void *dst, size_t dstCapacity, BYTE byte, size_ + * `src` must point to the start of a ZSTD frame, ZSTD legacy frame, or skippable frame + * `srcSize` must be at least as large as the frame contained + * @return : the compressed size of the frame starting at `src` */ +-size_t INIT ZSTD_findFrameCompressedSize(const void *src, size_t srcSize) ++STATIC size_t INIT ZSTD_findFrameCompressedSize(const void *src, size_t srcSize) + { + if (srcSize >= ZSTD_skippableHeaderSize && (ZSTD_readLE32(src) & 0xFFFFFFF0U) == ZSTD_MAGIC_SKIPPABLE_START) { + return ZSTD_skippableHeaderSize + ZSTD_readLE32((const BYTE *)src + 4); +@@ -1709,12 +1719,12 @@ static size_t INIT ZSTD_decompressMultiFrame(ZSTD_DCtx *dctx, void *dst, size_t + return (BYTE *)dst - (BYTE *)dststart; + } + +-size_t INIT ZSTD_decompress_usingDict(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize, const void *dict, size_t dictSize) ++STATIC size_t INIT ZSTD_decompress_usingDict(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize, const void *dict, size_t dictSize) + { + return ZSTD_decompressMultiFrame(dctx, dst, dstCapacity, src, srcSize, dict, dictSize, NULL); + } + +-size_t INIT ZSTD_decompressDCtx(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++STATIC size_t INIT ZSTD_decompressDCtx(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) + { + return ZSTD_decompress_usingDict(dctx, dst, dstCapacity, src, srcSize, NULL, 0); + } +@@ -1723,9 +1733,12 @@ size_t INIT ZSTD_decompressDCtx(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, + * Advanced Streaming Decompression API + * Bufferless and synchronous + ****************************************/ +-size_t INIT ZSTD_nextSrcSizeToDecompress(ZSTD_DCtx *dctx) { return dctx->expected; } ++STATIC size_t INIT ZSTD_nextSrcSizeToDecompress(ZSTD_DCtx *dctx) ++{ ++ return dctx->expected; ++} + +-ZSTD_nextInputType_e INIT ZSTD_nextInputType(ZSTD_DCtx *dctx) ++STATIC ZSTD_nextInputType_e INIT ZSTD_nextInputType(ZSTD_DCtx *dctx) + { + switch (dctx->stage) { + default: /* should not happen */ +@@ -1745,7 +1758,7 @@ int INIT ZSTD_isSkipFrame(ZSTD_DCtx *dctx) { return dctx->stage == ZSTDds_skipFr + /** ZSTD_decompressContinue() : + * @return : nb of bytes generated into `dst` (necessarily <= `dstCapacity) + * or an error code, which can be tested using ZSTD_isError() */ +-size_t INIT ZSTD_decompressContinue(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) ++STATIC size_t INIT ZSTD_decompressContinue(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize) + { + /* Sanity check */ + if (srcSize != dctx->expected) +@@ -1971,7 +1984,7 @@ static size_t INIT ZSTD_decompress_insertDictionary(ZSTD_DCtx *dctx, const void + return ZSTD_refDictContent(dctx, dict, dictSize); + } + +-size_t INIT ZSTD_decompressBegin_usingDict(ZSTD_DCtx *dctx, const void *dict, size_t dictSize) ++STATIC size_t INIT ZSTD_decompressBegin_usingDict(ZSTD_DCtx *dctx, const void *dict, size_t dictSize) + { + CHECK_F(ZSTD_decompressBegin(dctx)); + if (dict && dictSize) +@@ -1991,7 +2004,9 @@ struct ZSTD_DDict_s { + ZSTD_customMem cMem; + }; /* typedef'd to ZSTD_DDict within "zstd.h" */ + ++#ifdef BUILD_DEAD_CODE + size_t INIT ZSTD_DDictWorkspaceBound(void) { return ZSTD_ALIGN(sizeof(ZSTD_stack)) + ZSTD_ALIGN(sizeof(ZSTD_DDict)); } ++#endif + + static const void *INIT ZSTD_DDictDictContent(const ZSTD_DDict *ddict) { return ddict->dictContent; } + +@@ -2023,6 +2038,7 @@ static void INIT ZSTD_refDDict(ZSTD_DCtx *dstDCtx, const ZSTD_DDict *ddict) + } + } + ++#ifdef BUILD_DEAD_CODE + static size_t INIT ZSTD_loadEntropy_inDDict(ZSTD_DDict *ddict) + { + ddict->dictID = 0; +@@ -2090,6 +2106,7 @@ ZSTD_DDict *INIT ZSTD_initDDict(const void *dict, size_t dictSize, void *workspa + ZSTD_customMem const stackMem = ZSTD_initStack(workspace, workspaceSize); + return ZSTD_createDDict_advanced(dict, dictSize, 1, stackMem); + } ++#endif /* BUILD_DEAD_CODE */ + + size_t INIT ZSTD_freeDDict(ZSTD_DDict *ddict) + { +@@ -2103,6 +2120,7 @@ size_t INIT ZSTD_freeDDict(ZSTD_DDict *ddict) + } + } + ++#ifdef BUILD_DEAD_CODE + /*! ZSTD_getDictID_fromDict() : + * Provides the dictID stored within dictionary. + * if @return == 0, the dictionary is not conformant with Zstandard specification. +@@ -2145,11 +2163,12 @@ unsigned INIT ZSTD_getDictID_fromFrame(const void *src, size_t srcSize) + return 0; + return zfp.dictID; + } ++#endif /* BUILD_DEAD_CODE */ + + /*! ZSTD_decompress_usingDDict() : + * Decompression using a pre-digested Dictionary + * Use dictionary without significant overhead. */ +-size_t INIT ZSTD_decompress_usingDDict(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize, const ZSTD_DDict *ddict) ++STATIC size_t INIT ZSTD_decompress_usingDDict(ZSTD_DCtx *dctx, void *dst, size_t dstCapacity, const void *src, size_t srcSize, const ZSTD_DDict *ddict) + { + /* pass content and size in case legacy frames are encountered */ + return ZSTD_decompressMultiFrame(dctx, dst, dstCapacity, src, srcSize, NULL, 0, ddict); +@@ -2186,7 +2205,7 @@ struct ZSTD_DStream_s { + U32 hostageByte; + }; /* typedef'd to ZSTD_DStream within "zstd.h" */ + +-size_t INIT ZSTD_DStreamWorkspaceBound(size_t maxWindowSize) ++STATIC size_t INIT ZSTD_DStreamWorkspaceBound(size_t maxWindowSize) + { + size_t const blockSize = MIN(maxWindowSize, ZSTD_BLOCKSIZE_ABSOLUTEMAX); + size_t const inBuffSize = blockSize; +@@ -2216,7 +2235,7 @@ static ZSTD_DStream *INIT ZSTD_createDStream_advanced(ZSTD_customMem customMem) + return zds; + } + +-ZSTD_DStream *INIT ZSTD_initDStream(size_t maxWindowSize, void *workspace, size_t workspaceSize) ++STATIC ZSTD_DStream *INIT ZSTD_initDStream(size_t maxWindowSize, void *workspace, size_t workspaceSize) + { + ZSTD_customMem const stackMem = ZSTD_initStack(workspace, workspaceSize); + ZSTD_DStream *zds = ZSTD_createDStream_advanced(stackMem); +@@ -2249,6 +2268,7 @@ ZSTD_DStream *INIT ZSTD_initDStream(size_t maxWindowSize, void *workspace, size_ + return zds; + } + ++#ifdef BUILD_DEAD_CODE + ZSTD_DStream *INIT ZSTD_initDStream_usingDDict(size_t maxWindowSize, const ZSTD_DDict *ddict, void *workspace, size_t workspaceSize) + { + ZSTD_DStream *zds = ZSTD_initDStream(maxWindowSize, workspace, workspaceSize); +@@ -2257,6 +2277,7 @@ ZSTD_DStream *INIT ZSTD_initDStream_usingDDict(size_t maxWindowSize, const ZSTD_ + } + return zds; + } ++#endif + + size_t INIT ZSTD_freeDStream(ZSTD_DStream *zds) + { +@@ -2279,10 +2300,12 @@ size_t INIT ZSTD_freeDStream(ZSTD_DStream *zds) + + /* *** Initialization *** */ + ++#ifdef BUILD_DEAD_CODE + size_t INIT ZSTD_DStreamInSize(void) { return ZSTD_BLOCKSIZE_ABSOLUTEMAX + ZSTD_blockHeaderSize; } + size_t INIT ZSTD_DStreamOutSize(void) { return ZSTD_BLOCKSIZE_ABSOLUTEMAX; } ++#endif + +-size_t INIT ZSTD_resetDStream(ZSTD_DStream *zds) ++STATIC size_t INIT ZSTD_resetDStream(ZSTD_DStream *zds) + { + zds->stage = zdss_loadHeader; + zds->lhSize = zds->inPos = zds->outStart = zds->outEnd = 0; +@@ -2300,7 +2323,7 @@ ZSTD_STATIC size_t INIT ZSTD_limitCopy(void *dst, size_t dstCapacity, const void + return length; + } + +-size_t INIT ZSTD_decompressStream(ZSTD_DStream *zds, ZSTD_outBuffer *output, ZSTD_inBuffer *input) ++STATIC size_t INIT ZSTD_decompressStream(ZSTD_DStream *zds, ZSTD_outBuffer *output, ZSTD_inBuffer *input) + { + const char *const istart = (const char *)(input->src) + input->pos; + const char *const iend = (const char *)(input->src) + input->size; +diff --git a/xen/common/zstd/error_private.h b/xen/common/zstd/error_private.h +index d07bf3cb9b55..906d537e0844 100644 +--- a/xen/common/zstd/error_private.h ++++ b/xen/common/zstd/error_private.h +@@ -19,11 +19,6 @@ + #ifndef ERROR_H_MODULE + #define ERROR_H_MODULE + +-/* **************************************** +-* Dependencies +-******************************************/ +-#include /* size_t */ +- + /** + * enum ZSTD_ErrorCode - zstd error codes + * +diff --git a/xen/common/zstd/fse.h b/xen/common/zstd/fse.h +index b86717c34d0f..5761e09f17ff 100644 +--- a/xen/common/zstd/fse.h ++++ b/xen/common/zstd/fse.h +@@ -40,11 +40,6 @@ + #ifndef FSE_H + #define FSE_H + +-/*-***************************************** +-* Dependencies +-******************************************/ +-#include /* size_t, ptrdiff_t */ +- + /*-***************************************** + * FSE_PUBLIC_API : control library symbols visibility + ******************************************/ +diff --git a/xen/common/zstd/fse_decompress.c b/xen/common/zstd/fse_decompress.c +index cc51206df614..6c61e9002e62 100644 +--- a/xen/common/zstd/fse_decompress.c ++++ b/xen/common/zstd/fse_decompress.c +@@ -48,8 +48,6 @@ + #include "bitstream.h" + #include "fse.h" + #include "zstd_internal.h" +-#include +-#include /* memcpy, memset */ + + /* ************************************************************** + * Error Management +diff --git a/xen/common/zstd/huf.h b/xen/common/zstd/huf.h +index a9d522c7bb7b..a498e0de2871 100644 +--- a/xen/common/zstd/huf.h ++++ b/xen/common/zstd/huf.h +@@ -40,9 +40,6 @@ + #ifndef HUF_H_298734234 + #define HUF_H_298734234 + +-/* *** Dependencies *** */ +-#include /* size_t */ +- + /* *** Tool functions *** */ + #define HUF_BLOCKSIZE_MAX (128 * 1024) /**< maximum input size for a single block compressed with HUF_compress */ + size_t HUF_compressBound(size_t size); /**< maximum compressed size (worst case) */ +diff --git a/xen/common/zstd/huf_decompress.c b/xen/common/zstd/huf_decompress.c +index 341619e64246..f6aca709a6dd 100644 +--- a/xen/common/zstd/huf_decompress.c ++++ b/xen/common/zstd/huf_decompress.c +@@ -48,8 +48,6 @@ + #include "bitstream.h" /* BIT_* */ + #include "fse.h" /* header compression */ + #include "huf.h" +-#include +-#include /* memcpy, memset */ + + /* ************************************************************** + * Error Management +diff --git a/xen/common/zstd/mem.h b/xen/common/zstd/mem.h +index 288320069654..2acae6a8edc8 100644 +--- a/xen/common/zstd/mem.h ++++ b/xen/common/zstd/mem.h +@@ -20,9 +20,11 @@ + /*-**************************************** + * Dependencies + ******************************************/ ++#ifdef __XEN__ + #include /* memcpy */ + #include /* size_t, ptrdiff_t */ + #include ++#endif + + /*-**************************************** + * Compiler specifics +diff --git a/xen/common/zstd/zstd_internal.h b/xen/common/zstd/zstd_internal.h +index 7f8e5529ebfa..caa7aab40699 100644 +--- a/xen/common/zstd/zstd_internal.h ++++ b/xen/common/zstd/zstd_internal.h +@@ -28,8 +28,10 @@ + ***************************************/ + #include "error_private.h" + #include "mem.h" ++#ifdef __XEN__ + #include + #include ++#endif + + #define ALIGN(x, a) ((x + (a) - 1) & ~((a) - 1)) + #define PTR_ALIGN(p, a) ((typeof(p))ALIGN((unsigned long)(p), (a))) +@@ -95,8 +97,10 @@ typedef struct ZSTD_DStream_s ZSTD_DStream; + /*-************************************* + * shared macros + ***************************************/ ++#ifndef MIN + #define MIN(a, b) ((a) < (b) ? (a) : (b)) + #define MAX(a, b) ((a) > (b) ? (a) : (b)) ++#endif + #define CHECK_F(f) \ + { \ + size_t const errcod = f; \ +diff --git a/xen/include/xen/unaligned.h b/xen/include/xen/unaligned.h +index eef7ec73b658..0a2b16d05d92 100644 +--- a/xen/include/xen/unaligned.h ++++ b/xen/include/xen/unaligned.h +@@ -10,8 +10,10 @@ + #ifndef __XEN_UNALIGNED_H__ + #define __XEN_UNALIGNED_H__ + ++#ifdef __XEN__ + #include + #include ++#endif + + #define get_unaligned(p) (*(p)) + #define put_unaligned(val, p) (*(p) = (val)) +diff --git a/xen/lib/xxhash64.c b/xen/lib/xxhash64.c +index ba6bcf152d6f..481e76fbcf4c 100644 +--- a/xen/lib/xxhash64.c ++++ b/xen/lib/xxhash64.c +@@ -38,11 +38,13 @@ + * - xxHash source repository: https://github.com/Cyan4973/xxHash + */ + ++#ifdef __XEN__ + #include + #include + #include + #include + #include ++#endif + + /*-************************************* + * Macros +-- +2.34.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0006-fix-ftbfs-arm-lzo-unaligned.h.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0006-fix-ftbfs-arm-lzo-unaligned.h.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0006-fix-ftbfs-arm-lzo-unaligned.h.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/lp1956166-0006-fix-ftbfs-arm-lzo-unaligned.h.patch 2022-07-13 14:06:46.000000000 +0100 @@ -0,0 +1,50 @@ +Bug-Ubuntu: https://bugs.launchpad.net/bugs/1956166 +Description: Fix FTBFS on armhf/arm64 due to + Patch lp1956166-0001-introduce-unaligned.h.patch builds fine + on x86, but it fails to build from source on armhf and arm64: + . + lzo.c:100:10: fatal error: asm/unaligned.h: No such file or directory + . + The header is only available on x86, but arm + also builds lzo.c in Xen 4.11 ('obj-y' in xen/common/Makefile). + This isn't the case in Xen 4.15 (original release of the patch), + where lzo.c is obj-$(CONFIG_X86)-based. + . + So, make the lzo.c changes in the patch more conditional to x86, + keeping the (local) previous unaligned macro definitions on arm. + . + This keeps the spirit of the upstream patch (which is x86-only), + and is backwards compatible with 4.11 code. +Author: Mauricio Faria de Oliveira +Forwarded: not-needed +Last-Update: 2022-07-07 +--- +This patch header follows DEP-3: http://dep.debian.net/deps/dep3/ +Index: xen/xen/common/lzo.c +=================================================================== +--- xen.orig/xen/common/lzo.c ++++ xen/xen/common/lzo.c +@@ -97,12 +97,23 @@ + #ifdef __XEN__ + #include + #include ++#endif ++ ++#ifdef CONFIG_X86 ++#ifdef __XEN__ + #include + #else + #define get_unaligned_le16(_p) (*(u16 *)(_p)) + #endif ++#endif + + #include ++#ifndef CONFIG_X86 ++#define get_unaligned(_p) (*(_p)) ++#define put_unaligned(_val,_p) (*(_p)=_val) ++#define get_unaligned_le16(_p) (*(u16 *)(_p)) ++#define get_unaligned_le32(_p) (*(u32 *)(_p)) ++#endif + + static noinline size_t + lzo1x_1_do_compress(const unsigned char *in, size_t in_len, diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/series xen-4.11.3+24-g14b62ab3e5/debian/patches/series --- xen-4.11.3+24-g14b62ab3e5/debian/patches/series 2020-03-09 14:46:02.000000000 +0000 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/series 2022-07-13 14:06:46.000000000 +0100 @@ -52,3 +52,162 @@ 1000-flags-fcs-protect-none.patch 1001-strip-note-gnu-property.patch +xen-split-parameter-related-definitions-in-own-header-file.patch +xsa312-4.11.patch +xsa313-1.patch +xsa313-2.patch +xsa314-4.13.patch +xsa316-xen.patch +xsa317.patch +xsa318.patch +xsa319.patch +xsa320-4.11-1.patch +xsa320-4.11-2.patch +xsa320-4.11-3.patch +xsa328-4.11-1.patch +xsa328-4.11-2.patch +xsa321-4.11-1.patch +xsa321-4.11-2.patch +xsa321-4.11-3.patch +xsa321-4.11-4.patch +xsa321-4.11-5.patch +xsa321-4.11-6.patch +xsa321-4.11-7.patch +0001-tools-xenstore-allow-removing-child-of-a-node-exceed.patch +0002-tools-xenstore-ignore-transaction-id-for-un-watch.patch +0003-tools-xenstore-fix-node-accounting-after-failed-node.patch +0004-tools-xenstore-simplify-and-rename-check_event_node.patch +0005-tools-xenstore-check-privilege-for-XS_IS_DOMAIN_INTR.patch +0006-tools-xenstore-rework-node-removal.patch +0007-tools-xenstore-fire-watches-only-when-removing-a-spe.patch +0008-tools-xenstore-introduce-node_perms-structure.patch +0009-tools-xenstore-allow-special-watches-for-privileged-.patch +0010-tools-xenstore-avoid-watch-events-for-nodes-without-.patch +0001-tools-ocaml-xenstored-ignore-transaction-id-for-un-w.patch +0002-tools-ocaml-xenstored-check-privilege-for-XS_IS_DOMA.patch +0003-tools-ocaml-xenstored-unify-watch-firing.patch +0004-tools-ocaml-xenstored-introduce-permissions-for-spec.patch +0005-tools-ocaml-xenstored-avoid-watch-events-for-nodes-w.patch +0006-tools-ocaml-xenstored-add-xenstored.conf-flag-to-tur.patch +xsa322-4.12-c.patch +xsa322-4.11-o.patch +xsa323.patch +xsa324.patch +xsa325-4.14.patch +xsa327.patch +xsa330.patch +xsa333.patch +xsa336-4.11.patch +xsa337-4.12-1.patch +xsa337-4.12-2.patch +xsa338.patch +xsa339.patch +xsa340.patch +xsa342-4.13.patch +xsa343-4.11-1.patch +xsa343-4.11-2.patch +xsa343-4.11-3.patch +xsa344-4.11-1.patch +xsa344-4.11-2.patch +0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch +0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch +0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch +xsa346-4.11-1.patch +xsa346-4.11-2.patch +xsa347-4.11-1.patch +xsa347-4.11-2.patch +xsa348-4.11.patch +xsa351-arm.patch +xsa351-x86-4.11-1.patch +xsa351-x86-4.11-2.patch +xsa352.patch +xsa353.patch +xsa355.patch +evtchn-fifo-use-stable-fields-when-recording-last-queue-information.patch +xen-evtchn-rework-per-event-channel-lock.patch +xen-events-access-last_priority-and-last_vcpu_id-together.patch +fix_event_channel_race.patch +xsa358-4.14.patch +xsa359.patch +xsa364.patch +xsa366-4.11.patch +xsa373-4.11-1.patch +0001-SUPPORT.md-Document-speculative-attacks-status-of-no.patch +x86-pv-Options-to-disable-and-or-compile-out-32bit-PV-support.patch +0002-SUPPORT.md-Un-shimmed-32-bit-PV-guests-are-no-longer.patch +xsa373-4.11-2.patch +xsa373-4.11-3.patch +xsa373-4.11-4.patch +xsa373-4.11-5.patch +xsa375-4.12.patch +xsa377-4.11.patch +xsa378-4.11-0a.patch +xsa378-4.11-0b.patch +xsa378-4.11-0c.patch +xsa378-4.11-1.patch +xsa378-4.11-2.patch +xsa378-4.11-3.patch +xsa378-4.11-4.patch +xsa378-4.11-5.patch +AMD-IOMMU-fix-off-by-one-in-amd_iommu_get_paging_mode-callers.patch +xsa378-4.11-6.patch +xsa378-4.11-7.patch +xsa378-4.11-8.patch +xsa379-4.12.patch +xsa380-4.11-1.patch +xsa380-4.11-2.patch +xsa380-3.patch +xsa382.patch +xsa384-4.11.patch +amd-iommu-get-rid-of-pointless-IOMMU_PAGING_MODE_LEVEL_X-definitions.patch +xsa385-4.12.patch +xsa388-4.14-1.patch +xsa388-4.14-2.patch +xsa389-4.12.patch +xsa394-4.12.patch +xsa395-4.14.patch +xsa397-4.12.patch +xsa398-4.12-1-xen-arm-Introduce-new-Arm-processors.patch +xsa398-4.12-2-xen-arm-move-errata-CSV2-check-earlier.patch +xsa398-4.12-3-xen-arm-Add-ECBHB-and-CLEARBHB-ID-fields.patch +xen-arm64-entry-Use-named-label-in-guest_sync.patch +xen-arm-Add-ARCH_WORKAROUND_2-probing.patch +xen-arm-Add-command-line-option-to-control-SSBD-mitigation.patch +xen-arm-Add-ARCH_WORKAROUND_2-support-for-guests.patch +xen-arm64-Add-generic-assembly-macros.patch +xen-arm-Simplify-alternative-patching-of-non-writable-region.patch +xen-arm-alternatives-Add-dynamic-patching-feature.patch +xen-arm64-Implement-a-fast-path-for-handling-SMCCC_ARCH_WORKAROUND_2.patch +xsa398-4.12-4-xen-arm-Add-Spectre-BHB-handling.patch +xsa398-4.12-5-xen-arm-Allow-to-discover-and-use-SMCCC_ARCH_WORKARO.patch +xsa398-4.12-6-x86-spec-ctrl-Cease-using-thunk-lfence-on-AMD.patch +xsa399-4.12.patch +xsa400-4.12-00.patch +xsa400-4.12-01.patch +xsa400-4.12-02.patch +xsa400-4.12-03.patch +VT-d-dont-pass-bridge-devices-to-domain_context_mapping_one.patch +xsa400-4.12-04.patch +xsa400-4.12-05.patch +xsa400-4.12-06.patch +xsa400-4.12-07.patch +xsa400-4.12-08.patch +xsa400-4.12-09.patch +xsa400-4.12-10.patch +xsa400-4.12-11.patch +xsa401-4.13-1.patch +xsa401-4.13-2.patch +xsa402-4.13-1.patch +xsa402-4.13-2.patch +xsa402-4.13-3.patch +x86-feature-Generalise-synth-and-introduce-a-bug-word.patch +x86-AMD-Fix-handling-of-x87-exception-pointers-on-Fam17h-hardware.patch +xsa402-4.13-4.patch +xsa402-4.13-5.patch +x86-cpu-intel-Clear-cache-self-snoop-capability-in-CPUs-with-known-errata.patch +lp1956166-0001-introduce-unaligned.h.patch +lp1956166-0002-lib-introduce-xxhash.patch +lp1956166-0003-x86-Dom0-support-zstd-compressed-kernels.patch +lp1956166-0004-libxenguest-add-get_unaligned_le32.patch +lp1956166-0005-libxenguest-support-zstd-compressed-kernels.patch +lp1956166-0006-fix-ftbfs-arm-lzo-unaligned.h.patch diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/VT-d-dont-pass-bridge-devices-to-domain_context_mapping_one.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/VT-d-dont-pass-bridge-devices-to-domain_context_mapping_one.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/VT-d-dont-pass-bridge-devices-to-domain_context_mapping_one.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/VT-d-dont-pass-bridge-devices-to-domain_context_mapping_one.patch 2022-06-06 12:26:23.000000000 +0100 @@ -0,0 +1,63 @@ +From b9063ce924bb37986762d33a48c174348c38b61a Mon Sep 17 00:00:00 2001 +From: Jan Beulich +Date: Thu, 5 Mar 2020 11:16:46 +0100 +Subject: [PATCH] VT-d: don't pass bridge devices to + domain_context_mapping_one() +MIME-Version: 1.0 +Content-Type: text/plain; charset=utf8 +Content-Transfer-Encoding: 8bit + +When passed a non-NULL pdev, the function does an owner check when it +finds an already existing context mapping. Bridges, however, don't get +passed through to guests, and hence their owner is always going to be +Dom0, leading to the assigment of all but one of the function of multi- +function PCI devices behind bridges to fail. + +Reported-by: Marek Marczykowski-Górecki +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné +Reviewed-by: Kevin Tian +master commit: a4d457fd59f4ebfb524aec82cb6a3030087914ca +master date: 2020-01-22 16:39:58 +0100 +--- + xen/drivers/passthrough/vtd/iommu.c | 14 ++++++++++++-- + 1 file changed, 12 insertions(+), 2 deletions(-) + +diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c +index 576e72eba1..77ba8e14a6 100644 +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -1536,18 +1536,28 @@ static int domain_context_mapping(struct domain *domain, u8 devfn, + if ( find_upstream_bridge(seg, &bus, &devfn, &secbus) < 1 ) + break; + ++ /* ++ * Mapping a bridge should, if anything, pass the struct pci_dev of ++ * that bridge. Since bridges don't normally get assigned to guests, ++ * their owner would be the wrong one. Pass NULL instead. ++ */ + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- pci_get_pdev(seg, bus, devfn)); ++ NULL); + + /* + * Devices behind PCIe-to-PCI/PCIx bridge may generate different + * requester-id. It may originate from devfn=0 on the secondary bus + * behind the bridge. Map that id as well if we didn't already. ++ * ++ * Somewhat similar as for bridges, we don't want to pass a struct ++ * pci_dev here - there may not even exist one for this (secbus,0,0) ++ * tuple. If there is one, without properly working device groups it ++ * may again not have the correct owner. + */ + if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE && + (secbus != pdev->bus || pdev->devfn != 0) ) + ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0, +- pci_get_pdev(seg, secbus, 0)); ++ NULL); + + break; + +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-AMD-Fix-handling-of-x87-exception-pointers-on-Fam17h-hardware.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-AMD-Fix-handling-of-x87-exception-pointers-on-Fam17h-hardware.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-AMD-Fix-handling-of-x87-exception-pointers-on-Fam17h-hardware.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-AMD-Fix-handling-of-x87-exception-pointers-on-Fam17h-hardware.patch 2022-06-16 10:27:25.000000000 +0100 @@ -0,0 +1,190 @@ +From d2a95f1c3ef96f47840ab172278293e55c4fc430 Mon Sep 17 00:00:00 2001 +From: Andrew Cooper +Date: Thu, 27 Dec 2018 15:14:01 +0000 +Subject: [PATCH] x86/AMD: Fix handling of x87 exception pointers on Fam17h + hardware + +AMD Pre-Fam17h CPUs "optimise" {F,}X{SAVE,RSTOR} by not saving/restoring +FOP/FIP/FDP if an x87 exception isn't pending. This causes an information +leak, CVE-2006-1056, and worked around by several OSes, including Xen. AMD +Fam17h CPUs no longer have this leak, and advertise so in a CPUID bit. + +Introduce the RSTR_FP_ERR_PTRS feature, as specified by AMD, and expose to all +guests by default. While adjusting libxl's cpuid table, add CLZERO which +looks to have been omitted previously. + +Also introduce an X86_BUG bit to trigger the (F)XRSTOR workaround, and set it +on AMD hardware where RSTR_FP_ERR_PTRS is not advertised. Optimise the +conditions for the workaround paths. + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich +--- + tools/libxl/libxl_cpuid.c | 3 +++ + tools/misc/xen-cpuid.c | 1 + + xen/arch/x86/cpu/amd.c | 7 +++++++ + xen/arch/x86/i387.c | 16 +++++++--------- + xen/arch/x86/xstate.c | 7 +++---- + xen/include/asm-x86/cpufeature.h | 3 +++ + xen/include/asm-x86/cpufeatures.h | 2 ++ + xen/include/public/arch-x86/cpufeatureset.h | 1 + + 8 files changed, 27 insertions(+), 13 deletions(-) + +diff --git a/tools/libxl/libxl_cpuid.c b/tools/libxl/libxl_cpuid.c +index f1c6ce2076..953a3bbd8c 100644 +--- a/tools/libxl/libxl_cpuid.c ++++ b/tools/libxl/libxl_cpuid.c +@@ -246,7 +246,11 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *cpuid, const char* str) + + {"invtsc", 0x80000007, NA, CPUID_REG_EDX, 8, 1}, + ++ {"clzero", 0x80000008, NA, CPUID_REG_EBX, 0, 1}, ++ {"rstr-fp-err-ptrs", 0x80000008, NA, CPUID_REG_EBX, 2, 1}, ++ {"wbnoinvd", 0x80000008, NA, CPUID_REG_EBX, 9, 1}, + {"ibpb", 0x80000008, NA, CPUID_REG_EBX, 12, 1}, ++ + {"nc", 0x80000008, NA, CPUID_REG_ECX, 0, 8}, + {"apicidsize", 0x80000008, NA, CPUID_REG_ECX, 12, 4}, + +diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c +index be6a8d27a5..f51facffb6 100644 +--- a/tools/misc/xen-cpuid.c ++++ b/tools/misc/xen-cpuid.c +@@ -135,7 +135,10 @@ static const char *str_e7d[32] = + static const char *str_e8b[32] = + { + [ 0] = "clzero", ++ [ 2] = "rstr-fp-err-ptrs", + ++ /* [ 8] */ [ 9] = "wbnoinvd", ++ + [12] = "ibpb", + }; + +diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c +index a2f83c79a5..fec2830c6a 100644 +--- a/xen/arch/x86/cpu/amd.c ++++ b/xen/arch/x86/cpu/amd.c +@@ -569,6 +569,13 @@ static void init_amd(struct cpuinfo_x86 *c) + wrmsr_amd_safe(0xc001100d, l, h & ~1); + } + ++ /* ++ * Older AMD CPUs don't save/load FOP/FIP/FDP unless an FPU exception ++ * is pending. Xen works around this at (F)XRSTOR time. ++ */ ++ if (!cpu_has(c, X86_FEATURE_RSTR_FP_ERR_PTRS)) ++ setup_force_cpu_cap(X86_BUG_FPU_PTRS); ++ + /* + * Attempt to set lfence to be Dispatch Serialising. This MSR almost + * certainly isn't virtualised (and Xen at least will leak the real +diff --git a/xen/arch/x86/i387.c b/xen/arch/x86/i387.c +index 88178485cb..677f571792 100644 +--- a/xen/arch/x86/i387.c ++++ b/xen/arch/x86/i387.c +@@ -43,20 +43,18 @@ static inline void fpu_fxrstor(struct vcpu *v) + const typeof(v->arch.xsave_area->fpu_sse) *fpu_ctxt = v->arch.fpu_ctxt; + + /* +- * AMD CPUs don't save/restore FDP/FIP/FOP unless an exception ++ * Some CPUs don't save/restore FDP/FIP/FOP unless an exception + * is pending. Clear the x87 state here by setting it to fixed + * values. The hypervisor data segment can be sometimes 0 and + * sometimes new user value. Both should be ok. Use the FPU saved + * data block as a safe address because it should be in L1. + */ +- if ( !(fpu_ctxt->fsw & ~fpu_ctxt->fcw & 0x003f) && +- boot_cpu_data.x86_vendor == X86_VENDOR_AMD ) +- { ++ if ( cpu_bug_fpu_ptrs && ++ !(fpu_ctxt->fsw & ~fpu_ctxt->fcw & 0x003f) ) + asm volatile ( "fnclex\n\t" + "ffree %%st(7)\n\t" /* clear stack tag */ + "fildl %0" /* load to clear state */ + : : "m" (*fpu_ctxt) ); +- } + + /* + * FXRSTOR can fault if passed a corrupted data block. We handle this +@@ -169,11 +167,11 @@ static inline void fpu_fxsave(struct vcpu *v) + : "=m" (*fpu_ctxt) : "R" (fpu_ctxt) ); + + /* +- * AMD CPUs don't save/restore FDP/FIP/FOP unless an exception +- * is pending. ++ * Some CPUs don't save/restore FDP/FIP/FOP unless an exception is ++ * pending. In this case, the restore side will arrange safe values, ++ * and there is no point trying to collect FCS/FDS in addition. + */ +- if ( !(fpu_ctxt->fsw & 0x0080) && +- boot_cpu_data.x86_vendor == X86_VENDOR_AMD ) ++ if ( cpu_bug_fpu_ptrs && !(fpu_ctxt->fsw & 0x0080) ) + return; + + /* +diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c +index 3293ef834f..10016a05d0 100644 +--- a/xen/arch/x86/xstate.c ++++ b/xen/arch/x86/xstate.c +@@ -369,15 +369,14 @@ void xrstor(struct vcpu *v, uint64_t mask) + unsigned int faults, prev_faults; + + /* +- * AMD CPUs don't save/restore FDP/FIP/FOP unless an exception ++ * Some CPUs don't save/restore FDP/FIP/FOP unless an exception + * is pending. Clear the x87 state here by setting it to fixed + * values. The hypervisor data segment can be sometimes 0 and + * sometimes new user value. Both should be ok. Use the FPU saved + * data block as a safe address because it should be in L1. + */ +- if ( (mask & ptr->xsave_hdr.xstate_bv & X86_XCR0_FP) && +- !(ptr->fpu_sse.fsw & ~ptr->fpu_sse.fcw & 0x003f) && +- boot_cpu_data.x86_vendor == X86_VENDOR_AMD ) ++ if ( cpu_bug_fpu_ptrs && ++ !(ptr->fpu_sse.fsw & ~ptr->fpu_sse.fcw & 0x003f) ) + asm volatile ( "fnclex\n\t" /* clear exceptions */ + "ffree %%st(7)\n\t" /* clear stack tag */ + "fildl %0" /* load to clear state */ +diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h +index 7e1ff17ad4..00d22caac7 100644 +--- a/xen/include/asm-x86/cpufeature.h ++++ b/xen/include/asm-x86/cpufeature.h +@@ -117,6 +117,9 @@ + #define cpu_has_no_xpti boot_cpu_has(X86_FEATURE_NO_XPTI) + #define cpu_has_xen_lbr boot_cpu_has(X86_FEATURE_XEN_LBR) + ++/* Bugs. */ ++#define cpu_bug_fpu_ptrs boot_cpu_has(X86_BUG_FPU_PTRS) ++ + enum _cache_type { + CACHE_TYPE_NULL = 0, + CACHE_TYPE_DATA = 1, +diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h +index ab3650f73b..91eccf5161 100644 +--- a/xen/include/asm-x86/cpufeatures.h ++++ b/xen/include/asm-x86/cpufeatures.h +@@ -43,5 +43,7 @@ XEN_CPUFEATURE(SC_VERW_IDLE, X86_SYNTH(25)) /* VERW used by Xen for idle */ + #define X86_NR_BUG 1 + #define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x)) + ++#define X86_BUG_FPU_PTRS X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */ ++ + /* Total number of capability words, inc synth and bug words. */ + #define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit words worth of info */ +diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h +index f2ec470179..48d8d1f4e2 100644 +--- a/xen/include/public/arch-x86/cpufeatureset.h ++++ b/xen/include/public/arch-x86/cpufeatureset.h +@@ -237,6 +237,8 @@ XEN_CPUFEATURE(EFRO, 7*32+10) /* APERF/MPERF Read Only interface */ + + /* AMD-defined CPU features, CPUID level 0x80000008.ebx, word 8 */ + XEN_CPUFEATURE(CLZERO, 8*32+ 0) /*A CLZERO instruction */ ++XEN_CPUFEATURE(RSTR_FP_ERR_PTRS, 8*32+ 2) /*A (F)X{SAVE,RSTOR} always saves/restores FPU Error pointers */ ++XEN_CPUFEATURE(WBNOINVD, 8*32+ 9) /* WBNOINVD instruction */ + XEN_CPUFEATURE(IBPB, 8*32+12) /*A IBPB support only (no IBRS, used by AMD) */ + + /* Intel-defined CPU features, CPUID level 0x00000007:0.edx, word 9 */ +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-cpu-intel-Clear-cache-self-snoop-capability-in-CPUs-with-known-errata.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-cpu-intel-Clear-cache-self-snoop-capability-in-CPUs-with-known-errata.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-cpu-intel-Clear-cache-self-snoop-capability-in-CPUs-with-known-errata.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-cpu-intel-Clear-cache-self-snoop-capability-in-CPUs-with-known-errata.patch 2022-06-16 22:03:12.000000000 +0100 @@ -0,0 +1,100 @@ +From f2663ca2e5203bfa082b1d6d2721ad369e00426a Mon Sep 17 00:00:00 2001 +From: Ricardo Neri +Date: Fri, 19 Jul 2019 13:50:38 +0200 +Subject: [PATCH] x86/cpu/intel: Clear cache self-snoop capability in CPUs with + known errata + +From: Ricardo Neri + +Processors which have self-snooping capability can handle conflicting +memory type across CPUs by snooping its own cache. However, there exists +CPU models in which having conflicting memory types still leads to +unpredictable behavior, machine check errors, or hangs. + +Clear this feature on affected CPUs to prevent its use. + +Suggested-by: Alan Cox +Signed-off-by: Ricardo Neri +[Linux commit 1e03bff3600101bd9158d005e4313132e55bdec8] + +Strip Yonah - as per ark.intel.com it doesn't look to be 64-bit capable. +Call the new function on the boot CPU only. Don't clear the CPU feature +flag itself, as it is exposed to guests (who could otherwise observe it +disappear after migration). + +Requested-by: Andrew Cooper +Signed-off-by: Jan Beulich +--- + xen/arch/x86/cpu/intel.c | 35 ++++++++++++++++++++++++++++++- + xen/include/asm-x86/cpufeatures.h | 1 + + 2 files changed, 35 insertions(+), 1 deletion(-) + +diff --git a/xen/arch/x86/cpu/intel.c b/xen/arch/x86/cpu/intel.c +index 0dd8f98607..5356a6ae10 100644 +--- a/xen/arch/x86/cpu/intel.c ++++ b/xen/arch/x86/cpu/intel.c +@@ -15,6 +15,36 @@ + + #include "cpu.h" + ++/* ++ * Processors which have self-snooping capability can handle conflicting ++ * memory type across CPUs by snooping its own cache. However, there exists ++ * CPU models in which having conflicting memory types still leads to ++ * unpredictable behavior, machine check errors, or hangs. Clear this ++ * feature to prevent its use on machines with known erratas. ++ */ ++static void __init check_memory_type_self_snoop_errata(void) ++{ ++ if (!boot_cpu_has(X86_FEATURE_SS)) ++ return; ++ ++ switch (boot_cpu_data.x86_model) { ++ case 0x0f: /* Merom */ ++ case 0x16: /* Merom L */ ++ case 0x17: /* Penryn */ ++ case 0x1d: /* Dunnington */ ++ case 0x1e: /* Nehalem */ ++ case 0x1f: /* Auburndale / Havendale */ ++ case 0x1a: /* Nehalem EP */ ++ case 0x2e: /* Nehalem EX */ ++ case 0x25: /* Westmere */ ++ case 0x2c: /* Westmere EP */ ++ case 0x2a: /* SandyBridge */ ++ return; ++ } ++ ++ setup_force_cpu_cap(X86_FEATURE_XEN_SELFSNOOP); ++} ++ + /* + * Set caps in expected_levelling_cap, probe a specific masking MSR, and set + * caps in levelling_caps if it is found, or clobber the MSR index if missing. +@@ -257,8 +287,11 @@ static void early_init_intel(struct cpuinfo_x86 *c) + (boot_cpu_data.x86_mask == 3 || boot_cpu_data.x86_mask == 4)) + paddr_bits = 36; + +- if (c == &boot_cpu_data) ++ if (c == &boot_cpu_data) { ++ check_memory_type_self_snoop_errata(); ++ + intel_init_levelling(); ++ } + + ctxt_switch_levelling(NULL); + } +diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h +index 996f89df9a..57f3e61fd5 100644 +--- a/xen/include/asm-x86/cpufeatures.h ++++ b/xen/include/asm-x86/cpufeatures.h +@@ -38,6 +38,7 @@ XEN_CPUFEATURE(SC_MSR_PV, (FSCAPINTS+0)*32+16) /* MSR_SPEC_CTRL used by Xe + XEN_CPUFEATURE(SC_VERW_PV, X86_SYNTH(23)) /* VERW used by Xen for PV */ + XEN_CPUFEATURE(SC_VERW_HVM, X86_SYNTH(24)) /* VERW used by Xen for HVM */ + XEN_CPUFEATURE(SC_VERW_IDLE, X86_SYNTH(25)) /* VERW used by Xen for idle */ ++XEN_CPUFEATURE(XEN_SELFSNOOP, X86_SYNTH(26)) /* SELFSNOOP gets used by Xen itself */ + + /* Bug words follow the synthetic words. */ + #define X86_NR_BUG 1 +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-feature-Generalise-synth-and-introduce-a-bug-word.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-feature-Generalise-synth-and-introduce-a-bug-word.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-feature-Generalise-synth-and-introduce-a-bug-word.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-feature-Generalise-synth-and-introduce-a-bug-word.patch 2022-06-16 09:36:03.000000000 +0100 @@ -0,0 +1,96 @@ +From 6408ae3f80287e194cd66218f28edcec939b6fca Mon Sep 17 00:00:00 2001 +From: Andrew Cooper +Date: Thu, 27 Dec 2018 15:13:55 +0000 +Subject: [PATCH] x86/feature: Generalise synth and introduce a bug word + +Future changes are going to want to use cpu_bug_* in a mannor similar to +Linux. Introduce one bug word, and generalise the calculation of +NCAPINTS. + +Signed-off-by: Andrew Cooper +Acked-by: Jan Beulich +--- + xen/include/asm-x86/cpufeatures.h | 67 ++++++++++++++++++------------- + 1 file changed, 38 insertions(+), 29 deletions(-) + +diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h +index 57f3e61fd5..ab3650f73b 100644 +--- a/xen/include/asm-x86/cpufeatures.h ++++ b/xen/include/asm-x86/cpufeatures.h +@@ -4,35 +4,44 @@ + + #include + ++/* Number of capability words covered by the featureset words. */ + #define FSCAPINTS FEATURESET_NR_ENTRIES + +-#define NCAPINTS (FSCAPINTS + 1) /* N 32-bit words worth of info */ ++/* Synthetic words follow the featureset words. */ ++#define X86_NR_SYNTH 1 ++#define X86_SYNTH(x) (FSCAPINTS * 32 + (x)) + +-/* Other features, Xen-defined mapping. */ +-/* This range is used for feature bits which conflict or are synthesized */ +-XEN_CPUFEATURE(CONSTANT_TSC, (FSCAPINTS+0)*32+ 0) /* TSC ticks at a constant rate */ +-XEN_CPUFEATURE(NONSTOP_TSC, (FSCAPINTS+0)*32+ 1) /* TSC does not stop in C states */ +-XEN_CPUFEATURE(ARAT, (FSCAPINTS+0)*32+ 2) /* Always running APIC timer */ +-XEN_CPUFEATURE(ARCH_PERFMON, (FSCAPINTS+0)*32+ 3) /* Intel Architectural PerfMon */ +-XEN_CPUFEATURE(TSC_RELIABLE, (FSCAPINTS+0)*32+ 4) /* TSC is known to be reliable */ +-XEN_CPUFEATURE(XTOPOLOGY, (FSCAPINTS+0)*32+ 5) /* cpu topology enum extensions */ +-XEN_CPUFEATURE(CPUID_FAULTING, (FSCAPINTS+0)*32+ 6) /* cpuid faulting */ +-XEN_CPUFEATURE(CLFLUSH_MONITOR, (FSCAPINTS+0)*32+ 7) /* clflush reqd with monitor */ +-XEN_CPUFEATURE(APERFMPERF, (FSCAPINTS+0)*32+ 8) /* APERFMPERF */ +-XEN_CPUFEATURE(MFENCE_RDTSC, (FSCAPINTS+0)*32+ 9) /* MFENCE synchronizes RDTSC */ +-XEN_CPUFEATURE(XEN_SMEP, (FSCAPINTS+0)*32+10) /* SMEP gets used by Xen itself */ +-XEN_CPUFEATURE(XEN_SMAP, (FSCAPINTS+0)*32+11) /* SMAP gets used by Xen itself */ +-XEN_CPUFEATURE(LFENCE_DISPATCH, (FSCAPINTS+0)*32+12) /* lfence set as Dispatch Serialising */ +-XEN_CPUFEATURE(IND_THUNK_LFENCE,(FSCAPINTS+0)*32+13) /* Use IND_THUNK_LFENCE */ +-XEN_CPUFEATURE(IND_THUNK_JMP, (FSCAPINTS+0)*32+14) /* Use IND_THUNK_JMP */ +-XEN_CPUFEATURE(XEN_IBPB, (FSCAPINTS+0)*32+15) /* IBRSB || IBPB */ +-XEN_CPUFEATURE(SC_MSR_PV, (FSCAPINTS+0)*32+16) /* MSR_SPEC_CTRL used by Xen for PV */ +-XEN_CPUFEATURE(SC_MSR_HVM, (FSCAPINTS+0)*32+17) /* MSR_SPEC_CTRL used by Xen for HVM */ +-XEN_CPUFEATURE(SC_RSB_PV, (FSCAPINTS+0)*32+18) /* RSB overwrite needed for PV */ +-XEN_CPUFEATURE(SC_RSB_HVM, (FSCAPINTS+0)*32+19) /* RSB overwrite needed for HVM */ +-XEN_CPUFEATURE(NO_XPTI, (FSCAPINTS+0)*32+20) /* XPTI mitigation not in use */ +-XEN_CPUFEATURE(SC_MSR_IDLE, (FSCAPINTS+0)*32+21) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */ +-XEN_CPUFEATURE(XEN_LBR, (FSCAPINTS+0)*32+22) /* Xen uses MSR_DEBUGCTL.LBR */ +-XEN_CPUFEATURE(SC_VERW_PV, (FSCAPINTS+0)*32+23) /* VERW used by Xen for PV */ +-XEN_CPUFEATURE(SC_VERW_HVM, (FSCAPINTS+0)*32+24) /* VERW used by Xen for HVM */ +-XEN_CPUFEATURE(SC_VERW_IDLE, (FSCAPINTS+0)*32+25) /* VERW used by Xen for idle */ ++/* Synthetic features */ ++XEN_CPUFEATURE(CONSTANT_TSC, X86_SYNTH( 0)) /* TSC ticks at a constant rate */ ++XEN_CPUFEATURE(NONSTOP_TSC, X86_SYNTH( 1)) /* TSC does not stop in C states */ ++XEN_CPUFEATURE(ARAT, X86_SYNTH( 2)) /* Always running APIC timer */ ++XEN_CPUFEATURE(ARCH_PERFMON, X86_SYNTH( 3)) /* Intel Architectural PerfMon */ ++XEN_CPUFEATURE(TSC_RELIABLE, X86_SYNTH( 4)) /* TSC is known to be reliable */ ++XEN_CPUFEATURE(XTOPOLOGY, X86_SYNTH( 5)) /* cpu topology enum extensions */ ++XEN_CPUFEATURE(CPUID_FAULTING, X86_SYNTH( 6)) /* cpuid faulting */ ++XEN_CPUFEATURE(CLFLUSH_MONITOR, X86_SYNTH( 7)) /* clflush reqd with monitor */ ++XEN_CPUFEATURE(APERFMPERF, X86_SYNTH( 8)) /* APERFMPERF */ ++XEN_CPUFEATURE(MFENCE_RDTSC, X86_SYNTH( 9)) /* MFENCE synchronizes RDTSC */ ++XEN_CPUFEATURE(XEN_SMEP, X86_SYNTH(10)) /* SMEP gets used by Xen itself */ ++XEN_CPUFEATURE(XEN_SMAP, X86_SYNTH(11)) /* SMAP gets used by Xen itself */ ++XEN_CPUFEATURE(LFENCE_DISPATCH, X86_SYNTH(12)) /* lfence set as Dispatch Serialising */ ++XEN_CPUFEATURE(IND_THUNK_LFENCE, X86_SYNTH(13)) /* Use IND_THUNK_LFENCE */ ++XEN_CPUFEATURE(IND_THUNK_JMP, X86_SYNTH(14)) /* Use IND_THUNK_JMP */ ++XEN_CPUFEATURE(XEN_IBPB, X86_SYNTH(15)) /* IBRSB || IBPB */ ++XEN_CPUFEATURE(SC_MSR_PV, X86_SYNTH(16)) /* MSR_SPEC_CTRL used by Xen for PV */ ++XEN_CPUFEATURE(SC_MSR_HVM, X86_SYNTH(17)) /* MSR_SPEC_CTRL used by Xen for HVM */ ++XEN_CPUFEATURE(SC_RSB_PV, X86_SYNTH(18)) /* RSB overwrite needed for PV */ ++XEN_CPUFEATURE(SC_RSB_HVM, X86_SYNTH(19)) /* RSB overwrite needed for HVM */ ++XEN_CPUFEATURE(NO_XPTI, X86_SYNTH(20)) /* XPTI mitigation not in use */ ++XEN_CPUFEATURE(SC_MSR_IDLE, X86_SYNTH(21)) /* (SC_MSR_PV || SC_MSR_HVM) && default_xen_spec_ctrl */ ++XEN_CPUFEATURE(XEN_LBR, X86_SYNTH(22)) /* Xen uses MSR_DEBUGCTL.LBR */ ++XEN_CPUFEATURE(SC_VERW_PV, X86_SYNTH(23)) /* VERW used by Xen for PV */ ++XEN_CPUFEATURE(SC_VERW_HVM, X86_SYNTH(24)) /* VERW used by Xen for HVM */ ++XEN_CPUFEATURE(SC_VERW_IDLE, X86_SYNTH(25)) /* VERW used by Xen for idle */ ++ ++/* Bug words follow the synthetic words. */ ++#define X86_NR_BUG 1 ++#define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x)) ++ ++/* Total number of capability words, inc synth and bug words. */ ++#define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit words worth of info */ +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-pv-Options-to-disable-and-or-compile-out-32bit-PV-support.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-pv-Options-to-disable-and-or-compile-out-32bit-PV-support.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-pv-Options-to-disable-and-or-compile-out-32bit-PV-support.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/x86-pv-Options-to-disable-and-or-compile-out-32bit-PV-support.patch 2022-06-16 16:13:08.000000000 +0100 @@ -0,0 +1,207 @@ +From 68d757df8dd23b88bebfb6a56c9f51df59de969f Mon Sep 17 00:00:00 2001 +From: Andrew Cooper +Date: Fri, 17 Apr 2020 12:39:40 +0100 +Subject: [PATCH] x86/pv: Options to disable and/or compile out 32bit PV + support +MIME-Version: 1.0 +Content-Type: text/plain; charset=utf8 +Content-Transfer-Encoding: 8bit + +This is the start of some performance and security-hardening improvements, +based on the fact that 32bit PV guests are few and far between these days. + +Ring1 is full of architectural corner cases, such as counting as supervisor +from a paging point of view. This accounts for a substantial performance hit +on processors from the last 8 years (adjusting SMEP/SMAP on every privilege +transition), and the gap is only going to get bigger with new hardware +features. + +Signed-off-by: Andrew Cooper +Reviewed-by: Wei Liu +Reviewed-by: Roger Pau Monné +Acked-by: Jan Beulich +--- + docs/misc/xen-command-line.markdown | 12 ++++++++++- + xen/arch/x86/Kconfig | 16 +++++++++++++++ + xen/arch/x86/pv/domain.c | 34 +++++++++++++++++++++++++++++++ + xen/arch/x86/setup.c | 9 ++++++-- + xen/include/asm-x86/pv/domain.h | 6 ++++++ + xen/include/xen/param.h | 9 ++++++++ + 6 files changed, 83 insertions(+), 3 deletions(-) + +diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown +index acd0b3d994..ee12b0f53f 100644 +--- a/docs/misc/xen-command-line.markdown ++++ b/docs/misc/xen-command-line.markdown +@@ -1592,7 +1592,17 @@ The following resources are available: + CDP, one COS will corespond two CBMs other than one with CAT, due to the + sum of CBMs is fixed, that means actual `cos_max` in use will automatically + reduce to half when CDP is enabled. +- ++ ++### pv ++ = List of [ 32= ] ++ ++ Applicability: x86 ++ ++Controls for aspects of PV guest support. ++ ++* The `32` boolean controls whether 32bit PV guests can be created. It ++ defaults to `true`, and is ignored when `CONFIG_PV32` is compiled out. ++ + ### pv-linear-pt (x86) + > `= ` + +diff --git a/xen/arch/x86/Kconfig b/xen/arch/x86/Kconfig +index a69be983d6..96432f1f69 100644 +--- a/xen/arch/x86/Kconfig ++++ b/xen/arch/x86/Kconfig +@@ -37,6 +37,22 @@ config PV + config PV + def_bool y + ++config PV32 ++ bool "Support for 32bit PV guests" ++ depends on PV ++ default y ++ ---help--- ++ The 32bit PV ABI uses Ring1, an area of the x86 architecture which ++ was deprecated and mostly removed in the AMD64 spec. As a result, ++ it occasionally conflicts with newer x86 hardware features, causing ++ overheads for Xen to maintain backwards compatibility. ++ ++ People may wish to disable 32bit PV guests for attack surface ++ reduction, or performance reasons. Backwards compatibility can be ++ provided via the PV Shim mechanism. ++ ++ If unsure, say Y. ++ + config PV_LINEAR_PT + bool "Support for PV linear pagetables" + depends on PV +diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c +index 43da5c179f..3579dc063e 100644 +--- a/xen/arch/x86/pv/domain.c ++++ b/xen/arch/x86/pv/domain.c +@@ -16,6 +16,38 @@ + #include + #include + ++#ifdef CONFIG_PV32 ++int8_t __read_mostly opt_pv32 = -1; ++#endif ++ ++static __init int parse_pv(const char *s) ++{ ++ const char *ss; ++ int val, rc = 0; ++ ++ do { ++ ss = strchr(s, ','); ++ if ( !ss ) ++ ss = strchr(s, '\0'); ++ ++ if ( (val = parse_boolean("32", s, ss)) >= 0 ) ++ { ++#ifdef CONFIG_PV32 ++ opt_pv32 = val; ++#else ++ no_config_param("PV32", "pv", s, ss); ++#endif ++ } ++ else ++ rc = -EINVAL; ++ ++ s = ss + 1; ++ } while ( *ss ); ++ ++ return rc; ++} ++custom_param("pv", parse_pv); ++ + static __read_mostly enum { + PCID_OFF, + PCID_ALL, +@@ -161,6 +193,8 @@ int switch_compat(struct domain *d) + + BUILD_BUG_ON(offsetof(struct shared_info, vcpu_info) != 0); + ++ if ( !opt_pv32 ) ++ return -EOPNOTSUPP; + if ( is_hvm_domain(d) || d->tot_pages != 0 ) + return -EACCES; + if ( is_pv_32bit_domain(d) ) +diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c +index eb56d78c2f..9e9576344c 100644 +--- a/xen/arch/x86/setup.c ++++ b/xen/arch/x86/setup.c +@@ -53,6 +53,7 @@ + #include + #include + #include ++#include + + /* opt_nosmp: If true, secondary processors are ignored. */ + static bool __initdata opt_nosmp; +@@ -1807,8 +1808,12 @@ void arch_get_xen_caps(xen_capabilities_info_t *info) + + snprintf(s, sizeof(s), "xen-%d.%d-x86_64 ", major, minor); + safe_strcat(*info, s); +- snprintf(s, sizeof(s), "xen-%d.%d-x86_32p ", major, minor); +- safe_strcat(*info, s); ++ ++ if ( opt_pv32 ) ++ { ++ snprintf(s, sizeof(s), "xen-%d.%d-x86_32p ", major, minor); ++ safe_strcat(*info, s); ++ } + if ( hvm_enabled ) + { + snprintf(s, sizeof(s), "hvm-%d.%d-x86_32 ", major, minor); +diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h +index 7a69bfb303..df9716ff26 100644 +--- a/xen/include/asm-x86/pv/domain.h ++++ b/xen/include/asm-x86/pv/domain.h +@@ -21,6 +21,12 @@ + #ifndef __X86_PV_DOMAIN_H__ + #define __X86_PV_DOMAIN_H__ + ++#ifdef CONFIG_PV32 ++extern int8_t opt_pv32; ++#else ++# define opt_pv32 false ++#endif ++ + /* + * PCID values for the address spaces of 64-bit pv domains: + * +diff --git a/xen/include/xen/param.h b/xen/include/xen/param.h +index d4578cd27f..a1dc3ba8f0 100644 +--- a/xen/include/xen/param.h ++++ b/xen/include/xen/param.h +@@ -2,6 +2,8 @@ + #define _XEN_PARAM_H + + #include ++#include ++#include + + /* + * Used for kernel command line parameter setup +@@ -116,4 +118,13 @@ extern const struct kernel_param __param_start[], __param_end[]; + string_param(_name, _var); \ + string_runtime_only_param(_name, _var) + ++static inline void no_config_param(const char *cfg, const char *param, ++ const char *s, const char *e) ++{ ++ int len = e ? ({ ASSERT(e >= s); e - s; }) : strlen(s); ++ ++ printk(XENLOG_INFO "CONFIG_%s disabled - ignoring '%s=%*s' setting\n", ++ cfg, param, len, s); ++} ++ + #endif /* _XEN_PARAM_H */ +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Add-generic-assembly-macros.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Add-generic-assembly-macros.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Add-generic-assembly-macros.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Add-generic-assembly-macros.patch 2022-06-19 23:08:45.000000000 +0100 @@ -0,0 +1,66 @@ +From bb2e9fc7df592753e1fd73b4fec21c375cd3e2e1 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:39 +0100 +Subject: [PATCH] xen/arm64: Add generic assembly macros + +Add assembly macros to simplify assembly code: + - adr_cpu_info: Get the address to the current cpu_info structure + - ldr_this_cpu: Load a per-cpu value + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini +--- + xen/include/asm-arm/arm64/macros.h | 25 +++++++++++++++++++++++++ + xen/include/asm-arm/macros.h | 2 +- + 2 files changed, 26 insertions(+), 1 deletion(-) + create mode 100644 xen/include/asm-arm/arm64/macros.h + +diff --git a/xen/include/asm-arm/arm64/macros.h b/xen/include/asm-arm/arm64/macros.h +new file mode 100644 +index 0000000000..9c5e676b37 +--- /dev/null ++++ b/xen/include/asm-arm/arm64/macros.h +@@ -0,0 +1,25 @@ ++#ifndef __ASM_ARM_ARM64_MACROS_H ++#define __ASM_ARM_ARM64_MACROS_H ++ ++ /* ++ * @dst: Result of get_cpu_info() ++ */ ++ .macro adr_cpu_info, dst ++ add \dst, sp, #STACK_SIZE ++ and \dst, \dst, #~(STACK_SIZE - 1) ++ sub \dst, \dst, #CPUINFO_sizeof ++ .endm ++ ++ /* ++ * @dst: Result of READ_ONCE(per_cpu(sym, smp_processor_id())) ++ * @sym: The name of the per-cpu variable ++ * @tmp: scratch register ++ */ ++ .macro ldr_this_cpu, dst, sym, tmp ++ ldr \dst, =per_cpu__\sym ++ mrs \tmp, tpidr_el2 ++ ldr \dst, [\dst, \tmp] ++ .endm ++ ++#endif /* __ASM_ARM_ARM64_MACROS_H */ ++ +diff --git a/xen/include/asm-arm/macros.h b/xen/include/asm-arm/macros.h +index 5d837cb38b..1d4bb41d15 100644 +--- a/xen/include/asm-arm/macros.h ++++ b/xen/include/asm-arm/macros.h +@@ -8,7 +8,7 @@ + #if defined (CONFIG_ARM_32) + # include + #elif defined(CONFIG_ARM_64) +-/* No specific ARM64 macros for now */ ++# include + #else + # error "unknown ARM variant" + #endif +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-entry-Use-named-label-in-guest_sync.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-entry-Use-named-label-in-guest_sync.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-entry-Use-named-label-in-guest_sync.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-entry-Use-named-label-in-guest_sync.patch 2022-06-05 21:39:43.000000000 +0100 @@ -0,0 +1,54 @@ +From beb8ae4e767f8fe1d982127ba9049c5f3b2bd5b6 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:32 +0100 +Subject: [PATCH] xen/arm64: entry: Use named label in guest_sync + +This will improve readability for future changes. + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini +--- + xen/arch/arm/arm64/entry.S | 8 ++++---- + 1 file changed, 4 insertions(+), 4 deletions(-) + +diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S +index ffa9a1c492..e2344e565f 100644 +--- a/xen/arch/arm/arm64/entry.S ++++ b/xen/arch/arm/arm64/entry.S +@@ -266,11 +266,11 @@ guest_sync: + mrs x1, esr_el2 + lsr x1, x1, #HSR_EC_SHIFT /* x1 = ESR_EL2.EC */ + cmp x1, #HSR_EC_HVC64 +- b.ne 1f /* Not a HVC skip fastpath. */ ++ b.ne guest_sync_slowpath /* Not a HVC skip fastpath. */ + + mrs x1, esr_el2 + and x1, x1, #0xffff /* Check the immediate [0:16] */ +- cbnz x1, 1f /* should be 0 for HVC #0 */ ++ cbnz x1, guest_sync_slowpath /* should be 0 for HVC #0 */ + + /* + * Fastest path possible for ARM_SMCCC_ARCH_WORKAROUND_1. +@@ -281,7 +281,7 @@ guest_sync: + * be encoded as an immediate for cmp. + */ + eor w0, w0, #ARM_SMCCC_ARCH_WORKAROUND_1_FID +- cbnz w0, 1f ++ cbnz w0, guest_sync_slowpath + + /* + * Clobber both x0 and x1 to prevent leakage. Note that thanks +@@ -291,7 +291,7 @@ guest_sync: + eret + sb + +-1: ++guest_sync_slowpath: + /* + * x0/x1 may have been scratch by the fast path above, so avoid + * to save them. +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Implement-a-fast-path-for-handling-SMCCC_ARCH_WORKAROUND_2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Implement-a-fast-path-for-handling-SMCCC_ARCH_WORKAROUND_2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Implement-a-fast-path-for-handling-SMCCC_ARCH_WORKAROUND_2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm64-Implement-a-fast-path-for-handling-SMCCC_ARCH_WORKAROUND_2.patch 2022-06-05 21:45:55.000000000 +0100 @@ -0,0 +1,153 @@ +From 6dec2c87c4d7b2f03806266c5ceff82b69792a17 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:40 +0100 +Subject: [PATCH] xen/arm64: Implement a fast path for handling + SMCCC_ARCH_WORKAROUND_2 + +The function ARM_SMCCC_ARCH_WORKAROUND_2 will be called by the guest for +enabling/disabling the ssbd mitigation. So we want the handling to +be as fast as possible. + +The new sequence will forward guest's ARCH_WORKAROUND_2 call to EL3 and +also track the state of the workaround per-vCPU. + +Note that since we need to execute branches, this always executes after +the spectre-v2 mitigation. + +This code is based on KVM counterpart "arm64: KVM: Handle guest's +ARCH_WORKAROUND_2 requests" written by Marc Zyngier. + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini +--- + xen/arch/arm/arm64/asm-offsets.c | 2 ++ + xen/arch/arm/arm64/entry.S | 42 +++++++++++++++++++++++++++++++- + xen/arch/arm/cpuerrata.c | 18 ++++++++++++++ + 3 files changed, 61 insertions(+), 1 deletion(-) + +diff --git a/xen/arch/arm/arm64/asm-offsets.c b/xen/arch/arm/arm64/asm-offsets.c +index ce24e44473..f5c696d092 100644 +--- a/xen/arch/arm/arm64/asm-offsets.c ++++ b/xen/arch/arm/arm64/asm-offsets.c +@@ -22,6 +22,7 @@ + void __dummy__(void) + { + OFFSET(UREGS_X0, struct cpu_user_regs, x0); ++ OFFSET(UREGS_X1, struct cpu_user_regs, x1); + OFFSET(UREGS_LR, struct cpu_user_regs, lr); + + OFFSET(UREGS_SP, struct cpu_user_regs, sp); +@@ -45,6 +46,7 @@ void __dummy__(void) + BLANK(); + + DEFINE(CPUINFO_sizeof, sizeof(struct cpu_info)); ++ OFFSET(CPUINFO_flags, struct cpu_info, flags); + + OFFSET(VCPU_arch_saved_context, struct vcpu, arch.saved_context); + +diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S +index e2344e565f..97b05f53ea 100644 +--- a/xen/arch/arm/arm64/entry.S ++++ b/xen/arch/arm/arm64/entry.S +@@ -1,4 +1,6 @@ + #include ++#include ++#include + #include + #include + #include +@@ -281,7 +283,7 @@ guest_sync: + * be encoded as an immediate for cmp. + */ + eor w0, w0, #ARM_SMCCC_ARCH_WORKAROUND_1_FID +- cbnz w0, guest_sync_slowpath ++ cbnz w0, check_wa2 + + /* + * Clobber both x0 and x1 to prevent leakage. Note that thanks +@@ -291,6 +293,44 @@ guest_sync: + eret + sb + ++check_wa2: ++ /* ARM_SMCCC_ARCH_WORKAROUND_2 handling */ ++ eor w0, w0, #(ARM_SMCCC_ARCH_WORKAROUND_1_FID ^ ARM_SMCCC_ARCH_WORKAROUND_2_FID) ++ cbnz w0, guest_sync_slowpath ++#ifdef CONFIG_ARM_SSBD ++alternative_cb arm_enable_wa2_handling ++ b wa2_end ++alternative_cb_end ++ /* Sanitize the argument */ ++ mov x0, #-(UREGS_kernel_sizeof - UREGS_X1) /* x0 := offset of guest's x1 on the stack */ ++ ldr x1, [sp, x0] /* Load guest's x1 */ ++ cmp w1, wzr ++ cset x1, ne ++ ++ /* ++ * Update the guest flag. At this stage sp point after the field ++ * guest_cpu_user_regs in cpu_info. ++ */ ++ adr_cpu_info x2 ++ ldr x0, [x2, #CPUINFO_flags] ++ bfi x0, x1, #CPUINFO_WORKAROUND_2_FLAG_SHIFT, #1 ++ str x0, [x2, #CPUINFO_flags] ++ ++ /* Check that we actually need to perform the call */ ++ ldr_this_cpu x0, ssbd_callback_required, x2 ++ cbz x0, wa2_end ++ ++ mov w0, #ARM_SMCCC_ARCH_WORKAROUND_2_FID ++ smc #0 ++ ++wa2_end: ++ /* Don't leak data from the SMC call */ ++ mov x1, xzr ++ mov x2, xzr ++ mov x3, xzr ++#endif /* !CONFIG_ARM_SSBD */ ++ mov x0, xzr ++ eret + guest_sync_slowpath: + /* + * x0/x1 may have been scratch by the fast path above, so avoid +diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c +index 1e642c416a..97a118293b 100644 +--- a/xen/arch/arm/cpuerrata.c ++++ b/xen/arch/arm/cpuerrata.c +@@ -9,6 +9,7 @@ + #include + #include + #include ++#include + #include + + /* Override macros from asm/page.h to make them work with mfn_t */ +@@ -274,6 +275,23 @@ static int __init parse_spec_ctrl(const char *s) + } + custom_param("spec-ctrl", parse_spec_ctrl); + ++/* Arm64 only for now as for Arm32 the workaround is currently handled in C. */ ++#ifdef CONFIG_ARM_64 ++void __init arm_enable_wa2_handling(const struct alt_instr *alt, ++ const uint32_t *origptr, ++ uint32_t *updptr, int nr_inst) ++{ ++ BUG_ON(nr_inst != 1); ++ ++ /* ++ * Only allow mitigation on guest ARCH_WORKAROUND_2 if the SSBD ++ * state allow it to be flipped. ++ */ ++ if ( get_ssbd_state() == ARM_SSBD_RUNTIME ) ++ *updptr = aarch64_insn_gen_nop(); ++} ++#endif ++ + /* + * Assembly code may use the variable directly, so we need to make sure + * it fits in a register. +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-probing.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-probing.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-probing.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-probing.patch 2022-06-05 21:24:39.000000000 +0100 @@ -0,0 +1,192 @@ +From 280997891e8ca583f1b7a43297e197c0e4be8f0c Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:34 +0100 +Subject: [PATCH 1/1] xen/arm: Add ARCH_WORKAROUND_2 probing + +As for Spectre variant-2, we rely on SMCCC 1.1 to provide the discovery +mechanism for detecting the SSBD mitigation. + +A new capability is also allocated for that purpose, and a config +option. + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini +--- + xen/arch/arm/Kconfig | 10 ++++++ + xen/arch/arm/cpuerrata.c | 58 ++++++++++++++++++++++++++++++++ + xen/include/asm-arm/cpuerrata.h | 21 ++++++++++++ + xen/include/asm-arm/cpufeature.h | 3 +- + xen/include/asm-arm/smccc.h | 7 ++++ + 5 files changed, 98 insertions(+), 1 deletion(-) + +diff --git a/xen/arch/arm/Kconfig b/xen/arch/arm/Kconfig +index 4dc7ef5351..2cbe9dd43b 100644 +--- a/xen/arch/arm/Kconfig ++++ b/xen/arch/arm/Kconfig +@@ -73,6 +73,16 @@ config SBSA_VUART_CONSOLE + Allows a guest to use SBSA Generic UART as a console. The + SBSA Generic UART implements a subset of ARM PL011 UART. + ++config ARM_SSBD ++ bool "Speculative Store Bypass Disable" if EXPERT = "y" ++ depends on HAS_ALTERNATIVE ++ default y ++ help ++ This enables mitigation of bypassing of previous stores by speculative ++ loads. ++ ++ If unsure, say Y. ++ + endmenu + + menu "ARM errata workaround via the alternative framework" +diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c +index b829d226ef..03f78fec96 100644 +--- a/xen/arch/arm/cpuerrata.c ++++ b/xen/arch/arm/cpuerrata.c +@@ -331,6 +331,58 @@ static int enable_ic_inv_hardening(void *data) + + #endif + ++#ifdef CONFIG_ARM_SSBD ++ ++/* ++ * Assembly code may use the variable directly, so we need to make sure ++ * it fits in a register. ++ */ ++DEFINE_PER_CPU_READ_MOSTLY(register_t, ssbd_callback_required); ++ ++static bool has_ssbd_mitigation(const struct arm_cpu_capabilities *entry) ++{ ++ struct arm_smccc_res res; ++ bool required; ++ ++ if ( smccc_ver < SMCCC_VERSION(1, 1) ) ++ return false; ++ ++ /* ++ * The probe function return value is either negative (unsupported ++ * or mitigated), positive (unaffected), or zero (requires ++ * mitigation). We only need to do anything in the last case. ++ */ ++ arm_smccc_1_1_smc(ARM_SMCCC_ARCH_FEATURES_FID, ++ ARM_SMCCC_ARCH_WORKAROUND_2_FID, &res); ++ ++ switch ( (int)res.a0 ) ++ { ++ case ARM_SMCCC_NOT_SUPPORTED: ++ return false; ++ ++ case ARM_SMCCC_NOT_REQUIRED: ++ return false; ++ ++ case ARM_SMCCC_SUCCESS: ++ required = true; ++ break; ++ ++ case 1: /* Mitigation not required on this CPU. */ ++ required = false; ++ break; ++ ++ default: ++ ASSERT_UNREACHABLE(); ++ return false; ++ } ++ ++ if ( required ) ++ this_cpu(ssbd_callback_required) = 1; ++ ++ return required; ++} ++#endif ++ + #define MIDR_RANGE(model, min, max) \ + .matches = is_affected_midr_range, \ + .midr_model = model, \ +@@ -489,6 +541,12 @@ static const struct arm_cpu_capabilities arm_errata[] = { + MIDR_ALL_VERSIONS(MIDR_CORTEX_A15), + .enable = enable_ic_inv_hardening, + }, ++#endif ++#ifdef CONFIG_ARM_SSBD ++ { ++ .capability = ARM_SSBD, ++ .matches = has_ssbd_mitigation, ++ }, + #endif + {}, + }; +diff --git a/xen/include/asm-arm/cpuerrata.h b/xen/include/asm-arm/cpuerrata.h +index 4e45b237c8..e628d3ff56 100644 +--- a/xen/include/asm-arm/cpuerrata.h ++++ b/xen/include/asm-arm/cpuerrata.h +@@ -27,9 +27,30 @@ static inline bool check_workaround_##erratum(void) \ + + CHECK_WORKAROUND_HELPER(766422, ARM32_WORKAROUND_766422, CONFIG_ARM_32) + CHECK_WORKAROUND_HELPER(834220, ARM64_WORKAROUND_834220, CONFIG_ARM_64) ++CHECK_WORKAROUND_HELPER(ssbd, ARM_SSBD, CONFIG_ARM_SSBD) + + #undef CHECK_WORKAROUND_HELPER + ++#ifdef CONFIG_ARM_SSBD ++ ++#include ++ ++DECLARE_PER_CPU(register_t, ssbd_callback_required); ++ ++static inline bool cpu_require_ssbd_mitigation(void) ++{ ++ return this_cpu(ssbd_callback_required); ++} ++ ++#else ++ ++static inline bool cpu_require_ssbd_mitigation(void) ++{ ++ return false; ++} ++ ++#endif ++ + #endif /* __ARM_CPUERRATA_H__ */ + /* + * Local variables: +diff --git a/xen/include/asm-arm/cpufeature.h b/xen/include/asm-arm/cpufeature.h +index c5d046218b..3de6b54301 100644 +--- a/xen/include/asm-arm/cpufeature.h ++++ b/xen/include/asm-arm/cpufeature.h +@@ -43,8 +43,9 @@ + #define SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT 5 + #define SKIP_CTXT_SWITCH_SERROR_SYNC 6 + #define ARM_HARDEN_BRANCH_PREDICTOR 7 ++#define ARM_SSBD 8 + +-#define ARM_NCAPS 8 ++#define ARM_NCAPS 9 + + #ifndef __ASSEMBLY__ + +diff --git a/xen/include/asm-arm/smccc.h b/xen/include/asm-arm/smccc.h +index 8342cc33fe..a6804cec99 100644 +--- a/xen/include/asm-arm/smccc.h ++++ b/xen/include/asm-arm/smccc.h +@@ -258,7 +258,14 @@ struct arm_smccc_res { + ARM_SMCCC_OWNER_ARCH, \ + 0x8000) + ++#define ARM_SMCCC_ARCH_WORKAROUND_2_FID \ ++ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ ++ ARM_SMCCC_CONV_32, \ ++ ARM_SMCCC_OWNER_ARCH, \ ++ 0x7FFF) ++ + /* SMCCC error codes */ ++#define ARM_SMCCC_NOT_REQUIRED (-2) + #define ARM_SMCCC_ERR_UNKNOWN_FUNCTION (-1) + #define ARM_SMCCC_NOT_SUPPORTED (-1) + #define ARM_SMCCC_SUCCESS (0) +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-support-for-guests.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-support-for-guests.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-support-for-guests.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-ARCH_WORKAROUND_2-support-for-guests.patch 2022-06-05 21:20:56.000000000 +0100 @@ -0,0 +1,187 @@ +From a7898e4c593f83cc5db419d99bdecc0b220bf4e3 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:36 +0100 +Subject: [PATCH] xen/arm: Add ARCH_WORKAROUND_2 support for guests + +In order to offer ARCH_WORKAROUND_2 support to guests, we need to track the +state of the workaround per-vCPU. The field 'pad' in cpu_info is now +repurposed to store flags easily accessible in assembly. + +As the hypervisor will always run with the workaround enabled, we may +need to enable (on guest exit) or disable (on guest entry) the +workaround. + +A follow-up patch will add fastpath for the workaround for arm64 guests. + +Note that check_workaround_ssbd() is used instead of ssbd_get_state() +because the former is implemented using an alternative. Thefore the code +will be shortcut on affected platform. + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini +--- + xen/arch/arm/domain.c | 8 ++++++++ + xen/arch/arm/traps.c | 20 +++++++++++++++++++ + xen/arch/arm/vsmc.c | 37 +++++++++++++++++++++++++++++++++++ + xen/include/asm-arm/current.h | 6 +++++- + 4 files changed, 70 insertions(+), 1 deletion(-) + +diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c +index 5a2a9a6b83..4baecc2447 100644 +--- a/xen/arch/arm/domain.c ++++ b/xen/arch/arm/domain.c +@@ -21,6 +21,7 @@ + #include + + #include ++#include + #include + #include + #include +@@ -572,6 +573,13 @@ int vcpu_initialise(struct vcpu *v) + if ( (rc = vcpu_vtimer_init(v)) != 0 ) + goto fail; + ++ /* ++ * The workaround 2 (i.e SSBD mitigation) is enabled by default if ++ * supported. ++ */ ++ if ( get_ssbd_state() == ARM_SSBD_RUNTIME ) ++ v->arch.cpu_info->flags |= CPUINFO_WORKAROUND_2_FLAG; ++ + return rc; + + fail: +diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c +index d71adfa745..e47ec8aad5 100644 +--- a/xen/arch/arm/traps.c ++++ b/xen/arch/arm/traps.c +@@ -2021,10 +2021,23 @@ inject_abt: + inject_iabt_exception(regs, gva, hsr.len); + } + ++static inline bool needs_ssbd_flip(struct vcpu *v) ++{ ++ if ( !check_workaround_ssbd() ) ++ return false; ++ ++ return !(v->arch.cpu_info->flags & CPUINFO_WORKAROUND_2_FLAG) && ++ cpu_require_ssbd_mitigation(); ++} ++ + static void enter_hypervisor_head(struct cpu_user_regs *regs) + { + if ( guest_mode(regs) ) + { ++ /* If the guest has disabled the workaround, bring it back on. */ ++ if ( needs_ssbd_flip(current) ) ++ arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL); ++ + /* + * If we pended a virtual abort, preserve it until it gets cleared. + * See ARM ARM DDI 0487A.j D1.14.3 (Virtual Interrupts) for details, +@@ -2270,6 +2283,13 @@ void leave_hypervisor_tail(void) + */ + SYNCHRONIZE_SERROR(SKIP_SYNCHRONIZE_SERROR_ENTRY_EXIT); + ++ /* ++ * The hypervisor runs with the workaround always present. ++ * If the guest wants it disabled, so be it... ++ */ ++ if ( needs_ssbd_flip(current) ) ++ arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 0, NULL); ++ + return; + } + local_irq_enable(); +diff --git a/xen/arch/arm/vsmc.c b/xen/arch/arm/vsmc.c +index 40a80d5760..c4ccae6030 100644 +--- a/xen/arch/arm/vsmc.c ++++ b/xen/arch/arm/vsmc.c +@@ -18,6 +18,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -104,6 +105,23 @@ static bool handle_arch(struct cpu_user_regs *regs) + if ( cpus_have_cap(ARM_HARDEN_BRANCH_PREDICTOR) ) + ret = 0; + break; ++ case ARM_SMCCC_ARCH_WORKAROUND_2_FID: ++ switch ( get_ssbd_state() ) ++ { ++ case ARM_SSBD_UNKNOWN: ++ case ARM_SSBD_FORCE_DISABLE: ++ break; ++ ++ case ARM_SSBD_RUNTIME: ++ ret = ARM_SMCCC_SUCCESS; ++ break; ++ ++ case ARM_SSBD_FORCE_ENABLE: ++ case ARM_SSBD_MITIGATED: ++ ret = ARM_SMCCC_NOT_REQUIRED; ++ break; ++ } ++ break; + } + + set_user_reg(regs, 0, ret); +@@ -114,6 +132,25 @@ static bool handle_arch(struct cpu_user_regs *regs) + case ARM_SMCCC_ARCH_WORKAROUND_1_FID: + /* No return value */ + return true; ++ ++ case ARM_SMCCC_ARCH_WORKAROUND_2_FID: ++ { ++ bool enable = (uint32_t)get_user_reg(regs, 1); ++ ++ /* ++ * ARM_WORKAROUND_2_FID should only be called when mitigation ++ * state can be changed at runtime. ++ */ ++ if ( unlikely(get_ssbd_state() != ARM_SSBD_RUNTIME) ) ++ return true; ++ ++ if ( enable ) ++ get_cpu_info()->flags |= CPUINFO_WORKAROUND_2_FLAG; ++ else ++ get_cpu_info()->flags &= ~CPUINFO_WORKAROUND_2_FLAG; ++ ++ return true; ++ } + } + + return false; +diff --git a/xen/include/asm-arm/current.h b/xen/include/asm-arm/current.h +index 7a0971fdea..f9819b34fc 100644 +--- a/xen/include/asm-arm/current.h ++++ b/xen/include/asm-arm/current.h +@@ -7,6 +7,10 @@ + #include + #include + ++/* Tell whether the guest vCPU enabled Workaround 2 (i.e variant 4) */ ++#define CPUINFO_WORKAROUND_2_FLAG_SHIFT 0 ++#define CPUINFO_WORKAROUND_2_FLAG (_AC(1, U) << CPUINFO_WORKAROUND_2_FLAG_SHIFT) ++ + #ifndef __ASSEMBLY__ + + struct vcpu; +@@ -21,7 +25,7 @@ DECLARE_PER_CPU(struct vcpu *, curr_vcpu); + struct cpu_info { + struct cpu_user_regs guest_cpu_user_regs; + unsigned long elr; +- unsigned int pad; ++ uint32_t flags; + }; + + static inline struct cpu_info *get_cpu_info(void) +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-command-line-option-to-control-SSBD-mitigation.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-command-line-option-to-control-SSBD-mitigation.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-command-line-option-to-control-SSBD-mitigation.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Add-command-line-option-to-control-SSBD-mitigation.patch 2022-06-05 21:56:15.000000000 +0100 @@ -0,0 +1,246 @@ +From 07182e7d490aa6318a9d33706d8b40cbdb62e51d Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:35 +0100 +Subject: [PATCH] xen/arm: Add command line option to control SSBD mitigation + +On a system where the firmware implements ARCH_WORKAROUND_2, it may be +useful to either permanently enable or disable the workaround for cases +where the user decides that they'd rather not get a trap overhead, and +keep the mitigation permanently on or off instead of switching it on +exception entry/exit. In any case, default to mitigation being enabled. + +The new command line option is implemented as list of one option to +follow x86 option and also allow to extend it more easily in the future. + +Note that for convenience, the full implemention of the workaround is +done in the .matches callback. + +Lastly, a accessor is provided to know the state of the mitigation. + +After this patch, there are 3 methods complementing each other to find the +state of the mitigation: + - The capability ARM_SSBD indicates the platform is affected by the + vulnerability. This will also return false if the user decide to force + disabled the mitigation (spec-ctrl="ssbd=force-disable"). The + capability is useful for putting shortcut in place using alternative. + - ssbd_state indicates the global state of the mitigation (e.g + unknown, force enable...). The global state is required to report + the state to a guest. + - The per-cpu ssbd_callback_required indicates whether a pCPU + requires to call the SMC. This allows to shortcut SMC call + and save an entry/exit to EL3. + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini +--- + docs/misc/xen-command-line.markdown | 18 ++++++ + xen/arch/arm/cpuerrata.c | 88 ++++++++++++++++++++++++++--- + xen/include/asm-arm/cpuerrata.h | 21 +++++++ + 3 files changed, 120 insertions(+), 7 deletions(-) + +diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown +index 8712a833a2..962028b6ed 100644 +--- a/docs/misc/xen-command-line.markdown ++++ b/docs/misc/xen-command-line.markdown +@@ -1756,6 +1756,24 @@ enforces the maximum theoretically necessary timeout of 670ms. Any number + is being interpreted as a custom timeout in milliseconds. Zero or boolean + false disable the quirk workaround, which is also the default. + ++### spec-ctrl (Arm) ++> `= List of [ ssbd=force-disable|runtime|force-enable ]` ++ ++Controls for speculative execution sidechannel mitigations. ++ ++The option `ssbd=` is used to control the state of Speculative Store ++Bypass Disable (SSBD) mitigation. ++ ++* `ssbd=force-disable` will keep the mitigation permanently off. The guest ++will not be able to control the state of the mitigation. ++* `ssbd=runtime` will always turn on the mitigation when running in the ++hypervisor context. The guest will be to turn on/off the mitigation for ++itself by using the firmware interface ARCH\_WORKAROUND\_2. ++* `ssbd=force-enable` will keep the mitigation permanently on. The guest will ++not be able to control the state of the mitigation. ++ ++By default SSBD will be mitigated at runtime (i.e `ssbd=runtime`). ++ + ### spec-ctrl (x86) + > `= List of [ , xen=, {pv,hvm,msr-sc,rsb,md-clear}=, + > bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu, +diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c +index 03f78fec96..1e642c416a 100644 +--- a/xen/arch/arm/cpuerrata.c ++++ b/xen/arch/arm/cpuerrata.c +@@ -237,6 +237,41 @@ static int enable_ic_inv_hardening(void *data) + + #ifdef CONFIG_ARM_SSBD + ++enum ssbd_state ssbd_state = ARM_SSBD_RUNTIME; ++ ++static int __init parse_spec_ctrl(const char *s) ++{ ++ const char *ss; ++ int rc = 0; ++ ++ do { ++ ss = strchr(s, ','); ++ if ( !ss ) ++ ss = strchr(s, '\0'); ++ ++ if ( !strncmp(s, "ssbd=", 5) ) ++ { ++ s += 5; ++ ++ if ( !strncmp(s, "force-disable", ss - s) ) ++ ssbd_state = ARM_SSBD_FORCE_DISABLE; ++ else if ( !strncmp(s, "runtime", ss - s) ) ++ ssbd_state = ARM_SSBD_RUNTIME; ++ else if ( !strncmp(s, "force-enable", ss - s) ) ++ ssbd_state = ARM_SSBD_FORCE_ENABLE; ++ else ++ rc = -EINVAL; ++ } ++ else ++ rc = -EINVAL; ++ ++ s = ss + 1; ++ } while ( *ss ); ++ ++ return rc; ++} ++custom_param("spec-ctrl", parse_spec_ctrl); ++ + /* + * Assembly code may use the variable directly, so we need to make sure + * it fits in a register. +@@ -251,20 +286,17 @@ static bool has_ssbd_mitigation(const struct arm_cpu_capabilities *entry) + if ( smccc_ver < SMCCC_VERSION(1, 1) ) + return false; + +- /* +- * The probe function return value is either negative (unsupported +- * or mitigated), positive (unaffected), or zero (requires +- * mitigation). We only need to do anything in the last case. +- */ + arm_smccc_1_1_smc(ARM_SMCCC_ARCH_FEATURES_FID, + ARM_SMCCC_ARCH_WORKAROUND_2_FID, &res); + + switch ( (int)res.a0 ) + { + case ARM_SMCCC_NOT_SUPPORTED: ++ ssbd_state = ARM_SSBD_UNKNOWN; + return false; + + case ARM_SMCCC_NOT_REQUIRED: ++ ssbd_state = ARM_SSBD_MITIGATED; + return false; + + case ARM_SMCCC_SUCCESS: +@@ -280,8 +312,49 @@ static bool has_ssbd_mitigation(const struct arm_cpu_capabilities *entry) + return false; + } + +- if ( required ) +- this_cpu(ssbd_callback_required) = 1; ++ switch ( ssbd_state ) ++ { ++ case ARM_SSBD_FORCE_DISABLE: ++ { ++ static bool once = true; ++ ++ if ( once ) ++ printk("%s disabled from command-line\n", entry->desc); ++ once = false; ++ ++ arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 0, NULL); ++ required = false; ++ ++ break; ++ } ++ ++ case ARM_SSBD_RUNTIME: ++ if ( required ) ++ { ++ this_cpu(ssbd_callback_required) = 1; ++ arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL); ++ } ++ ++ break; ++ ++ case ARM_SSBD_FORCE_ENABLE: ++ { ++ static bool once = true; ++ ++ if ( once ) ++ printk("%s forced from command-line\n", entry->desc); ++ once = false; ++ ++ arm_smccc_1_1_smc(ARM_SMCCC_ARCH_WORKAROUND_2_FID, 1, NULL); ++ required = true; ++ ++ break; ++ } ++ ++ default: ++ ASSERT_UNREACHABLE(); ++ return false; ++ } + + return required; + } +@@ -390,6 +463,7 @@ static const struct arm_cpu_capabilities arm_errata[] = { + #endif + #ifdef CONFIG_ARM_SSBD + { ++ .desc = "Speculative Store Bypass Disabled", + .capability = ARM_SSBD, + .matches = has_ssbd_mitigation, + }, +diff --git a/xen/include/asm-arm/cpuerrata.h b/xen/include/asm-arm/cpuerrata.h +index e628d3ff56..55ddfda272 100644 +--- a/xen/include/asm-arm/cpuerrata.h ++++ b/xen/include/asm-arm/cpuerrata.h +@@ -31,10 +31,26 @@ CHECK_WORKAROUND_HELPER(ssbd, ARM_SSBD, CONFIG_ARM_SSBD) + + #undef CHECK_WORKAROUND_HELPER + ++enum ssbd_state ++{ ++ ARM_SSBD_UNKNOWN, ++ ARM_SSBD_FORCE_DISABLE, ++ ARM_SSBD_RUNTIME, ++ ARM_SSBD_FORCE_ENABLE, ++ ARM_SSBD_MITIGATED, ++}; ++ + #ifdef CONFIG_ARM_SSBD + + #include + ++extern enum ssbd_state ssbd_state; ++ ++static inline enum ssbd_state get_ssbd_state(void) ++{ ++ return ssbd_state; ++} ++ + DECLARE_PER_CPU(register_t, ssbd_callback_required); + + static inline bool cpu_require_ssbd_mitigation(void) +@@ -49,6 +65,11 @@ static inline bool cpu_require_ssbd_mitigation(void) + return false; + } + ++static inline enum ssbd_state get_ssbd_state(void) ++{ ++ return ARM_SSBD_UNKNOWN; ++} ++ + #endif + + #endif /* __ARM_CPUERRATA_H__ */ +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-alternatives-Add-dynamic-patching-feature.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-alternatives-Add-dynamic-patching-feature.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-alternatives-Add-dynamic-patching-feature.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-alternatives-Add-dynamic-patching-feature.patch 2022-06-19 23:04:12.000000000 +0100 @@ -0,0 +1,246 @@ +From 3e9db39ea06e726c66c40cb8f2f0e5fa62de9c7c Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:38 +0100 +Subject: [PATCH] xen/arm: alternatives: Add dynamic patching feature + +This is based on the Linux commit dea5e2a4c5bc "arm64: alternatives: Add +dynamic patching feature" written by Marc Zyngier: + + We've so far relied on a patching infrastructure that only gave us + a single alternative, without any way to provide a range of potential + replacement instructions. For a single feature, this is an all or + nothing thing. + + It would be interesting to have a more flexible grained way of patching the + kernel though, where we could dynamically tune the code that gets injected. + + In order to achive this, let's introduce a new form of dynamic patching, + assiciating a callback to a patching site. This callback gets source and + target locations of the patching request, as well as the number of + instructions to be patched. + + Dynamic patching is declared with the new ALTERNATIVE_CB and alternative_cb + directives: + asm volatile(ALTERNATIVE_CB("mov %0, #0\n", callback) + : "r" (v)); + or + + alternative_cb callback + mov x0, #0 + alternative_cb_end + + where callback is the C function computing the alternative. + + Reviewed-by: Christoffer Dall + Reviewed-by: Catalin Marinas + Signed-off-by: Marc Zyngier + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Acked-by: Stefano Stabellini +--- + xen/arch/arm/alternative.c | 48 +++++++++++++++++++++++-------- + xen/include/asm-arm/alternative.h | 44 ++++++++++++++++++++++++---- + 2 files changed, 75 insertions(+), 17 deletions(-) + +diff --git a/xen/arch/arm/alternative.c b/xen/arch/arm/alternative.c +index 936cf04956..52ed7edf69 100644 +--- a/xen/arch/arm/alternative.c ++++ b/xen/arch/arm/alternative.c +@@ -30,6 +30,8 @@ + #include + #include + #include ++/* XXX: Move ARCH_PATCH_INSN_SIZE out of livepatch.h */ ++#include + #include + + /* Override macros from asm/page.h to make them work with mfn_t */ +@@ -94,6 +96,23 @@ static u32 get_alt_insn(const struct alt_instr *alt, + return insn; + } + ++static void patch_alternative(const struct alt_instr *alt, ++ const uint32_t *origptr, ++ uint32_t *updptr, int nr_inst) ++{ ++ const uint32_t *replptr; ++ unsigned int i; ++ ++ replptr = ALT_REPL_PTR(alt); ++ for ( i = 0; i < nr_inst; i++ ) ++ { ++ uint32_t insn; ++ ++ insn = get_alt_insn(alt, origptr + i, replptr + i); ++ updptr[i] = cpu_to_le32(insn); ++ } ++} ++ + /* + * The region patched should be read-write to allow __apply_alternatives + * to replacing the instructions when necessary. +@@ -105,33 +124,38 @@ static int __apply_alternatives(const struct alt_region *region, + paddr_t update_offset) + { + const struct alt_instr *alt; +- const u32 *replptr, *origptr; ++ const u32 *origptr; + u32 *updptr; ++ alternative_cb_t alt_cb; + + printk(XENLOG_INFO "alternatives: Patching with alt table %p -> %p\n", + region->begin, region->end); + + for ( alt = region->begin; alt < region->end; alt++ ) + { +- u32 insn; +- int i, nr_inst; ++ int nr_inst; + +- if ( !cpus_have_cap(alt->cpufeature) ) ++ /* Use ARM_CB_PATCH as an unconditional patch */ ++ if ( alt->cpufeature < ARM_CB_PATCH && ++ !cpus_have_cap(alt->cpufeature) ) + continue; + +- BUG_ON(alt->alt_len != alt->orig_len); ++ if ( alt->cpufeature == ARM_CB_PATCH ) ++ BUG_ON(alt->alt_len != 0); ++ else ++ BUG_ON(alt->alt_len != alt->orig_len); + + origptr = ALT_ORIG_PTR(alt); + updptr = (void *)origptr + update_offset; +- replptr = ALT_REPL_PTR(alt); + +- nr_inst = alt->alt_len / sizeof(insn); ++ nr_inst = alt->orig_len / ARCH_PATCH_INSN_SIZE; + +- for ( i = 0; i < nr_inst; i++ ) +- { +- insn = get_alt_insn(alt, origptr + i, replptr + i); +- *(updptr + i) = cpu_to_le32(insn); +- } ++ if ( alt->cpufeature < ARM_CB_PATCH ) ++ alt_cb = patch_alternative; ++ else ++ alt_cb = ALT_REPL_PTR(alt); ++ ++ alt_cb(alt, origptr, updptr, nr_inst); + + /* Ensure the new instructions reached the memory and nuke */ + clean_and_invalidate_dcache_va_range(origptr, +diff --git a/xen/include/asm-arm/alternative.h b/xen/include/asm-arm/alternative.h +index 4e33d1cdf7..9b4b02811b 100644 +--- a/xen/include/asm-arm/alternative.h ++++ b/xen/include/asm-arm/alternative.h +@@ -3,6 +3,8 @@ + + #include + ++#define ARM_CB_PATCH ARM_NCAPS ++ + #ifndef __ASSEMBLY__ + + #include +@@ -18,16 +20,24 @@ struct alt_instr { + }; + + /* Xen: helpers used by common code. */ +-#define __ALT_PTR(a,f) ((u32 *)((void *)&(a)->f + (a)->f)) ++#define __ALT_PTR(a,f) ((void *)&(a)->f + (a)->f) + #define ALT_ORIG_PTR(a) __ALT_PTR(a, orig_offset) + #define ALT_REPL_PTR(a) __ALT_PTR(a, alt_offset) + ++typedef void (*alternative_cb_t)(const struct alt_instr *alt, ++ const uint32_t *origptr, uint32_t *updptr, ++ int nr_inst); ++ + void __init apply_alternatives_all(void); + int apply_alternatives(const struct alt_instr *start, const struct alt_instr *end); + +-#define ALTINSTR_ENTRY(feature) \ ++#define ALTINSTR_ENTRY(feature, cb) \ + " .word 661b - .\n" /* label */ \ ++ " .if " __stringify(cb) " == 0\n" \ + " .word 663f - .\n" /* new instruction */ \ ++ " .else\n" \ ++ " .word " __stringify(cb) "- .\n" /* callback */ \ ++ " .endif\n" \ + " .hword " __stringify(feature) "\n" /* feature bit */ \ + " .byte 662b-661b\n" /* source len */ \ + " .byte 664f-663f\n" /* replacement len */ +@@ -45,15 +55,18 @@ int apply_alternatives(const struct alt_instr *start, const struct alt_instr *en + * but most assemblers die if insn1 or insn2 have a .inst. This should + * be fixed in a binutils release posterior to 2.25.51.0.2 (anything + * containing commit 4e4d08cf7399b606 or c1baaddf8861). ++ * ++ * Alternatives with callbacks do not generate replacement instructions. + */ +-#define __ALTERNATIVE_CFG(oldinstr, newinstr, feature, cfg_enabled) \ ++#define __ALTERNATIVE_CFG(oldinstr, newinstr, feature, cfg_enabled, cb) \ + ".if "__stringify(cfg_enabled)" == 1\n" \ + "661:\n\t" \ + oldinstr "\n" \ + "662:\n" \ + ".pushsection .altinstructions,\"a\"\n" \ +- ALTINSTR_ENTRY(feature) \ ++ ALTINSTR_ENTRY(feature,cb) \ + ".popsection\n" \ ++ " .if " __stringify(cb) " == 0\n" \ + ".pushsection .altinstr_replacement, \"a\"\n" \ + "663:\n\t" \ + newinstr "\n" \ +@@ -61,11 +74,17 @@ int apply_alternatives(const struct alt_instr *start, const struct alt_instr *en + ".popsection\n\t" \ + ".org . - (664b-663b) + (662b-661b)\n\t" \ + ".org . - (662b-661b) + (664b-663b)\n" \ ++ ".else\n\t" \ ++ "663:\n\t" \ ++ "664:\n\t" \ ++ ".endif\n" \ + ".endif\n" + + #define _ALTERNATIVE_CFG(oldinstr, newinstr, feature, cfg, ...) \ +- __ALTERNATIVE_CFG(oldinstr, newinstr, feature, IS_ENABLED(cfg)) ++ __ALTERNATIVE_CFG(oldinstr, newinstr, feature, IS_ENABLED(cfg), 0) + ++#define ALTERNATIVE_CB(oldinstr, cb) \ ++ __ALTERNATIVE_CFG(oldinstr, "NOT_AN_INSTRUCTION", ARM_CB_PATCH, 1, cb) + #else + + #include +@@ -126,6 +145,14 @@ int apply_alternatives(const struct alt_instr *start, const struct alt_instr *en + 663: + .endm + ++.macro alternative_cb cb ++ .set .Lasm_alt_mode, 0 ++ .pushsection .altinstructions, "a" ++ altinstruction_entry 661f, \cb, ARM_CB_PATCH, 662f-661f, 0 ++ .popsection ++661: ++.endm ++ + /* + * Complete an alternative code sequence. + */ +@@ -135,6 +162,13 @@ int apply_alternatives(const struct alt_instr *start, const struct alt_instr *en + .org . - (662b-661b) + (664b-663b) + .endm + ++/* ++ * Callback-based alternative epilogue ++ */ ++.macro alternative_cb_end ++662: ++.endm ++ + #define _ALTERNATIVE_CFG(insn1, insn2, cap, cfg, ...) \ + alternative_insn insn1, insn2, cap, IS_ENABLED(cfg) + +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Simplify-alternative-patching-of-non-writable-region.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Simplify-alternative-patching-of-non-writable-region.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Simplify-alternative-patching-of-non-writable-region.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-arm-Simplify-alternative-patching-of-non-writable-region.patch 2022-06-19 23:17:54.000000000 +0100 @@ -0,0 +1,128 @@ +From 7c98c24e9ba76df3b9f531353d99e5c1bfa8b9a9 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 12 Jun 2018 12:36:37 +0100 +Subject: [PATCH] xen/arm: Simplify alternative patching of non-writable region + +During the MMU setup process, Xen will set SCTLR_EL2.WNX +(Write-Non-eXecutable) bit. Because of that, the alternative code need +to re-mapped the region in a difference place in order to modify the +text section. + +At the moment, the function patching the code is only aware of the +re-mapped region. This requires the caller to mess with Xen internal in +order to have function such as is_active_kernel_text() working. + +All the interactions with Xen internal can be removed by specifying the +offset between the region patch and the writable region for updating the +instruction + +This simplification will also make it easier to integrate dynamic patching +in a follow-up patch. Indeed, the callback address should be in +an original region and not re-mapped only which is writeable non-executable. + +This is part of XSA-263. + +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini +--- + xen/arch/arm/alternative.c | 42 ++++++++++++-------------------------- + 1 file changed, 13 insertions(+), 29 deletions(-) + +diff --git a/xen/arch/arm/alternative.c b/xen/arch/arm/alternative.c +index 9ffdc475d6..936cf04956 100644 +--- a/xen/arch/arm/alternative.c ++++ b/xen/arch/arm/alternative.c +@@ -97,12 +97,16 @@ static u32 get_alt_insn(const struct alt_instr *alt, + /* + * The region patched should be read-write to allow __apply_alternatives + * to replacing the instructions when necessary. ++ * ++ * @update_offset: Offset between the region patched and the writable ++ * region for the update. 0 if the patched region is writable. + */ +-static int __apply_alternatives(const struct alt_region *region) ++static int __apply_alternatives(const struct alt_region *region, ++ paddr_t update_offset) + { + const struct alt_instr *alt; +- const u32 *replptr; +- u32 *origptr; ++ const u32 *replptr, *origptr; ++ u32 *updptr; + + printk(XENLOG_INFO "alternatives: Patching with alt table %p -> %p\n", + region->begin, region->end); +@@ -118,6 +122,7 @@ static int __apply_alternatives(const struct alt_region *region) + BUG_ON(alt->alt_len != alt->orig_len); + + origptr = ALT_ORIG_PTR(alt); ++ updptr = (void *)origptr + update_offset; + replptr = ALT_REPL_PTR(alt); + + nr_inst = alt->alt_len / sizeof(insn); +@@ -125,7 +130,7 @@ static int __apply_alternatives(const struct alt_region *region) + for ( i = 0; i < nr_inst; i++ ) + { + insn = get_alt_insn(alt, origptr + i, replptr + i); +- *(origptr + i) = cpu_to_le32(insn); ++ *(updptr + i) = cpu_to_le32(insn); + } + + /* Ensure the new instructions reached the memory and nuke */ +@@ -162,9 +167,6 @@ static int __apply_alternatives_multi_stop(void *unused) + paddr_t xen_size = _end - _start; + unsigned int xen_order = get_order_from_bytes(xen_size); + void *xenmap; +- struct virtual_region patch_region = { +- .list = LIST_HEAD_INIT(patch_region.list), +- }; + + BUG_ON(patched); + +@@ -177,31 +179,13 @@ static int __apply_alternatives_multi_stop(void *unused) + /* Re-mapping Xen is not expected to fail during boot. */ + BUG_ON(!xenmap); + +- /* +- * If we generate a new branch instruction, the target will be +- * calculated in this re-mapped Xen region. So we have to register +- * this re-mapped Xen region as a virtual region temporarily. +- */ +- patch_region.start = xenmap; +- patch_region.end = xenmap + xen_size; +- register_virtual_region(&patch_region); ++ region.begin = __alt_instructions; ++ region.end = __alt_instructions_end; + +- /* +- * Find the virtual address of the alternative region in the new +- * mapping. +- * alt_instr contains relative offset, so the function +- * __apply_alternatives will patch in the re-mapped version of +- * Xen. +- */ +- region.begin = (void *)__alt_instructions - (void *)_start + xenmap; +- region.end = (void *)__alt_instructions_end - (void *)_start + xenmap; +- +- ret = __apply_alternatives(®ion); ++ ret = __apply_alternatives(®ion, xenmap - (void *)_start); + /* The patching is not expected to fail during boot. */ + BUG_ON(ret != 0); + +- unregister_virtual_region(&patch_region); +- + vunmap(xenmap); + + /* Barriers provided by the cache flushing */ +@@ -235,7 +219,7 @@ int apply_alternatives(const struct alt_instr *start, const struct alt_instr *en + .end = end, + }; + +- return __apply_alternatives(®ion); ++ return __apply_alternatives(®ion, 0); + } + + /* +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-events-access-last_priority-and-last_vcpu_id-together.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-events-access-last_priority-and-last_vcpu_id-together.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-events-access-last_priority-and-last_vcpu_id-together.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-events-access-last_priority-and-last_vcpu_id-together.patch 2022-06-01 10:53:11.000000000 +0100 @@ -0,0 +1,98 @@ +From 8ab4af91fab6618994e5857b65aec5896915d9c5 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Tue, 1 Dec 2020 17:06:15 +0100 +Subject: [PATCH 1/1] xen/events: access last_priority and last_vcpu_id + together + +The queue for a fifo event is depending on the vcpu_id and the +priority of the event. When sending an event it might happen the +event needs to change queues and the old queue needs to be kept for +keeping the links between queue elements intact. For this purpose +the event channel contains last_priority and last_vcpu_id values +elements for being able to identify the old queue. + +In order to avoid races always access last_priority and last_vcpu_id +with a single atomic operation avoiding any inconsistencies. + +Signed-off-by: Juergen Gross +Reviewed-by: Julien Grall +master commit: 1277cb9dc5e966f1faf665bcded02b7533e38078 +master date: 2020-11-24 11:23:42 +0100 +--- + xen/common/event_fifo.c | 25 +++++++++++++++++++------ + xen/include/xen/sched.h | 3 +-- + 2 files changed, 20 insertions(+), 8 deletions(-) + +diff --git a/xen/common/event_fifo.c b/xen/common/event_fifo.c +index 98742ba9cb..b1951a29ad 100644 +--- a/xen/common/event_fifo.c ++++ b/xen/common/event_fifo.c +@@ -21,6 +21,14 @@ + + #include + ++union evtchn_fifo_lastq { ++ uint32_t raw; ++ struct { ++ uint8_t last_priority; ++ uint16_t last_vcpu_id; ++ }; ++}; ++ + static inline event_word_t *evtchn_fifo_word_from_port(const struct domain *d, + unsigned int port) + { +@@ -64,16 +72,18 @@ static struct evtchn_fifo_queue *lock_old_queue(const struct domain *d, + struct vcpu *v; + struct evtchn_fifo_queue *q, *old_q; + unsigned int try; ++ union evtchn_fifo_lastq lastq; + + for ( try = 0; try < 3; try++ ) + { +- v = d->vcpu[evtchn->last_vcpu_id]; +- old_q = &v->evtchn_fifo->queue[evtchn->last_priority]; ++ lastq.raw = read_atomic(&evtchn->fifo_lastq); ++ v = d->vcpu[lastq.last_vcpu_id]; ++ old_q = &v->evtchn_fifo->queue[lastq.last_priority]; + + spin_lock_irqsave(&old_q->lock, *flags); + +- v = d->vcpu[evtchn->last_vcpu_id]; +- q = &v->evtchn_fifo->queue[evtchn->last_priority]; ++ v = d->vcpu[lastq.last_vcpu_id]; ++ q = &v->evtchn_fifo->queue[lastq.last_priority]; + + if ( old_q == q ) + return old_q; +@@ -224,8 +234,11 @@ static void evtchn_fifo_set_pending(struct vcpu *v, struct evtchn *evtchn) + /* Moved to a different queue? */ + if ( old_q != q ) + { +- evtchn->last_vcpu_id = v->vcpu_id; +- evtchn->last_priority = q->priority; ++ union evtchn_fifo_lastq lastq = { }; ++ ++ lastq.last_vcpu_id = v->vcpu_id; ++ lastq.last_priority = q->priority; ++ write_atomic(&evtchn->fifo_lastq, lastq.raw); + + spin_unlock_irqrestore(&old_q->lock, flags); + spin_lock_irqsave(&q->lock, flags); +diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h +index 3c1284c3da..81af1209dd 100644 +--- a/xen/include/xen/sched.h ++++ b/xen/include/xen/sched.h +@@ -114,8 +114,7 @@ struct evtchn + #ifndef NDEBUG + u8 old_state; /* State when taking lock in write mode. */ + #endif +- u8 last_priority; +- u16 last_vcpu_id; ++ u32 fifo_lastq; /* Data for fifo events identifying last queue. */ + #ifdef CONFIG_XSM + union { + #ifdef XSM_NEED_GENERIC_EVTCHN_SSID +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-evtchn-rework-per-event-channel-lock.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-evtchn-rework-per-event-channel-lock.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-evtchn-rework-per-event-channel-lock.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-evtchn-rework-per-event-channel-lock.patch 2022-06-01 12:11:00.000000000 +0100 @@ -0,0 +1,592 @@ +From 4438fc14a6c60b265a47c780d6a61d4f3015a297 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Tue, 1 Dec 2020 17:04:43 +0100 +Subject: [PATCH] xen/evtchn: rework per event channel lock + +Currently the lock for a single event channel needs to be taken with +interrupts off, which causes deadlocks in some cases. + +Rework the per event channel lock to be non-blocking for the case of +sending an event and removing the need for disabling interrupts for +taking the lock. + +The lock is needed for avoiding races between event channel state +changes (creation, closing, binding) against normal operations (set +pending, [un]masking, priority changes). + +Use a rwlock, but with some restrictions: + +- Changing the state of an event channel (creation, closing, binding) + needs to use write_lock(), with ASSERT()ing that the lock is taken as + writer only when the state of the event channel is either before or + after the locked region appropriate (either free or unbound). + +- Sending an event needs to use read_trylock() mostly, in case of not + obtaining the lock the operation is omitted. This is needed as + sending an event can happen with interrupts off (at least in some + cases). + +- Dumping the event channel state for debug purposes is using + read_trylock(), too, in order to avoid blocking in case the lock is + taken as writer for a long time. + +- All other cases can use read_lock(). + +Fixes: e045199c7c9c54 ("evtchn: address races with evtchn_reset()") +Signed-off-by: Juergen Gross +Reviewed-by: Jan Beulich +Acked-by: Julien Grall + +xen/events: fix build + +Commit 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock") +introduced a build failure for NDEBUG builds. + +Fixes: 5f2df45ead7c1195 ("xen/evtchn: rework per event channel lock") +Signed-off-by: Juergen Gross +Signed-off-by: Jan Beulich +master commit: 5f2df45ead7c1195142f68b7923047a1e9479d54 +master date: 2020-11-10 14:36:15 +0100 +master commit: 53bacb86f496fdb11560d9e3b361bca7de60d268 +master date: 2020-11-11 08:56:21 +0100 +--- + xen/arch/x86/irq.c | 6 +- + xen/arch/x86/pv/shim.c | 9 +-- + xen/common/event_channel.c | 141 ++++++++++++++++++++++--------------- + xen/include/xen/event.h | 27 +++++-- + xen/include/xen/sched.h | 5 +- + 5 files changed, 116 insertions(+), 72 deletions(-) + +diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c +index 9878486073..a293270cf2 100644 +--- a/xen/arch/x86/irq.c ++++ b/xen/arch/x86/irq.c +@@ -2331,14 +2331,12 @@ static void dump_irqs(unsigned char key) + pirq = domain_irq_to_pirq(d, irq); + info = pirq_info(d, pirq); + evtchn = evtchn_from_port(d, info->evtchn); +- local_irq_disable(); +- if ( spin_trylock(&evtchn->lock) ) ++ if ( evtchn_read_trylock(evtchn) ) + { + pending = evtchn_is_pending(d, evtchn); + masked = evtchn_is_masked(d, evtchn); +- spin_unlock(&evtchn->lock); ++ evtchn_read_unlock(evtchn); + } +- local_irq_enable(); + printk("%u:%3d(%c%c%c)", + d->domain_id, pirq, "-P?"[pending], + "-M?"[masked], info->masked ? 'M' : '-'); +diff --git a/xen/arch/x86/pv/shim.c b/xen/arch/x86/pv/shim.c +index 4b7d498c00..f8883bd102 100644 +--- a/xen/arch/x86/pv/shim.c ++++ b/xen/arch/x86/pv/shim.c +@@ -616,11 +616,12 @@ void pv_shim_inject_evtchn(unsigned int port) + if ( port_is_valid(guest, port) ) + { + struct evtchn *chn = evtchn_from_port(guest, port); +- unsigned long flags; + +- spin_lock_irqsave(&chn->lock, flags); +- evtchn_port_set_pending(guest, chn->notify_vcpu_id, chn); +- spin_unlock_irqrestore(&chn->lock, flags); ++ if ( evtchn_read_trylock(chn) ) ++ { ++ evtchn_port_set_pending(guest, chn->notify_vcpu_id, chn); ++ evtchn_read_unlock(chn); ++ } + } + } + +diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c +index 0066c8a87f..7f2ad9d826 100644 +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -50,6 +50,40 @@ + + #define consumer_is_xen(e) (!!(e)->xen_consumer) + ++/* ++ * Lock an event channel exclusively. This is allowed only when the channel is ++ * free or unbound either when taking or when releasing the lock, as any ++ * concurrent operation on the event channel using evtchn_read_trylock() will ++ * just assume the event channel is free or unbound at the moment when the ++ * evtchn_read_trylock() returns false. ++ */ ++static inline void evtchn_write_lock(struct evtchn *evtchn) ++{ ++ write_lock(&evtchn->lock); ++ ++#ifndef NDEBUG ++ evtchn->old_state = evtchn->state; ++#endif ++} ++ ++static inline unsigned int old_state(const struct evtchn *evtchn) ++{ ++#ifndef NDEBUG ++ return evtchn->old_state; ++#else ++ return ECS_RESERVED; /* Just to allow things to build. */ ++#endif ++} ++ ++static inline void evtchn_write_unlock(struct evtchn *evtchn) ++{ ++ /* Enforce lock discipline. */ ++ ASSERT(old_state(evtchn) == ECS_FREE || old_state(evtchn) == ECS_UNBOUND || ++ evtchn->state == ECS_FREE || evtchn->state == ECS_UNBOUND); ++ ++ write_unlock(&evtchn->lock); ++} ++ + /* + * The function alloc_unbound_xen_event_channel() allows an arbitrary + * notifier function to be specified. However, very few unique functions +@@ -131,7 +165,7 @@ static struct evtchn *alloc_evtchn_bucket(struct domain *d, unsigned int port) + return NULL; + } + chn[i].port = port + i; +- spin_lock_init(&chn[i].lock); ++ rwlock_init(&chn[i].lock); + } + return chn; + } +@@ -249,7 +283,6 @@ static long evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc) + int port; + domid_t dom = alloc->dom; + long rc; +- unsigned long flags; + + d = rcu_lock_domain_by_any_id(dom); + if ( d == NULL ) +@@ -265,14 +298,14 @@ static long evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc) + if ( rc ) + goto out; + +- spin_lock_irqsave(&chn->lock, flags); ++ evtchn_write_lock(chn); + + chn->state = ECS_UNBOUND; + if ( (chn->u.unbound.remote_domid = alloc->remote_dom) == DOMID_SELF ) + chn->u.unbound.remote_domid = current->domain->domain_id; + evtchn_port_init(d, chn); + +- spin_unlock_irqrestore(&chn->lock, flags); ++ evtchn_write_unlock(chn); + + alloc->port = port; + +@@ -285,32 +318,26 @@ static long evtchn_alloc_unbound(evtchn_alloc_unbound_t *alloc) + } + + +-static unsigned long double_evtchn_lock(struct evtchn *lchn, +- struct evtchn *rchn) ++static void double_evtchn_lock(struct evtchn *lchn, struct evtchn *rchn) + { +- unsigned long flags; +- + if ( lchn <= rchn ) + { +- spin_lock_irqsave(&lchn->lock, flags); ++ evtchn_write_lock(lchn); + if ( lchn != rchn ) +- spin_lock(&rchn->lock); ++ evtchn_write_lock(rchn); + } + else + { +- spin_lock_irqsave(&rchn->lock, flags); +- spin_lock(&lchn->lock); ++ evtchn_write_lock(rchn); ++ evtchn_write_lock(lchn); + } +- +- return flags; + } + +-static void double_evtchn_unlock(struct evtchn *lchn, struct evtchn *rchn, +- unsigned long flags) ++static void double_evtchn_unlock(struct evtchn *lchn, struct evtchn *rchn) + { + if ( lchn != rchn ) +- spin_unlock(&lchn->lock); +- spin_unlock_irqrestore(&rchn->lock, flags); ++ evtchn_write_unlock(lchn); ++ evtchn_write_unlock(rchn); + } + + static long evtchn_bind_interdomain(evtchn_bind_interdomain_t *bind) +@@ -320,7 +347,6 @@ static long evtchn_bind_interdomain(evtchn_bind_interdomain_t *bind) + int lport, rport = bind->remote_port; + domid_t rdom = bind->remote_dom; + long rc; +- unsigned long flags; + + if ( rdom == DOMID_SELF ) + rdom = current->domain->domain_id; +@@ -356,7 +382,7 @@ static long evtchn_bind_interdomain(evtchn_bind_interdomain_t *bind) + if ( rc ) + goto out; + +- flags = double_evtchn_lock(lchn, rchn); ++ double_evtchn_lock(lchn, rchn); + + lchn->u.interdomain.remote_dom = rd; + lchn->u.interdomain.remote_port = rport; +@@ -373,7 +399,7 @@ static long evtchn_bind_interdomain(evtchn_bind_interdomain_t *bind) + */ + evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn); + +- double_evtchn_unlock(lchn, rchn, flags); ++ double_evtchn_unlock(lchn, rchn); + + bind->local_port = lport; + +@@ -396,7 +422,6 @@ int evtchn_bind_virq(evtchn_bind_virq_t *bind, evtchn_port_t port) + struct domain *d = current->domain; + int virq = bind->virq, vcpu = bind->vcpu; + int rc = 0; +- unsigned long flags; + + if ( (virq < 0) || (virq >= ARRAY_SIZE(v->virq_to_evtchn)) ) + return -EINVAL; +@@ -429,14 +454,14 @@ int evtchn_bind_virq(evtchn_bind_virq_t *bind, evtchn_port_t port) + + chn = evtchn_from_port(d, port); + +- spin_lock_irqsave(&chn->lock, flags); ++ evtchn_write_lock(chn); + + chn->state = ECS_VIRQ; + chn->notify_vcpu_id = vcpu; + chn->u.virq = virq; + evtchn_port_init(d, chn); + +- spin_unlock_irqrestore(&chn->lock, flags); ++ evtchn_write_unlock(chn); + + v->virq_to_evtchn[virq] = bind->port = port; + +@@ -453,7 +478,6 @@ static long evtchn_bind_ipi(evtchn_bind_ipi_t *bind) + struct domain *d = current->domain; + int port, vcpu = bind->vcpu; + long rc = 0; +- unsigned long flags; + + if ( (vcpu < 0) || (vcpu >= d->max_vcpus) || + (d->vcpu[vcpu] == NULL) ) +@@ -466,13 +490,13 @@ static long evtchn_bind_ipi(evtchn_bind_ipi_t *bind) + + chn = evtchn_from_port(d, port); + +- spin_lock_irqsave(&chn->lock, flags); ++ evtchn_write_lock(chn); + + chn->state = ECS_IPI; + chn->notify_vcpu_id = vcpu; + evtchn_port_init(d, chn); + +- spin_unlock_irqrestore(&chn->lock, flags); ++ evtchn_write_unlock(chn); + + bind->port = port; + +@@ -516,7 +540,6 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind) + struct pirq *info; + int port = 0, pirq = bind->pirq; + long rc; +- unsigned long flags; + + if ( (pirq < 0) || (pirq >= d->nr_pirqs) ) + return -EINVAL; +@@ -549,14 +572,14 @@ static long evtchn_bind_pirq(evtchn_bind_pirq_t *bind) + goto out; + } + +- spin_lock_irqsave(&chn->lock, flags); ++ evtchn_write_lock(chn); + + chn->state = ECS_PIRQ; + chn->u.pirq.irq = pirq; + link_pirq_port(port, chn, v); + evtchn_port_init(d, chn); + +- spin_unlock_irqrestore(&chn->lock, flags); ++ evtchn_write_unlock(chn); + + bind->port = port; + +@@ -577,7 +600,6 @@ int evtchn_close(struct domain *d1, int port1, bool guest) + struct evtchn *chn1, *chn2; + int port2; + long rc = 0; +- unsigned long flags; + + again: + spin_lock(&d1->event_lock); +@@ -677,14 +699,14 @@ int evtchn_close(struct domain *d1, int port1, bool guest) + BUG_ON(chn2->state != ECS_INTERDOMAIN); + BUG_ON(chn2->u.interdomain.remote_dom != d1); + +- flags = double_evtchn_lock(chn1, chn2); ++ double_evtchn_lock(chn1, chn2); + + evtchn_free(d1, chn1); + + chn2->state = ECS_UNBOUND; + chn2->u.unbound.remote_domid = d1->domain_id; + +- double_evtchn_unlock(chn1, chn2, flags); ++ double_evtchn_unlock(chn1, chn2); + + goto out; + +@@ -692,9 +714,9 @@ int evtchn_close(struct domain *d1, int port1, bool guest) + BUG(); + } + +- spin_lock_irqsave(&chn1->lock, flags); ++ evtchn_write_lock(chn1); + evtchn_free(d1, chn1); +- spin_unlock_irqrestore(&chn1->lock, flags); ++ evtchn_write_unlock(chn1); + + out: + if ( d2 != NULL ) +@@ -714,7 +736,6 @@ int evtchn_send(struct domain *ld, unsigned int lport) + struct evtchn *lchn, *rchn; + struct domain *rd; + int rport, ret = 0; +- unsigned long flags; + + if ( !port_is_valid(ld, lport) ) + return -EINVAL; +@@ -727,7 +748,7 @@ int evtchn_send(struct domain *ld, unsigned int lport) + + lchn = evtchn_from_port(ld, lport); + +- spin_lock_irqsave(&lchn->lock, flags); ++ evtchn_read_lock(lchn); + + /* Guest cannot send via a Xen-attached event channel. */ + if ( unlikely(consumer_is_xen(lchn)) ) +@@ -762,7 +783,7 @@ int evtchn_send(struct domain *ld, unsigned int lport) + } + + out: +- spin_unlock_irqrestore(&lchn->lock, flags); ++ evtchn_read_unlock(lchn); + + return ret; + } +@@ -789,9 +810,11 @@ void send_guest_vcpu_virq(struct vcpu *v, uint32_t virq) + + d = v->domain; + chn = evtchn_from_port(d, port); +- spin_lock(&chn->lock); +- evtchn_port_set_pending(d, v->vcpu_id, chn); +- spin_unlock(&chn->lock); ++ if ( evtchn_read_trylock(chn) ) ++ { ++ evtchn_port_set_pending(d, v->vcpu_id, chn); ++ evtchn_read_unlock(chn); ++ } + + out: + spin_unlock_irqrestore(&v->virq_lock, flags); +@@ -820,9 +843,11 @@ static void send_guest_global_virq(struct domain *d, uint32_t virq) + goto out; + + chn = evtchn_from_port(d, port); +- spin_lock(&chn->lock); +- evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); +- spin_unlock(&chn->lock); ++ if ( evtchn_read_trylock(chn) ) ++ { ++ evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); ++ evtchn_read_unlock(chn); ++ } + + out: + spin_unlock_irqrestore(&v->virq_lock, flags); +@@ -832,7 +857,6 @@ void send_guest_pirq(struct domain *d, const struct pirq *pirq) + { + int port; + struct evtchn *chn; +- unsigned long flags; + + /* + * PV guests: It should not be possible to race with __evtchn_close(). The +@@ -847,9 +871,11 @@ void send_guest_pirq(struct domain *d, const struct pirq *pirq) + } + + chn = evtchn_from_port(d, port); +- spin_lock_irqsave(&chn->lock, flags); +- evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); +- spin_unlock_irqrestore(&chn->lock, flags); ++ if ( evtchn_read_trylock(chn) ) ++ { ++ evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); ++ evtchn_read_unlock(chn); ++ } + } + + static struct domain *global_virq_handlers[NR_VIRQS] __read_mostly; +@@ -1044,15 +1070,17 @@ int evtchn_unmask(unsigned int port) + { + struct domain *d = current->domain; + struct evtchn *evtchn; +- unsigned long flags; + + if ( unlikely(!port_is_valid(d, port)) ) + return -EINVAL; + + evtchn = evtchn_from_port(d, port); +- spin_lock_irqsave(&evtchn->lock, flags); ++ ++ evtchn_read_lock(evtchn); ++ + evtchn_port_unmask(d, evtchn); +- spin_unlock_irqrestore(&evtchn->lock, flags); ++ ++ evtchn_read_unlock(evtchn); + + return 0; + } +@@ -1298,7 +1326,6 @@ int alloc_unbound_xen_event_channel( + { + struct evtchn *chn; + int port, rc; +- unsigned long flags; + + spin_lock(&ld->event_lock); + +@@ -1311,14 +1338,14 @@ int alloc_unbound_xen_event_channel( + if ( rc ) + goto out; + +- spin_lock_irqsave(&chn->lock, flags); ++ evtchn_write_lock(chn); + + chn->state = ECS_UNBOUND; + chn->xen_consumer = get_xen_consumer(notification_fn); + chn->notify_vcpu_id = lvcpu; + chn->u.unbound.remote_domid = remote_domid; + +- spin_unlock_irqrestore(&chn->lock, flags); ++ evtchn_write_unlock(chn); + + write_atomic(&ld->xen_evtchns, ld->xen_evtchns + 1); + +@@ -1350,7 +1377,6 @@ void notify_via_xen_event_channel(struct domain *ld, int lport) + { + struct evtchn *lchn, *rchn; + struct domain *rd; +- unsigned long flags; + + if ( !port_is_valid(ld, lport) ) + { +@@ -1365,7 +1391,8 @@ void notify_via_xen_event_channel(struct domain *ld, int lport) + + lchn = evtchn_from_port(ld, lport); + +- spin_lock_irqsave(&lchn->lock, flags); ++ if ( !evtchn_read_trylock(lchn) ) ++ return; + + if ( likely(lchn->state == ECS_INTERDOMAIN) ) + { +@@ -1375,7 +1402,7 @@ void notify_via_xen_event_channel(struct domain *ld, int lport) + evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); + } + +- spin_unlock_irqrestore(&lchn->lock, flags); ++ evtchn_read_unlock(lchn); + } + + void evtchn_check_pollers(struct domain *d, unsigned int port) +diff --git a/xen/include/xen/event.h b/xen/include/xen/event.h +index 87a4aade86..b0c39d402c 100644 +--- a/xen/include/xen/event.h ++++ b/xen/include/xen/event.h +@@ -103,6 +103,21 @@ static inline unsigned int max_evtchns(const struct domain *d) + : BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); + } + ++static inline void evtchn_read_lock(struct evtchn *evtchn) ++{ ++ read_lock(&evtchn->lock); ++} ++ ++static inline bool evtchn_read_trylock(struct evtchn *evtchn) ++{ ++ return read_trylock(&evtchn->lock); ++} ++ ++static inline void evtchn_read_unlock(struct evtchn *evtchn) ++{ ++ read_unlock(&evtchn->lock); ++} ++ + static inline bool_t port_is_valid(struct domain *d, unsigned int p) + { + if ( p >= read_atomic(&d->valid_evtchns) ) +@@ -236,11 +251,10 @@ static inline bool evtchn_port_is_pending(struct domain *d, evtchn_port_t port) + { + struct evtchn *evtchn = evtchn_from_port(d, port); + bool rc; +- unsigned long flags; + +- spin_lock_irqsave(&evtchn->lock, flags); ++ evtchn_read_lock(evtchn); + rc = evtchn_is_pending(d, evtchn); +- spin_unlock_irqrestore(&evtchn->lock, flags); ++ evtchn_read_unlock(evtchn); + + return rc; + } +@@ -255,11 +269,12 @@ static inline bool evtchn_port_is_masked(struct domain *d, evtchn_port_t port) + { + struct evtchn *evtchn = evtchn_from_port(d, port); + bool rc; +- unsigned long flags; + +- spin_lock_irqsave(&evtchn->lock, flags); ++ evtchn_read_lock(evtchn); ++ + rc = evtchn_is_masked(d, evtchn); +- spin_unlock_irqrestore(&evtchn->lock, flags); ++ ++ evtchn_read_unlock(evtchn); + + return rc; + } +diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h +index 7e4ad5d51b..3c1284c3da 100644 +--- a/xen/include/xen/sched.h ++++ b/xen/include/xen/sched.h +@@ -82,7 +82,7 @@ extern domid_t hardware_domid; + + struct evtchn + { +- spinlock_t lock; ++ rwlock_t lock; + #define ECS_FREE 0 /* Channel is available for use. */ + #define ECS_RESERVED 1 /* Channel is reserved. */ + #define ECS_UNBOUND 2 /* Channel is waiting to bind to a remote domain. */ +@@ -111,6 +111,9 @@ struct evtchn + u16 virq; /* state == ECS_VIRQ */ + } u; + u8 priority; ++#ifndef NDEBUG ++ u8 old_state; /* State when taking lock in write mode. */ ++#endif + u8 last_priority; + u16 last_vcpu_id; + #ifdef CONFIG_XSM +-- +2.30.2 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-split-parameter-related-definitions-in-own-header-file.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-split-parameter-related-definitions-in-own-header-file.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-split-parameter-related-definitions-in-own-header-file.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xen-split-parameter-related-definitions-in-own-header-file.patch 2022-06-16 22:48:10.000000000 +0100 @@ -0,0 +1,1409 @@ +From ffdeb6dea596c077aebbdf7d864cdd67d6a6b2f8 Mon Sep 17 00:00:00 2001 +From: Juergen Gross +Date: Mon, 3 Feb 2020 13:04:30 +0100 +Subject: [PATCH] xen: split parameter related definitions in own header file + +Move the parameter related definitions from init.h into a new header +file param.h. This will avoid include hell when new dependencies are +added to parameter definitions. + +Signed-off-by: Juergen Gross +Acked-by: Julien Grall +Acked-by: Dario Faggioli +Acked-by: Paul Durrant +Reviewed-by: Kevin Tian +Acked-by: Jan Beulich +--- + xen/arch/arm/acpi/boot.c | 1 + + xen/arch/arm/cpuerrata.c | 1 + + xen/arch/arm/domain_build.c | 1 + + xen/arch/arm/gic-v3-lpi.c | 1 + + xen/arch/arm/setup.c | 1 + + xen/arch/arm/smpboot.c | 1 + + xen/arch/arm/traps.c | 1 + + xen/arch/x86/acpi/boot.c | 1 + + xen/arch/x86/acpi/cpu_idle.c | 1 + + xen/arch/x86/acpi/cpufreq/cpufreq.c | 1 + + xen/arch/x86/acpi/power.c | 1 + + xen/arch/x86/apic.c | 1 + + xen/arch/x86/cpu/amd.c | 1 + + xen/arch/x86/cpu/common.c | 1 + + xen/arch/x86/cpu/mcheck/mce.c | 1 + + xen/arch/x86/cpu/mcheck/mce_intel.c | 1 + + xen/arch/x86/cpu/mtrr/generic.c | 1 + + xen/arch/x86/cpu/mwait-idle.c | 1 + + xen/arch/x86/cpu/vpmu.c | 1 + + xen/arch/x86/cpuid.c | 1 + + xen/arch/x86/dom0_build.c | 1 + + xen/arch/x86/e820.c | 1 + + xen/arch/x86/genapic/probe.c | 1 + + xen/arch/x86/genapic/x2apic.c | 1 + + xen/arch/x86/hpet.c | 1 + + xen/arch/x86/hvm/asid.c | 1 + + xen/arch/x86/hvm/hvm.c | 1 + + xen/arch/x86/hvm/quirks.c | 1 + + xen/arch/x86/hvm/viridian.c | 1 + + xen/arch/x86/hvm/vmx/vmcs.c | 1 + + xen/arch/x86/hvm/vmx/vmx.c | 1 + + xen/arch/x86/io_apic.c | 1 + + xen/arch/x86/irq.c | 1 + + xen/arch/x86/microcode.c | 1 + + xen/arch/x86/mm.c | 1 + + xen/arch/x86/mm/p2m.c | 1 + + xen/arch/x86/msi.c | 1 + + xen/arch/x86/nmi.c | 1 + + xen/arch/x86/numa.c | 1 + + xen/arch/x86/oprofile/nmi_int.c | 1 + + xen/arch/x86/psr.c | 1 + + xen/arch/x86/pv/domain.c | 1 + + xen/arch/x86/pv/shim.c | 1 + + xen/arch/x86/setup.c | 1 + + xen/arch/x86/shutdown.c | 1 + + xen/arch/x86/spec_ctrl.c | 1 + + xen/arch/x86/tboot.c | 1 + + xen/arch/x86/time.c | 1 + + xen/arch/x86/traps.c | 1 + + xen/arch/x86/tsx.c | 1 + + xen/arch/x86/x86_64/mmconfig-shared.c | 1 + + xen/arch/x86/xstate.c | 1 + + xen/common/core_parking.c | 1 + + xen/common/domain.c | 1 + + xen/common/efi/boot.c | 1 + + xen/common/gdbstub.c | 1 + + xen/common/grant_table.c | 1 + + xen/common/kernel.c | 1 + + xen/common/kexec.c | 1 + + xen/common/memory.c | 1 + + xen/common/page_alloc.c | 1 + + xen/common/rcupdate.c | 1 + + xen/common/sched_credit.c | 1 + + xen/common/sched_credit2.c | 1 + + xen/common/schedule.c | 1 + + xen/common/shutdown.c | 1 + + xen/common/timer.c | 1 + + xen/common/tmem_xen.c | 1 + + xen/common/trace.c | 1 + + xen/drivers/acpi/apei/hest.c | 1 + + xen/drivers/acpi/tables.c | 1 + + xen/drivers/char/arm-uart.c | 1 + + xen/drivers/char/console.c | 1 + + xen/drivers/char/ehci-dbgp.c | 1 + + xen/drivers/char/ns16550.c | 1 + + xen/drivers/char/serial.c | 1 + + xen/drivers/cpufreq/cpufreq.c | 1 + + xen/drivers/passthrough/amd/iommu_acpi.c | 1 + + xen/drivers/passthrough/iommu.c | 1 + + xen/drivers/passthrough/pci.c | 1 + + xen/drivers/passthrough/vtd/dmar.c | 1 + + xen/drivers/passthrough/vtd/quirks.c | 1 + + xen/drivers/passthrough/vtd/x86/vtd.c | 1 + + xen/drivers/passthrough/x86/ats.c | 1 + + xen/drivers/video/vesa.c | 1 + + xen/drivers/video/vga.c | 1 + + xen/include/xen/init.h | 120 --------------------- + xen/include/xen/param.h | 126 +++++++++++++++++++++++ + xen/xsm/flask/flask_op.c | 1 + + xen/xsm/xsm_core.c | 1 + + 92 files changed, 216 insertions(+), 120 deletions(-) + create mode 100644 xen/include/xen/param.h + +diff --git a/xen/arch/arm/acpi/boot.c b/xen/arch/arm/acpi/boot.c +index bf9c78b02c..30e4bd1bc5 100644 +--- a/xen/arch/arm/acpi/boot.c ++++ b/xen/arch/arm/acpi/boot.c +@@ -30,6 +30,7 @@ + #include + #include + #include ++#include + #include + + #include +diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c +index da72b02442..0248893de0 100644 +--- a/xen/arch/arm/cpuerrata.c ++++ b/xen/arch/arm/cpuerrata.c +@@ -1,6 +1,7 @@ + #include + #include ++#include + #include + #include + #include + #include +diff --git a/xen/arch/arm/domain_build.c b/xen/arch/arm/domain_build.c +index dd9c3b73ba..d2d11eda26 100644 +--- a/xen/arch/arm/domain_build.c ++++ b/xen/arch/arm/domain_build.c +@@ -2,6 +2,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/arm/gic-v3-lpi.c b/xen/arch/arm/gic-v3-lpi.c +index 78b9521b21..869bc97fa1 100644 +--- a/xen/arch/arm/gic-v3-lpi.c ++++ b/xen/arch/arm/gic-v3-lpi.c +@@ -20,6 +20,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/arm/setup.c b/xen/arch/arm/setup.c +index 494f70546b..3c8ae11b73 100644 +--- a/xen/arch/arm/setup.c ++++ b/xen/arch/arm/setup.c +@@ -29,6 +29,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c +index 00b64c3322..cae2179126 100644 +--- a/xen/arch/arm/smpboot.c ++++ b/xen/arch/arm/smpboot.c +@@ -23,6 +23,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c +index a20474f87c..6f9bec22d3 100644 +--- a/xen/arch/arm/traps.c ++++ b/xen/arch/arm/traps.c +@@ -26,6 +26,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c +index afc6ed9d99..bcba52e232 100644 +--- a/xen/arch/x86/acpi/boot.c ++++ b/xen/arch/x86/acpi/boot.c +@@ -27,6 +27,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/acpi/cpu_idle.c b/xen/arch/x86/acpi/cpu_idle.c +index 2676f0d7da..5cd70d7a40 100644 +--- a/xen/arch/x86/acpi/cpu_idle.c ++++ b/xen/arch/x86/acpi/cpu_idle.c +@@ -37,6 +37,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/acpi/cpufreq/cpufreq.c b/xen/arch/x86/acpi/cpufreq/cpufreq.c +index f05275578d..281be131a3 100644 +--- a/xen/arch/x86/acpi/cpufreq/cpufreq.c ++++ b/xen/arch/x86/acpi/cpufreq/cpufreq.c +@@ -31,6 +31,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c +index feb0f6ce20..b5df00b22c 100644 +--- a/xen/arch/x86/acpi/power.c ++++ b/xen/arch/x86/acpi/power.c +@@ -14,6 +14,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/apic.c b/xen/arch/x86/apic.c +index 508b1586f2..a361781456 100644 +--- a/xen/arch/x86/apic.c ++++ b/xen/arch/x86/apic.c +@@ -20,6 +20,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c +index 8b5f0f2e4c..e351dd227f 100644 +--- a/xen/arch/x86/cpu/amd.c ++++ b/xen/arch/x86/cpu/amd.c +@@ -1,6 +1,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c +index e5ad17d8d9..1b33f1ed71 100644 +--- a/xen/arch/x86/cpu/common.c ++++ b/xen/arch/x86/cpu/common.c +@@ -1,6 +1,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c +index 198595ff97..d61e582af3 100644 +--- a/xen/arch/x86/cpu/mcheck/mce.c ++++ b/xen/arch/x86/cpu/mcheck/mce.c +@@ -6,6 +6,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpu/mcheck/mce_intel.c b/xen/arch/x86/cpu/mcheck/mce_intel.c +index 70738852b9..6f23ea5329 100644 +--- a/xen/arch/x86/cpu/mcheck/mce_intel.c ++++ b/xen/arch/x86/cpu/mcheck/mce_intel.c +@@ -4,6 +4,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpu/mtrr/generic.c b/xen/arch/x86/cpu/mtrr/generic.c +index cc0bf4c310..89634f918f 100644 +--- a/xen/arch/x86/cpu/mtrr/generic.c ++++ b/xen/arch/x86/cpu/mtrr/generic.c +@@ -3,6 +3,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpu/mwait-idle.c b/xen/arch/x86/cpu/mwait-idle.c +index f49b04c45b..b81937966e 100644 +--- a/xen/arch/x86/cpu/mwait-idle.c ++++ b/xen/arch/x86/cpu/mwait-idle.c +@@ -52,6 +52,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpu/vpmu.c b/xen/arch/x86/cpu/vpmu.c +index b62095eef2..3c778450ac 100644 +--- a/xen/arch/x86/cpu/vpmu.c ++++ b/xen/arch/x86/cpu/vpmu.c +@@ -22,6 +22,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c +index b1ed33d524..aee221dc44 100644 +--- a/xen/arch/x86/cpuid.c ++++ b/xen/arch/x86/cpuid.c +@@ -1,5 +1,6 @@ + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/dom0_build.c b/xen/arch/x86/dom0_build.c +index 56c2dee0fc..6bf5365582 100644 +--- a/xen/arch/x86/dom0_build.c ++++ b/xen/arch/x86/dom0_build.c +@@ -7,6 +7,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/e820.c b/xen/arch/x86/e820.c +index 3892c9cfb7..b9f589cac3 100644 +--- a/xen/arch/x86/e820.c ++++ b/xen/arch/x86/e820.c +@@ -1,6 +1,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/genapic/probe.c b/xen/arch/x86/genapic/probe.c +index 1fcc1734f5..d4d7a554a0 100644 +--- a/xen/arch/x86/genapic/probe.c ++++ b/xen/arch/x86/genapic/probe.c +@@ -8,6 +8,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/genapic/x2apic.c b/xen/arch/x86/genapic/x2apic.c +index 1cb16bc10d..f9b5e49761 100644 +--- a/xen/arch/x86/genapic/x2apic.c ++++ b/xen/arch/x86/genapic/x2apic.c +@@ -19,6 +19,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/hpet.c b/xen/arch/x86/hpet.c +index 57f68fa81b..ae99993d90 100644 +--- a/xen/arch/x86/hpet.c ++++ b/xen/arch/x86/hpet.c +@@ -11,6 +11,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/hvm/asid.c b/xen/arch/x86/hvm/asid.c +index 9d3c671a5f..8e00a28443 100644 +--- a/xen/arch/x86/hvm/asid.c ++++ b/xen/arch/x86/hvm/asid.c +@@ -18,6 +18,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c +index ea99417f08..2fee569a5f 100644 +--- a/xen/arch/x86/hvm/hvm.c ++++ b/xen/arch/x86/hvm/hvm.c +@@ -35,6 +35,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/hvm/quirks.c b/xen/arch/x86/hvm/quirks.c +index 881c6b99d2..54cc66c382 100644 +--- a/xen/arch/x86/hvm/quirks.c ++++ b/xen/arch/x86/hvm/quirks.c +@@ -19,6 +19,7 @@ + #include + #include + #include ++#include + #include + + s8 __read_mostly hvm_port80_allowed = -1; +diff -u a/xen/arch/x86/hvm/viridian.c b/xen/arch/x86/hvm/viridian.c +--- a/xen/arch/x86/hvm/viridian.c ++++ b/xen/arch/x86/hvm/viridian.c +@@ -9,6 +9,7 @@ + * for more information. + */ + ++#include + #include + #include + #include +diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c +index 634d1946d3..65445afeb0 100644 +--- a/xen/arch/x86/hvm/vmx/vmcs.c ++++ b/xen/arch/x86/hvm/vmx/vmcs.c +@@ -18,6 +18,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c +index b262d38a7c..35c8402ea0 100644 +--- a/xen/arch/x86/hvm/vmx/vmx.c ++++ b/xen/arch/x86/hvm/vmx/vmx.c +@@ -17,6 +17,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/io_apic.c b/xen/arch/x86/io_apic.c +index 4125ea0c0c..e98e08e9c8 100644 +--- a/xen/arch/x86/io_apic.c ++++ b/xen/arch/x86/io_apic.c +@@ -24,6 +24,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/irq.c b/xen/arch/x86/irq.c +index 310ac00a60..cc2eb8e925 100644 +--- a/xen/arch/x86/irq.c ++++ b/xen/arch/x86/irq.c +@@ -10,6 +10,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/microcode.c b/xen/arch/x86/microcode.c +index 71e881b243..c0fb690f79 100644 +--- a/xen/arch/x86/microcode.c ++++ b/xen/arch/x86/microcode.c +@@ -26,6 +26,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index f50c065af3..a05a713276 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -103,6 +103,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c +index 49cc138362..def13f657b 100644 +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -27,6 +27,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c +index df97ce0c72..c85cf9f85a 100644 +--- a/xen/arch/x86/msi.c ++++ b/xen/arch/x86/msi.c +@@ -14,6 +14,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/nmi.c b/xen/arch/x86/nmi.c +index e26121a737..a5c6bdd0ce 100644 +--- a/xen/arch/x86/nmi.c ++++ b/xen/arch/x86/nmi.c +@@ -16,6 +16,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c +index 7e1f563012..6ef15b34d5 100644 +--- a/xen/arch/x86/numa.c ++++ b/xen/arch/x86/numa.c +@@ -11,6 +11,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/oprofile/nmi_int.c b/xen/arch/x86/oprofile/nmi_int.c +index 3dfb8fef93..8f97f7522c 100644 +--- a/xen/arch/x86/oprofile/nmi_int.c ++++ b/xen/arch/x86/oprofile/nmi_int.c +@@ -15,6 +15,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/psr.c b/xen/arch/x86/psr.c +index 8bf1c23751..d7f8864651 100644 +--- a/xen/arch/x86/psr.c ++++ b/xen/arch/x86/psr.c +@@ -16,6 +16,7 @@ + #include + #include + #include ++#include + #include + #include + +diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c +index 4da0b2afff..c3473b9a47 100644 +--- a/xen/arch/x86/pv/domain.c ++++ b/xen/arch/x86/pv/domain.c +@@ -7,6 +7,7 @@ + #include + #include + #include ++#include + #include + + #include +diff --git a/xen/arch/x86/pv/shim.c b/xen/arch/x86/pv/shim.c +index 7a898fdbe5..76fb380100 100644 +--- a/xen/arch/x86/pv/shim.c ++++ b/xen/arch/x86/pv/shim.c +@@ -23,6 +23,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c +index 0223967b24..e50e1f86b3 100644 +--- a/xen/arch/x86/setup.c ++++ b/xen/arch/x86/setup.c +@@ -1,6 +1,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/shutdown.c b/xen/arch/x86/shutdown.c +index 005c0bf4fa..acef033143 100644 +--- a/xen/arch/x86/shutdown.c ++++ b/xen/arch/x86/shutdown.c +@@ -6,6 +6,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c +index aa632bdcee..20f562902b 100644 +--- a/xen/arch/x86/spec_ctrl.c ++++ b/xen/arch/x86/spec_ctrl.c +@@ -19,6 +19,7 @@ + #include + #include + #include ++#include + #include + + #include +diff --git a/xen/arch/x86/tboot.c b/xen/arch/x86/tboot.c +index 5020c4ad49..8c232270b4 100644 +--- a/xen/arch/x86/tboot.c ++++ b/xen/arch/x86/tboot.c +@@ -1,6 +1,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c +index f6b26f8883..cf3e51fb5e 100644 +--- a/xen/arch/x86/time.c ++++ b/xen/arch/x86/time.c +@@ -14,6 +14,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c +index 97499a0c79..56067f85d1 100644 +--- a/xen/arch/x86/traps.c ++++ b/xen/arch/x86/traps.c +@@ -30,6 +30,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/tsx.c b/xen/arch/x86/tsx.c +index 2d202a0d4e..39e483640a 100644 +--- a/xen/arch/x86/tsx.c ++++ b/xen/arch/x86/tsx.c +@@ -1,4 +1,5 @@ + #include ++#include + #include + + /* +diff --git a/xen/arch/x86/x86_64/mmconfig-shared.c b/xen/arch/x86/x86_64/mmconfig-shared.c +index cc08b52a35..0c55c7206e 100644 +--- a/xen/arch/x86/x86_64/mmconfig-shared.c ++++ b/xen/arch/x86/x86_64/mmconfig-shared.c +@@ -14,6 +14,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c +index 243495ed07..078419a171 100644 +--- a/xen/arch/x86/xstate.c ++++ b/xen/arch/x86/xstate.c +@@ -5,6 +5,7 @@ + * + */ + ++#include + #include + #include + #include +diff --git a/xen/common/core_parking.c b/xen/common/core_parking.c +index a6669e1766..411106c675 100644 +--- a/xen/common/core_parking.c ++++ b/xen/common/core_parking.c +@@ -19,6 +19,7 @@ + #include + #include + #include ++#include + #include + #include + +diff --git a/xen/common/domain.c b/xen/common/domain.c +index dfea575b49..0ae04d5bb9 100644 +--- a/xen/common/domain.c ++++ b/xen/common/domain.c +@@ -9,6 +9,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/efi/boot.c b/xen/common/efi/boot.c +index bf7bb95999..b9f461505c 100644 +--- a/xen/common/efi/boot.c ++++ b/xen/common/efi/boot.c +@@ -11,6 +11,7 @@ + #include + #include + #include ++#include + #include + #include + #if EFI_PAGE_SIZE != PAGE_SIZE +diff --git a/xen/common/gdbstub.c b/xen/common/gdbstub.c +index 6234834a20..848c1f4327 100644 +--- a/xen/common/gdbstub.c ++++ b/xen/common/gdbstub.c +@@ -40,6 +40,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c +index 5536d282b9..2ecf38dfbe 100644 +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -28,6 +28,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/kernel.c b/xen/common/kernel.c +index 760917dab5..22941cec94 100644 +--- a/xen/common/kernel.c ++++ b/xen/common/kernel.c +@@ -7,6 +7,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/kexec.c b/xen/common/kexec.c +index a262cc5a18..9af7de4df3 100644 +--- a/xen/common/kexec.c ++++ b/xen/common/kexec.c +@@ -12,6 +12,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/memory.c b/xen/common/memory.c +index c7d2bac452..ecc7e64334 100644 +--- a/xen/common/memory.c ++++ b/xen/common/memory.c +@@ -10,6 +10,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c +index 919a270587..97902d42c1 100644 +--- a/xen/common/page_alloc.c ++++ b/xen/common/page_alloc.c +@@ -126,6 +126,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/rcupdate.c b/xen/common/rcupdate.c +index cb712c8690..91d4ad0fd8 100644 +--- a/xen/common/rcupdate.c ++++ b/xen/common/rcupdate.c +@@ -34,6 +34,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff -u a/xen/common/sched_credit.c b/xen/common/sched_credit.c +--- a/xen/common/sched_credit.c ++++ b/xen/common/sched_credit.c +@@ -9,6 +9,7 @@ + */ + + #include ++#include + #include + #include + #include +diff -u a/xen/common/sched_credit2.c b/xen/common/sched_credit2.c +--- a/xen/common/sched_credit2.c ++++ b/xen/common/sched_credit2.c +@@ -11,6 +11,7 @@ + */ + + #include ++#include + #include + #include + #include +diff -u a/xen/common/schedule.c b/xen/common/schedule.c +--- a/xen/common/schedule.c ++++ b/xen/common/schedule.c +@@ -15,6 +15,7 @@ + + #ifndef COMPAT + #include ++#include + #include + #include + #include +diff --git a/xen/common/shutdown.c b/xen/common/shutdown.c +index 2ed4d62214..912593915b 100644 +--- a/xen/common/shutdown.c ++++ b/xen/common/shutdown.c +@@ -1,5 +1,6 @@ + #include + #include ++#include + #include + #include + #include +diff --git a/xen/common/timer.c b/xen/common/timer.c +index 645206a989..1bb265ceea 100644 +--- a/xen/common/timer.c ++++ b/xen/common/timer.c +@@ -10,6 +10,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff -u a/xen/common/tmem_xen.c b/xen/common/tmem_xen.c +--- a/xen/common/tmem_xen.c ++++ b/xen/common/tmem_xen.c +@@ -13,6 +13,7 @@ + #include + #include + #include ++#include + + bool __read_mostly opt_tmem; + boolean_param("tmem", opt_tmem); +diff --git a/xen/common/trace.c b/xen/common/trace.c +index ebfc735b31..a2a389a1c7 100644 +--- a/xen/common/trace.c ++++ b/xen/common/trace.c +@@ -19,6 +19,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/acpi/apei/hest.c b/xen/drivers/acpi/apei/hest.c +index 70734ab0e2..c5f3aaab7c 100644 +--- a/xen/drivers/acpi/apei/hest.c ++++ b/xen/drivers/acpi/apei/hest.c +@@ -30,6 +30,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/acpi/tables.c b/xen/drivers/acpi/tables.c +index b890b73901..8c2a279e18 100644 +--- a/xen/drivers/acpi/tables.c ++++ b/xen/drivers/acpi/tables.c +@@ -24,6 +24,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/char/arm-uart.c b/xen/drivers/char/arm-uart.c +index 627746ba89..eeb9ceefc0 100644 +--- a/xen/drivers/char/arm-uart.c ++++ b/xen/drivers/char/arm-uart.c +@@ -21,6 +21,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/char/console.c b/xen/drivers/char/console.c +index 4bcbbfa7d6..913ae1b66a 100644 +--- a/xen/drivers/char/console.c ++++ b/xen/drivers/char/console.c +@@ -15,6 +15,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/char/ehci-dbgp.c b/xen/drivers/char/ehci-dbgp.c +index b6e155d17b..c893d246de 100644 +--- a/xen/drivers/char/ehci-dbgp.c ++++ b/xen/drivers/char/ehci-dbgp.c +@@ -8,6 +8,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c +index aa87c57fc9..bd048f307a 100644 +--- a/xen/drivers/char/ns16550.c ++++ b/xen/drivers/char/ns16550.c +@@ -11,6 +11,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/char/serial.c b/xen/drivers/char/serial.c +index 88cd876790..5ecba0af33 100644 +--- a/xen/drivers/char/serial.c ++++ b/xen/drivers/char/serial.c +@@ -9,6 +9,7 @@ + #include + #include + #include ++#include + #include + #include + +diff --git a/xen/drivers/cpufreq/cpufreq.c b/xen/drivers/cpufreq/cpufreq.c +index 2d716abf72..e630a47419 100644 +--- a/xen/drivers/cpufreq/cpufreq.c ++++ b/xen/drivers/cpufreq/cpufreq.c +@@ -31,6 +31,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/passthrough/amd/iommu_acpi.c b/xen/drivers/passthrough/amd/iommu_acpi.c +index 9fbc343c58..6c5f8e46ec 100644 +--- a/xen/drivers/passthrough/amd/iommu_acpi.c ++++ b/xen/drivers/passthrough/amd/iommu_acpi.c +@@ -19,6 +19,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/passthrough/iommu.c b/xen/drivers/passthrough/iommu.c +index 4e19cf56cc..9d421e06de 100644 +--- a/xen/drivers/passthrough/iommu.c ++++ b/xen/drivers/passthrough/iommu.c +@@ -17,6 +17,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c +index 65d1d457ff..5660f7e1c2 100644 +--- a/xen/drivers/passthrough/pci.c ++++ b/xen/drivers/passthrough/pci.c +@@ -21,6 +21,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/passthrough/vtd/dmar.c b/xen/drivers/passthrough/vtd/dmar.c +index f36b99ae37..1784f91b34 100644 +--- a/xen/drivers/passthrough/vtd/dmar.c ++++ b/xen/drivers/passthrough/vtd/dmar.c +@@ -24,6 +24,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c +index 4dadd9523f..5594270678 100644 +--- a/xen/drivers/passthrough/vtd/quirks.c ++++ b/xen/drivers/passthrough/vtd/quirks.c +@@ -17,6 +17,7 @@ + */ + + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/passthrough/vtd/x86/vtd.c b/xen/drivers/passthrough/vtd/x86/vtd.c +index ff456e1e70..f379afac03 100644 +--- a/xen/drivers/passthrough/vtd/x86/vtd.c ++++ b/xen/drivers/passthrough/vtd/x86/vtd.c +@@ -17,6 +17,7 @@ + * Copyright (C) Weidong Han + */ + ++#include + #include + #include + #include +diff --git a/xen/drivers/passthrough/x86/ats.c b/xen/drivers/passthrough/x86/ats.c +index 3eea7f89fc..8ae0eae4a2 100644 +--- a/xen/drivers/passthrough/x86/ats.c ++++ b/xen/drivers/passthrough/x86/ats.c +@@ -12,6 +12,7 @@ + * this program; If not, see . + */ + ++#include + #include + #include + #include +diff --git a/xen/drivers/video/vesa.c b/xen/drivers/video/vesa.c +index fd2cb1312d..2c1bbd9278 100644 +--- a/xen/drivers/video/vesa.c ++++ b/xen/drivers/video/vesa.c +@@ -6,6 +6,7 @@ + + #include + #include ++#include + #include + #include + #include +diff --git a/xen/drivers/video/vga.c b/xen/drivers/video/vga.c +index 666f2e2509..b7f04d0d97 100644 +--- a/xen/drivers/video/vga.c ++++ b/xen/drivers/video/vga.c +@@ -7,6 +7,7 @@ + #include + #include + #include ++#include + #include + #include + #include +diff --git a/xen/include/xen/init.h b/xen/include/xen/init.h +index d0f3a007d0..bfe789e93f 100644 +--- a/xen/include/xen/init.h ++++ b/xen/include/xen/init.h +@@ -71,120 +71,6 @@ typedef void (*exitcall_t)(void); + void do_presmp_initcalls(void); + void do_initcalls(void); + +-/* +- * Used for kernel command line parameter setup +- */ +-struct kernel_param { +- const char *name; +- enum { +- OPT_STR, +- OPT_UINT, +- OPT_BOOL, +- OPT_SIZE, +- OPT_CUSTOM +- } type; +- unsigned int len; +- union { +- void *var; +- int (*func)(const char *); +- } par; +-}; +- +-extern const struct kernel_param __setup_start[], __setup_end[]; +-extern const struct kernel_param __param_start[], __param_end[]; +- +-#define __dataparam __used_section(".data.param") +- +-#define __param(att) static const att \ +- __attribute__((__aligned__(sizeof(void *)))) struct kernel_param +- +-#define __setup_str static const __initconst \ +- __attribute__((__aligned__(1))) char +-#define __kparam __param(__initsetup) +- +-#define custom_param(_name, _var) \ +- __setup_str __setup_str_##_var[] = _name; \ +- __kparam __setup_##_var = \ +- { .name = __setup_str_##_var, \ +- .type = OPT_CUSTOM, \ +- .par.func = _var } +-#define boolean_param(_name, _var) \ +- __setup_str __setup_str_##_var[] = _name; \ +- __kparam __setup_##_var = \ +- { .name = __setup_str_##_var, \ +- .type = OPT_BOOL, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +-#define integer_param(_name, _var) \ +- __setup_str __setup_str_##_var[] = _name; \ +- __kparam __setup_##_var = \ +- { .name = __setup_str_##_var, \ +- .type = OPT_UINT, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +-#define size_param(_name, _var) \ +- __setup_str __setup_str_##_var[] = _name; \ +- __kparam __setup_##_var = \ +- { .name = __setup_str_##_var, \ +- .type = OPT_SIZE, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +-#define string_param(_name, _var) \ +- __setup_str __setup_str_##_var[] = _name; \ +- __kparam __setup_##_var = \ +- { .name = __setup_str_##_var, \ +- .type = OPT_STR, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +- +-#define __rtparam __param(__dataparam) +- +-#define custom_runtime_only_param(_name, _var) \ +- __rtparam __rtpar_##_var = \ +- { .name = _name, \ +- .type = OPT_CUSTOM, \ +- .par.func = _var } +-#define boolean_runtime_only_param(_name, _var) \ +- __rtparam __rtpar_##_var = \ +- { .name = _name, \ +- .type = OPT_BOOL, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +-#define integer_runtime_only_param(_name, _var) \ +- __rtparam __rtpar_##_var = \ +- { .name = _name, \ +- .type = OPT_UINT, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +-#define size_runtime_only_param(_name, _var) \ +- __rtparam __rtpar_##_var = \ +- { .name = _name, \ +- .type = OPT_SIZE, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +-#define string_runtime_only_param(_name, _var) \ +- __rtparam __rtpar_##_var = \ +- { .name = _name, \ +- .type = OPT_STR, \ +- .len = sizeof(_var), \ +- .par.var = &_var } +- +-#define custom_runtime_param(_name, _var) \ +- custom_param(_name, _var); \ +- custom_runtime_only_param(_name, _var) +-#define boolean_runtime_param(_name, _var) \ +- boolean_param(_name, _var); \ +- boolean_runtime_only_param(_name, _var) +-#define integer_runtime_param(_name, _var) \ +- integer_param(_name, _var); \ +- integer_runtime_only_param(_name, _var) +-#define size_runtime_param(_name, _var) \ +- size_param(_name, _var); \ +- size_runtime_only_param(_name, _var) +-#define string_runtime_param(_name, _var) \ +- string_param(_name, _var); \ +- string_runtime_only_param(_name, _var) +- + #endif /* __ASSEMBLY__ */ + + #ifdef CONFIG_LATE_HWDOM +diff --git a/xen/include/xen/param.h b/xen/include/xen/param.h +new file mode 100644 +index 0000000000..75471eb4ad +--- /dev/null ++++ b/xen/include/xen/param.h +@@ -0,0 +1,119 @@ ++#ifndef _XEN_PARAM_H ++#define _XEN_PARAM_H ++ ++#include ++ ++/* ++ * Used for kernel command line parameter setup ++ */ ++struct kernel_param { ++ const char *name; ++ enum { ++ OPT_STR, ++ OPT_UINT, ++ OPT_BOOL, ++ OPT_SIZE, ++ OPT_CUSTOM ++ } type; ++ unsigned int len; ++ union { ++ void *var; ++ int (*func)(const char *); ++ } par; ++}; ++ ++extern const struct kernel_param __setup_start[], __setup_end[]; ++extern const struct kernel_param __param_start[], __param_end[]; ++ ++#define __dataparam __used_section(".data.param") ++ ++#define __param(att) static const att \ ++ __attribute__((__aligned__(sizeof(void *)))) struct kernel_param ++ ++#define __setup_str static const __initconst \ ++ __attribute__((__aligned__(1))) char ++#define __kparam __param(__initsetup) ++ ++#define custom_param(_name, _var) \ ++ __setup_str __setup_str_##_var[] = _name; \ ++ __kparam __setup_##_var = \ ++ { .name = __setup_str_##_var, \ ++ .type = OPT_CUSTOM, \ ++ .par.func = _var } ++#define boolean_param(_name, _var) \ ++ __setup_str __setup_str_##_var[] = _name; \ ++ __kparam __setup_##_var = \ ++ { .name = __setup_str_##_var, \ ++ .type = OPT_BOOL, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++#define integer_param(_name, _var) \ ++ __setup_str __setup_str_##_var[] = _name; \ ++ __kparam __setup_##_var = \ ++ { .name = __setup_str_##_var, \ ++ .type = OPT_UINT, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++#define size_param(_name, _var) \ ++ __setup_str __setup_str_##_var[] = _name; \ ++ __kparam __setup_##_var = \ ++ { .name = __setup_str_##_var, \ ++ .type = OPT_SIZE, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++#define string_param(_name, _var) \ ++ __setup_str __setup_str_##_var[] = _name; \ ++ __kparam __setup_##_var = \ ++ { .name = __setup_str_##_var, \ ++ .type = OPT_STR, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++#define __rtparam __param(__dataparam) ++ ++#define custom_runtime_only_param(_name, _var) \ ++ __rtparam __rtpar_##_var = \ ++ { .name = _name, \ ++ .type = OPT_CUSTOM, \ ++ .par.func = _var } ++#define boolean_runtime_only_param(_name, _var) \ ++ __rtparam __rtpar_##_var = \ ++ { .name = _name, \ ++ .type = OPT_BOOL, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++#define integer_runtime_only_param(_name, _var) \ ++ __rtparam __rtpar_##_var = \ ++ { .name = _name, \ ++ .type = OPT_UINT, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++#define size_runtime_only_param(_name, _var) \ ++ __rtparam __rtpar_##_var = \ ++ { .name = _name, \ ++ .type = OPT_SIZE, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++#define string_runtime_only_param(_name, _var) \ ++ __rtparam __rtpar_##_var = \ ++ { .name = _name, \ ++ .type = OPT_STR, \ ++ .len = sizeof(_var), \ ++ .par.var = &_var } ++ ++#define custom_runtime_param(_name, _var) \ ++ custom_param(_name, _var); \ ++ custom_runtime_only_param(_name, _var) ++#define boolean_runtime_param(_name, _var) \ ++ boolean_param(_name, _var); \ ++ boolean_runtime_only_param(_name, _var) ++#define integer_runtime_param(_name, _var) \ ++ integer_param(_name, _var); \ ++ integer_runtime_only_param(_name, _var) ++#define size_runtime_param(_name, _var) \ ++ size_param(_name, _var); \ ++ size_runtime_only_param(_name, _var) ++#define string_runtime_param(_name, _var) \ ++ string_param(_name, _var); \ ++ string_runtime_only_param(_name, _var) ++ ++#endif /* _XEN_PARAM_H */ +diff --git a/xen/xsm/flask/flask_op.c b/xen/xsm/flask/flask_op.c +index 1c4decc6cd..a5f2b104e2 100644 +--- a/xen/xsm/flask/flask_op.c ++++ b/xen/xsm/flask/flask_op.c +@@ -13,6 +13,7 @@ + #include + #include + #include ++#include + + #include + +diff --git a/xen/xsm/xsm_core.c b/xen/xsm/xsm_core.c +index a319df253d..5eab21e1b1 100644 +--- a/xen/xsm/xsm_core.c ++++ b/xen/xsm/xsm_core.c +@@ -13,6 +13,7 @@ + #include + #include + #include ++#include + + #include + #include +-- +2.25.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa312-4.11.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa312-4.11.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa312-4.11.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa312-4.11.patch 2022-05-21 13:43:08.000000000 +0100 @@ -0,0 +1,98 @@ +From 35cb81a9967a061df7d0eb8c387395f1c1984454 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Thu, 19 Dec 2019 08:12:21 +0000 +Subject: [PATCH] xen/arm: Place a speculation barrier sequence following an + eret instruction + +Some CPUs can speculate past an ERET instruction and potentially perform +speculative accesses to memory before processing the exception return. +Since the register state is often controlled by lower privilege level +at the point of an ERET, this could potentially be used as part of a +side-channel attack. + +Newer CPUs may implement a new SB barrier instruction which acts +as an architected speculation barrier. For current CPUs, the sequence +DSB; ISB is known to prevent speculation. + +The latter sequence is heavier than SB but it would never be executed +(this is speculation after all!). + +Introduce a new macro 'sb' that could be used when a speculation barrier +is required. For now it is using dsb; isb but this could easily be +updated to cater SB in the future. + +This is XSA-312. + +Signed-off-by: Julien Grall +--- + xen/arch/arm/arm32/entry.S | 2 ++ + xen/arch/arm/arm64/entry.S | 3 +++ + xen/include/asm-arm/macros.h | 9 +++++++++ + 3 files changed, 14 insertions(+) + +diff --git a/xen/arch/arm/arm32/entry.S b/xen/arch/arm/arm32/entry.S +index 16d9f93653..464c8b8645 100644 +--- a/xen/arch/arm/arm32/entry.S ++++ b/xen/arch/arm/arm32/entry.S +@@ -1,4 +1,5 @@ + #include ++#include + #include + #include + #include +@@ -379,6 +380,7 @@ return_to_hypervisor: + add sp, #(UREGS_SP_usr - UREGS_sp); /* SP, LR, SPSR, PC */ + clrex + eret ++ sb + + /* + * struct vcpu *__context_switch(struct vcpu *prev, struct vcpu *next) +diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S +index 12df95e901..a42c51e489 100644 +--- a/xen/arch/arm/arm64/entry.S ++++ b/xen/arch/arm/arm64/entry.S +@@ -2,6 +2,7 @@ + #include + #include + #include ++#include + #include + + /* +@@ -288,6 +289,7 @@ guest_sync: + */ + mov x1, xzr + eret ++ sb + + 1: + /* +@@ -413,6 +415,7 @@ return_from_trap: + ldr lr, [sp], #(UREGS_SPSR_el1 - UREGS_LR) /* CPSR, PC, SP, LR */ + + eret ++ sb + + /* + * This function is used to check pending virtual SError in the gap of +diff --git a/xen/include/asm-arm/macros.h b/xen/include/asm-arm/macros.h +index 5d837cb38b..539f613ee5 100644 +--- a/xen/include/asm-arm/macros.h ++++ b/xen/include/asm-arm/macros.h +@@ -13,4 +13,13 @@ + # error "unknown ARM variant" + #endif + ++ /* ++ * Speculative barrier ++ * XXX: Add support for the 'sb' instruction ++ */ ++ .macro sb ++ dsb nsh ++ isb ++ .endm ++ + #endif /* __ASM_ARM_MACROS_H */ +-- +2.17.1 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-1.patch 2022-05-21 14:18:38.000000000 +0100 @@ -0,0 +1,26 @@ +From: Jan Beulich +Subject: xenoprof: clear buffer intended to be shared with guests + +alloc_xenheap_pages() making use of MEMF_no_scrub is fine for Xen +internally used allocations, but buffers allocated to be shared with +(unpriviliged) guests need to be zapped of their prior content. + +This is part of XSA-313. + +Reported-by: Ilja Van Sprundel +Signed-off-by: Jan Beulich +Reviewed-by: Andrew Cooper +Reviewed-by: Wei Liu + +--- a/xen/common/xenoprof.c ++++ b/xen/common/xenoprof.c +@@ -253,6 +253,9 @@ static int alloc_xenoprof_struct( + return -ENOMEM; + } + ++ for ( i = 0; i < npages; ++i ) ++ clear_page(d->xenoprof->rawbuf + i * PAGE_SIZE); ++ + d->xenoprof->npages = npages; + d->xenoprof->nbuf = nvcpu; + d->xenoprof->bufsize = bufsize; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa313-2.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,132 @@ +From: Jan Beulich +Subject: xenoprof: limit consumption of shared buffer data + +Since a shared buffer can be written to by the guest, we may only read +the head and tail pointers from there (all other fields should only ever +be written to). Furthermore, for any particular operation the two values +must be read exactly once, with both checks and consumption happening +with the thus read values. (The backtrace related xenoprof_buf_space() +use in xenoprof_log_event() is an exception: The values used there get +re-checked by every subsequent xenoprof_add_sample().) + +Since that code needed touching, also fix the double increment of the +lost samples count in case the backtrace related xenoprof_add_sample() +invocation in xenoprof_log_event() fails. + +Where code is being touched anyway, add const as appropriate, but take +the opportunity to entirely drop the now unused domain parameter of +xenoprof_buf_space(). + +This is part of XSA-313. + +Reported-by: Ilja Van Sprundel +Signed-off-by: Jan Beulich +Reviewed-by: George Dunlap +Reviewed-by: Wei Liu + +--- a/xen/common/xenoprof.c ++++ b/xen/common/xenoprof.c +@@ -479,25 +479,22 @@ static int add_passive_list(XEN_GUEST_HA + + + /* Get space in the buffer */ +-static int xenoprof_buf_space(struct domain *d, xenoprof_buf_t * buf, int size) ++static int xenoprof_buf_space(int head, int tail, int size) + { +- int head, tail; +- +- head = xenoprof_buf(d, buf, event_head); +- tail = xenoprof_buf(d, buf, event_tail); +- + return ((tail > head) ? 0 : size) + tail - head - 1; + } + + /* Check for space and add a sample. Return 1 if successful, 0 otherwise. */ +-static int xenoprof_add_sample(struct domain *d, xenoprof_buf_t *buf, ++static int xenoprof_add_sample(const struct domain *d, ++ const struct xenoprof_vcpu *v, + uint64_t eip, int mode, int event) + { ++ xenoprof_buf_t *buf = v->buffer; + int head, tail, size; + + head = xenoprof_buf(d, buf, event_head); + tail = xenoprof_buf(d, buf, event_tail); +- size = xenoprof_buf(d, buf, event_size); ++ size = v->event_size; + + /* make sure indexes in shared buffer are sane */ + if ( (head < 0) || (head >= size) || (tail < 0) || (tail >= size) ) +@@ -506,7 +503,7 @@ static int xenoprof_add_sample(struct do + return 0; + } + +- if ( xenoprof_buf_space(d, buf, size) > 0 ) ++ if ( xenoprof_buf_space(head, tail, size) > 0 ) + { + xenoprof_buf(d, buf, event_log[head].eip) = eip; + xenoprof_buf(d, buf, event_log[head].mode) = mode; +@@ -530,7 +527,6 @@ static int xenoprof_add_sample(struct do + int xenoprof_add_trace(struct vcpu *vcpu, uint64_t pc, int mode) + { + struct domain *d = vcpu->domain; +- xenoprof_buf_t *buf = d->xenoprof->vcpu[vcpu->vcpu_id].buffer; + + /* Do not accidentally write an escape code due to a broken frame. */ + if ( pc == XENOPROF_ESCAPE_CODE ) +@@ -539,7 +535,8 @@ int xenoprof_add_trace(struct vcpu *vcpu + return 0; + } + +- return xenoprof_add_sample(d, buf, pc, mode, 0); ++ return xenoprof_add_sample(d, &d->xenoprof->vcpu[vcpu->vcpu_id], ++ pc, mode, 0); + } + + void xenoprof_log_event(struct vcpu *vcpu, const struct cpu_user_regs *regs, +@@ -570,17 +567,22 @@ void xenoprof_log_event(struct vcpu *vcp + /* Provide backtrace if requested. */ + if ( backtrace_depth > 0 ) + { +- if ( (xenoprof_buf_space(d, buf, v->event_size) < 2) || +- !xenoprof_add_sample(d, buf, XENOPROF_ESCAPE_CODE, mode, +- XENOPROF_TRACE_BEGIN) ) ++ if ( xenoprof_buf_space(xenoprof_buf(d, buf, event_head), ++ xenoprof_buf(d, buf, event_tail), ++ v->event_size) < 2 ) + { + xenoprof_buf(d, buf, lost_samples)++; + lost_samples++; + return; + } ++ ++ /* xenoprof_add_sample() will increment lost_samples on failure */ ++ if ( !xenoprof_add_sample(d, v, XENOPROF_ESCAPE_CODE, mode, ++ XENOPROF_TRACE_BEGIN) ) ++ return; + } + +- if ( xenoprof_add_sample(d, buf, pc, mode, event) ) ++ if ( xenoprof_add_sample(d, v, pc, mode, event) ) + { + if ( is_active(vcpu->domain) ) + active_samples++; +--- a/xen/include/xen/xenoprof.h ++++ b/xen/include/xen/xenoprof.h +@@ -61,12 +61,12 @@ struct xenoprof { + + #ifndef CONFIG_COMPAT + #define XENOPROF_COMPAT(x) 0 +-#define xenoprof_buf(d, b, field) ((b)->field) ++#define xenoprof_buf(d, b, field) ACCESS_ONCE((b)->field) + #else + #define XENOPROF_COMPAT(x) ((x)->is_compat) +-#define xenoprof_buf(d, b, field) (*(!(d)->xenoprof->is_compat ? \ +- &(b)->native.field : \ +- &(b)->compat.field)) ++#define xenoprof_buf(d, b, field) ACCESS_ONCE(*(!(d)->xenoprof->is_compat \ ++ ? &(b)->native.field \ ++ : &(b)->compat.field)) + #endif + + struct domain; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa314-4.13.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa314-4.13.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa314-4.13.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa314-4.13.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,121 @@ +From ab49f005f7d01d4004d76f2e295d31aca7d4f93a Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Thu, 20 Feb 2020 20:54:40 +0000 +Subject: [PATCH] xen/rwlock: Add missing memory barrier in the unlock path of + rwlock + +The rwlock unlock paths are using atomic_sub() to release the lock. +However the implementation of atomic_sub() rightfully doesn't contain a +memory barrier. On Arm, this means a processor is allowed to re-order +the memory access with the preceeding access. + +In other words, the unlock may be seen by another processor before all +the memory accesses within the "critical" section. + +The rwlock paths already contains barrier indirectly, but they are not +very useful without the counterpart in the unlock paths. + +The memory barriers are not necessary on x86 because loads/stores are +not re-ordered with lock instructions. + +So add arch_lock_release_barrier() in the unlock paths that will only +add memory barrier on Arm. + +Take the opportunity to document each lock paths explaining why a +barrier is not necessary. + +This is XSA-314. + +Signed-off-by: Julien Grall +Reviewed-by: Jan Beulich +Reviewed-by: Stefano Stabellini + +--- + xen/include/xen/rwlock.h | 29 ++++++++++++++++++++++++++++- + 1 file changed, 28 insertions(+), 1 deletion(-) + +diff --git a/xen/include/xen/rwlock.h b/xen/include/xen/rwlock.h +index 3dfea1ac2a..516486306f 100644 +--- a/xen/include/xen/rwlock.h ++++ b/xen/include/xen/rwlock.h +@@ -48,6 +48,10 @@ static inline int _read_trylock(rwlock_t *lock) + if ( likely(!(cnts & _QW_WMASK)) ) + { + cnts = (u32)atomic_add_return(_QR_BIAS, &lock->cnts); ++ /* ++ * atomic_add_return() is a full barrier so no need for an ++ * arch_lock_acquire_barrier(). ++ */ + if ( likely(!(cnts & _QW_WMASK)) ) + return 1; + atomic_sub(_QR_BIAS, &lock->cnts); +@@ -64,11 +68,19 @@ static inline void _read_lock(rwlock_t *lock) + u32 cnts; + + cnts = atomic_add_return(_QR_BIAS, &lock->cnts); ++ /* ++ * atomic_add_return() is a full barrier so no need for an ++ * arch_lock_acquire_barrier(). ++ */ + if ( likely(!(cnts & _QW_WMASK)) ) + return; + + /* The slowpath will decrement the reader count, if necessary. */ + queue_read_lock_slowpath(lock); ++ /* ++ * queue_read_lock_slowpath() is using spinlock and therefore is a ++ * full barrier. So no need for an arch_lock_acquire_barrier(). ++ */ + } + + static inline void _read_lock_irq(rwlock_t *lock) +@@ -92,6 +104,7 @@ static inline unsigned long _read_lock_irqsave(rwlock_t *lock) + */ + static inline void _read_unlock(rwlock_t *lock) + { ++ arch_lock_release_barrier(); + /* + * Atomically decrement the reader count + */ +@@ -121,11 +134,20 @@ static inline int _rw_is_locked(rwlock_t *lock) + */ + static inline void _write_lock(rwlock_t *lock) + { +- /* Optimize for the unfair lock case where the fair flag is 0. */ ++ /* ++ * Optimize for the unfair lock case where the fair flag is 0. ++ * ++ * atomic_cmpxchg() is a full barrier so no need for an ++ * arch_lock_acquire_barrier(). ++ */ + if ( atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0 ) + return; + + queue_write_lock_slowpath(lock); ++ /* ++ * queue_write_lock_slowpath() is using spinlock and therefore is a ++ * full barrier. So no need for an arch_lock_acquire_barrier(). ++ */ + } + + static inline void _write_lock_irq(rwlock_t *lock) +@@ -157,11 +179,16 @@ static inline int _write_trylock(rwlock_t *lock) + if ( unlikely(cnts) ) + return 0; + ++ /* ++ * atomic_cmpxchg() is a full barrier so no need for an ++ * arch_lock_acquire_barrier(). ++ */ + return likely(atomic_cmpxchg(&lock->cnts, 0, _QW_LOCKED) == 0); + } + + static inline void _write_unlock(rwlock_t *lock) + { ++ arch_lock_release_barrier(); + /* + * If the writer field is atomic, it can be cleared directly. + * Otherwise, an atomic subtraction will be used to clear it. +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa316-xen.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa316-xen.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa316-xen.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa316-xen.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,30 @@ +From: Ross Lagerwall +Subject: xen/gnttab: Fix error path in map_grant_ref() + +Part of XSA-295 (c/s 863e74eb2cffb) inadvertently re-positioned the brackets, +changing the logic. If the _set_status() call fails, the grant_map hypercall +would fail with a status of 1 (rc != GNTST_okay) instead of the expected +negative GNTST_* error. + +This error path can be taken due to bad guest state, and causes net/blk-back +in Linux to crash. + +This is XSA-316. + +Signed-off-by: Ross Lagerwall +Reviewed-by: Andrew Cooper +Reviewed-by: Julien Grall + +diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c +index 9fd6e60416..4b5344dc21 100644 +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -1031,7 +1031,7 @@ map_grant_ref( + { + if ( (rc = _set_status(shah, status, rd, rgt->gt_version, act, + op->flags & GNTMAP_readonly, 1, +- ld->domain_id) != GNTST_okay) ) ++ ld->domain_id)) != GNTST_okay ) + goto act_release_out; + + if ( !act->pin ) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa317.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa317.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa317.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa317.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,50 @@ +From aeb46e92f915f19a61d5a8a1f4b696793f64e6fb Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Thu, 19 Mar 2020 13:17:31 +0000 +Subject: [PATCH] xen/common: event_channel: Don't ignore error in + get_free_port() + +Currently, get_free_port() is assuming that the port has been allocated +when evtchn_allocate_port() is not return -EBUSY. + +However, the function may return an error when: + - We exhausted all the event channels. This can happen if the limit + configured by the administrator for the guest ('max_event_channels' + in xl cfg) is higher than the ABI used by the guest. For instance, + if the guest is using 2L, the limit should not be higher than 4095. + - We cannot allocate memory (e.g Xen has not more memory). + +Users of get_free_port() (such as EVTCHNOP_alloc_unbound) will validly +assuming the port was valid and will next call evtchn_from_port(). This +will result to a crash as the memory backing the event channel structure +is not present. + +Fixes: 368ae9a05fe ("xen/pvshim: forward evtchn ops between L0 Xen and L2 DomU") +Signed-off-by: Julien Grall +Reviewed-by: Jan Beulich +--- + xen/common/event_channel.c | 8 ++++---- + 1 file changed, 4 insertions(+), 4 deletions(-) + +diff --git a/xen/common/event_channel.c b/xen/common/event_channel.c +index e86e2bfab0..a8d182b584 100644 +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -195,10 +195,10 @@ static int get_free_port(struct domain *d) + { + int rc = evtchn_allocate_port(d, port); + +- if ( rc == -EBUSY ) +- continue; +- +- return port; ++ if ( rc == 0 ) ++ return port; ++ else if ( rc != -EBUSY ) ++ return rc; + } + + return -ENOSPC; +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa318.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa318.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa318.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa318.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,39 @@ +From: Jan Beulich +Subject: gnttab: fix GNTTABOP_copy continuation handling + +The XSA-226 fix was flawed - the backwards transformation on rc was done +too early, causing a continuation to not get invoked when the need for +preemption was determined at the very first iteration of the request. +This in particular means that all of the status fields of the individual +operations would be left untouched, i.e. set to whatever the caller may +or may not have initialized them to. + +This is part of XSA-318. + +Reported-by: Pawel Wieczorkiewicz +Tested-by: Pawel Wieczorkiewicz +Signed-off-by: Jan Beulich +Reviewed-by: Juergen Gross + +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -3576,8 +3576,7 @@ do_grant_table_op( + rc = gnttab_copy(copy, count); + if ( rc > 0 ) + { +- rc = count - rc; +- guest_handle_add_offset(copy, rc); ++ guest_handle_add_offset(copy, count - rc); + uop = guest_handle_cast(copy, void); + } + break; +@@ -3644,6 +3643,9 @@ do_grant_table_op( + out: + if ( rc > 0 || opaque_out != 0 ) + { ++ /* Adjust rc, see gnttab_copy() for why this is needed. */ ++ if ( cmd == GNTTABOP_copy ) ++ rc = count - rc; + ASSERT(rc < count); + ASSERT((opaque_out & GNTTABOP_CMD_MASK) == 0); + rc = hypercall_create_continuation(__HYPERVISOR_grant_table_op, "ihi", diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa319.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa319.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa319.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa319.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,27 @@ +From: Jan Beulich +Subject: x86/shadow: correct an inverted conditional in dirty VRAM tracking + +This originally was "mfn_x(mfn) == INVALID_MFN". Make it like this +again, taking the opportunity to also drop the unnecessary nearby +braces. + +This is XSA-319. + +Fixes: 246a5a3377c2 ("xen: Use a typesafe to define INVALID_MFN") +Signed-off-by: Jan Beulich +Reviewed-by: Andrew Cooper + +--- a/xen/arch/x86/mm/shadow/common.c ++++ b/xen/arch/x86/mm/shadow/common.c +@@ -3252,10 +3252,8 @@ int shadow_track_dirty_vram(struct domai + int dirty = 0; + paddr_t sl1ma = dirty_vram->sl1ma[i]; + +- if ( !mfn_eq(mfn, INVALID_MFN) ) +- { ++ if ( mfn_eq(mfn, INVALID_MFN) ) + dirty = 1; +- } + else + { + page = mfn_to_page(mfn); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-1.patch 2022-06-16 10:08:34.000000000 +0100 @@ -0,0 +1,133 @@ +From: Andrew Cooper +Subject: x86/spec-ctrl: CPUID/MSR definitions for Special Register Buffer Data Sampling + +This is part of XSA-320 / CVE-2020-0543 + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich +Acked-by: Wei Liu + +diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown +index 194615bfc5..9be18ac99f 100644 +--- a/docs/misc/xen-command-line.markdown ++++ b/docs/misc/xen-command-line.markdown +@@ -489,10 +489,10 @@ accounting for hardware capabilities as enumerated via CPUID. + + Currently accepted: + +-The Speculation Control hardware features `md-clear`, `ibrsb`, `stibp`, `ibpb`, +-`l1d-flush` and `ssbd` are used by default if available and applicable. They can +-be ignored, e.g. `no-ibrsb`, at which point Xen won't use them itself, and +-won't offer them to guests. ++The Speculation Control hardware features `srbds-ctrl`, `md-clear`, `ibrsb`, ++`stibp`, `ibpb`, `l1d-flush` and `ssbd` are used by default if available and ++applicable. They can be ignored, e.g. `no-ibrsb`, at which point Xen won't ++use them itself, and won't offer them to guests. + + ### cpuid\_mask\_cpu (AMD only) + > `= fam_0f_rev_c | fam_0f_rev_d | fam_0f_rev_e | fam_0f_rev_f | fam_0f_rev_g | fam_10_rev_b | fam_10_rev_c | fam_11_rev_b` +diff --git a/tools/libxl/libxl_cpuid.c b/tools/libxl/libxl_cpuid.c +index 5a1702d703..1235c8b91e 100644 +--- a/tools/libxl/libxl_cpuid.c ++++ b/tools/libxl/libxl_cpuid.c +@@ -202,6 +202,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list *cpuid, const char* str) + + {"avx512-4vnniw",0x00000007, 0, CPUID_REG_EDX, 2, 1}, + {"avx512-4fmaps",0x00000007, 0, CPUID_REG_EDX, 3, 1}, ++ {"srbds-ctrl", 0x00000007, 0, CPUID_REG_EDX, 9, 1}, + {"md-clear", 0x00000007, 0, CPUID_REG_EDX, 10, 1}, + {"ibrsb", 0x00000007, 0, CPUID_REG_EDX, 26, 1}, + {"stibp", 0x00000007, 0, CPUID_REG_EDX, 27, 1}, +diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c +index 4c9af6b7f0..8fb54c3001 100644 +--- a/tools/misc/xen-cpuid.c ++++ b/tools/misc/xen-cpuid.c +@@ -143,6 +143,7 @@ static const char *str_7d0[32] = + { + [ 2] = "avx512_4vnniw", [ 3] = "avx512_4fmaps", + ++ /* 8 */ [ 9] = "srbds-ctrl", + [10] = "md-clear", + /* 12 */ [13] = "tsx-force-abort", + +diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c +index 04aefa555d..b8e5b6fe67 100644 +--- a/xen/arch/x86/cpuid.c ++++ b/xen/arch/x86/cpuid.c +@@ -58,6 +58,11 @@ static int __init parse_xen_cpuid(const char *s) + if ( !val ) + setup_clear_cpu_cap(X86_FEATURE_SSBD); + } ++ else if ( (val = parse_boolean("srbds-ctrl", s, ss)) >= 0 ) ++ { ++ if ( !val ) ++ setup_clear_cpu_cap(X86_FEATURE_SRBDS_CTRL); ++ } + else + rc = -EINVAL; + +diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c +index ccb316c547..256e58d82b 100644 +--- a/xen/arch/x86/msr.c ++++ b/xen/arch/x86/msr.c +@@ -154,6 +154,7 @@ int guest_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) + /* Write-only */ + case MSR_TSX_FORCE_ABORT: + case MSR_TSX_CTRL: ++ case MSR_MCU_OPT_CTRL: + /* Not offered to guests. */ + goto gp_fault; + +@@ -243,6 +244,7 @@ int guest_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val) + /* Read-only */ + case MSR_TSX_FORCE_ABORT: + case MSR_TSX_CTRL: ++ case MSR_MCU_OPT_CTRL: + /* Not offered to guests. */ + goto gp_fault; + +diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c +index ab196b156d..94ab8dd786 100644 +--- a/xen/arch/x86/spec_ctrl.c ++++ b/xen/arch/x86/spec_ctrl.c +@@ -365,12 +365,13 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps) + printk("Speculative mitigation facilities:\n"); + + /* Hardware features which pertain to speculative mitigations. */ +- printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n", ++ printk(" Hardware features:%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n", + (_7d0 & cpufeat_mask(X86_FEATURE_IBRSB)) ? " IBRS/IBPB" : "", + (_7d0 & cpufeat_mask(X86_FEATURE_STIBP)) ? " STIBP" : "", + (_7d0 & cpufeat_mask(X86_FEATURE_L1D_FLUSH)) ? " L1D_FLUSH" : "", + (_7d0 & cpufeat_mask(X86_FEATURE_SSBD)) ? " SSBD" : "", + (_7d0 & cpufeat_mask(X86_FEATURE_MD_CLEAR)) ? " MD_CLEAR" : "", ++ (_7d0 & cpufeat_mask(X86_FEATURE_SRBDS_CTRL)) ? " SRBDS_CTRL" : "", + (e8b & cpufeat_mask(X86_FEATURE_IBPB)) ? " IBPB" : "", + (caps & ARCH_CAPS_IBRS_ALL) ? " IBRS_ALL" : "", + (caps & ARCH_CAPS_RDCL_NO) ? " RDCL_NO" : "", +diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h +index 1761a01f1f..480d1d8102 100644 +--- a/xen/include/asm-x86/msr-index.h ++++ b/xen/include/asm-x86/msr-index.h +@@ -177,6 +177,9 @@ + #define MSR_IA32_VMX_TRUE_ENTRY_CTLS 0x490 + #define MSR_IA32_VMX_VMFUNC 0x491 + ++#define MSR_MCU_OPT_CTRL 0x00000123 ++#define MCU_OPT_CTRL_RNGDS_MITG_DIS (_AC(1, ULL) << 0) ++ + /* K7/K8 MSRs. Not complete. See the architecture manual for a more + complete list. */ + #define MSR_K7_EVNTSEL0 0xc0010000 +diff --git a/xen/include/public/arch-x86/cpufeatureset.h b/xen/include/public/arch-x86/cpufeatureset.h +index a14d8a7013..9d210e74a0 100644 +--- a/xen/include/public/arch-x86/cpufeatureset.h ++++ b/xen/include/public/arch-x86/cpufeatureset.h +@@ -242,6 +242,7 @@ XEN_CPUFEATURE(IBPB, 8*32+12) /*A IBPB support only (no IBRS, used by + /* Intel-defined CPU features, CPUID level 0x00000007:0.edx, word 9 */ + XEN_CPUFEATURE(AVX512_4VNNIW, 9*32+ 2) /*A AVX512 Neural Network Instructions */ + XEN_CPUFEATURE(AVX512_4FMAPS, 9*32+ 3) /*A AVX512 Multiply Accumulation Single Precision */ ++XEN_CPUFEATURE(SRBDS_CTRL, 9*32+ 9) /* MSR_MCU_OPT_CTRL and RNGDS_MITG_DIS. */ + XEN_CPUFEATURE(MD_CLEAR, 9*32+10) /*A VERW clears microarchitectural buffers */ + XEN_CPUFEATURE(TSX_FORCE_ABORT, 9*32+13) /* MSR_TSX_FORCE_ABORT.RTM_ABORT */ + XEN_CPUFEATURE(IBRSB, 9*32+26) /*A IBRS and IBPB support (used by Intel) */ diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-2.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,179 @@ +From: Andrew Cooper +Subject: x86/spec-ctrl: Mitigate the Special Register Buffer Data Sampling sidechannel + +See patch documentation and comments. + +This is part of XSA-320 / CVE-2020-0543 + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown +index 9be18ac99f..3356e59fee 100644 +--- a/docs/misc/xen-command-line.markdown ++++ b/docs/misc/xen-command-line.markdown +@@ -1858,7 +1858,7 @@ false disable the quirk workaround, which is also the default. + ### spec-ctrl (x86) + > `= List of [ , xen=, {pv,hvm,msr-sc,rsb,md-clear}=, + > bti-thunk=retpoline|lfence|jmp, {ibrs,ibpb,ssbd,eager-fpu, +-> l1d-flush}= ]` ++> l1d-flush,srb-lock}= ]` + + Controls for speculative execution sidechannel mitigations. By default, Xen + will pick the most appropriate mitigations based on compiled in support, +@@ -1930,6 +1930,12 @@ Irrespective of Xen's setting, the feature is virtualised for HVM guests to + use. By default, Xen will enable this mitigation on hardware believed to be + vulnerable to L1TF. + ++On hardware supporting SRBDS_CTRL, the `srb-lock=` option can be used to force ++or prevent Xen from protect the Special Register Buffer from leaking stale ++data. By default, Xen will enable this mitigation, except on parts where MDS ++is fixed and TAA is fixed/mitigated (in which case, there is believed to be no ++way for an attacker to obtain the stale data). ++ + ### sync\_console + > `= ` + +diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c +index 4c12794809..30e1bd5cd3 100644 +--- a/xen/arch/x86/acpi/power.c ++++ b/xen/arch/x86/acpi/power.c +@@ -266,6 +266,9 @@ static int enter_state(u32 state) + ci->spec_ctrl_flags |= (default_spec_ctrl_flags & SCF_ist_wrmsr); + spec_ctrl_exit_idle(ci); + ++ if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) ++ wrmsrl(MSR_MCU_OPT_CTRL, default_xen_mcu_opt_ctrl); ++ + done: + spin_debug_enable(); + local_irq_restore(flags); +diff --git a/xen/arch/x86/smpboot.c b/xen/arch/x86/smpboot.c +index 0887806e85..d24d215946 100644 +--- a/xen/arch/x86/smpboot.c ++++ b/xen/arch/x86/smpboot.c +@@ -369,12 +369,14 @@ void start_secondary(void *unused) + microcode_resume_cpu(cpu); + + /* +- * If MSR_SPEC_CTRL is available, apply Xen's default setting and discard +- * any firmware settings. Note: MSR_SPEC_CTRL may only become available +- * after loading microcode. ++ * If any speculative control MSRs are available, apply Xen's default ++ * settings. Note: These MSRs may only become available after loading ++ * microcode. + */ + if ( boot_cpu_has(X86_FEATURE_IBRSB) ) + wrmsrl(MSR_SPEC_CTRL, default_xen_spec_ctrl); ++ if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) ++ wrmsrl(MSR_MCU_OPT_CTRL, default_xen_mcu_opt_ctrl); + + tsx_init(); /* Needs microcode. May change HLE/RTM feature bits. */ + +diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c +index 94ab8dd786..a306d10c34 100644 +--- a/xen/arch/x86/spec_ctrl.c ++++ b/xen/arch/x86/spec_ctrl.c +@@ -63,6 +63,9 @@ static unsigned int __initdata l1d_maxphysaddr; + static bool __initdata cpu_has_bug_msbds_only; /* => minimal HT impact. */ + static bool __initdata cpu_has_bug_mds; /* Any other M{LP,SB,FB}DS combination. */ + ++static int8_t __initdata opt_srb_lock = -1; ++uint64_t __read_mostly default_xen_mcu_opt_ctrl; ++ + static int __init parse_bti(const char *s) + { + const char *ss; +@@ -166,6 +169,7 @@ static int __init parse_spec_ctrl(const char *s) + opt_ibpb = false; + opt_ssbd = false; + opt_l1d_flush = 0; ++ opt_srb_lock = 0; + } + else if ( val > 0 ) + rc = -EINVAL; +@@ -231,6 +235,8 @@ static int __init parse_spec_ctrl(const char *s) + opt_eager_fpu = val; + else if ( (val = parse_boolean("l1d-flush", s, ss)) >= 0 ) + opt_l1d_flush = val; ++ else if ( (val = parse_boolean("srb-lock", s, ss)) >= 0 ) ++ opt_srb_lock = val; + else + rc = -EINVAL; + +@@ -394,7 +400,7 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps) + "\n"); + + /* Settings for Xen's protection, irrespective of guests. */ +- printk(" Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s, Other:%s%s%s\n", ++ printk(" Xen settings: BTI-Thunk %s, SPEC_CTRL: %s%s%s, Other:%s%s%s%s\n", + thunk == THUNK_NONE ? "N/A" : + thunk == THUNK_RETPOLINE ? "RETPOLINE" : + thunk == THUNK_LFENCE ? "LFENCE" : +@@ -405,6 +411,8 @@ static void __init print_details(enum ind_thunk thunk, uint64_t caps) + (default_xen_spec_ctrl & SPEC_CTRL_SSBD) ? " SSBD+" : " SSBD-", + !(caps & ARCH_CAPS_TSX_CTRL) ? "" : + (opt_tsx & 1) ? " TSX+" : " TSX-", ++ !boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ? "" : ++ opt_srb_lock ? " SRB_LOCK+" : " SRB_LOCK-", + opt_ibpb ? " IBPB" : "", + opt_l1d_flush ? " L1D_FLUSH" : "", + opt_md_clear_pv || opt_md_clear_hvm ? " VERW" : ""); +@@ -1196,6 +1204,34 @@ void __init init_speculation_mitigations(void) + tsx_init(); + } + ++ /* Calculate suitable defaults for MSR_MCU_OPT_CTRL */ ++ if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) ++ { ++ uint64_t val; ++ ++ rdmsrl(MSR_MCU_OPT_CTRL, val); ++ ++ /* ++ * On some SRBDS-affected hardware, it may be safe to relax srb-lock ++ * by default. ++ * ++ * On parts which enumerate MDS_NO and not TAA_NO, TSX is the only way ++ * to access the Fill Buffer. If TSX isn't available (inc. SKU ++ * reasons on some models), or TSX is explicitly disabled, then there ++ * is no need for the extra overhead to protect RDRAND/RDSEED. ++ */ ++ if ( opt_srb_lock == -1 && ++ (caps & (ARCH_CAPS_MDS_NO|ARCH_CAPS_TAA_NO)) == ARCH_CAPS_MDS_NO && ++ (!cpu_has_hle || ((caps & ARCH_CAPS_TSX_CTRL) && opt_tsx == 0)) ) ++ opt_srb_lock = 0; ++ ++ val &= ~MCU_OPT_CTRL_RNGDS_MITG_DIS; ++ if ( !opt_srb_lock ) ++ val |= MCU_OPT_CTRL_RNGDS_MITG_DIS; ++ ++ default_xen_mcu_opt_ctrl = val; ++ } ++ + print_details(thunk, caps); + + /* +@@ -1227,6 +1263,9 @@ void __init init_speculation_mitigations(void) + + wrmsrl(MSR_SPEC_CTRL, bsp_delay_spec_ctrl ? 0 : default_xen_spec_ctrl); + } ++ ++ if ( boot_cpu_has(X86_FEATURE_SRBDS_CTRL) ) ++ wrmsrl(MSR_MCU_OPT_CTRL, default_xen_mcu_opt_ctrl); + } + + static void __init __maybe_unused build_assertions(void) +diff --git a/xen/include/asm-x86/spec_ctrl.h b/xen/include/asm-x86/spec_ctrl.h +index 333d180b7e..bf10d2ce5c 100644 +--- a/xen/include/asm-x86/spec_ctrl.h ++++ b/xen/include/asm-x86/spec_ctrl.h +@@ -46,6 +46,8 @@ extern int8_t opt_pv_l1tf_hwdom, opt_pv_l1tf_domu; + */ + extern paddr_t l1tf_addr_mask, l1tf_safe_maddr; + ++extern uint64_t default_xen_mcu_opt_ctrl; ++ + static inline void init_shadow_spec_ctrl_state(void) + { + struct cpu_info *info = get_cpu_info(); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-3.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-3.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-3.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa320-4.11-3.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,57 @@ +From: Andrew Cooper +Subject: x86/spec-ctrl: Allow the RDRAND/RDSEED features to be hidden + +RDRAND/RDSEED can be hidden using cpuid= to mitigate SRBDS if microcode +isn't available. + +This is part of XSA-320 / CVE-2020-0543. + +Signed-off-by: Andrew Cooper +Acked-by: Julien Grall + +diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown +index 3356e59fee..ac397e7de0 100644 +--- a/docs/misc/xen-command-line.markdown ++++ b/docs/misc/xen-command-line.markdown +@@ -487,12 +487,18 @@ choice of `dom0-kernel` is deprecated and not supported by all Dom0 kernels. + This option allows for fine tuning of the facilities Xen will use, after + accounting for hardware capabilities as enumerated via CPUID. + ++Unless otherwise noted, options only have any effect in their negative form, ++to hide the named feature(s). Ignoring a feature using this mechanism will ++cause Xen not to use the feature, nor offer them as usable to guests. ++ + Currently accepted: + + The Speculation Control hardware features `srbds-ctrl`, `md-clear`, `ibrsb`, + `stibp`, `ibpb`, `l1d-flush` and `ssbd` are used by default if available and +-applicable. They can be ignored, e.g. `no-ibrsb`, at which point Xen won't +-use them itself, and won't offer them to guests. ++applicable. They can all be ignored. ++ ++`rdrand` and `rdseed` can be ignored, as a mitigation to XSA-320 / ++CVE-2020-0543. + + ### cpuid\_mask\_cpu (AMD only) + > `= fam_0f_rev_c | fam_0f_rev_d | fam_0f_rev_e | fam_0f_rev_f | fam_0f_rev_g | fam_10_rev_b | fam_10_rev_c | fam_11_rev_b` +diff --git a/xen/arch/x86/cpuid.c b/xen/arch/x86/cpuid.c +index b8e5b6fe67..78d08dbb32 100644 +--- a/xen/arch/x86/cpuid.c ++++ b/xen/arch/x86/cpuid.c +@@ -63,6 +63,16 @@ static int __init parse_xen_cpuid(const char *s) + if ( !val ) + setup_clear_cpu_cap(X86_FEATURE_SRBDS_CTRL); + } ++ else if ( (val = parse_boolean("rdrand", s, ss)) >= 0 ) ++ { ++ if ( !val ) ++ setup_clear_cpu_cap(X86_FEATURE_RDRAND); ++ } ++ else if ( (val = parse_boolean("rdseed", s, ss)) >= 0 ) ++ { ++ if ( !val ) ++ setup_clear_cpu_cap(X86_FEATURE_RDSEED); ++ } + else + rc = -EINVAL; + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-1.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,31 @@ +From: Jan Beulich +Subject: vtd: improve IOMMU TLB flush + +Do not limit PSI flushes to order 0 pages, in order to avoid doing a +full TLB flush if the passed in page has an order greater than 0 and +is aligned. Should increase the performance of IOMMU TLB flushes when +dealing with page orders greater than 0. + +This is part of XSA-321. + +Signed-off-by: Jan Beulich + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -612,13 +612,14 @@ static int __must_check iommu_flush_iotl + if ( iommu_domid == -1 ) + continue; + +- if ( page_count != 1 || gfn == gfn_x(INVALID_GFN) ) ++ if ( !page_count || (page_count & (page_count - 1)) || ++ gfn == gfn_x(INVALID_GFN) || !IS_ALIGNED(gfn, page_count) ) + rc = iommu_flush_iotlb_dsi(iommu, iommu_domid, + 0, flush_dev_iotlb); + else + rc = iommu_flush_iotlb_psi(iommu, iommu_domid, + (paddr_t)gfn << PAGE_SHIFT_4K, +- PAGE_ORDER_4K, ++ get_order_from_pages(page_count), + !dma_old_pte_present, + flush_dev_iotlb); + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-2.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,175 @@ +From: +Subject: vtd: prune (and rename) cache flush functions + +Rename __iommu_flush_cache to iommu_sync_cache and remove +iommu_flush_cache_page. Also remove the iommu_flush_cache_entry +wrapper and just use iommu_sync_cache instead. Note the _entry suffix +was meaningless as the wrapper was already taking a size parameter in +bytes. While there also constify the addr parameter. + +No functional change intended. + +This is part of XSA-321. + +Reviewed-by: Jan Beulich + +--- a/xen/drivers/passthrough/vtd/extern.h ++++ b/xen/drivers/passthrough/vtd/extern.h +@@ -37,8 +37,7 @@ void disable_qinval(struct iommu *iommu) + int enable_intremap(struct iommu *iommu, int eim); + void disable_intremap(struct iommu *iommu); + +-void iommu_flush_cache_entry(void *addr, unsigned int size); +-void iommu_flush_cache_page(void *addr, unsigned long npages); ++void iommu_sync_cache(const void *addr, unsigned int size); + int iommu_alloc(struct acpi_drhd_unit *drhd); + void iommu_free(struct acpi_drhd_unit *drhd); + +--- a/xen/drivers/passthrough/vtd/intremap.c ++++ b/xen/drivers/passthrough/vtd/intremap.c +@@ -231,7 +231,7 @@ static void free_remap_entry(struct iomm + iremap_entries, iremap_entry); + + update_irte(iommu, iremap_entry, &new_ire, false); +- iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry)); ++ iommu_sync_cache(iremap_entry, sizeof(*iremap_entry)); + iommu_flush_iec_index(iommu, 0, index); + + unmap_vtd_domain_page(iremap_entries); +@@ -403,7 +403,7 @@ static int ioapic_rte_to_remap_entry(str + } + + update_irte(iommu, iremap_entry, &new_ire, !init); +- iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry)); ++ iommu_sync_cache(iremap_entry, sizeof(*iremap_entry)); + iommu_flush_iec_index(iommu, 0, index); + + unmap_vtd_domain_page(iremap_entries); +@@ -694,7 +694,7 @@ static int msi_msg_to_remap_entry( + update_irte(iommu, iremap_entry, &new_ire, msi_desc->irte_initialized); + msi_desc->irte_initialized = true; + +- iommu_flush_cache_entry(iremap_entry, sizeof(*iremap_entry)); ++ iommu_sync_cache(iremap_entry, sizeof(*iremap_entry)); + iommu_flush_iec_index(iommu, 0, index); + + unmap_vtd_domain_page(iremap_entries); +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -158,7 +158,8 @@ static void __init free_intel_iommu(stru + } + + static int iommus_incoherent; +-static void __iommu_flush_cache(void *addr, unsigned int size) ++ ++void iommu_sync_cache(const void *addr, unsigned int size) + { + int i; + static unsigned int clflush_size = 0; +@@ -173,16 +174,6 @@ static void __iommu_flush_cache(void *ad + cacheline_flush((char *)addr + i); + } + +-void iommu_flush_cache_entry(void *addr, unsigned int size) +-{ +- __iommu_flush_cache(addr, size); +-} +- +-void iommu_flush_cache_page(void *addr, unsigned long npages) +-{ +- __iommu_flush_cache(addr, PAGE_SIZE * npages); +-} +- + /* Allocate page table, return its machine address */ + u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages) + { +@@ -207,7 +198,7 @@ u64 alloc_pgtable_maddr(struct acpi_drhd + vaddr = __map_domain_page(cur_pg); + memset(vaddr, 0, PAGE_SIZE); + +- iommu_flush_cache_page(vaddr, 1); ++ iommu_sync_cache(vaddr, PAGE_SIZE); + unmap_domain_page(vaddr); + cur_pg++; + } +@@ -242,7 +233,7 @@ static u64 bus_to_context_maddr(struct i + } + set_root_value(*root, maddr); + set_root_present(*root); +- iommu_flush_cache_entry(root, sizeof(struct root_entry)); ++ iommu_sync_cache(root, sizeof(struct root_entry)); + } + maddr = (u64) get_context_addr(*root); + unmap_vtd_domain_page(root_entries); +@@ -300,7 +291,7 @@ static u64 addr_to_dma_page_maddr(struct + */ + dma_set_pte_readable(*pte); + dma_set_pte_writable(*pte); +- iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); ++ iommu_sync_cache(pte, sizeof(struct dma_pte)); + } + + if ( level == 2 ) +@@ -674,7 +665,7 @@ static int __must_check dma_pte_clear_on + + dma_clear_pte(*pte); + spin_unlock(&hd->arch.mapping_lock); +- iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); ++ iommu_sync_cache(pte, sizeof(struct dma_pte)); + + if ( !this_cpu(iommu_dont_flush_iotlb) ) + rc = iommu_flush_iotlb_pages(domain, addr >> PAGE_SHIFT_4K, 1); +@@ -716,7 +707,7 @@ static void iommu_free_page_table(struct + iommu_free_pagetable(dma_pte_addr(*pte), next_level); + + dma_clear_pte(*pte); +- iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); ++ iommu_sync_cache(pte, sizeof(struct dma_pte)); + } + + unmap_vtd_domain_page(pt_vaddr); +@@ -1449,7 +1440,7 @@ int domain_context_mapping_one( + context_set_address_width(*context, agaw); + context_set_fault_enable(*context); + context_set_present(*context); +- iommu_flush_cache_entry(context, sizeof(struct context_entry)); ++ iommu_sync_cache(context, sizeof(struct context_entry)); + spin_unlock(&iommu->lock); + + /* Context entry was previously non-present (with domid 0). */ +@@ -1602,7 +1593,7 @@ int domain_context_unmap_one( + + context_clear_present(*context); + context_clear_entry(*context); +- iommu_flush_cache_entry(context, sizeof(struct context_entry)); ++ iommu_sync_cache(context, sizeof(struct context_entry)); + + iommu_domid= domain_iommu_domid(domain, iommu); + if ( iommu_domid == -1 ) +@@ -1828,7 +1819,7 @@ static int __must_check intel_iommu_map_ + + *pte = new; + +- iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); ++ iommu_sync_cache(pte, sizeof(struct dma_pte)); + spin_unlock(&hd->arch.mapping_lock); + unmap_vtd_domain_page(page); + +@@ -1862,7 +1853,7 @@ int iommu_pte_flush(struct domain *d, u6 + int iommu_domid; + int rc = 0; + +- iommu_flush_cache_entry(pte, sizeof(struct dma_pte)); ++ iommu_sync_cache(pte, sizeof(struct dma_pte)); + + for_each_drhd_unit ( drhd ) + { +@@ -2725,7 +2716,7 @@ static int __init intel_iommu_quarantine + dma_set_pte_addr(*pte, maddr); + dma_set_pte_readable(*pte); + } +- iommu_flush_cache_page(parent, 1); ++ iommu_sync_cache(parent, PAGE_SIZE); + + unmap_vtd_domain_page(parent); + parent = map_vtd_domain_page(maddr); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-3.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-3.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-3.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-3.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,82 @@ +From: +Subject: x86/iommu: introduce a cache sync hook + +The hook is only implemented for VT-d and it uses the already existing +iommu_sync_cache function present in VT-d code. The new hook is +added so that the cache can be flushed by code outside of VT-d when +using shared page tables. + +Note that alloc_pgtable_maddr must use the now locally defined +sync_cache function, because IOMMU ops are not yet setup the first +time the function gets called during IOMMU initialization. + +No functional change intended. + +This is part of XSA-321. + +Reviewed-by: Jan Beulich + +--- a/xen/drivers/passthrough/vtd/extern.h ++++ b/xen/drivers/passthrough/vtd/extern.h +@@ -37,7 +37,6 @@ void disable_qinval(struct iommu *iommu) + int enable_intremap(struct iommu *iommu, int eim); + void disable_intremap(struct iommu *iommu); + +-void iommu_sync_cache(const void *addr, unsigned int size); + int iommu_alloc(struct acpi_drhd_unit *drhd); + void iommu_free(struct acpi_drhd_unit *drhd); + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -159,7 +159,7 @@ static void __init free_intel_iommu(stru + + static int iommus_incoherent; + +-void iommu_sync_cache(const void *addr, unsigned int size) ++static void sync_cache(const void *addr, unsigned int size) + { + int i; + static unsigned int clflush_size = 0; +@@ -198,7 +198,7 @@ u64 alloc_pgtable_maddr(struct acpi_drhd + vaddr = __map_domain_page(cur_pg); + memset(vaddr, 0, PAGE_SIZE); + +- iommu_sync_cache(vaddr, PAGE_SIZE); ++ sync_cache(vaddr, PAGE_SIZE); + unmap_domain_page(vaddr); + cur_pg++; + } +@@ -2760,6 +2760,7 @@ const struct iommu_ops intel_iommu_ops = + .iotlb_flush_all = iommu_flush_iotlb_all, + .get_reserved_device_memory = intel_iommu_get_reserved_device_memory, + .dump_p2m_table = vtd_dump_p2m_table, ++ .sync_cache = sync_cache, + }; + + /* +--- a/xen/include/asm-x86/iommu.h ++++ b/xen/include/asm-x86/iommu.h +@@ -98,6 +98,13 @@ extern bool untrusted_msi; + int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq, + const uint8_t gvec); + ++#define iommu_sync_cache(addr, size) ({ \ ++ const struct iommu_ops *ops = iommu_get_ops(); \ ++ \ ++ if ( ops->sync_cache ) \ ++ ops->sync_cache(addr, size); \ ++}) ++ + #endif /* !__ARCH_X86_IOMMU_H__ */ + /* + * Local variables: +--- a/xen/include/xen/iommu.h ++++ b/xen/include/xen/iommu.h +@@ -161,6 +161,7 @@ struct iommu_ops { + void (*update_ire_from_apic)(unsigned int apic, unsigned int reg, unsigned int value); + unsigned int (*read_apic_from_ire)(unsigned int apic, unsigned int reg); + int (*setup_hpet_msi)(struct msi_desc *); ++ void (*sync_cache)(const void *addr, unsigned int size); + #endif /* CONFIG_X86 */ + int __must_check (*suspend)(void); + void (*resume)(void); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-4.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-4.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-4.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-4.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,36 @@ +From: +Subject: vtd: don't assume addresses are aligned in sync_cache + +Current code in sync_cache assume that the address passed in is +aligned to a cache line size. Fix the code to support passing in +arbitrary addresses not necessarily aligned to a cache line size. + +This is part of XSA-321. + +Reviewed-by: Jan Beulich + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -161,8 +161,8 @@ static int iommus_incoherent; + + static void sync_cache(const void *addr, unsigned int size) + { +- int i; +- static unsigned int clflush_size = 0; ++ static unsigned long clflush_size = 0; ++ const void *end = addr + size; + + if ( !iommus_incoherent ) + return; +@@ -170,8 +170,9 @@ static void sync_cache(const void *addr, + if ( clflush_size == 0 ) + clflush_size = get_cache_line_size(); + +- for ( i = 0; i < size; i += clflush_size ) +- cacheline_flush((char *)addr + i); ++ addr -= (unsigned long)addr & (clflush_size - 1); ++ for ( ; addr < end; addr += clflush_size ) ++ cacheline_flush((char *)addr); + } + + /* Allocate page table, return its machine address */ diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-5.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-5.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-5.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-5.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,24 @@ +From: +Subject: x86/alternative: introduce alternative_2 + +It's based on alternative_io_2 without inputs or outputs but with an +added memory clobber. + +This is part of XSA-321. + +Acked-by: Jan Beulich + +--- a/xen/include/asm-x86/alternative.h ++++ b/xen/include/asm-x86/alternative.h +@@ -113,6 +113,11 @@ extern void alternative_instructions(voi + #define alternative(oldinstr, newinstr, feature) \ + asm volatile (ALTERNATIVE(oldinstr, newinstr, feature) : : : "memory") + ++#define alternative_2(oldinstr, newinstr1, feature1, newinstr2, feature2) \ ++ asm volatile (ALTERNATIVE_2(oldinstr, newinstr1, feature1, \ ++ newinstr2, feature2) \ ++ : : : "memory") ++ + /* + * Alternative inline assembly with input. + * diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-6.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-6.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-6.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-6.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,91 @@ +From: +Subject: vtd: optimize CPU cache sync + +Some VT-d IOMMUs are non-coherent, which requires a cache write back +in order for the changes made by the CPU to be visible to the IOMMU. +This cache write back was unconditionally done using clflush, but there are +other more efficient instructions to do so, hence implement support +for them using the alternative framework. + +This is part of XSA-321. + +Reviewed-by: Jan Beulich + +--- a/xen/drivers/passthrough/vtd/extern.h ++++ b/xen/drivers/passthrough/vtd/extern.h +@@ -63,7 +63,6 @@ int __must_check qinval_device_iotlb_syn + u16 did, u16 size, u64 addr); + + unsigned int get_cache_line_size(void); +-void cacheline_flush(char *); + void flush_all_cache(void); + + u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages); +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -31,6 +31,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -172,7 +173,42 @@ static void sync_cache(const void *addr, + + addr -= (unsigned long)addr & (clflush_size - 1); + for ( ; addr < end; addr += clflush_size ) +- cacheline_flush((char *)addr); ++/* ++ * The arguments to a macro must not include preprocessor directives. Doing so ++ * results in undefined behavior, so we have to create some defines here in ++ * order to avoid it. ++ */ ++#if defined(HAVE_AS_CLWB) ++# define CLWB_ENCODING "clwb %[p]" ++#elif defined(HAVE_AS_XSAVEOPT) ++# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */ ++#else ++# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */ ++#endif ++ ++#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr)) ++#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT) ++# define INPUT BASE_INPUT ++#else ++# define INPUT(addr) "a" (addr), BASE_INPUT(addr) ++#endif ++ /* ++ * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush ++ * + prefix than a clflush + nop, and hence the prefix is added instead ++ * of letting the alternative framework fill the gap by appending nops. ++ */ ++ alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]", ++ "data16 clflush %[p]", /* clflushopt */ ++ X86_FEATURE_CLFLUSHOPT, ++ CLWB_ENCODING, ++ X86_FEATURE_CLWB, /* no outputs */, ++ INPUT(addr)); ++#undef INPUT ++#undef BASE_INPUT ++#undef CLWB_ENCODING ++ ++ alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT, ++ "sfence", X86_FEATURE_CLWB); + } + + /* Allocate page table, return its machine address */ +--- a/xen/drivers/passthrough/vtd/x86/vtd.c ++++ b/xen/drivers/passthrough/vtd/x86/vtd.c +@@ -53,11 +53,6 @@ unsigned int get_cache_line_size(void) + return ((cpuid_ebx(1) >> 8) & 0xff) * 8; + } + +-void cacheline_flush(char * addr) +-{ +- clflush(addr); +-} +- + void flush_all_cache() + { + wbinvd(); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-7.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-7.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-7.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa321-4.11-7.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,164 @@ +From: +Subject: x86/ept: flush cache when modifying PTEs and sharing page tables + +Modifications made to the page tables by EPT code need to be written +to memory when the page tables are shared with the IOMMU, as Intel +IOMMUs can be non-coherent and thus require changes to be written to +memory in order to be visible to the IOMMU. + +In order to achieve this make sure data is written back to memory +after writing an EPT entry when the recalc bit is not set in +atomic_write_ept_entry. If such bit is set, the entry will be +adjusted and atomic_write_ept_entry will be called a second time +without the recalc bit set. Note that when splitting a super page the +new tables resulting of the split should also be written back. + +Failure to do so can allow devices behind the IOMMU access to the +stale super page, or cause coherency issues as changes made by the +processor to the page tables are not visible to the IOMMU. + +This allows to remove the VT-d specific iommu_pte_flush helper, since +the cache write back is now performed by atomic_write_ept_entry, and +hence iommu_iotlb_flush can be used to flush the IOMMU TLB. The newly +used method (iommu_iotlb_flush) can result in less flushes, since it +might sometimes be called rightly with 0 flags, in which case it +becomes a no-op. + +This is part of XSA-321. + +Reviewed-by: Jan Beulich + +--- a/xen/arch/x86/mm/p2m-ept.c ++++ b/xen/arch/x86/mm/p2m-ept.c +@@ -90,6 +90,19 @@ static int atomic_write_ept_entry(ept_en + + write_atomic(&entryptr->epte, new.epte); + ++ /* ++ * The recalc field on the EPT is used to signal either that a ++ * recalculation of the EMT field is required (which doesn't effect the ++ * IOMMU), or a type change. Type changes can only be between ram_rw, ++ * logdirty and ioreq_server: changes to/from logdirty won't work well with ++ * an IOMMU anyway, as IOMMU #PFs are not synchronous and will lead to ++ * aborts, and changes to/from ioreq_server are already fully flushed ++ * before returning to guest context (see ++ * XEN_DMOP_map_mem_type_to_ioreq_server). ++ */ ++ if ( !new.recalc && iommu_hap_pt_share ) ++ iommu_sync_cache(entryptr, sizeof(*entryptr)); ++ + if ( unlikely(oldmfn != mfn_x(INVALID_MFN)) ) + put_page(mfn_to_page(_mfn(oldmfn))); + +@@ -319,6 +332,9 @@ static bool_t ept_split_super_page(struc + break; + } + ++ if ( iommu_hap_pt_share ) ++ iommu_sync_cache(table, EPT_PAGETABLE_ENTRIES * sizeof(ept_entry_t)); ++ + unmap_domain_page(table); + + /* Even failed we should install the newly allocated ept page. */ +@@ -378,6 +394,9 @@ static int ept_next_level(struct p2m_dom + if ( !next ) + return GUEST_TABLE_MAP_FAILED; + ++ if ( iommu_hap_pt_share ) ++ iommu_sync_cache(next, EPT_PAGETABLE_ENTRIES * sizeof(ept_entry_t)); ++ + rc = atomic_write_ept_entry(ept_entry, e, next_level); + ASSERT(rc == 0); + } +@@ -875,7 +894,7 @@ out: + need_modify_vtd_table ) + { + if ( iommu_hap_pt_share ) +- rc = iommu_pte_flush(d, gfn, &ept_entry->epte, order, vtd_pte_present); ++ rc = iommu_flush_iotlb(d, gfn, vtd_pte_present, 1u << order); + else + { + if ( iommu_flags ) +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -612,10 +612,8 @@ static int __must_check iommu_flush_all( + return rc; + } + +-static int __must_check iommu_flush_iotlb(struct domain *d, +- unsigned long gfn, +- bool_t dma_old_pte_present, +- unsigned int page_count) ++int iommu_flush_iotlb(struct domain *d, unsigned long gfn, ++ bool dma_old_pte_present, unsigned int page_count) + { + struct domain_iommu *hd = dom_iommu(d); + struct acpi_drhd_unit *drhd; +@@ -1880,53 +1878,6 @@ static int __must_check intel_iommu_unma + return dma_pte_clear_one(d, (paddr_t)gfn << PAGE_SHIFT_4K); + } + +-int iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte, +- int order, int present) +-{ +- struct acpi_drhd_unit *drhd; +- struct iommu *iommu = NULL; +- struct domain_iommu *hd = dom_iommu(d); +- bool_t flush_dev_iotlb; +- int iommu_domid; +- int rc = 0; +- +- iommu_sync_cache(pte, sizeof(struct dma_pte)); +- +- for_each_drhd_unit ( drhd ) +- { +- iommu = drhd->iommu; +- if ( !test_bit(iommu->index, &hd->arch.iommu_bitmap) ) +- continue; +- +- flush_dev_iotlb = !!find_ats_dev_drhd(iommu); +- iommu_domid= domain_iommu_domid(d, iommu); +- if ( iommu_domid == -1 ) +- continue; +- +- rc = iommu_flush_iotlb_psi(iommu, iommu_domid, +- (paddr_t)gfn << PAGE_SHIFT_4K, +- order, !present, flush_dev_iotlb); +- if ( rc > 0 ) +- { +- iommu_flush_write_buffer(iommu); +- rc = 0; +- } +- } +- +- if ( unlikely(rc) ) +- { +- if ( !d->is_shutting_down && printk_ratelimit() ) +- printk(XENLOG_ERR VTDPREFIX +- " d%d: IOMMU pages flush failed: %d\n", +- d->domain_id, rc); +- +- if ( !is_hardware_domain(d) ) +- domain_crash(d); +- } +- +- return rc; +-} +- + static int __init vtd_ept_page_compatible(struct iommu *iommu) + { + u64 ept_cap, vtd_cap = iommu->cap; +--- a/xen/include/asm-x86/iommu.h ++++ b/xen/include/asm-x86/iommu.h +@@ -87,8 +87,9 @@ int iommu_setup_hpet_msi(struct msi_desc + + /* While VT-d specific, this must get declared in a generic header. */ + int adjust_vtd_irq_affinities(void); +-int __must_check iommu_pte_flush(struct domain *d, u64 gfn, u64 *pte, +- int order, int present); ++int __must_check iommu_flush_iotlb(struct domain *d, unsigned long gfn, ++ bool dma_old_pte_present, ++ unsigned int page_count); + bool_t iommu_supports_eim(void); + int iommu_enable_x2apic_IR(void); + void iommu_disable_x2apic_IR(void); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.11-o.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.11-o.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.11-o.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.11-o.patch 2022-04-05 13:04:21.000000000 +0100 @@ -0,0 +1,110 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: clean up permissions for dead domains +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +domain ids are prone to wrapping (15-bits), and with sufficient number +of VMs in a reboot loop it is possible to trigger it. Xenstore entries +may linger after a domain dies, until a toolstack cleans it up. During +this time there is a window where a wrapped domid could access these +xenstore keys (that belonged to another VM). + +To prevent this do a cleanup when a domain dies: + * walk the entire xenstore tree and update permissions for all nodes + * if the dead domain had an ACL entry: remove it + * if the dead domain was the owner: change the owner to Dom0 + +This is done without quota checks or a transaction. Quota checks would +be a no-op (either the domain is dead, or it is Dom0 where they are not +enforced). Transactions are not needed, because this is all done +atomically by oxenstored's single thread. + +The xenstore entries owned by the dead domain are not deleted, because +that could confuse a toolstack / backends that are still bound to it +(or generate unexpected watch events). It is the responsibility of a +toolstack to remove the xenstore entries themselves. + +This is part of XSA-322. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig + +diff --git a/tools/ocaml/xenstored/perms.ml b/tools/ocaml/xenstored/perms.ml +index ee7fee6bda..e8a16221f8 100644 +--- a/tools/ocaml/xenstored/perms.ml ++++ b/tools/ocaml/xenstored/perms.ml +@@ -58,6 +58,15 @@ let get_other perms = perms.other + let get_acl perms = perms.acl + let get_owner perm = perm.owner + ++(** [remote_domid ~domid perm] removes all ACLs for [domid] from perm. ++* If [domid] was the owner then it is changed to Dom0. ++* This is used for cleaning up after dead domains. ++* *) ++let remove_domid ~domid perm = ++ let acl = List.filter (fun (acl_domid, _) -> acl_domid <> domid) perm.acl in ++ let owner = if perm.owner = domid then 0 else perm.owner in ++ { perm with acl; owner } ++ + let default0 = create 0 NONE [] + + let perm_of_string s = +diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml +index 3cd0097db9..6a998f8764 100644 +--- a/tools/ocaml/xenstored/process.ml ++++ b/tools/ocaml/xenstored/process.ml +@@ -437,6 +437,7 @@ let do_release con t domains cons data = + let fire_spec_watches = Domains.exist domains domid in + Domains.del domains domid; + Connections.del_domain cons domid; ++ Store.reset_permissions (Transaction.get_store t) domid; + if fire_spec_watches + then Connections.fire_spec_watches (Transaction.get_root t) cons Store.Path.release_domain + else raise Invalid_Cmd_Args +diff --git a/tools/ocaml/xenstored/store.ml b/tools/ocaml/xenstored/store.ml +index 0ce6f68e8d..101c094715 100644 +--- a/tools/ocaml/xenstored/store.ml ++++ b/tools/ocaml/xenstored/store.ml +@@ -89,6 +89,13 @@ let check_owner node connection = + + let rec recurse fct node = fct node; List.iter (recurse fct) node.children + ++(** [recurse_map f tree] applies [f] on each node in the tree recursively *) ++let recurse_map f = ++ let rec walk node = ++ f { node with children = List.rev_map walk node.children |> List.rev } ++ in ++ walk ++ + let unpack node = (Symbol.to_string node.name, node.perms, node.value) + + end +@@ -405,6 +412,15 @@ let setperms store perm path nperms = + Quota.del_entry store.quota old_owner; + Quota.add_entry store.quota new_owner + ++let reset_permissions store domid = ++ Logging.info "store|node" "Cleaning up xenstore ACLs for domid %d" domid; ++ store.root <- Node.recurse_map (fun node -> ++ let perms = Perms.Node.remove_domid ~domid node.perms in ++ if perms <> node.perms then ++ Logging.debug "store|node" "Changed permissions for node %s" (Node.get_name node); ++ { node with perms } ++ ) store.root ++ + type ops = { + store: t; + write: Path.t -> string -> unit; +diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml +index 30fc874327..183dd2754b 100644 +--- a/tools/ocaml/xenstored/xenstored.ml ++++ b/tools/ocaml/xenstored/xenstored.ml +@@ -340,6 +340,7 @@ let _ = + finally (fun () -> + if Some port = eventchn.Event.virq_port then ( + let (notify, deaddom) = Domains.cleanup domains in ++ List.iter (Store.reset_permissions store) deaddom; + List.iter (Connections.del_domain cons) deaddom; + if deaddom <> [] || notify then + Connections.fire_spec_watches diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.12-c.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.12-c.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.12-c.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa322-4.12-c.patch 2022-05-28 19:29:36.000000000 +0100 @@ -0,0 +1,534 @@ +From: Juergen Gross +Subject: tools/xenstore: revoke access rights for removed domains + +Access rights of Xenstore nodes are per domid. Unfortunately existing +granted access rights are not removed when a domain is being destroyed. +This means that a new domain created with the same domid will inherit +the access rights to Xenstore nodes from the previous domain(s) with +the same domid. + +This can be avoided by adding a generation counter to each domain. +The generation counter of the domain is set to the global generation +counter when a domain structure is being allocated. When reading or +writing a node all permissions of domains which are younger than the +node itself are dropped. This is done by flagging the related entry +as invalid in order to avoid modifying permissions in a way the user +could detect. + +A special case has to be considered: for a new domain the first +Xenstore entries are already written before the domain is officially +introduced in Xenstore. In order not to drop the permissions for the +new domain a domain struct is allocated even before introduction if +the hypervisor is aware of the domain. This requires adding another +bool "introduced" to struct domain in xenstored. In order to avoid +additional padding holes convert the shutdown flag to bool, too. + +As verifying permissions has its price regarding runtime add a new +quota for limiting the number of permissions an unprivileged domain +can set for a node. The default for that new quota is 5. + +This is part of XSA-322. + +Signed-off-by: Juergen Gross +Reviewed-by: Paul Durrant +Acked-by: Julien Grall + +diff --git a/tools/xenstore/include/xenstore_lib.h b/tools/xenstore/include/xenstore_lib.h +index 0ffbae9eb574..4c9b6d16858d 100644 +--- a/tools/xenstore/include/xenstore_lib.h ++++ b/tools/xenstore/include/xenstore_lib.h +@@ -34,6 +34,7 @@ enum xs_perm_type { + /* Internal use. */ + XS_PERM_ENOENT_OK = 4, + XS_PERM_OWNER = 8, ++ XS_PERM_IGNORE = 16, + }; + + struct xs_permissions +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index 2a86c4aa5bce..4fbe5c759c1b 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -101,6 +101,7 @@ int quota_nb_entry_per_domain = 1000; + int quota_nb_watch_per_domain = 128; + int quota_max_entry_size = 2048; /* 2K */ + int quota_max_transaction = 10; ++int quota_nb_perms_per_node = 5; + + void trace(const char *fmt, ...) + { +@@ -407,8 +408,13 @@ struct node *read_node(struct connection *conn, const void *ctx, + + /* Permissions are struct xs_permissions. */ + node->perms.p = hdr->perms; ++ if (domain_adjust_node_perms(node)) { ++ talloc_free(node); ++ return NULL; ++ } ++ + /* Data is binary blob (usually ascii, no nul). */ +- node->data = node->perms.p + node->perms.num; ++ node->data = node->perms.p + hdr->num_perms; + /* Children is strings, nul separated. */ + node->children = node->data + node->datalen; + +@@ -424,6 +430,9 @@ int write_node_raw(struct connection *conn, TDB_DATA *key, struct node *node, + void *p; + struct xs_tdb_record_hdr *hdr; + ++ if (domain_adjust_node_perms(node)) ++ return errno; ++ + data.dsize = sizeof(*hdr) + + node->perms.num * sizeof(node->perms.p[0]) + + node->datalen + node->childlen; +@@ -483,8 +492,9 @@ enum xs_perm_type perm_for_conn(struct connection *conn, + return (XS_PERM_READ|XS_PERM_WRITE|XS_PERM_OWNER) & mask; + + for (i = 1; i < perms->num; i++) +- if (perms->p[i].id == conn->id +- || (conn->target && perms->p[i].id == conn->target->id)) ++ if (!(perms->p[i].perms & XS_PERM_IGNORE) && ++ (perms->p[i].id == conn->id || ++ (conn->target && perms->p[i].id == conn->target->id))) + return perms->p[i].perms & mask; + + return perms->p[0].perms & mask; +@@ -1246,8 +1256,12 @@ static int do_set_perms(struct connection *conn, struct buffered_data *in) + if (perms.num < 2) + return EINVAL; + +- permstr = in->buffer + strlen(in->buffer) + 1; + perms.num--; ++ if (domain_is_unprivileged(conn) && ++ perms.num > quota_nb_perms_per_node) ++ return ENOSPC; ++ ++ permstr = in->buffer + strlen(in->buffer) + 1; + + perms.p = talloc_array(in, struct xs_permissions, perms.num); + if (!perms.p) +@@ -1919,6 +1933,7 @@ static void usage(void) + " -S, --entry-size limit the size of entry per domain, and\n" + " -W, --watch-nb limit the number of watches per domain,\n" + " -t, --transaction limit the number of transaction allowed per domain,\n" ++" -A, --perm-nb limit the number of permissions per node,\n" + " -R, --no-recovery to request that no recovery should be attempted when\n" + " the store is corrupted (debug only),\n" + " -I, --internal-db store database in memory, not on disk\n" +@@ -1939,6 +1954,7 @@ static struct option options[] = { + { "entry-size", 1, NULL, 'S' }, + { "trace-file", 1, NULL, 'T' }, + { "transaction", 1, NULL, 't' }, ++ { "perm-nb", 1, NULL, 'A' }, + { "no-recovery", 0, NULL, 'R' }, + { "internal-db", 0, NULL, 'I' }, + { "verbose", 0, NULL, 'V' }, +@@ -1961,7 +1977,7 @@ int main(int argc, char *argv[]) + int timeout; + + +- while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:T:RVW:", options, ++ while ((opt = getopt_long(argc, argv, "DE:F:HNPS:t:A:T:RVW:", options, + NULL)) != -1) { + switch (opt) { + case 'D': +@@ -2003,6 +2019,9 @@ int main(int argc, char *argv[]) + case 'W': + quota_nb_watch_per_domain = strtol(optarg, NULL, 10); + break; ++ case 'A': ++ quota_nb_perms_per_node = strtol(optarg, NULL, 10); ++ break; + case 'e': + dom0_event = strtol(optarg, NULL, 10); + break; +diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c +index 0b2f49ac7d4c..f5e7af46e8aa 100644 +--- a/tools/xenstore/xenstored_domain.c ++++ b/tools/xenstore/xenstored_domain.c +@@ -71,8 +71,14 @@ struct domain + /* The connection associated with this. */ + struct connection *conn; + ++ /* Generation count at domain introduction time. */ ++ uint64_t generation; ++ + /* Have we noticed that this domain is shutdown? */ +- int shutdown; ++ bool shutdown; ++ ++ /* Has domain been officially introduced? */ ++ bool introduced; + + /* number of entry from this domain in the store */ + int nbentry; +@@ -200,6 +206,9 @@ static int destroy_domain(void *_domain) + + list_del(&domain->list); + ++ if (!domain->introduced) ++ return 0; ++ + if (domain->port) { + if (xenevtchn_unbind(xce_handle, domain->port) == -1) + eprintf("> Unbinding port %i failed!\n", domain->port); +@@ -221,20 +230,33 @@ static int destroy_domain(void *_domain) + return 0; + } + ++static bool get_domain_info(unsigned int domid, xc_dominfo_t *dominfo) ++{ ++ return xc_domain_getinfo(*xc_handle, domid, 1, dominfo) == 1 && ++ dominfo->domid == domid; ++} ++ + static void domain_cleanup(void) + { + xc_dominfo_t dominfo; + struct domain *domain; + int notify = 0; ++ bool dom_valid; + + again: + list_for_each_entry(domain, &domains, list) { +- if (xc_domain_getinfo(*xc_handle, domain->domid, 1, +- &dominfo) == 1 && +- dominfo.domid == domain->domid) { ++ dom_valid = get_domain_info(domain->domid, &dominfo); ++ if (!domain->introduced) { ++ if (!dom_valid) { ++ talloc_free(domain); ++ goto again; ++ } ++ continue; ++ } ++ if (dom_valid) { + if ((dominfo.crashed || dominfo.shutdown) + && !domain->shutdown) { +- domain->shutdown = 1; ++ domain->shutdown = true; + notify = 1; + } + if (!dominfo.dying) +@@ -301,58 +323,84 @@ static char *talloc_domain_path(void *context, unsigned int domid) + return talloc_asprintf(context, "/local/domain/%u", domid); + } + +-static struct domain *new_domain(void *context, unsigned int domid, +- int port) ++static struct domain *find_domain_struct(unsigned int domid) ++{ ++ struct domain *i; ++ ++ list_for_each_entry(i, &domains, list) { ++ if (i->domid == domid) ++ return i; ++ } ++ return NULL; ++} ++ ++static struct domain *alloc_domain(void *context, unsigned int domid) + { + struct domain *domain; +- int rc; + + domain = talloc(context, struct domain); +- if (!domain) ++ if (!domain) { ++ errno = ENOMEM; + return NULL; ++ } + +- domain->port = 0; +- domain->shutdown = 0; + domain->domid = domid; +- domain->path = talloc_domain_path(domain, domid); +- if (!domain->path) +- return NULL; ++ domain->generation = generation; ++ domain->introduced = false; + +- wrl_domain_new(domain); ++ talloc_set_destructor(domain, destroy_domain); + + list_add(&domain->list, &domains); +- talloc_set_destructor(domain, destroy_domain); ++ ++ return domain; ++} ++ ++static int new_domain(struct domain *domain, int port) ++{ ++ int rc; ++ ++ domain->port = 0; ++ domain->shutdown = false; ++ domain->path = talloc_domain_path(domain, domain->domid); ++ if (!domain->path) { ++ errno = ENOMEM; ++ return errno; ++ } ++ ++ wrl_domain_new(domain); + + /* Tell kernel we're interested in this event. */ +- rc = xenevtchn_bind_interdomain(xce_handle, domid, port); ++ rc = xenevtchn_bind_interdomain(xce_handle, domain->domid, port); + if (rc == -1) +- return NULL; ++ return errno; + domain->port = rc; + ++ domain->introduced = true; ++ + domain->conn = new_connection(writechn, readchn); +- if (!domain->conn) +- return NULL; ++ if (!domain->conn) { ++ errno = ENOMEM; ++ return errno; ++ } + + domain->conn->domain = domain; +- domain->conn->id = domid; ++ domain->conn->id = domain->domid; + + domain->remote_port = port; + domain->nbentry = 0; + domain->nbwatch = 0; + +- return domain; ++ return 0; + } + + + static struct domain *find_domain_by_domid(unsigned int domid) + { +- struct domain *i; ++ struct domain *d; + +- list_for_each_entry(i, &domains, list) { +- if (i->domid == domid) +- return i; +- } +- return NULL; ++ d = find_domain_struct(domid); ++ ++ return (d && d->introduced) ? d : NULL; + } + + static void domain_conn_reset(struct domain *domain) +@@ -399,15 +447,21 @@ int do_introduce(struct connection *conn, struct buffered_data *in) + if (port <= 0) + return EINVAL; + +- domain = find_domain_by_domid(domid); ++ domain = find_domain_struct(domid); + + if (domain == NULL) { ++ /* Hang domain off "in" until we're finished. */ ++ domain = alloc_domain(in, domid); ++ if (domain == NULL) ++ return ENOMEM; ++ } ++ ++ if (!domain->introduced) { + interface = map_interface(domid, mfn); + if (!interface) + return errno; + /* Hang domain off "in" until we're finished. */ +- domain = new_domain(in, domid, port); +- if (!domain) { ++ if (new_domain(domain, port)) { + rc = errno; + unmap_interface(interface); + return rc; +@@ -518,8 +572,8 @@ int do_resume(struct connection *conn, struct buffered_data *in) + if (IS_ERR(domain)) + return -PTR_ERR(domain); + +- domain->shutdown = 0; +- ++ domain->shutdown = false; ++ + send_ack(conn, XS_RESUME); + + return 0; +@@ -662,8 +716,10 @@ static int dom0_init(void) + if (port == -1) + return -1; + +- dom0 = new_domain(NULL, xenbus_master_domid(), port); +- if (dom0 == NULL) ++ dom0 = alloc_domain(NULL, xenbus_master_domid()); ++ if (!dom0) ++ return -1; ++ if (new_domain(dom0, port)) + return -1; + + dom0->interface = xenbus_map(); +@@ -744,6 +800,66 @@ void domain_entry_inc(struct connection *conn, struct node *node) + } + } + ++/* ++ * Check whether a domain was created before or after a specific generation ++ * count (used for testing whether a node permission is older than a domain). ++ * ++ * Return values: ++ * -1: error ++ * 0: domain has higher generation count (it is younger than a node with the ++ * given count), or domain isn't existing any longer ++ * 1: domain is older than the node ++ */ ++static int chk_domain_generation(unsigned int domid, uint64_t gen) ++{ ++ struct domain *d; ++ xc_dominfo_t dominfo; ++ ++ if (!xc_handle && domid == 0) ++ return 1; ++ ++ d = find_domain_struct(domid); ++ if (d) ++ return (d->generation <= gen) ? 1 : 0; ++ ++ if (!get_domain_info(domid, &dominfo)) ++ return 0; ++ ++ d = alloc_domain(NULL, domid); ++ return d ? 1 : -1; ++} ++ ++/* ++ * Remove permissions for no longer existing domains in order to avoid a new ++ * domain with the same domid inheriting the permissions. ++ */ ++int domain_adjust_node_perms(struct node *node) ++{ ++ unsigned int i; ++ int ret; ++ ++ ret = chk_domain_generation(node->perms.p[0].id, node->generation); ++ if (ret < 0) ++ return errno; ++ ++ /* If the owner doesn't exist any longer give it to priv domain. */ ++ if (!ret) ++ node->perms.p[0].id = priv_domid; ++ ++ for (i = 1; i < node->perms.num; i++) { ++ if (node->perms.p[i].perms & XS_PERM_IGNORE) ++ continue; ++ ret = chk_domain_generation(node->perms.p[i].id, ++ node->generation); ++ if (ret < 0) ++ return errno; ++ if (!ret) ++ node->perms.p[i].perms |= XS_PERM_IGNORE; ++ } ++ ++ return 0; ++} ++ + void domain_entry_dec(struct connection *conn, struct node *node) + { + struct domain *d; +diff --git a/tools/xenstore/xenstored_domain.h b/tools/xenstore/xenstored_domain.h +index 259183962a9c..5e00087206c7 100644 +--- a/tools/xenstore/xenstored_domain.h ++++ b/tools/xenstore/xenstored_domain.h +@@ -56,6 +56,9 @@ bool domain_can_write(struct connection *conn); + + bool domain_is_unprivileged(struct connection *conn); + ++/* Remove node permissions for no longer existing domains. */ ++int domain_adjust_node_perms(struct node *node); ++ + /* Quota manipulation */ + void domain_entry_inc(struct connection *conn, struct node *); + void domain_entry_dec(struct connection *conn, struct node *); +diff --git a/tools/xenstore/xenstored_transaction.c b/tools/xenstore/xenstored_transaction.c +index 36793b9b1af3..9fcb4c9ba986 100644 +--- a/tools/xenstore/xenstored_transaction.c ++++ b/tools/xenstore/xenstored_transaction.c +@@ -47,7 +47,12 @@ + * transaction. + * Each time the global generation count is copied to either a node or a + * transaction it is incremented. This ensures all nodes and/or transactions +- * are having a unique generation count. ++ * are having a unique generation count. The increment is done _before_ the ++ * copy as that is needed for checking whether a domain was created before ++ * or after a node has been written (the domain's generation is set with the ++ * actual generation count without incrementing it, in order to support ++ * writing a node for a domain before the domain has been officially ++ * introduced). + * + * Transaction conflicts are detected by checking the generation count of all + * nodes read in the transaction to match with the generation count in the +@@ -161,7 +166,7 @@ struct transaction + }; + + extern int quota_max_transaction; +-static uint64_t generation; ++uint64_t generation; + + static void set_tdb_key(const char *name, TDB_DATA *key) + { +@@ -237,7 +242,7 @@ int access_node(struct connection *conn, struct node *node, + bool introduce = false; + + if (type != NODE_ACCESS_READ) { +- node->generation = generation++; ++ node->generation = ++generation; + if (conn && !conn->transaction) + wrl_apply_debit_direct(conn); + } +@@ -374,7 +379,7 @@ static int finalize_transaction(struct connection *conn, + if (!data.dptr) + goto err; + hdr = (void *)data.dptr; +- hdr->generation = generation++; ++ hdr->generation = ++generation; + ret = tdb_store(tdb_ctx, key, data, + TDB_REPLACE); + talloc_free(data.dptr); +@@ -462,7 +467,7 @@ int do_transaction_start(struct connection *conn, struct buffered_data *in) + INIT_LIST_HEAD(&trans->accessed); + INIT_LIST_HEAD(&trans->changed_domains); + trans->fail = false; +- trans->generation = generation++; ++ trans->generation = ++generation; + + /* Pick an unused transaction identifier. */ + do { +diff --git a/tools/xenstore/xenstored_transaction.h b/tools/xenstore/xenstored_transaction.h +index 3386bac56508..43a162bea3f3 100644 +--- a/tools/xenstore/xenstored_transaction.h ++++ b/tools/xenstore/xenstored_transaction.h +@@ -27,6 +27,8 @@ enum node_access_type { + + struct transaction; + ++extern uint64_t generation; ++ + int do_transaction_start(struct connection *conn, struct buffered_data *node); + int do_transaction_end(struct connection *conn, struct buffered_data *in); + +diff --git a/tools/xenstore/xs_lib.c b/tools/xenstore/xs_lib.c +index 3e43f8809d42..d407d5713aff 100644 +--- a/tools/xenstore/xs_lib.c ++++ b/tools/xenstore/xs_lib.c +@@ -152,7 +152,7 @@ bool xs_strings_to_perms(struct xs_permissions *perms, unsigned int num, + bool xs_perm_to_string(const struct xs_permissions *perm, + char *buffer, size_t buf_len) + { +- switch ((int)perm->perms) { ++ switch ((int)perm->perms & ~XS_PERM_IGNORE) { + case XS_PERM_WRITE: + *buffer = 'w'; + break; +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa323.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa323.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa323.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa323.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,140 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: Fix path length validation +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Currently, oxenstored checks the length of paths against 1024, then +prepends "/local/domain/$DOMID/" to relative paths. This allows a domU +to create paths which can't subsequently be read by anyone, even dom0. +This also interferes with listing directories, etc. + +Define a new oxenstored.conf entry: quota-path-max, defaulting to 1024 +as before. For paths that begin with "/local/domain/$DOMID/" check the +relative path length against this quota. For all other paths check the +entire path length. + +This ensures that if the domid changes (and thus the length of a prefix +changes) a path that used to be valid stays valid (e.g. after a +live-migration). It also ensures that regardless how the client tries +to access a path (domid-relative or absolute) it will get consistent +results, since the limit is always applied on the final canonicalized +path. + +Delete the unused Domain.get_path to avoid it being confused with +Connection.get_path (which differs by a trailing slash only). + +Rewrite Util.path_validate to apply the appropriate length restriction +based on whether the path is relative or not. Remove the check for +connection_path being absolute, because it is not guest controlled data. + +This is part of XSA-323. + +Signed-off-by: Andrew Cooper +Signed-off-by: Edwin Török +Acked-by: Christian Lindig + +diff --git a/tools/ocaml/libs/xb/partial.ml b/tools/ocaml/libs/xb/partial.ml +index d4d1c7bdec..b6e2a716e2 100644 +--- a/tools/ocaml/libs/xb/partial.ml ++++ b/tools/ocaml/libs/xb/partial.ml +@@ -28,6 +28,7 @@ external header_of_string_internal: string -> int * int * int * int + = "stub_header_of_string" + + let xenstore_payload_max = 4096 (* xen/include/public/io/xs_wire.h *) ++let xenstore_rel_path_max = 2048 (* xen/include/public/io/xs_wire.h *) + + let of_string s = + let tid, rid, opint, dlen = header_of_string_internal s in +diff --git a/tools/ocaml/libs/xb/partial.mli b/tools/ocaml/libs/xb/partial.mli +index 359a75e88d..b9216018f5 100644 +--- a/tools/ocaml/libs/xb/partial.mli ++++ b/tools/ocaml/libs/xb/partial.mli +@@ -9,6 +9,7 @@ external header_size : unit -> int = "stub_header_size" + external header_of_string_internal : string -> int * int * int * int + = "stub_header_of_string" + val xenstore_payload_max : int ++val xenstore_rel_path_max : int + val of_string : string -> pkt + val append : pkt -> string -> int -> unit + val to_complete : pkt -> int +diff --git a/tools/ocaml/xenstored/define.ml b/tools/ocaml/xenstored/define.ml +index ea9e1b7620..ebe18b8e31 100644 +--- a/tools/ocaml/xenstored/define.ml ++++ b/tools/ocaml/xenstored/define.ml +@@ -31,6 +31,8 @@ let conflict_rate_limit_is_aggregate = ref true + + let domid_self = 0x7FF0 + ++let path_max = ref Xenbus.Partial.xenstore_rel_path_max ++ + exception Not_a_directory of string + exception Not_a_value of string + exception Already_exist +diff --git a/tools/ocaml/xenstored/domain.ml b/tools/ocaml/xenstored/domain.ml +index aeb185ff7e..81cb59b8f1 100644 +--- a/tools/ocaml/xenstored/domain.ml ++++ b/tools/ocaml/xenstored/domain.ml +@@ -38,7 +38,6 @@ type t = + } + + let is_dom0 d = d.id = 0 +-let get_path dom = "/local/domain/" ^ (sprintf "%u" dom.id) + let get_id domain = domain.id + let get_interface d = d.interface + let get_mfn d = d.mfn +diff --git a/tools/ocaml/xenstored/oxenstored.conf.in b/tools/ocaml/xenstored/oxenstored.conf.in +index f843482981..4ae48e42d4 100644 +--- a/tools/ocaml/xenstored/oxenstored.conf.in ++++ b/tools/ocaml/xenstored/oxenstored.conf.in +@@ -61,6 +61,7 @@ quota-maxsize = 2048 + quota-maxwatch = 100 + quota-transaction = 10 + quota-maxrequests = 1024 ++quota-path-max = 1024 + + # Activate filed base backend + persistent = false +diff --git a/tools/ocaml/xenstored/utils.ml b/tools/ocaml/xenstored/utils.ml +index e8c9fe4e94..eb79bf0146 100644 +--- a/tools/ocaml/xenstored/utils.ml ++++ b/tools/ocaml/xenstored/utils.ml +@@ -93,7 +93,7 @@ let read_file_single_integer filename = + let path_validate path connection_path = + let len = String.length path in + +- if len = 0 || len > 1024 then raise Define.Invalid_path; ++ if len = 0 then raise Define.Invalid_path; + + let abs_path = + match String.get path 0 with +@@ -101,4 +101,17 @@ let path_validate path connection_path = + | _ -> connection_path ^ path + in + ++ (* Regardless whether client specified absolute or relative path, ++ canonicalize it (above) and, for domain-relative paths, check the ++ length of the relative part. ++ ++ This prevents paths becoming invalid across migrate when the length ++ of the domid changes in @param connection_path. ++ *) ++ let len = String.length abs_path in ++ let on_absolute _ _ = len in ++ let on_relative _ offset = len - offset in ++ let len = Scanf.ksscanf abs_path on_absolute "/local/domain/%d/%n" on_relative in ++ if len > !Define.path_max then raise Define.Invalid_path; ++ + abs_path +diff --git a/tools/ocaml/xenstored/xenstored.ml b/tools/ocaml/xenstored/xenstored.ml +index ff9fbbbac2..39d6d767e4 100644 +--- a/tools/ocaml/xenstored/xenstored.ml ++++ b/tools/ocaml/xenstored/xenstored.ml +@@ -102,6 +102,7 @@ let parse_config filename = + ("quota-maxentity", Config.Set_int Quota.maxent); + ("quota-maxsize", Config.Set_int Quota.maxsize); + ("quota-maxrequests", Config.Set_int Define.maxrequests); ++ ("quota-path-max", Config.Set_int Define.path_max); + ("test-eagain", Config.Set_bool Transaction.test_eagain); + ("persistent", Config.Set_bool Disk.enable); + ("xenstored-log-file", Config.String Logging.set_xenstored_log_destination); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa324.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa324.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa324.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa324.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,48 @@ +From: Juergen Gross +Subject: tools/xenstore: drop watch event messages exceeding maximum size + +By setting a watch with a very large tag it is possible to trick +xenstored to send watch event messages exceeding the maximum allowed +payload size. This might in turn lead to a crash of xenstored as the +resulting error can cause dereferencing a NULL pointer in case there +is no active request being handled by the guest the watch event is +being sent to. + +Fix that by just dropping such watch events. Additionally modify the +error handling to test the pointer to be not NULL before dereferencing +it. + +This is XSA-324. + +Signed-off-by: Juergen Gross +Acked-by: Julien Grall + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index 33f95dcf3c..3d74dbbb40 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -674,6 +674,9 @@ void send_reply(struct connection *conn, enum xsd_sockmsg_type type, + /* Replies reuse the request buffer, events need a new one. */ + if (type != XS_WATCH_EVENT) { + bdata = conn->in; ++ /* Drop asynchronous responses, e.g. errors for watch events. */ ++ if (!bdata) ++ return; + bdata->inhdr = true; + bdata->used = 0; + conn->in = NULL; +diff --git a/tools/xenstore/xenstored_watch.c b/tools/xenstore/xenstored_watch.c +index 71c108ea99..9ff20690c0 100644 +--- a/tools/xenstore/xenstored_watch.c ++++ b/tools/xenstore/xenstored_watch.c +@@ -92,6 +92,10 @@ static void add_event(struct connection *conn, + } + + len = strlen(name) + 1 + strlen(watch->token) + 1; ++ /* Don't try to send over-long events. */ ++ if (len > XENSTORE_PAYLOAD_MAX) ++ return; ++ + data = talloc_array(ctx, char, len); + if (!data) + return; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa325-4.14.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa325-4.14.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa325-4.14.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa325-4.14.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,192 @@ +From: Harsha Shamsundara Havanur +Subject: tools/xenstore: Preserve bad client until they are destroyed + +XenStored will kill any connection that it thinks has misbehaved, +this is currently happening in two places: + * In `handle_input()` if the sanity check on the ring and the message + fails. + * In `handle_output()` when failing to write the response in the ring. + +As the domain structure is a child of the connection, XenStored will +destroy its view of the domain when killing the connection. This will +result in sending @releaseDomain event to all the watchers. + +As the watch event doesn't carry which domain has been released, +the watcher (such as XenStored) will generally go through the list of +domains registers and check if one of them is shutting down/dying. +In the case of a client misbehaving, the domain will likely to be +running, so no action will be performed. + +When the domain is effectively destroyed, XenStored will not be aware of +the domain anymore. So the watch event is not going to be sent. +By consequence, the watchers of the event will not release mappings +they may have on the domain. This will result in a zombie domain. + +In order to send @releaseDomain event at the correct time, we want +to keep the domain structure until the domain is effectively +shutting-down/dying. + +We also want to keep the connection around so we could possibly revive +the connection in the future. + +A new flag 'is_ignored' is added to mark whether a connection should be +ignored when checking if there are work to do. Additionally any +transactions, watches, buffers associated to the connection will be +freed as you can't do much with them (restarting the connection will +likely need a reset). + +As a side note, when the device model were running in a stubdomain, a +guest would have been able to introduce a use-after-free because there +is two parents for a guest connection. + +This is XSA-325. + +Reported-by: Pawel Wieczorkiewicz +Signed-off-by: Harsha Shamsundara Havanur +Signed-off-by: Julien Grall +Reviewed-by: Juergen Gross +Reviewed-by: Paul Durrant + +diff --git a/tools/xenstore/xenstored_core.c b/tools/xenstore/xenstored_core.c +index af3d17004b3f..27d8f15b6b76 100644 +--- a/tools/xenstore/xenstored_core.c ++++ b/tools/xenstore/xenstored_core.c +@@ -1355,6 +1355,32 @@ static struct { + [XS_DIRECTORY_PART] = { "DIRECTORY_PART", send_directory_part }, + }; + ++/* ++ * Keep the connection alive but stop processing any new request or sending ++ * reponse. This is to allow sending @releaseDomain watch event at the correct ++ * moment and/or to allow the connection to restart (not yet implemented). ++ * ++ * All watches, transactions, buffers will be freed. ++ */ ++static void ignore_connection(struct connection *conn) ++{ ++ struct buffered_data *out, *tmp; ++ ++ trace("CONN %p ignored\n", conn); ++ ++ conn->is_ignored = true; ++ conn_delete_all_watches(conn); ++ conn_delete_all_transactions(conn); ++ ++ list_for_each_entry_safe(out, tmp, &conn->out_list, list) { ++ list_del(&out->list); ++ talloc_free(out); ++ } ++ ++ talloc_free(conn->in); ++ conn->in = NULL; ++} ++ + static const char *sockmsg_string(enum xsd_sockmsg_type type) + { + if ((unsigned int)type < ARRAY_SIZE(wire_funcs) && wire_funcs[type].str) +@@ -1413,8 +1439,10 @@ static void consider_message(struct connection *conn) + assert(conn->in == NULL); + } + +-/* Errors in reading or allocating here mean we get out of sync, so we +- * drop the whole client connection. */ ++/* ++ * Errors in reading or allocating here means we get out of sync, so we mark ++ * the connection as ignored. ++ */ + static void handle_input(struct connection *conn) + { + int bytes; +@@ -1471,14 +1499,14 @@ static void handle_input(struct connection *conn) + return; + + bad_client: +- /* Kill it. */ +- talloc_free(conn); ++ ignore_connection(conn); + } + + static void handle_output(struct connection *conn) + { ++ /* Ignore the connection if an error occured */ + if (!write_messages(conn)) +- talloc_free(conn); ++ ignore_connection(conn); + } + + struct connection *new_connection(connwritefn_t *write, connreadfn_t *read) +@@ -1494,6 +1522,7 @@ struct connection *new_connection(connwritefn_t *write, connreadfn_t *read) + new->write = write; + new->read = read; + new->can_write = true; ++ new->is_ignored = false; + new->transaction_started = 0; + INIT_LIST_HEAD(&new->out_list); + INIT_LIST_HEAD(&new->watches); +@@ -2186,8 +2215,9 @@ int main(int argc, char *argv[]) + if (fds[conn->pollfd_idx].revents + & ~(POLLIN|POLLOUT)) + talloc_free(conn); +- else if (fds[conn->pollfd_idx].revents +- & POLLIN) ++ else if ((fds[conn->pollfd_idx].revents ++ & POLLIN) && ++ !conn->is_ignored) + handle_input(conn); + } + if (talloc_free(conn) == 0) +@@ -2199,8 +2229,9 @@ int main(int argc, char *argv[]) + if (fds[conn->pollfd_idx].revents + & ~(POLLIN|POLLOUT)) + talloc_free(conn); +- else if (fds[conn->pollfd_idx].revents +- & POLLOUT) ++ else if ((fds[conn->pollfd_idx].revents ++ & POLLOUT) && ++ !conn->is_ignored) + handle_output(conn); + } + if (talloc_free(conn) == 0) +diff --git a/tools/xenstore/xenstored_core.h b/tools/xenstore/xenstored_core.h +index eb19b71f5f46..196a6fd2b0be 100644 +--- a/tools/xenstore/xenstored_core.h ++++ b/tools/xenstore/xenstored_core.h +@@ -80,6 +80,9 @@ struct connection + /* Is this a read-only connection? */ + bool can_write; + ++ /* Is this connection ignored? */ ++ bool is_ignored; ++ + /* Buffered incoming data. */ + struct buffered_data *in; + +diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c +index dc635e9be30c..d5e1e3e9d42d 100644 +--- a/tools/xenstore/xenstored_domain.c ++++ b/tools/xenstore/xenstored_domain.c +@@ -286,6 +286,10 @@ bool domain_can_read(struct connection *conn) + + if (domain_is_unprivileged(conn) && conn->domain->wrl_credit < 0) + return false; ++ ++ if (conn->is_ignored) ++ return false; ++ + return (intf->req_cons != intf->req_prod); + } + +@@ -303,6 +307,10 @@ bool domain_is_unprivileged(struct connection *conn) + bool domain_can_write(struct connection *conn) + { + struct xenstore_domain_interface *intf = conn->domain->interface; ++ ++ if (conn->is_ignored) ++ return false; ++ + return ((intf->rsp_prod - intf->rsp_cons) != XENSTORE_RING_SIZE); + } + +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa327.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa327.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa327.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa327.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,63 @@ +From 030300ebbb86c40c12db038714479d746167c767 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Tue, 26 May 2020 18:31:33 +0100 +Subject: [PATCH] xen: Check the alignment of the offset pased via + VCPUOP_register_vcpu_info + +Currently a guest is able to register any guest physical address to use +for the vcpu_info structure as long as the structure can fits in the +rest of the frame. + +This means a guest can provide an address that is not aligned to the +natural alignment of the structure. + +On Arm 32-bit, unaligned access are completely forbidden by the +hypervisor. This will result to a data abort which is fatal. + +On Arm 64-bit, unaligned access are only forbidden when used for atomic +access. As the structure contains fields (such as evtchn_pending_self) +that are updated using atomic operations, any unaligned access will be +fatal as well. + +While the misalignment is only fatal on Arm, a generic check is added +as an x86 guest shouldn't sensibly pass an unaligned address (this +would result to a split lock). + +This is XSA-327. + +Reported-by: Julien Grall +Signed-off-by: Julien Grall +Reviewed-by: Andrew Cooper +Reviewed-by: Stefano Stabellini +--- + xen/common/domain.c | 10 ++++++++++ + 1 file changed, 10 insertions(+) + +diff --git a/xen/common/domain.c b/xen/common/domain.c +index 7cc9526139a6..e9be05f1d05f 100644 +--- a/xen/common/domain.c ++++ b/xen/common/domain.c +@@ -1227,10 +1227,20 @@ int map_vcpu_info(struct vcpu *v, unsigned long gfn, unsigned offset) + void *mapping; + vcpu_info_t *new_info; + struct page_info *page; ++ unsigned int align; + + if ( offset > (PAGE_SIZE - sizeof(vcpu_info_t)) ) + return -EINVAL; + ++#ifdef CONFIG_COMPAT ++ if ( has_32bit_shinfo(d) ) ++ align = alignof(new_info->compat); ++ else ++#endif ++ align = alignof(*new_info); ++ if ( offset & (align - 1) ) ++ return -EINVAL; ++ + if ( !mfn_eq(v->vcpu_info_mfn, INVALID_MFN) ) + return -EINVAL; + +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-1.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,118 @@ +From: Jan Beulich +Subject: x86/EPT: ept_set_middle_entry() related adjustments + +ept_split_super_page() wants to further modify the newly allocated +table, so have ept_set_middle_entry() return the mapped pointer rather +than tearing it down and then getting re-established right again. + +Similarly ept_next_level() wants to hand back a mapped pointer of +the next level page, so re-use the one established by +ept_set_middle_entry() in case that path was taken. + +Pull the setting of suppress_ve ahead of insertion into the higher level +table, and don't have ept_split_super_page() set the field a 2nd time. + +This is part of XSA-328. + +Signed-off-by: Jan Beulich + +--- a/xen/arch/x86/mm/p2m-ept.c ++++ b/xen/arch/x86/mm/p2m-ept.c +@@ -228,8 +228,9 @@ static void ept_p2m_type_to_flags(struct + #define GUEST_TABLE_SUPER_PAGE 2 + #define GUEST_TABLE_POD_PAGE 3 + +-/* Fill in middle levels of ept table */ +-static int ept_set_middle_entry(struct p2m_domain *p2m, ept_entry_t *ept_entry) ++/* Fill in middle level of ept table; return pointer to mapped new table. */ ++static ept_entry_t *ept_set_middle_entry(struct p2m_domain *p2m, ++ ept_entry_t *ept_entry) + { + mfn_t mfn; + ept_entry_t *table; +@@ -237,7 +238,12 @@ static int ept_set_middle_entry(struct p + + mfn = p2m_alloc_ptp(p2m, 0); + if ( mfn_eq(mfn, INVALID_MFN) ) +- return 0; ++ return NULL; ++ ++ table = map_domain_page(mfn); ++ ++ for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) ++ table[i].suppress_ve = 1; + + ept_entry->epte = 0; + ept_entry->mfn = mfn_x(mfn); +@@ -249,14 +255,7 @@ static int ept_set_middle_entry(struct p + + ept_entry->suppress_ve = 1; + +- table = map_domain_page(mfn); +- +- for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) +- table[i].suppress_ve = 1; +- +- unmap_domain_page(table); +- +- return 1; ++ return table; + } + + /* free ept sub tree behind an entry */ +@@ -294,10 +293,10 @@ static bool_t ept_split_super_page(struc + + ASSERT(is_epte_superpage(ept_entry)); + +- if ( !ept_set_middle_entry(p2m, &new_ept) ) ++ table = ept_set_middle_entry(p2m, &new_ept); ++ if ( !table ) + return 0; + +- table = map_domain_page(_mfn(new_ept.mfn)); + trunk = 1UL << ((level - 1) * EPT_TABLE_ORDER); + + for ( i = 0; i < EPT_PAGETABLE_ENTRIES; i++ ) +@@ -308,7 +307,6 @@ static bool_t ept_split_super_page(struc + epte->sp = (level > 1); + epte->mfn += i * trunk; + epte->snp = (iommu_enabled && iommu_snoop); +- epte->suppress_ve = 1; + + ept_p2m_type_to_flags(p2m, epte, epte->sa_p2mt, epte->access); + +@@ -347,8 +345,7 @@ static int ept_next_level(struct p2m_dom + ept_entry_t **table, unsigned long *gfn_remainder, + int next_level) + { +- unsigned long mfn; +- ept_entry_t *ept_entry, e; ++ ept_entry_t *ept_entry, *next = NULL, e; + u32 shift, index; + + shift = next_level * EPT_TABLE_ORDER; +@@ -373,19 +370,17 @@ static int ept_next_level(struct p2m_dom + if ( read_only ) + return GUEST_TABLE_MAP_FAILED; + +- if ( !ept_set_middle_entry(p2m, ept_entry) ) ++ next = ept_set_middle_entry(p2m, ept_entry); ++ if ( !next ) + return GUEST_TABLE_MAP_FAILED; +- else +- e = atomic_read_ept_entry(ept_entry); /* Refresh */ ++ /* e is now stale and hence may not be used anymore below. */ + } +- + /* The only time sp would be set here is if we had hit a superpage */ +- if ( is_epte_superpage(&e) ) ++ else if ( is_epte_superpage(&e) ) + return GUEST_TABLE_SUPER_PAGE; + +- mfn = e.mfn; + unmap_domain_page(*table); +- *table = map_domain_page(_mfn(mfn)); ++ *table = next ?: map_domain_page(_mfn(e.mfn)); + *gfn_remainder &= (1UL << shift) - 1; + return GUEST_TABLE_NORMAL_PAGE; + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa328-4.11-2.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,48 @@ +From: +Subject: x86/ept: atomically modify entries in ept_next_level + +ept_next_level was passing a live PTE pointer to ept_set_middle_entry, +which was then modified without taking into account that the PTE could +be part of a live EPT table. This wasn't a security issue because the +pages returned by p2m_alloc_ptp are zeroed, so adding such an entry +before actually initializing it didn't allow a guest to access +physical memory addresses it wasn't supposed to access. + +This is part of XSA-328. + +Reviewed-by: Jan Beulich + +--- a/xen/arch/x86/mm/p2m-ept.c ++++ b/xen/arch/x86/mm/p2m-ept.c +@@ -348,6 +348,8 @@ static int ept_next_level(struct p2m_dom + ept_entry_t *ept_entry, *next = NULL, e; + u32 shift, index; + ++ ASSERT(next_level); ++ + shift = next_level * EPT_TABLE_ORDER; + + index = *gfn_remainder >> shift; +@@ -364,16 +366,20 @@ static int ept_next_level(struct p2m_dom + + if ( !is_epte_present(&e) ) + { ++ int rc; ++ + if ( e.sa_p2mt == p2m_populate_on_demand ) + return GUEST_TABLE_POD_PAGE; + + if ( read_only ) + return GUEST_TABLE_MAP_FAILED; + +- next = ept_set_middle_entry(p2m, ept_entry); ++ next = ept_set_middle_entry(p2m, &e); + if ( !next ) + return GUEST_TABLE_MAP_FAILED; +- /* e is now stale and hence may not be used anymore below. */ ++ ++ rc = atomic_write_ept_entry(ept_entry, e, next_level); ++ ASSERT(rc == 0); + } + /* The only time sp would be set here is if we had hit a superpage */ + else if ( is_epte_superpage(&e) ) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa330.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa330.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa330.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa330.patch 2022-05-30 08:15:43.000000000 +0100 @@ -0,0 +1,68 @@ +Variable names modified for version in Ubuntu 20.04. + +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: delete watch from trie too when resetting + watches +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +c/s f8c72b526129 "oxenstored: implement XS_RESET_WATCHES" from Xen 4.6 +introduced reset watches support in oxenstored by mirroring the change +in cxenstored. + +However the OCaml version has some additional data structures to +optimize watch firing, and just resetting the watches in one of the data +structures creates a security bug where a malicious guest kernel can +exceed its watch quota, driving oxenstored into OOM: + * create watches + * reset watches (this still keeps the watches lingering in another data + structure, using memory) + * create some more watches + * loop until oxenstored dies + +The guest kernel doesn't necessarily have to be malicious to trigger +this: + * if control/platform-feature-xs_reset_watches is set + * the guest kexecs (e.g. because it crashes) + * on boot more watches are set up + * this will slowly "leak" memory for watches in oxenstored, driving it + towards OOM. + +This is XSA-330. + +Fixes: f8c72b526129 ("oxenstored: implement XS_RESET_WATCHES") +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/connections.ml b/tools/ocaml/xenstored/connections.ml +index 9f9f7ee2f0..6ee3552ec2 100644 +--- a/tools/ocaml/xenstored/connections.ml ++++ b/tools/ocaml/xenstored/connections.ml +@@ -134,6 +134,10 @@ let del_watch cons con path token = + cons.watches <- Trie.set cons.watches key watches; + watch + ++let del_watches cons con = ++ Connection.del_watches con; ++ cons.watches <- Trie.map (del_watches_of_con con) cons.watches ++ + (* path is absolute *) + let fire_watches ?oldroot root cons path recurse = + let key = key_of_path path in +diff --git a/tools/ocaml/xenstored/process.ml b/tools/ocaml/xenstored/process.ml +index 73e04cc18b..437d2dcf9e 100644 +--- a/tools/ocaml/xenstored/process.ml ++++ b/tools/ocaml/xenstored/process.ml +@@ -179,8 +179,8 @@ let do_isintroduced con t domains cons data = + if domid = Define.domid_self || Domains.exist domains domid then "T\000" else "F\000" + + (* only in xen >= 4.2 *) +-let do_reset_watches con t domains cons data = +- Connection.del_watches con; ++let do_reset_watches con t domains cons data = ++ Connections.del_watches cons con; + Connection.del_transactions con + + (* only in >= xen3.3 *) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa333.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa333.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa333.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa333.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,39 @@ +From: Andrew Cooper +Subject: x86/pv: Handle the Intel-specific MSR_MISC_ENABLE correctly + +This MSR doesn't exist on AMD hardware, and switching away from the safe +functions in the common MSR path was an erroneous change. + +Partially revert the change. + +This is XSA-333. + +Fixes: 4fdc932b3cc ("x86/Intel: drop another 32-bit leftover") +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich +Reviewed-by: Wei Liu + +diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c +index efeb2a727e..6332c74b80 100644 +--- a/xen/arch/x86/pv/emul-priv-op.c ++++ b/xen/arch/x86/pv/emul-priv-op.c +@@ -924,7 +924,8 @@ static int read_msr(unsigned int reg, uint64_t *val, + return X86EMUL_OKAY; + + case MSR_IA32_MISC_ENABLE: +- rdmsrl(reg, *val); ++ if ( rdmsr_safe(reg, *val) ) ++ break; + *val = guest_misc_enable(*val); + return X86EMUL_OKAY; + +@@ -1059,7 +1060,8 @@ static int write_msr(unsigned int reg, uint64_t val, + break; + + case MSR_IA32_MISC_ENABLE: +- rdmsrl(reg, temp); ++ if ( rdmsr_safe(reg, temp) ) ++ break; + if ( val != guest_misc_enable(temp) ) + goto invalid; + return X86EMUL_OKAY; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa336-4.11.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa336-4.11.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa336-4.11.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa336-4.11.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,256 @@ +From: Roger Pau Monné +Subject: x86/vpt: fix race when migrating timers between vCPUs + +The current vPT code will migrate the emulated timers between vCPUs +(change the pt->vcpu field) while just holding the destination lock, +either from create_periodic_time or pt_adjust_global_vcpu_target if +the global target is adjusted. Changing the periodic_timer vCPU field +in this way creates a race where a third party could grab the lock in +the unlocked region of pt_adjust_global_vcpu_target (or before +create_periodic_time performs the vcpu change) and then release the +lock from a different vCPU, creating a locking imbalance. + +Introduce a per-domain rwlock in order to protect periodic_time +migration between vCPU lists. Taking the lock in read mode prevents +any timer from being migrated to a different vCPU, while taking it in +write mode allows performing migration of timers across vCPUs. The +per-vcpu locks are still used to protect all the other fields from the +periodic_timer struct. + +Note that such migration shouldn't happen frequently, and hence +there's no performance drop as a result of such locking. + +This is XSA-336. + +Reported-by: Igor Druzhinin +Tested-by: Igor Druzhinin +Signed-off-by: Roger Pau Monné +Reviewed-by: Jan Beulich + +--- a/xen/arch/x86/hvm/hvm.c ++++ b/xen/arch/x86/hvm/hvm.c +@@ -627,6 +627,8 @@ int hvm_domain_initialise(struct domain + /* need link to containing domain */ + d->arch.hvm_domain.pl_time->domain = d; + ++ rwlock_init(&d->arch.hvm_domain.pl_time->pt_migrate); ++ + /* Set the default IO Bitmap. */ + if ( is_hardware_domain(d) ) + { +--- a/xen/arch/x86/hvm/vpt.c ++++ b/xen/arch/x86/hvm/vpt.c +@@ -152,23 +152,32 @@ static int pt_irq_masked(struct periodic + return 1; + } + +-static void pt_lock(struct periodic_time *pt) ++static void pt_vcpu_lock(struct vcpu *v) + { +- struct vcpu *v; ++ read_lock(&v->domain->arch.hvm_domain.pl_time->pt_migrate); ++ spin_lock(&v->arch.hvm_vcpu.tm_lock); ++} + +- for ( ; ; ) +- { +- v = pt->vcpu; +- spin_lock(&v->arch.hvm_vcpu.tm_lock); +- if ( likely(pt->vcpu == v) ) +- break; +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); +- } ++static void pt_vcpu_unlock(struct vcpu *v) ++{ ++ spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ read_unlock(&v->domain->arch.hvm_domain.pl_time->pt_migrate); ++} ++ ++static void pt_lock(struct periodic_time *pt) ++{ ++ /* ++ * We cannot use pt_vcpu_lock here, because we need to acquire the ++ * per-domain lock first and then (re-)fetch the value of pt->vcpu, or ++ * else we might be using a stale value of pt->vcpu. ++ */ ++ read_lock(&pt->vcpu->domain->arch.hvm_domain.pl_time->pt_migrate); ++ spin_lock(&pt->vcpu->arch.hvm_vcpu.tm_lock); + } + + static void pt_unlock(struct periodic_time *pt) + { +- spin_unlock(&pt->vcpu->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(pt->vcpu); + } + + static void pt_process_missed_ticks(struct periodic_time *pt) +@@ -218,7 +227,7 @@ void pt_save_timer(struct vcpu *v) + if ( v->pause_flags & VPF_blocked ) + return; + +- spin_lock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_lock(v); + + list_for_each_entry ( pt, head, list ) + if ( !pt->do_not_freeze ) +@@ -226,7 +235,7 @@ void pt_save_timer(struct vcpu *v) + + pt_freeze_time(v); + +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(v); + } + + void pt_restore_timer(struct vcpu *v) +@@ -234,7 +243,7 @@ void pt_restore_timer(struct vcpu *v) + struct list_head *head = &v->arch.hvm_vcpu.tm_list; + struct periodic_time *pt; + +- spin_lock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_lock(v); + + list_for_each_entry ( pt, head, list ) + { +@@ -247,7 +256,7 @@ void pt_restore_timer(struct vcpu *v) + + pt_thaw_time(v); + +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(v); + } + + static void pt_timer_fn(void *data) +@@ -272,7 +281,7 @@ int pt_update_irq(struct vcpu *v) + uint64_t max_lag; + int irq, pt_vector = -1; + +- spin_lock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_lock(v); + + earliest_pt = NULL; + max_lag = -1ULL; +@@ -300,14 +309,14 @@ int pt_update_irq(struct vcpu *v) + + if ( earliest_pt == NULL ) + { +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(v); + return -1; + } + + earliest_pt->irq_issued = 1; + irq = earliest_pt->irq; + +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(v); + + switch ( earliest_pt->source ) + { +@@ -377,12 +386,12 @@ void pt_intr_post(struct vcpu *v, struct + if ( intack.source == hvm_intsrc_vector ) + return; + +- spin_lock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_lock(v); + + pt = is_pt_irq(v, intack); + if ( pt == NULL ) + { +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(v); + return; + } + +@@ -421,7 +430,7 @@ void pt_intr_post(struct vcpu *v, struct + cb = pt->cb; + cb_priv = pt->priv; + +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(v); + + if ( cb != NULL ) + cb(v, cb_priv); +@@ -432,12 +441,12 @@ void pt_migrate(struct vcpu *v) + struct list_head *head = &v->arch.hvm_vcpu.tm_list; + struct periodic_time *pt; + +- spin_lock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_lock(v); + + list_for_each_entry ( pt, head, list ) + migrate_timer(&pt->timer, v->processor); + +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ pt_vcpu_unlock(v); + } + + void create_periodic_time( +@@ -455,7 +464,7 @@ void create_periodic_time( + + destroy_periodic_time(pt); + +- spin_lock(&v->arch.hvm_vcpu.tm_lock); ++ write_lock(&v->domain->arch.hvm_domain.pl_time->pt_migrate); + + pt->pending_intr_nr = 0; + pt->do_not_freeze = 0; +@@ -504,7 +513,7 @@ void create_periodic_time( + init_timer(&pt->timer, pt_timer_fn, pt, v->processor); + set_timer(&pt->timer, pt->scheduled); + +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ write_unlock(&v->domain->arch.hvm_domain.pl_time->pt_migrate); + } + + void destroy_periodic_time(struct periodic_time *pt) +@@ -529,30 +538,20 @@ void destroy_periodic_time(struct period + + static void pt_adjust_vcpu(struct periodic_time *pt, struct vcpu *v) + { +- int on_list; +- + ASSERT(pt->source == PTSRC_isa || pt->source == PTSRC_ioapic); + + if ( pt->vcpu == NULL ) + return; + +- pt_lock(pt); +- on_list = pt->on_list; +- if ( pt->on_list ) +- list_del(&pt->list); +- pt->on_list = 0; +- pt_unlock(pt); +- +- spin_lock(&v->arch.hvm_vcpu.tm_lock); ++ write_lock(&pt->vcpu->domain->arch.hvm_domain.pl_time->pt_migrate); + pt->vcpu = v; +- if ( on_list ) ++ if ( pt->on_list ) + { +- pt->on_list = 1; ++ list_del(&pt->list); + list_add(&pt->list, &v->arch.hvm_vcpu.tm_list); +- + migrate_timer(&pt->timer, v->processor); + } +- spin_unlock(&v->arch.hvm_vcpu.tm_lock); ++ write_unlock(&pt->vcpu->domain->arch.hvm_domain.pl_time->pt_migrate); + } + + void pt_adjust_global_vcpu_target(struct vcpu *v) +--- a/xen/include/asm-x86/hvm/vpt.h ++++ b/xen/include/asm-x86/hvm/vpt.h +@@ -133,6 +133,13 @@ struct pl_time { /* platform time */ + struct RTCState vrtc; + struct HPETState vhpet; + struct PMTState vpmt; ++ /* ++ * rwlock to prevent periodic_time vCPU migration. Take the lock in read ++ * mode in order to prevent the vcpu field of periodic_time from changing. ++ * Lock must be taken in write mode when changes to the vcpu field are ++ * performed, as it allows exclusive access to all the timers of a domain. ++ */ ++ rwlock_t pt_migrate; + /* guest_time = Xen sys time + stime_offset */ + int64_t stime_offset; + /* Ensures monotonicity in appropriate timer modes. */ diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-1.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,92 @@ +From: Roger Pau Monné +Subject: x86/msi: get rid of read_msi_msg + +It's safer and faster to just use the cached last written +(untranslated) MSI message stored in msi_desc for the single user that +calls read_msi_msg. + +This also prevents relying on the data read from the device MSI +registers in order to figure out the index into the IOMMU interrupt +remapping table, which is not safe. + +This is part of XSA-337. + +Reported-by: Andrew Cooper +Requested-by: Andrew Cooper +Signed-off-by: Roger Pau Monné +Reviewed-by: Jan Beulich + +--- a/xen/arch/x86/msi.c ++++ b/xen/arch/x86/msi.c +@@ -192,59 +192,6 @@ void msi_compose_msg(unsigned vector, co + MSI_DATA_VECTOR(vector); + } + +-static bool read_msi_msg(struct msi_desc *entry, struct msi_msg *msg) +-{ +- switch ( entry->msi_attrib.type ) +- { +- case PCI_CAP_ID_MSI: +- { +- struct pci_dev *dev = entry->dev; +- int pos = entry->msi_attrib.pos; +- u16 data, seg = dev->seg; +- u8 bus = dev->bus; +- u8 slot = PCI_SLOT(dev->devfn); +- u8 func = PCI_FUNC(dev->devfn); +- +- msg->address_lo = pci_conf_read32(seg, bus, slot, func, +- msi_lower_address_reg(pos)); +- if ( entry->msi_attrib.is_64 ) +- { +- msg->address_hi = pci_conf_read32(seg, bus, slot, func, +- msi_upper_address_reg(pos)); +- data = pci_conf_read16(seg, bus, slot, func, +- msi_data_reg(pos, 1)); +- } +- else +- { +- msg->address_hi = 0; +- data = pci_conf_read16(seg, bus, slot, func, +- msi_data_reg(pos, 0)); +- } +- msg->data = data; +- break; +- } +- case PCI_CAP_ID_MSIX: +- { +- void __iomem *base = entry->mask_base; +- +- if ( unlikely(!msix_memory_decoded(entry->dev, +- entry->msi_attrib.pos)) ) +- return false; +- msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET); +- msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET); +- msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET); +- break; +- } +- default: +- BUG(); +- } +- +- if ( iommu_intremap ) +- iommu_read_msi_from_ire(entry, msg); +- +- return true; +-} +- + static int write_msi_msg(struct msi_desc *entry, struct msi_msg *msg) + { + entry->msg = *msg; +@@ -322,10 +269,7 @@ void set_msi_affinity(struct irq_desc *d + + ASSERT(spin_is_locked(&desc->lock)); + +- memset(&msg, 0, sizeof(msg)); +- if ( !read_msi_msg(msi_desc, &msg) ) +- return; +- ++ msg = msi_desc->msg; + msg.data &= ~MSI_DATA_VECTOR_MASK; + msg.data |= MSI_DATA_VECTOR(desc->arch.vector); + msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa337-4.12-2.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,182 @@ +From: Jan Beulich +Subject: x86/MSI-X: restrict reading of table/PBA bases from BARs + +When assigned to less trusted or un-trusted guests, devices may change +state behind our backs (they may e.g. get reset by means we may not know +about). Therefore we should avoid reading BARs from hardware once a +device is no longer owned by Dom0. Furthermore when we can't read a BAR, +or when we read zero, we shouldn't instead use the caller provided +address unless that caller can be trusted. + +Re-arrange the logic in msix_capability_init() such that only Dom0 (and +only if the device isn't DomU-owned yet) or calls through +PHYSDEVOP_prepare_msix will actually result in the reading of the +respective BAR register(s). Additionally do so only as long as in-use +table entries are known (note that invocation of PHYSDEVOP_prepare_msix +counts as a "pseudo" entry). In all other uses the value already +recorded will get used instead. + +Clear the recorded values in _pci_cleanup_msix() as well as on the one +affected error path. (Adjust this error path to also avoid blindly +disabling MSI-X when it was enabled on entry to the function.) + +While moving around variable declarations (in many cases to reduce their +scopes), also adjust some of their types. + +This is part of XSA-337. + +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné + +--- a/xen/arch/x86/msi.c ++++ b/xen/arch/x86/msi.c +@@ -790,16 +790,14 @@ static int msix_capability_init(struct p + { + struct arch_msix *msix = dev->msix; + struct msi_desc *entry = NULL; +- int vf; + u16 control; + u64 table_paddr; + u32 table_offset; +- u8 bir, pbus, pslot, pfunc; + u16 seg = dev->seg; + u8 bus = dev->bus; + u8 slot = PCI_SLOT(dev->devfn); + u8 func = PCI_FUNC(dev->devfn); +- bool maskall = msix->host_maskall; ++ bool maskall = msix->host_maskall, zap_on_error = false; + + ASSERT(pcidevs_locked()); + +@@ -837,43 +835,45 @@ static int msix_capability_init(struct p + /* Locate MSI-X table region */ + table_offset = pci_conf_read32(seg, bus, slot, func, + msix_table_offset_reg(pos)); +- bir = (u8)(table_offset & PCI_MSIX_BIRMASK); +- table_offset &= ~PCI_MSIX_BIRMASK; ++ if ( !msix->used_entries && ++ (!msi || ++ (is_hardware_domain(current->domain) && ++ (dev->domain == current->domain || dev->domain == dom_io))) ) ++ { ++ unsigned int bir = table_offset & PCI_MSIX_BIRMASK, pbus, pslot, pfunc; ++ int vf; ++ paddr_t pba_paddr; ++ unsigned int pba_offset; + +- if ( !dev->info.is_virtfn ) +- { +- pbus = bus; +- pslot = slot; +- pfunc = func; +- vf = -1; +- } +- else +- { +- pbus = dev->info.physfn.bus; +- pslot = PCI_SLOT(dev->info.physfn.devfn); +- pfunc = PCI_FUNC(dev->info.physfn.devfn); +- vf = PCI_BDF2(dev->bus, dev->devfn); +- } +- +- table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf); +- WARN_ON(msi && msi->table_base != table_paddr); +- if ( !table_paddr ) +- { +- if ( !msi || !msi->table_base ) ++ if ( !dev->info.is_virtfn ) + { +- pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), +- control & ~PCI_MSIX_FLAGS_ENABLE); +- xfree(entry); +- return -ENXIO; ++ pbus = bus; ++ pslot = slot; ++ pfunc = func; ++ vf = -1; ++ } ++ else ++ { ++ pbus = dev->info.physfn.bus; ++ pslot = PCI_SLOT(dev->info.physfn.devfn); ++ pfunc = PCI_FUNC(dev->info.physfn.devfn); ++ vf = PCI_BDF2(dev->bus, dev->devfn); + } +- table_paddr = msi->table_base; +- } +- table_paddr += table_offset; + +- if ( !msix->used_entries ) +- { +- u64 pba_paddr; +- u32 pba_offset; ++ table_paddr = read_pci_mem_bar(seg, pbus, pslot, pfunc, bir, vf); ++ WARN_ON(msi && msi->table_base != table_paddr); ++ if ( !table_paddr ) ++ { ++ if ( !msi || !msi->table_base ) ++ { ++ pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), ++ control & ~PCI_MSIX_FLAGS_ENABLE); ++ xfree(entry); ++ return -ENXIO; ++ } ++ table_paddr = msi->table_base; ++ } ++ table_paddr += table_offset & ~PCI_MSIX_BIRMASK; + + msix->nr_entries = nr_entries; + msix->table.first = PFN_DOWN(table_paddr); +@@ -894,7 +894,19 @@ static int msix_capability_init(struct p + BITS_TO_LONGS(nr_entries) - 1); + WARN_ON(rangeset_overlaps_range(mmio_ro_ranges, msix->pba.first, + msix->pba.last)); ++ ++ zap_on_error = true; + } ++ else if ( !msix->table.first ) ++ { ++ pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), ++ control); ++ xfree(entry); ++ return -ENODATA; ++ } ++ else ++ table_paddr = (msix->table.first << PAGE_SHIFT) + ++ (table_offset & ~PCI_MSIX_BIRMASK & ~PAGE_MASK); + + if ( entry ) + { +@@ -905,8 +917,16 @@ static int msix_capability_init(struct p + + if ( idx < 0 ) + { ++ if ( zap_on_error ) ++ { ++ msix->table.first = 0; ++ msix->pba.first = 0; ++ ++ control &= ~PCI_MSIX_FLAGS_ENABLE; ++ } ++ + pci_conf_write16(seg, bus, slot, func, msix_control_reg(pos), +- control & ~PCI_MSIX_FLAGS_ENABLE); ++ control); + xfree(entry); + return idx; + } +@@ -1102,9 +1122,14 @@ static void _pci_cleanup_msix(struct arc + if ( rangeset_remove_range(mmio_ro_ranges, msix->table.first, + msix->table.last) ) + WARN(); ++ msix->table.first = 0; ++ msix->table.last = 0; ++ + if ( rangeset_remove_range(mmio_ro_ranges, msix->pba.first, + msix->pba.last) ) + WARN(); ++ msix->pba.first = 0; ++ msix->pba.last = 0; + } + } + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa338.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa338.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa338.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa338.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,42 @@ +From: Jan Beulich +Subject: evtchn: relax port_is_valid() + +To avoid ports potentially becoming invalid behind the back of certain +other functions (due to ->max_evtchn shrinking) because of +- a guest invoking evtchn_reset() and from a 2nd vCPU opening new + channels in parallel (see also XSA-343), +- alloc_unbound_xen_event_channel() produced channels living above the + 2-level range (see also XSA-342), +drop the max_evtchns check from port_is_valid(). For a port for which +the function once returned "true", the returned value may not turn into +"false" later on. The function's result may only depend on bounds which +can only ever grow (which is the case for d->valid_evtchns). + +This also eliminates a false sense of safety, utilized by some of the +users (see again XSA-343): Without a suitable lock held, d->max_evtchns +may change at any time, and hence deducing that certain other operations +are safe when port_is_valid() returned true is not legitimate. The +opportunities to abuse this may get widened by the change here +(depending on guest and host configuration), but will be taken care of +by the other XSA. + +This is XSA-338. + +Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") +Signed-off-by: Jan Beulich +Reviewed-by: Stefano Stabellini +Reviewed-by: Julien Grall +--- +v5: New, split from larger patch. + +--- a/xen/include/xen/event.h ++++ b/xen/include/xen/event.h +@@ -107,8 +107,6 @@ void notify_via_xen_event_channel(struct + + static inline bool_t port_is_valid(struct domain *d, unsigned int p) + { +- if ( p >= d->max_evtchns ) +- return 0; + return p < read_atomic(&d->valid_evtchns); + } + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa339.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa339.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa339.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa339.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,76 @@ +From: Andrew Cooper +Subject: x86/pv: Avoid double exception injection + +There is at least one path (SYSENTER with NT set, Xen converts to #GP) which +ends up injecting the #GP fault twice, first in compat_sysenter(), and then a +second time in compat_test_all_events(), due to the stale TBF_EXCEPTION left +in TRAPBOUNCE_flags. + +The guest kernel sees the second fault first, which is a kernel level #GP +pointing at the head of the #GP handler, and is therefore a userspace +trigger-able DoS. + +This particular bug has bitten us several times before, so rearrange +{compat_,}create_bounce_frame() to clobber TRAPBOUNCE on success, rather than +leaving this task to one area of code which isn't used uniformly. + +Other scenarios which might result in a double injection (e.g. two calls +directly to compat_create_bounce_frame) will now crash the guest, which is far +more obvious than letting the kernel run with corrupt state. + +This is XSA-339 + +Fixes: fdac9515607b ("x86: clear EFLAGS.NT in SYSENTER entry path") +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +diff --git a/xen/arch/x86/x86_64/compat/entry.S b/xen/arch/x86/x86_64/compat/entry.S +index c3e62f8734..73619f57ca 100644 +--- a/xen/arch/x86/x86_64/compat/entry.S ++++ b/xen/arch/x86/x86_64/compat/entry.S +@@ -78,7 +78,6 @@ compat_process_softirqs: + sti + .Lcompat_bounce_exception: + call compat_create_bounce_frame +- movb $0, TRAPBOUNCE_flags(%rdx) + jmp compat_test_all_events + + ALIGN +@@ -352,7 +351,13 @@ __UNLIKELY_END(compat_bounce_null_selector) + movl %eax,UREGS_cs+8(%rsp) + movl TRAPBOUNCE_eip(%rdx),%eax + movl %eax,UREGS_rip+8(%rsp) ++ ++ /* Trapbounce complete. Clobber state to avoid an erroneous second injection. */ ++ xor %eax, %eax ++ mov %ax, TRAPBOUNCE_cs(%rdx) ++ mov %al, TRAPBOUNCE_flags(%rdx) + ret ++ + .section .fixup,"ax" + .Lfx13: + xorl %edi,%edi +diff --git a/xen/arch/x86/x86_64/entry.S b/xen/arch/x86/x86_64/entry.S +index 1e880eb9f6..71a00e846b 100644 +--- a/xen/arch/x86/x86_64/entry.S ++++ b/xen/arch/x86/x86_64/entry.S +@@ -90,7 +90,6 @@ process_softirqs: + sti + .Lbounce_exception: + call create_bounce_frame +- movb $0, TRAPBOUNCE_flags(%rdx) + jmp test_all_events + + ALIGN +@@ -512,6 +511,11 @@ UNLIKELY_START(z, create_bounce_frame_bad_bounce_ip) + jmp asm_domain_crash_synchronous /* Does not return */ + __UNLIKELY_END(create_bounce_frame_bad_bounce_ip) + movq %rax,UREGS_rip+8(%rsp) ++ ++ /* Trapbounce complete. Clobber state to avoid an erroneous second injection. */ ++ xor %eax, %eax ++ mov %rax, TRAPBOUNCE_eip(%rdx) ++ mov %al, TRAPBOUNCE_flags(%rdx) + ret + + .pushsection .fixup, "ax", @progbits diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa340.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa340.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa340.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa340.patch 2022-04-05 13:04:22.000000000 +0100 @@ -0,0 +1,65 @@ +From: Julien Grall +Subject: xen/evtchn: Add missing barriers when accessing/allocating an event channel + +While the allocation of a bucket is always performed with the per-domain +lock, the bucket may be accessed without the lock taken (for instance, see +evtchn_send()). + +Instead such sites relies on port_is_valid() to return a non-zero value +when the port has a struct evtchn associated to it. The function will +mostly check whether the port is less than d->valid_evtchns as all the +buckets/event channels should be allocated up to that point. + +Unfortunately a compiler is free to re-order the assignment in +evtchn_allocate_port() so it would be possible to have d->valid_evtchns +updated before the new bucket has finish to allocate. + +Additionally on Arm, even if this was compiled "correctly", the +processor can still re-order the memory access. + +Add a write memory barrier in the allocation side and a read memory +barrier when the port is valid to prevent any re-ordering issue. + +This is XSA-340. + +Reported-by: Julien Grall +Signed-off-by: Julien Grall +Reviewed-by: Stefano Stabellini + +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -178,6 +178,13 @@ int evtchn_allocate_port(struct domain * + return -ENOMEM; + bucket_from_port(d, port) = chn; + ++ /* ++ * d->valid_evtchns is used to check whether the bucket can be ++ * accessed without the per-domain lock. Therefore, ++ * d->valid_evtchns should be seen *after* the new bucket has ++ * been setup. ++ */ ++ smp_wmb(); + write_atomic(&d->valid_evtchns, d->valid_evtchns + EVTCHNS_PER_BUCKET); + } + +--- a/xen/include/xen/event.h ++++ b/xen/include/xen/event.h +@@ -107,7 +107,17 @@ void notify_via_xen_event_channel(struct + + static inline bool_t port_is_valid(struct domain *d, unsigned int p) + { +- return p < read_atomic(&d->valid_evtchns); ++ if ( p >= read_atomic(&d->valid_evtchns) ) ++ return false; ++ ++ /* ++ * The caller will usually access the event channel afterwards and ++ * may be done without taking the per-domain lock. The barrier is ++ * going in pair the smp_wmb() barrier in evtchn_allocate_port(). ++ */ ++ smp_rmb(); ++ ++ return true; + } + + static inline struct evtchn *evtchn_from_port(struct domain *d, unsigned int p) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa342-4.13.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa342-4.13.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa342-4.13.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa342-4.13.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,145 @@ +From: Jan Beulich +Subject: evtchn/x86: enforce correct upper limit for 32-bit guests + +The recording of d->max_evtchns in evtchn_2l_init(), in particular with +the limited set of callers of the function, is insufficient. Neither for +PV nor for HVM guests the bitness is known at domain_create() time, yet +the upper bound in 2-level mode depends upon guest bitness. Recording +too high a limit "allows" x86 32-bit domains to open not properly usable +event channels, management of which (inside Xen) would then result in +corruption of the shared info and vCPU info structures. + +Keep the upper limit dynamic for the 2-level case, introducing a helper +function to retrieve the effective limit. This helper is now supposed to +be private to the event channel code. The used in do_poll() and +domain_dump_evtchn_info() weren't consistent with port uses elsewhere +and hence get switched to port_is_valid(). + +Furthermore FIFO mode's setup_ports() gets adjusted to loop only up to +the prior ABI limit, rather than all the way up to the new one. + +Finally a word on the change to do_poll(): Accessing ->max_evtchns +without holding a suitable lock was never safe, as it as well as +->evtchn_port_ops may change behind do_poll()'s back. Using +port_is_valid() instead widens some the window for potential abuse, +until we've dealt with the race altogether (see XSA-343). + +This is XSA-342. + +Reported-by: Julien Grall +Fixes: 48974e6ce52e ("evtchn: use a per-domain variable for the max number of event channels") +Signed-off-by: Jan Beulich +Reviewed-by: Stefano Stabellini +Reviewed-by: Julien Grall + +--- a/xen/common/event_2l.c ++++ b/xen/common/event_2l.c +@@ -103,7 +103,6 @@ static const struct evtchn_port_ops evtc + void evtchn_2l_init(struct domain *d) + { + d->evtchn_port_ops = &evtchn_port_ops_2l; +- d->max_evtchns = BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); + } + + /* +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -151,7 +151,7 @@ static void free_evtchn_bucket(struct do + + int evtchn_allocate_port(struct domain *d, evtchn_port_t port) + { +- if ( port > d->max_evtchn_port || port >= d->max_evtchns ) ++ if ( port > d->max_evtchn_port || port >= max_evtchns(d) ) + return -ENOSPC; + + if ( port_is_valid(d, port) ) +@@ -1396,13 +1396,11 @@ static void domain_dump_evtchn_info(stru + + spin_lock(&d->event_lock); + +- for ( port = 1; port < d->max_evtchns; ++port ) ++ for ( port = 1; port_is_valid(d, port); ++port ) + { + const struct evtchn *chn; + char *ssid; + +- if ( !port_is_valid(d, port) ) +- continue; + chn = evtchn_from_port(d, port); + if ( chn->state == ECS_FREE ) + continue; +--- a/xen/common/event_fifo.c ++++ b/xen/common/event_fifo.c +@@ -478,7 +478,7 @@ static void cleanup_event_array(struct d + d->evtchn_fifo = NULL; + } + +-static void setup_ports(struct domain *d) ++static void setup_ports(struct domain *d, unsigned int prev_evtchns) + { + unsigned int port; + +@@ -488,7 +488,7 @@ static void setup_ports(struct domain *d + * - save its pending state. + * - set default priority. + */ +- for ( port = 1; port < d->max_evtchns; port++ ) ++ for ( port = 1; port < prev_evtchns; port++ ) + { + struct evtchn *evtchn; + +@@ -546,6 +546,8 @@ int evtchn_fifo_init_control(struct evtc + if ( !d->evtchn_fifo ) + { + struct vcpu *vcb; ++ /* Latch the value before it changes during setup_event_array(). */ ++ unsigned int prev_evtchns = max_evtchns(d); + + for_each_vcpu ( d, vcb ) { + rc = setup_control_block(vcb); +@@ -562,8 +564,7 @@ int evtchn_fifo_init_control(struct evtc + goto error; + + d->evtchn_port_ops = &evtchn_port_ops_fifo; +- d->max_evtchns = EVTCHN_FIFO_NR_CHANNELS; +- setup_ports(d); ++ setup_ports(d, prev_evtchns); + } + else + rc = map_control_block(v, gfn, offset); +--- a/xen/common/schedule.c ++++ b/xen/common/schedule.c +@@ -1434,7 +1434,7 @@ static long do_poll(struct sched_poll *s + goto out; + + rc = -EINVAL; +- if ( port >= d->max_evtchns ) ++ if ( !port_is_valid(d, port) ) + goto out; + + rc = 0; +--- a/xen/include/xen/event.h ++++ b/xen/include/xen/event.h +@@ -105,6 +105,12 @@ void notify_via_xen_event_channel(struct + #define bucket_from_port(d, p) \ + ((group_from_port(d, p))[((p) % EVTCHNS_PER_GROUP) / EVTCHNS_PER_BUCKET]) + ++static inline unsigned int max_evtchns(const struct domain *d) ++{ ++ return d->evtchn_fifo ? EVTCHN_FIFO_NR_CHANNELS ++ : BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); ++} ++ + static inline bool_t port_is_valid(struct domain *d, unsigned int p) + { + if ( p >= read_atomic(&d->valid_evtchns) ) +--- a/xen/include/xen/sched.h ++++ b/xen/include/xen/sched.h +@@ -382,7 +382,6 @@ struct domain + /* Event channel information. */ + struct evtchn *evtchn; /* first bucket only */ + struct evtchn **evtchn_group[NR_EVTCHN_GROUPS]; /* all other buckets */ +- unsigned int max_evtchns; /* number supported by ABI */ + unsigned int max_evtchn_port; /* max permitted port number */ + unsigned int valid_evtchns; /* number of allocated event channels */ + spinlock_t event_lock; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-1.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,190 @@ +From: Jan Beulich +Subject: evtchn: evtchn_reset() shouldn't succeed with still-open ports + +While the function closes all ports, it does so without holding any +lock, and hence racing requests may be issued causing new ports to get +opened. This would have been problematic in particular if such a newly +opened port had a port number above the new implementation limit (i.e. +when switching from FIFO to 2-level) after the reset, as prior to +"evtchn: relax port_is_valid()" this could have led to e.g. +evtchn_close()'s "BUG_ON(!port_is_valid(d2, port2))" to trigger. + +Introduce a counter of active ports and check that it's (still) no +larger then the number of Xen internally used ones after obtaining the +necessary lock in evtchn_reset(). + +As to the access model of the new {active,xen}_evtchns fields - while +all writes get done using write_atomic(), reads ought to use +read_atomic() only when outside of a suitably locked region. + +Note that as of now evtchn_bind_virq() and evtchn_bind_ipi() don't have +a need to call check_free_port(). + +This is part of XSA-343. + +Signed-off-by: Jan Beulich +Reviewed-by: Stefano Stabellini +Reviewed-by: Julien Grall + +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -188,6 +188,8 @@ int evtchn_allocate_port(struct domain * + write_atomic(&d->valid_evtchns, d->valid_evtchns + EVTCHNS_PER_BUCKET); + } + ++ write_atomic(&d->active_evtchns, d->active_evtchns + 1); ++ + return 0; + } + +@@ -211,11 +213,26 @@ static int get_free_port(struct domain * + return -ENOSPC; + } + ++/* ++ * Check whether a port is still marked free, and if so update the domain ++ * counter accordingly. To be used on function exit paths. ++ */ ++static void check_free_port(struct domain *d, evtchn_port_t port) ++{ ++ if ( port_is_valid(d, port) && ++ evtchn_from_port(d, port)->state == ECS_FREE ) ++ write_atomic(&d->active_evtchns, d->active_evtchns - 1); ++} ++ + void evtchn_free(struct domain *d, struct evtchn *chn) + { + /* Clear pending event to avoid unexpected behavior on re-bind. */ + evtchn_port_clear_pending(d, chn); + ++ if ( consumer_is_xen(chn) ) ++ write_atomic(&d->xen_evtchns, d->xen_evtchns - 1); ++ write_atomic(&d->active_evtchns, d->active_evtchns - 1); ++ + /* Reset binding to vcpu0 when the channel is freed. */ + chn->state = ECS_FREE; + chn->notify_vcpu_id = 0; +@@ -258,6 +275,7 @@ static long evtchn_alloc_unbound(evtchn_ + alloc->port = port; + + out: ++ check_free_port(d, port); + spin_unlock(&d->event_lock); + rcu_unlock_domain(d); + +@@ -351,6 +369,7 @@ static long evtchn_bind_interdomain(evtc + bind->local_port = lport; + + out: ++ check_free_port(ld, lport); + spin_unlock(&ld->event_lock); + if ( ld != rd ) + spin_unlock(&rd->event_lock); +@@ -484,7 +503,7 @@ static long evtchn_bind_pirq(evtchn_bind + struct domain *d = current->domain; + struct vcpu *v = d->vcpu[0]; + struct pirq *info; +- int port, pirq = bind->pirq; ++ int port = 0, pirq = bind->pirq; + long rc; + + if ( (pirq < 0) || (pirq >= d->nr_pirqs) ) +@@ -532,6 +551,7 @@ static long evtchn_bind_pirq(evtchn_bind + arch_evtchn_bind_pirq(d, pirq); + + out: ++ check_free_port(d, port); + spin_unlock(&d->event_lock); + + return rc; +@@ -1005,10 +1025,10 @@ int evtchn_unmask(unsigned int port) + return 0; + } + +- + int evtchn_reset(struct domain *d) + { + unsigned int i; ++ int rc = 0; + + if ( d != current->domain && !d->controller_pause_count ) + return -EINVAL; +@@ -1018,7 +1038,9 @@ int evtchn_reset(struct domain *d) + + spin_lock(&d->event_lock); + +- if ( d->evtchn_fifo ) ++ if ( d->active_evtchns > d->xen_evtchns ) ++ rc = -EAGAIN; ++ else if ( d->evtchn_fifo ) + { + /* Switching back to 2-level ABI. */ + evtchn_fifo_destroy(d); +@@ -1027,7 +1049,7 @@ int evtchn_reset(struct domain *d) + + spin_unlock(&d->event_lock); + +- return 0; ++ return rc; + } + + static long evtchn_set_priority(const struct evtchn_set_priority *set_priority) +@@ -1213,10 +1235,9 @@ int alloc_unbound_xen_event_channel( + + spin_lock(&ld->event_lock); + +- rc = get_free_port(ld); ++ port = rc = get_free_port(ld); + if ( rc < 0 ) + goto out; +- port = rc; + chn = evtchn_from_port(ld, port); + + rc = xsm_evtchn_unbound(XSM_TARGET, ld, chn, remote_domid); +@@ -1232,7 +1253,10 @@ int alloc_unbound_xen_event_channel( + + spin_unlock(&chn->lock); + ++ write_atomic(&ld->xen_evtchns, ld->xen_evtchns + 1); ++ + out: ++ check_free_port(ld, port); + spin_unlock(&ld->event_lock); + + return rc < 0 ? rc : port; +@@ -1308,6 +1332,7 @@ int evtchn_init(struct domain *d) + return -EINVAL; + } + evtchn_from_port(d, 0)->state = ECS_RESERVED; ++ write_atomic(&d->active_evtchns, 0); + + #if MAX_VIRT_CPUS > BITS_PER_LONG + d->poll_mask = xzalloc_array(unsigned long, +@@ -1335,6 +1360,8 @@ void evtchn_destroy(struct domain *d) + for ( i = 0; port_is_valid(d, i); i++ ) + evtchn_close(d, i, 0); + ++ ASSERT(!d->active_evtchns); ++ + clear_global_virq_handlers(d); + + evtchn_fifo_destroy(d); +--- a/xen/include/xen/sched.h ++++ b/xen/include/xen/sched.h +@@ -345,6 +345,16 @@ struct domain + struct evtchn **evtchn_group[NR_EVTCHN_GROUPS]; /* all other buckets */ + unsigned int max_evtchn_port; /* max permitted port number */ + unsigned int valid_evtchns; /* number of allocated event channels */ ++ /* ++ * Number of in-use event channels. Writers should use write_atomic(). ++ * Readers need to use read_atomic() only when not holding event_lock. ++ */ ++ unsigned int active_evtchns; ++ /* ++ * Number of event channels used internally by Xen (not subject to ++ * EVTCHNOP_reset). Read/write access like for active_evtchns. ++ */ ++ unsigned int xen_evtchns; + spinlock_t event_lock; + const struct evtchn_port_ops *evtchn_port_ops; + struct evtchn_fifo_domain *evtchn_fifo; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-2.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,290 @@ +From: Jan Beulich +Subject: evtchn: convert per-channel lock to be IRQ-safe + +... in order for send_guest_{global,vcpu}_virq() to be able to make use +of it. + +This is part of XSA-343. + +Signed-off-by: Jan Beulich +Acked-by: Julien Grall + +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -248,6 +248,7 @@ static long evtchn_alloc_unbound(evtchn_ + int port; + domid_t dom = alloc->dom; + long rc; ++ unsigned long flags; + + d = rcu_lock_domain_by_any_id(dom); + if ( d == NULL ) +@@ -263,14 +264,14 @@ static long evtchn_alloc_unbound(evtchn_ + if ( rc ) + goto out; + +- spin_lock(&chn->lock); ++ spin_lock_irqsave(&chn->lock, flags); + + chn->state = ECS_UNBOUND; + if ( (chn->u.unbound.remote_domid = alloc->remote_dom) == DOMID_SELF ) + chn->u.unbound.remote_domid = current->domain->domain_id; + evtchn_port_init(d, chn); + +- spin_unlock(&chn->lock); ++ spin_unlock_irqrestore(&chn->lock, flags); + + alloc->port = port; + +@@ -283,26 +284,32 @@ static long evtchn_alloc_unbound(evtchn_ + } + + +-static void double_evtchn_lock(struct evtchn *lchn, struct evtchn *rchn) ++static unsigned long double_evtchn_lock(struct evtchn *lchn, ++ struct evtchn *rchn) + { +- if ( lchn < rchn ) ++ unsigned long flags; ++ ++ if ( lchn <= rchn ) + { +- spin_lock(&lchn->lock); +- spin_lock(&rchn->lock); ++ spin_lock_irqsave(&lchn->lock, flags); ++ if ( lchn != rchn ) ++ spin_lock(&rchn->lock); + } + else + { +- if ( lchn != rchn ) +- spin_lock(&rchn->lock); ++ spin_lock_irqsave(&rchn->lock, flags); + spin_lock(&lchn->lock); + } ++ ++ return flags; + } + +-static void double_evtchn_unlock(struct evtchn *lchn, struct evtchn *rchn) ++static void double_evtchn_unlock(struct evtchn *lchn, struct evtchn *rchn, ++ unsigned long flags) + { +- spin_unlock(&lchn->lock); + if ( lchn != rchn ) +- spin_unlock(&rchn->lock); ++ spin_unlock(&lchn->lock); ++ spin_unlock_irqrestore(&rchn->lock, flags); + } + + static long evtchn_bind_interdomain(evtchn_bind_interdomain_t *bind) +@@ -312,6 +319,7 @@ static long evtchn_bind_interdomain(evtc + int lport, rport = bind->remote_port; + domid_t rdom = bind->remote_dom; + long rc; ++ unsigned long flags; + + if ( rdom == DOMID_SELF ) + rdom = current->domain->domain_id; +@@ -347,7 +355,7 @@ static long evtchn_bind_interdomain(evtc + if ( rc ) + goto out; + +- double_evtchn_lock(lchn, rchn); ++ flags = double_evtchn_lock(lchn, rchn); + + lchn->u.interdomain.remote_dom = rd; + lchn->u.interdomain.remote_port = rport; +@@ -364,7 +372,7 @@ static long evtchn_bind_interdomain(evtc + */ + evtchn_port_set_pending(ld, lchn->notify_vcpu_id, lchn); + +- double_evtchn_unlock(lchn, rchn); ++ double_evtchn_unlock(lchn, rchn, flags); + + bind->local_port = lport; + +@@ -387,6 +395,7 @@ int evtchn_bind_virq(evtchn_bind_virq_t + struct domain *d = current->domain; + int virq = bind->virq, vcpu = bind->vcpu; + int rc = 0; ++ unsigned long flags; + + if ( (virq < 0) || (virq >= ARRAY_SIZE(v->virq_to_evtchn)) ) + return -EINVAL; +@@ -419,14 +428,14 @@ int evtchn_bind_virq(evtchn_bind_virq_t + + chn = evtchn_from_port(d, port); + +- spin_lock(&chn->lock); ++ spin_lock_irqsave(&chn->lock, flags); + + chn->state = ECS_VIRQ; + chn->notify_vcpu_id = vcpu; + chn->u.virq = virq; + evtchn_port_init(d, chn); + +- spin_unlock(&chn->lock); ++ spin_unlock_irqrestore(&chn->lock, flags); + + v->virq_to_evtchn[virq] = bind->port = port; + +@@ -443,6 +452,7 @@ static long evtchn_bind_ipi(evtchn_bind_ + struct domain *d = current->domain; + int port, vcpu = bind->vcpu; + long rc = 0; ++ unsigned long flags; + + if ( (vcpu < 0) || (vcpu >= d->max_vcpus) || + (d->vcpu[vcpu] == NULL) ) +@@ -455,13 +465,13 @@ static long evtchn_bind_ipi(evtchn_bind_ + + chn = evtchn_from_port(d, port); + +- spin_lock(&chn->lock); ++ spin_lock_irqsave(&chn->lock, flags); + + chn->state = ECS_IPI; + chn->notify_vcpu_id = vcpu; + evtchn_port_init(d, chn); + +- spin_unlock(&chn->lock); ++ spin_unlock_irqrestore(&chn->lock, flags); + + bind->port = port; + +@@ -505,6 +515,7 @@ static long evtchn_bind_pirq(evtchn_bind + struct pirq *info; + int port = 0, pirq = bind->pirq; + long rc; ++ unsigned long flags; + + if ( (pirq < 0) || (pirq >= d->nr_pirqs) ) + return -EINVAL; +@@ -537,14 +548,14 @@ static long evtchn_bind_pirq(evtchn_bind + goto out; + } + +- spin_lock(&chn->lock); ++ spin_lock_irqsave(&chn->lock, flags); + + chn->state = ECS_PIRQ; + chn->u.pirq.irq = pirq; + link_pirq_port(port, chn, v); + evtchn_port_init(d, chn); + +- spin_unlock(&chn->lock); ++ spin_unlock_irqrestore(&chn->lock, flags); + + bind->port = port; + +@@ -565,6 +576,7 @@ int evtchn_close(struct domain *d1, int + struct evtchn *chn1, *chn2; + int port2; + long rc = 0; ++ unsigned long flags; + + again: + spin_lock(&d1->event_lock); +@@ -664,14 +676,14 @@ int evtchn_close(struct domain *d1, int + BUG_ON(chn2->state != ECS_INTERDOMAIN); + BUG_ON(chn2->u.interdomain.remote_dom != d1); + +- double_evtchn_lock(chn1, chn2); ++ flags = double_evtchn_lock(chn1, chn2); + + evtchn_free(d1, chn1); + + chn2->state = ECS_UNBOUND; + chn2->u.unbound.remote_domid = d1->domain_id; + +- double_evtchn_unlock(chn1, chn2); ++ double_evtchn_unlock(chn1, chn2, flags); + + goto out; + +@@ -679,9 +691,9 @@ int evtchn_close(struct domain *d1, int + BUG(); + } + +- spin_lock(&chn1->lock); ++ spin_lock_irqsave(&chn1->lock, flags); + evtchn_free(d1, chn1); +- spin_unlock(&chn1->lock); ++ spin_unlock_irqrestore(&chn1->lock, flags); + + out: + if ( d2 != NULL ) +@@ -701,13 +713,14 @@ int evtchn_send(struct domain *ld, unsig + struct evtchn *lchn, *rchn; + struct domain *rd; + int rport, ret = 0; ++ unsigned long flags; + + if ( !port_is_valid(ld, lport) ) + return -EINVAL; + + lchn = evtchn_from_port(ld, lport); + +- spin_lock(&lchn->lock); ++ spin_lock_irqsave(&lchn->lock, flags); + + /* Guest cannot send via a Xen-attached event channel. */ + if ( unlikely(consumer_is_xen(lchn)) ) +@@ -742,7 +755,7 @@ int evtchn_send(struct domain *ld, unsig + } + + out: +- spin_unlock(&lchn->lock); ++ spin_unlock_irqrestore(&lchn->lock, flags); + + return ret; + } +@@ -1232,6 +1245,7 @@ int alloc_unbound_xen_event_channel( + { + struct evtchn *chn; + int port, rc; ++ unsigned long flags; + + spin_lock(&ld->event_lock); + +@@ -1244,14 +1258,14 @@ int alloc_unbound_xen_event_channel( + if ( rc ) + goto out; + +- spin_lock(&chn->lock); ++ spin_lock_irqsave(&chn->lock, flags); + + chn->state = ECS_UNBOUND; + chn->xen_consumer = get_xen_consumer(notification_fn); + chn->notify_vcpu_id = lvcpu; + chn->u.unbound.remote_domid = remote_domid; + +- spin_unlock(&chn->lock); ++ spin_unlock_irqrestore(&chn->lock, flags); + + write_atomic(&ld->xen_evtchns, ld->xen_evtchns + 1); + +@@ -1274,11 +1288,12 @@ void notify_via_xen_event_channel(struct + { + struct evtchn *lchn, *rchn; + struct domain *rd; ++ unsigned long flags; + + ASSERT(port_is_valid(ld, lport)); + lchn = evtchn_from_port(ld, lport); + +- spin_lock(&lchn->lock); ++ spin_lock_irqsave(&lchn->lock, flags); + + if ( likely(lchn->state == ECS_INTERDOMAIN) ) + { +@@ -1288,7 +1303,7 @@ void notify_via_xen_event_channel(struct + evtchn_port_set_pending(rd, rchn->notify_vcpu_id, rchn); + } + +- spin_unlock(&lchn->lock); ++ spin_unlock_irqrestore(&lchn->lock, flags); + } + + void evtchn_check_pollers(struct domain *d, unsigned int port) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-3.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-3.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-3.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa343-4.11-3.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,381 @@ +From: Jan Beulich +Subject: evtchn: address races with evtchn_reset() + +Neither d->evtchn_port_ops nor max_evtchns(d) may be used in an entirely +lock-less manner, as both may change by a racing evtchn_reset(). In the +common case, at least one of the domain's event lock or the per-channel +lock needs to be held. In the specific case of the inter-domain sending +by evtchn_send() and notify_via_xen_event_channel() holding the other +side's per-channel lock is sufficient, as the channel can't change state +without both per-channel locks held. Without such a channel changing +state, evtchn_reset() can't complete successfully. + +Lock-free accesses continue to be permitted for the shim (calling some +otherwise internal event channel functions), as this happens while the +domain is in effectively single-threaded mode. Special care also needs +taking for the shim's marking of in-use ports as ECS_RESERVED (allowing +use of such ports in the shim case is okay because switching into and +hence also out of FIFO mode is impossible there). + +As a side effect, certain operations on Xen bound event channels which +were mistakenly permitted so far (e.g. unmask or poll) will be refused +now. + +This is part of XSA-343. + +Reported-by: Julien Grall +Signed-off-by: Jan Beulich +Acked-by: Julien Grall + +--- a/xen/arch/x86/irq.c ++++ b/xen/arch/x86/irq.c +@@ -2367,14 +2367,24 @@ static void dump_irqs(unsigned char key) + + for ( i = 0; i < action->nr_guests; i++ ) + { ++ struct evtchn *evtchn; ++ unsigned int pending = 2, masked = 2; ++ + d = action->guest[i]; + pirq = domain_irq_to_pirq(d, irq); + info = pirq_info(d, pirq); ++ evtchn = evtchn_from_port(d, info->evtchn); ++ local_irq_disable(); ++ if ( spin_trylock(&evtchn->lock) ) ++ { ++ pending = evtchn_is_pending(d, evtchn); ++ masked = evtchn_is_masked(d, evtchn); ++ spin_unlock(&evtchn->lock); ++ } ++ local_irq_enable(); + printk("%u:%3d(%c%c%c)", +- d->domain_id, pirq, +- evtchn_port_is_pending(d, info->evtchn) ? 'P' : '-', +- evtchn_port_is_masked(d, info->evtchn) ? 'M' : '-', +- (info->masked ? 'M' : '-')); ++ d->domain_id, pirq, "-P?"[pending], ++ "-M?"[masked], info->masked ? 'M' : '-'); + if ( i != action->nr_guests ) + printk(","); + } +--- a/xen/arch/x86/pv/shim.c ++++ b/xen/arch/x86/pv/shim.c +@@ -616,8 +616,11 @@ void pv_shim_inject_evtchn(unsigned int + if ( port_is_valid(guest, port) ) + { + struct evtchn *chn = evtchn_from_port(guest, port); ++ unsigned long flags; + ++ spin_lock_irqsave(&chn->lock, flags); + evtchn_port_set_pending(guest, chn->notify_vcpu_id, chn); ++ spin_unlock_irqrestore(&chn->lock, flags); + } + } + +--- a/xen/common/event_2l.c ++++ b/xen/common/event_2l.c +@@ -63,8 +63,10 @@ static void evtchn_2l_unmask(struct doma + } + } + +-static bool evtchn_2l_is_pending(const struct domain *d, evtchn_port_t port) ++static bool evtchn_2l_is_pending(const struct domain *d, ++ const struct evtchn *evtchn) + { ++ evtchn_port_t port = evtchn->port; + unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); + + ASSERT(port < max_ports); +@@ -72,8 +74,10 @@ static bool evtchn_2l_is_pending(const s + guest_test_bit(d, port, &shared_info(d, evtchn_pending))); + } + +-static bool evtchn_2l_is_masked(const struct domain *d, evtchn_port_t port) ++static bool evtchn_2l_is_masked(const struct domain *d, ++ const struct evtchn *evtchn) + { ++ evtchn_port_t port = evtchn->port; + unsigned int max_ports = BITS_PER_EVTCHN_WORD(d) * BITS_PER_EVTCHN_WORD(d); + + ASSERT(port < max_ports); +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -156,8 +156,9 @@ int evtchn_allocate_port(struct domain * + + if ( port_is_valid(d, port) ) + { +- if ( evtchn_from_port(d, port)->state != ECS_FREE || +- evtchn_port_is_busy(d, port) ) ++ const struct evtchn *chn = evtchn_from_port(d, port); ++ ++ if ( chn->state != ECS_FREE || evtchn_is_busy(d, chn) ) + return -EBUSY; + } + else +@@ -770,6 +771,7 @@ void send_guest_vcpu_virq(struct vcpu *v + unsigned long flags; + int port; + struct domain *d; ++ struct evtchn *chn; + + ASSERT(!virq_is_global(virq)); + +@@ -780,7 +782,10 @@ void send_guest_vcpu_virq(struct vcpu *v + goto out; + + d = v->domain; +- evtchn_port_set_pending(d, v->vcpu_id, evtchn_from_port(d, port)); ++ chn = evtchn_from_port(d, port); ++ spin_lock(&chn->lock); ++ evtchn_port_set_pending(d, v->vcpu_id, chn); ++ spin_unlock(&chn->lock); + + out: + spin_unlock_irqrestore(&v->virq_lock, flags); +@@ -809,7 +814,9 @@ static void send_guest_global_virq(struc + goto out; + + chn = evtchn_from_port(d, port); ++ spin_lock(&chn->lock); + evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); ++ spin_unlock(&chn->lock); + + out: + spin_unlock_irqrestore(&v->virq_lock, flags); +@@ -819,6 +826,7 @@ void send_guest_pirq(struct domain *d, c + { + int port; + struct evtchn *chn; ++ unsigned long flags; + + /* + * PV guests: It should not be possible to race with __evtchn_close(). The +@@ -833,7 +841,9 @@ void send_guest_pirq(struct domain *d, c + } + + chn = evtchn_from_port(d, port); ++ spin_lock_irqsave(&chn->lock, flags); + evtchn_port_set_pending(d, chn->notify_vcpu_id, chn); ++ spin_unlock_irqrestore(&chn->lock, flags); + } + + static struct domain *global_virq_handlers[NR_VIRQS] __read_mostly; +@@ -1028,12 +1038,15 @@ int evtchn_unmask(unsigned int port) + { + struct domain *d = current->domain; + struct evtchn *evtchn; ++ unsigned long flags; + + if ( unlikely(!port_is_valid(d, port)) ) + return -EINVAL; + + evtchn = evtchn_from_port(d, port); ++ spin_lock_irqsave(&evtchn->lock, flags); + evtchn_port_unmask(d, evtchn); ++ spin_unlock_irqrestore(&evtchn->lock, flags); + + return 0; + } +@@ -1446,8 +1459,8 @@ static void domain_dump_evtchn_info(stru + + printk(" %4u [%d/%d/", + port, +- evtchn_port_is_pending(d, port), +- evtchn_port_is_masked(d, port)); ++ evtchn_is_pending(d, chn), ++ evtchn_is_masked(d, chn)); + evtchn_port_print_state(d, chn); + printk("]: s=%d n=%d x=%d", + chn->state, chn->notify_vcpu_id, chn->xen_consumer); +--- a/xen/common/event_fifo.c ++++ b/xen/common/event_fifo.c +@@ -295,23 +295,26 @@ static void evtchn_fifo_unmask(struct do + evtchn_fifo_set_pending(v, evtchn); + } + +-static bool evtchn_fifo_is_pending(const struct domain *d, evtchn_port_t port) ++static bool evtchn_fifo_is_pending(const struct domain *d, ++ const struct evtchn *evtchn) + { +- const event_word_t *word = evtchn_fifo_word_from_port(d, port); ++ const event_word_t *word = evtchn_fifo_word_from_port(d, evtchn->port); + + return word && guest_test_bit(d, EVTCHN_FIFO_PENDING, word); + } + +-static bool_t evtchn_fifo_is_masked(const struct domain *d, evtchn_port_t port) ++static bool_t evtchn_fifo_is_masked(const struct domain *d, ++ const struct evtchn *evtchn) + { +- const event_word_t *word = evtchn_fifo_word_from_port(d, port); ++ const event_word_t *word = evtchn_fifo_word_from_port(d, evtchn->port); + + return !word || guest_test_bit(d, EVTCHN_FIFO_MASKED, word); + } + +-static bool_t evtchn_fifo_is_busy(const struct domain *d, evtchn_port_t port) ++static bool_t evtchn_fifo_is_busy(const struct domain *d, ++ const struct evtchn *evtchn) + { +- const event_word_t *word = evtchn_fifo_word_from_port(d, port); ++ const event_word_t *word = evtchn_fifo_word_from_port(d, evtchn->port); + + return word && guest_test_bit(d, EVTCHN_FIFO_LINKED, word); + } +--- a/xen/include/asm-x86/event.h ++++ b/xen/include/asm-x86/event.h +@@ -47,4 +47,10 @@ static inline bool arch_virq_is_global(u + return true; + } + ++#ifdef CONFIG_PV_SHIM ++# include ++# define arch_evtchn_is_special(chn) \ ++ (pv_shim && (chn)->port && (chn)->state == ECS_RESERVED) ++#endif ++ + #endif +--- a/xen/include/xen/event.h ++++ b/xen/include/xen/event.h +@@ -125,6 +125,24 @@ static inline struct evtchn *evtchn_from + return bucket_from_port(d, p) + (p % EVTCHNS_PER_BUCKET); + } + ++/* ++ * "usable" as in "by a guest", i.e. Xen consumed channels are assumed to be ++ * taken care of separately where used for Xen's internal purposes. ++ */ ++static bool evtchn_usable(const struct evtchn *evtchn) ++{ ++ if ( evtchn->xen_consumer ) ++ return false; ++ ++#ifdef arch_evtchn_is_special ++ if ( arch_evtchn_is_special(evtchn) ) ++ return true; ++#endif ++ ++ BUILD_BUG_ON(ECS_FREE > ECS_RESERVED); ++ return evtchn->state > ECS_RESERVED; ++} ++ + /* Wait on a Xen-attached event channel. */ + #define wait_on_xen_event_channel(port, condition) \ + do { \ +@@ -157,19 +175,24 @@ int evtchn_reset(struct domain *d); + + /* + * Low-level event channel port ops. ++ * ++ * All hooks have to be called with a lock held which prevents the channel ++ * from changing state. This may be the domain event lock, the per-channel ++ * lock, or in the case of sending interdomain events also the other side's ++ * per-channel lock. Exceptions apply in certain cases for the PV shim. + */ + struct evtchn_port_ops { + void (*init)(struct domain *d, struct evtchn *evtchn); + void (*set_pending)(struct vcpu *v, struct evtchn *evtchn); + void (*clear_pending)(struct domain *d, struct evtchn *evtchn); + void (*unmask)(struct domain *d, struct evtchn *evtchn); +- bool (*is_pending)(const struct domain *d, evtchn_port_t port); +- bool (*is_masked)(const struct domain *d, evtchn_port_t port); ++ bool (*is_pending)(const struct domain *d, const struct evtchn *evtchn); ++ bool (*is_masked)(const struct domain *d, const struct evtchn *evtchn); + /* + * Is the port unavailable because it's still being cleaned up + * after being closed? + */ +- bool (*is_busy)(const struct domain *d, evtchn_port_t port); ++ bool (*is_busy)(const struct domain *d, const struct evtchn *evtchn); + int (*set_priority)(struct domain *d, struct evtchn *evtchn, + unsigned int priority); + void (*print_state)(struct domain *d, const struct evtchn *evtchn); +@@ -185,38 +208,67 @@ static inline void evtchn_port_set_pendi + unsigned int vcpu_id, + struct evtchn *evtchn) + { +- d->evtchn_port_ops->set_pending(d->vcpu[vcpu_id], evtchn); ++ if ( evtchn_usable(evtchn) ) ++ d->evtchn_port_ops->set_pending(d->vcpu[vcpu_id], evtchn); + } + + static inline void evtchn_port_clear_pending(struct domain *d, + struct evtchn *evtchn) + { +- d->evtchn_port_ops->clear_pending(d, evtchn); ++ if ( evtchn_usable(evtchn) ) ++ d->evtchn_port_ops->clear_pending(d, evtchn); + } + + static inline void evtchn_port_unmask(struct domain *d, + struct evtchn *evtchn) + { +- d->evtchn_port_ops->unmask(d, evtchn); ++ if ( evtchn_usable(evtchn) ) ++ d->evtchn_port_ops->unmask(d, evtchn); + } + +-static inline bool evtchn_port_is_pending(const struct domain *d, +- evtchn_port_t port) ++static inline bool evtchn_is_pending(const struct domain *d, ++ const struct evtchn *evtchn) + { +- return d->evtchn_port_ops->is_pending(d, port); ++ return evtchn_usable(evtchn) && d->evtchn_port_ops->is_pending(d, evtchn); + } + +-static inline bool evtchn_port_is_masked(const struct domain *d, +- evtchn_port_t port) ++static inline bool evtchn_port_is_pending(struct domain *d, evtchn_port_t port) + { +- return d->evtchn_port_ops->is_masked(d, port); ++ struct evtchn *evtchn = evtchn_from_port(d, port); ++ bool rc; ++ unsigned long flags; ++ ++ spin_lock_irqsave(&evtchn->lock, flags); ++ rc = evtchn_is_pending(d, evtchn); ++ spin_unlock_irqrestore(&evtchn->lock, flags); ++ ++ return rc; ++} ++ ++static inline bool evtchn_is_masked(const struct domain *d, ++ const struct evtchn *evtchn) ++{ ++ return !evtchn_usable(evtchn) || d->evtchn_port_ops->is_masked(d, evtchn); ++} ++ ++static inline bool evtchn_port_is_masked(struct domain *d, evtchn_port_t port) ++{ ++ struct evtchn *evtchn = evtchn_from_port(d, port); ++ bool rc; ++ unsigned long flags; ++ ++ spin_lock_irqsave(&evtchn->lock, flags); ++ rc = evtchn_is_masked(d, evtchn); ++ spin_unlock_irqrestore(&evtchn->lock, flags); ++ ++ return rc; + } + +-static inline bool evtchn_port_is_busy(const struct domain *d, +- evtchn_port_t port) ++static inline bool evtchn_is_busy(const struct domain *d, ++ const struct evtchn *evtchn) + { + return d->evtchn_port_ops->is_busy && +- d->evtchn_port_ops->is_busy(d, port); ++ d->evtchn_port_ops->is_busy(d, evtchn); + } + + static inline int evtchn_port_set_priority(struct domain *d, +@@ -225,6 +277,8 @@ static inline int evtchn_port_set_priori + { + if ( !d->evtchn_port_ops->set_priority ) + return -ENOSYS; ++ if ( !evtchn_usable(evtchn) ) ++ return -EACCES; + return d->evtchn_port_ops->set_priority(d, evtchn, priority); + } + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-1.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,132 @@ +From: Jan Beulich +Subject: evtchn: arrange for preemption in evtchn_destroy() + +Especially closing of fully established interdomain channels can take +quite some time, due to the locking involved. Therefore we shouldn't +assume we can clean up still active ports all in one go. Besides adding +the necessary preemption check, also avoid pointlessly starting from +(or now really ending at) 0; 1 is the lowest numbered port which may +need closing. + +Since we're now reducing ->valid_evtchns, free_xen_event_channel(), +and (at least to be on the safe side) notify_via_xen_event_channel() +need to cope with attempts to close / unbind from / send through already +closed (and no longer valid, as per port_is_valid()) ports. + +This is part of XSA-344. + +Signed-off-by: Jan Beulich +Acked-by: Julien Grall +Reviewed-by: Stefano Stabellini + +--- a/xen/common/domain.c ++++ b/xen/common/domain.c +@@ -646,7 +646,6 @@ int domain_kill(struct domain *d) + if ( d->is_dying != DOMDYING_alive ) + return domain_kill(d); + d->is_dying = DOMDYING_dying; +- evtchn_destroy(d); + gnttab_release_mappings(d); + tmem_destroy(d->tmem_client); + vnuma_destroy(d->vnuma); +@@ -654,6 +653,9 @@ int domain_kill(struct domain *d) + d->tmem_client = NULL; + /* fallthrough */ + case DOMDYING_dying: ++ rc = evtchn_destroy(d); ++ if ( rc ) ++ break; + rc = domain_relinquish_resources(d); + if ( rc != 0 ) + break; +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -1291,7 +1291,16 @@ int alloc_unbound_xen_event_channel( + + void free_xen_event_channel(struct domain *d, int port) + { +- BUG_ON(!port_is_valid(d, port)); ++ if ( !port_is_valid(d, port) ) ++ { ++ /* ++ * Make sure ->is_dying is read /after/ ->valid_evtchns, pairing ++ * with the spin_barrier() and BUG_ON() in evtchn_destroy(). ++ */ ++ smp_rmb(); ++ BUG_ON(!d->is_dying); ++ return; ++ } + + evtchn_close(d, port, 0); + } +@@ -1303,7 +1312,17 @@ void notify_via_xen_event_channel(struct + struct domain *rd; + unsigned long flags; + +- ASSERT(port_is_valid(ld, lport)); ++ if ( !port_is_valid(ld, lport) ) ++ { ++ /* ++ * Make sure ->is_dying is read /after/ ->valid_evtchns, pairing ++ * with the spin_barrier() and BUG_ON() in evtchn_destroy(). ++ */ ++ smp_rmb(); ++ ASSERT(ld->is_dying); ++ return; ++ } ++ + lchn = evtchn_from_port(ld, lport); + + spin_lock_irqsave(&lchn->lock, flags); +@@ -1375,8 +1394,7 @@ int evtchn_init(struct domain *d) + return 0; + } + +- +-void evtchn_destroy(struct domain *d) ++int evtchn_destroy(struct domain *d) + { + unsigned int i; + +@@ -1385,14 +1403,29 @@ void evtchn_destroy(struct domain *d) + spin_barrier(&d->event_lock); + + /* Close all existing event channels. */ +- for ( i = 0; port_is_valid(d, i); i++ ) ++ for ( i = d->valid_evtchns; --i; ) ++ { + evtchn_close(d, i, 0); + ++ /* ++ * Avoid preempting when called from domain_create()'s error path, ++ * and don't check too often (choice of frequency is arbitrary). ++ */ ++ if ( i && !(i & 0x3f) && d->is_dying != DOMDYING_dead && ++ hypercall_preempt_check() ) ++ { ++ write_atomic(&d->valid_evtchns, i); ++ return -ERESTART; ++ } ++ } ++ + ASSERT(!d->active_evtchns); + + clear_global_virq_handlers(d); + + evtchn_fifo_destroy(d); ++ ++ return 0; + } + + +--- a/xen/include/xen/sched.h ++++ b/xen/include/xen/sched.h +@@ -135,7 +135,7 @@ struct evtchn + } __attribute__((aligned(64))); + + int evtchn_init(struct domain *d); /* from domain_create */ +-void evtchn_destroy(struct domain *d); /* from domain_kill */ ++int evtchn_destroy(struct domain *d); /* from domain_kill */ + void evtchn_destroy_final(struct domain *d); /* from complete_domain_destroy */ + + struct waitqueue_vcpu; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa344-4.11-2.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,203 @@ +From: Jan Beulich +Subject: evtchn: arrange for preemption in evtchn_reset() + +Like for evtchn_destroy() looping over all possible event channels to +close them can take a significant amount of time. Unlike done there, we +can't alter domain properties (i.e. d->valid_evtchns) here. Borrow, in a +lightweight form, the paging domctl continuation concept, redirecting +the continuations to different sub-ops. Just like there this is to be +able to allow for predictable overall results of the involved sub-ops: +Racing requests should either complete or be refused. + +Note that a domain can't interfere with an already started (by a remote +domain) reset, due to being paused. It can prevent a remote reset from +happening by leaving a reset unfinished, but that's only going to affect +itself. + +This is part of XSA-344. + +Signed-off-by: Jan Beulich +Acked-by: Julien Grall +Reviewed-by: Stefano Stabellini + +--- a/xen/common/domain.c ++++ b/xen/common/domain.c +@@ -1105,7 +1105,7 @@ void domain_unpause_except_self(struct d + domain_unpause(d); + } + +-int domain_soft_reset(struct domain *d) ++int domain_soft_reset(struct domain *d, bool resuming) + { + struct vcpu *v; + int rc; +@@ -1119,7 +1119,7 @@ int domain_soft_reset(struct domain *d) + } + spin_unlock(&d->shutdown_lock); + +- rc = evtchn_reset(d); ++ rc = evtchn_reset(d, resuming); + if ( rc ) + return rc; + +--- a/xen/common/domctl.c ++++ b/xen/common/domctl.c +@@ -648,12 +648,22 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe + } + + case XEN_DOMCTL_soft_reset: ++ case XEN_DOMCTL_soft_reset_cont: + if ( d == current->domain ) /* no domain_pause() */ + { + ret = -EINVAL; + break; + } +- ret = domain_soft_reset(d); ++ ret = domain_soft_reset(d, op->cmd == XEN_DOMCTL_soft_reset_cont); ++ if ( ret == -ERESTART ) ++ { ++ op->cmd = XEN_DOMCTL_soft_reset_cont; ++ if ( !__copy_field_to_guest(u_domctl, op, cmd) ) ++ ret = hypercall_create_continuation(__HYPERVISOR_domctl, ++ "h", u_domctl); ++ else ++ ret = -EFAULT; ++ } + break; + + case XEN_DOMCTL_destroydomain: +--- a/xen/common/event_channel.c ++++ b/xen/common/event_channel.c +@@ -1051,7 +1051,7 @@ int evtchn_unmask(unsigned int port) + return 0; + } + +-int evtchn_reset(struct domain *d) ++int evtchn_reset(struct domain *d, bool resuming) + { + unsigned int i; + int rc = 0; +@@ -1059,11 +1059,40 @@ int evtchn_reset(struct domain *d) + if ( d != current->domain && !d->controller_pause_count ) + return -EINVAL; + +- for ( i = 0; port_is_valid(d, i); i++ ) ++ spin_lock(&d->event_lock); ++ ++ /* ++ * If we are resuming, then start where we stopped. Otherwise, check ++ * that a reset operation is not already in progress, and if none is, ++ * record that this is now the case. ++ */ ++ i = resuming ? d->next_evtchn : !d->next_evtchn; ++ if ( i > d->next_evtchn ) ++ d->next_evtchn = i; ++ ++ spin_unlock(&d->event_lock); ++ ++ if ( !i ) ++ return -EBUSY; ++ ++ for ( ; port_is_valid(d, i); i++ ) ++ { + evtchn_close(d, i, 1); + ++ /* NB: Choice of frequency is arbitrary. */ ++ if ( !(i & 0x3f) && hypercall_preempt_check() ) ++ { ++ spin_lock(&d->event_lock); ++ d->next_evtchn = i; ++ spin_unlock(&d->event_lock); ++ return -ERESTART; ++ } ++ } ++ + spin_lock(&d->event_lock); + ++ d->next_evtchn = 0; ++ + if ( d->active_evtchns > d->xen_evtchns ) + rc = -EAGAIN; + else if ( d->evtchn_fifo ) +@@ -1198,7 +1227,8 @@ long do_event_channel_op(int cmd, XEN_GU + break; + } + +- case EVTCHNOP_reset: { ++ case EVTCHNOP_reset: ++ case EVTCHNOP_reset_cont: { + struct evtchn_reset reset; + struct domain *d; + +@@ -1211,9 +1241,13 @@ long do_event_channel_op(int cmd, XEN_GU + + rc = xsm_evtchn_reset(XSM_TARGET, current->domain, d); + if ( !rc ) +- rc = evtchn_reset(d); ++ rc = evtchn_reset(d, cmd == EVTCHNOP_reset_cont); + + rcu_unlock_domain(d); ++ ++ if ( rc == -ERESTART ) ++ rc = hypercall_create_continuation(__HYPERVISOR_event_channel_op, ++ "ih", EVTCHNOP_reset_cont, arg); + break; + } + +--- a/xen/include/public/domctl.h ++++ b/xen/include/public/domctl.h +@@ -1121,7 +1121,10 @@ struct xen_domctl { + #define XEN_DOMCTL_iomem_permission 20 + #define XEN_DOMCTL_ioport_permission 21 + #define XEN_DOMCTL_hypercall_init 22 +-#define XEN_DOMCTL_arch_setup 23 /* Obsolete IA64 only */ ++#ifdef __XEN__ ++/* #define XEN_DOMCTL_arch_setup 23 Obsolete IA64 only */ ++#define XEN_DOMCTL_soft_reset_cont 23 ++#endif + #define XEN_DOMCTL_settimeoffset 24 + #define XEN_DOMCTL_getvcpuaffinity 25 + #define XEN_DOMCTL_real_mode_area 26 /* Obsolete PPC only */ +--- a/xen/include/public/event_channel.h ++++ b/xen/include/public/event_channel.h +@@ -74,6 +74,9 @@ + #define EVTCHNOP_init_control 11 + #define EVTCHNOP_expand_array 12 + #define EVTCHNOP_set_priority 13 ++#ifdef __XEN__ ++#define EVTCHNOP_reset_cont 14 ++#endif + /* ` } */ + + typedef uint32_t evtchn_port_t; +--- a/xen/include/xen/event.h ++++ b/xen/include/xen/event.h +@@ -163,7 +163,7 @@ void evtchn_check_pollers(struct domain + void evtchn_2l_init(struct domain *d); + + /* Close all event channels and reset to 2-level ABI. */ +-int evtchn_reset(struct domain *d); ++int evtchn_reset(struct domain *d, bool resuming); + + /* + * Low-level event channel port ops. +--- a/xen/include/xen/sched.h ++++ b/xen/include/xen/sched.h +@@ -355,6 +355,8 @@ struct domain + * EVTCHNOP_reset). Read/write access like for active_evtchns. + */ + unsigned int xen_evtchns; ++ /* Port to resume from in evtchn_reset(), when in a continuation. */ ++ unsigned int next_evtchn; + spinlock_t event_lock; + const struct evtchn_port_ops *evtchn_port_ops; + struct evtchn_fifo_domain *evtchn_fifo; +@@ -608,7 +610,7 @@ int domain_shutdown(struct domain *d, u8 + void domain_resume(struct domain *d); + void domain_pause_for_debugger(void); + +-int domain_soft_reset(struct domain *d); ++int domain_soft_reset(struct domain *d, bool resuming); + + int vcpu_start_shutdown_deferral(struct vcpu *v); + void vcpu_end_shutdown_deferral(struct vcpu *v); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-1.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,57 @@ +From: Jan Beulich +Subject: IOMMU: suppress "iommu_dont_flush_iotlb" when about to free a page + +Deferring flushes to a single, wide range one - as is done when +handling XENMAPSPACE_gmfn_range - is okay only as long as +pages don't get freed ahead of the eventual flush. While the only +function setting the flag (xenmem_add_to_physmap()) suggests by its name +that it's only mapping new entries, in reality the way +xenmem_add_to_physmap_one() works means an unmap would happen not only +for the page being moved (but not freed) but, if the destination GFN is +populated, also for the page being displaced from that GFN. Collapsing +the two flushes for this GFN into just one (end even more so deferring +it to a batched invocation) is not correct. + +This is part of XSA-346. + +Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Acked-by: Julien Grall + +--- a/xen/common/memory.c ++++ b/xen/common/memory.c +@@ -298,7 +298,10 @@ int guest_remove_page(struct domain *d, + p2m_type_t p2mt; + #endif + mfn_t mfn; ++#ifdef CONFIG_HAS_PASSTHROUGH ++ bool *dont_flush_p, dont_flush; + int rc; ++#endif + + #ifdef CONFIG_X86 + mfn = get_gfn_query(d, gmfn, &p2mt); +@@ -376,8 +379,22 @@ int guest_remove_page(struct domain *d, + return -ENXIO; + } + ++#ifdef CONFIG_HAS_PASSTHROUGH ++ /* ++ * Since we're likely to free the page below, we need to suspend ++ * xenmem_add_to_physmap()'s suppressing of IOMMU TLB flushes. ++ */ ++ dont_flush_p = &this_cpu(iommu_dont_flush_iotlb); ++ dont_flush = *dont_flush_p; ++ *dont_flush_p = false; ++#endif ++ + rc = guest_physmap_remove_page(d, _gfn(gmfn), mfn, 0); + ++#ifdef CONFIG_HAS_PASSTHROUGH ++ *dont_flush_p = dont_flush; ++#endif ++ + /* + * With the lack of an IOMMU on some platforms, domains with DMA-capable + * device must retrieve the same pfn when the hypercall populate_physmap diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa346-4.11-2.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,202 @@ +From: Jan Beulich +Subject: IOMMU: hold page ref until after deferred TLB flush + +When moving around a page via XENMAPSPACE_gmfn_range, deferring the TLB +flush for the "from" GFN range requires that the page remains allocated +to the guest until the TLB flush has actually occurred. Otherwise a +parallel hypercall to remove the page would only flush the TLB for the +GFN it has been moved to, but not the one is was mapped at originally. + +This is part of XSA-346. + +Fixes: cf95b2a9fd5a ("iommu: Introduce per cpu flag (iommu_dont_flush_iotlb) to avoid unnecessary iotlb... ") +Reported-by: Julien Grall +Signed-off-by: Jan Beulich +Acked-by: Julien Grall + +--- a/xen/arch/arm/mm.c ++++ b/xen/arch/arm/mm.c +@@ -1222,7 +1222,7 @@ void share_xen_page_with_guest(struct pa + int xenmem_add_to_physmap_one( + struct domain *d, + unsigned int space, +- union xen_add_to_physmap_batch_extra extra, ++ union add_to_physmap_extra extra, + unsigned long idx, + gfn_t gfn) + { +@@ -1294,10 +1294,6 @@ int xenmem_add_to_physmap_one( + break; + } + case XENMAPSPACE_dev_mmio: +- /* extra should be 0. Reserved for future use. */ +- if ( extra.res0 ) +- return -EOPNOTSUPP; +- + rc = map_dev_mmio_region(d, gfn, 1, _mfn(idx)); + return rc; + +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -4634,7 +4634,7 @@ static int handle_iomem_range(unsigned l + int xenmem_add_to_physmap_one( + struct domain *d, + unsigned int space, +- union xen_add_to_physmap_batch_extra extra, ++ union add_to_physmap_extra extra, + unsigned long idx, + gfn_t gpfn) + { +@@ -4721,9 +4721,20 @@ int xenmem_add_to_physmap_one( + rc = guest_physmap_add_page(d, gpfn, mfn, PAGE_ORDER_4K); + + put_both: +- /* In the XENMAPSPACE_gmfn case, we took a ref of the gfn at the top. */ ++ /* ++ * In the XENMAPSPACE_gmfn case, we took a ref of the gfn at the top. ++ * We also may need to transfer ownership of the page reference to our ++ * caller. ++ */ + if ( space == XENMAPSPACE_gmfn ) ++ { + put_gfn(d, gfn); ++ if ( !rc && extra.ppage ) ++ { ++ *extra.ppage = page; ++ page = NULL; ++ } ++ } + + if ( page ) + put_page(page); +--- a/xen/common/memory.c ++++ b/xen/common/memory.c +@@ -811,11 +811,10 @@ int xenmem_add_to_physmap(struct domain + { + unsigned int done = 0; + long rc = 0; +- union xen_add_to_physmap_batch_extra extra; ++ union add_to_physmap_extra extra = {}; ++ struct page_info *pages[16]; + +- if ( xatp->space != XENMAPSPACE_gmfn_foreign ) +- extra.res0 = 0; +- else ++ if ( xatp->space == XENMAPSPACE_gmfn_foreign ) + extra.foreign_domid = DOMID_INVALID; + + if ( xatp->space != XENMAPSPACE_gmfn_range ) +@@ -831,7 +830,10 @@ int xenmem_add_to_physmap(struct domain + + #ifdef CONFIG_HAS_PASSTHROUGH + if ( need_iommu(d) ) ++ { + this_cpu(iommu_dont_flush_iotlb) = 1; ++ extra.ppage = &pages[0]; ++ } + #endif + + while ( xatp->size > done ) +@@ -844,8 +846,12 @@ int xenmem_add_to_physmap(struct domain + xatp->idx++; + xatp->gpfn++; + ++ if ( extra.ppage ) ++ ++extra.ppage; ++ + /* Check for continuation if it's not the last iteration. */ +- if ( xatp->size > ++done && hypercall_preempt_check() ) ++ if ( (++done > ARRAY_SIZE(pages) && extra.ppage) || ++ (xatp->size > done && hypercall_preempt_check()) ) + { + rc = start + done; + break; +@@ -856,6 +862,7 @@ int xenmem_add_to_physmap(struct domain + if ( need_iommu(d) ) + { + int ret; ++ unsigned int i; + + this_cpu(iommu_dont_flush_iotlb) = 0; + +@@ -863,6 +870,15 @@ int xenmem_add_to_physmap(struct domain + if ( unlikely(ret) && rc >= 0 ) + rc = ret; + ++ /* ++ * Now that the IOMMU TLB flush was done for the original GFN, drop ++ * the page references. The 2nd flush below is fine to make later, as ++ * whoever removes the page again from its new GFN will have to do ++ * another flush anyway. ++ */ ++ for ( i = 0; i < done; ++i ) ++ put_page(pages[i]); ++ + ret = iommu_iotlb_flush(d, xatp->gpfn - done, done); + if ( unlikely(ret) && rc >= 0 ) + rc = ret; +@@ -876,6 +892,8 @@ static int xenmem_add_to_physmap_batch(s + struct xen_add_to_physmap_batch *xatpb, + unsigned int extent) + { ++ union add_to_physmap_extra extra = {}; ++ + if ( xatpb->size < extent ) + return -EILSEQ; + +@@ -884,6 +902,19 @@ static int xenmem_add_to_physmap_batch(s + !guest_handle_subrange_okay(xatpb->errs, extent, xatpb->size - 1) ) + return -EFAULT; + ++ switch ( xatpb->space ) ++ { ++ case XENMAPSPACE_dev_mmio: ++ /* res0 is reserved for future use. */ ++ if ( xatpb->u.res0 ) ++ return -EOPNOTSUPP; ++ break; ++ ++ case XENMAPSPACE_gmfn_foreign: ++ extra.foreign_domid = xatpb->u.foreign_domid; ++ break; ++ } ++ + while ( xatpb->size > extent ) + { + xen_ulong_t idx; +@@ -896,8 +927,7 @@ static int xenmem_add_to_physmap_batch(s + extent, 1)) ) + return -EFAULT; + +- rc = xenmem_add_to_physmap_one(d, xatpb->space, +- xatpb->u, ++ rc = xenmem_add_to_physmap_one(d, xatpb->space, extra, + idx, _gfn(gpfn)); + + if ( unlikely(__copy_to_guest_offset(xatpb->errs, extent, &rc, 1)) ) +--- a/xen/include/xen/mm.h ++++ b/xen/include/xen/mm.h +@@ -577,8 +577,22 @@ void scrub_one_page(struct page_info *); + &(d)->xenpage_list : &(d)->page_list) + #endif + ++union add_to_physmap_extra { ++ /* ++ * XENMAPSPACE_gmfn: When deferring TLB flushes, a page reference needs ++ * to be kept until after the flush, so the page can't get removed from ++ * the domain (and re-used for another purpose) beforehand. By passing ++ * non-NULL, the caller of xenmem_add_to_physmap_one() indicates it wants ++ * to have ownership of such a reference transferred in the success case. ++ */ ++ struct page_info **ppage; ++ ++ /* XENMAPSPACE_gmfn_foreign */ ++ domid_t foreign_domid; ++}; ++ + int xenmem_add_to_physmap_one(struct domain *d, unsigned int space, +- union xen_add_to_physmap_batch_extra extra, ++ union add_to_physmap_extra extra, + unsigned long idx, gfn_t gfn); + + int xenmem_add_to_physmap(struct domain *d, struct xen_add_to_physmap *xatp, diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-1.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,52 @@ +From: Jan Beulich +Subject: AMD/IOMMU: update live PTEs atomically + +Updating a live PTE word by word allows the IOMMU to see a partially +updated entry. Construct the new entry fully in a local variable and +then write the new entry by a single insn. + +This is part of XSA-347. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -41,7 +41,7 @@ static void clear_iommu_pte_present(unsi + + table = map_domain_page(_mfn(l1_mfn)); + pte = table + pfn_to_pde_idx(gfn, IOMMU_PAGING_MODE_LEVEL_1); +- *pte = 0; ++ write_atomic(pte, 0); + unmap_domain_page(table); + } + +@@ -49,7 +49,7 @@ static bool_t set_iommu_pde_present(u32 + unsigned int next_level, + bool_t iw, bool_t ir) + { +- uint64_t addr_lo, addr_hi, maddr_next; ++ uint64_t addr_lo, addr_hi, maddr_next, full; + u32 entry; + bool need_flush = false, old_present; + +@@ -106,7 +106,7 @@ static bool_t set_iommu_pde_present(u32 + if ( next_level == IOMMU_PAGING_MODE_LEVEL_0 ) + set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, entry, + IOMMU_PTE_FC_MASK, IOMMU_PTE_FC_SHIFT, &entry); +- pde[1] = entry; ++ full = (uint64_t)entry << 32; + + /* mark next level as 'present' */ + set_field_in_reg_u32((u32)addr_lo >> PAGE_SHIFT, 0, +@@ -118,7 +118,9 @@ static bool_t set_iommu_pde_present(u32 + set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, entry, + IOMMU_PDE_PRESENT_MASK, + IOMMU_PDE_PRESENT_SHIFT, &entry); +- pde[0] = entry; ++ full |= entry; ++ ++ write_atomic((uint64_t *)pde, full); + + return need_flush; + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa347-4.11-2.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,80 @@ +From: Jan Beulich +Subject: AMD/IOMMU: ensure suitable ordering of DTE modifications + +DMA and interrupt translation should be enabled only after other +applicable DTE fields have been written. Similarly when disabling +translation or when moving a device between domains, translation should +first be disabled, before other entry fields get modified. Note however +that the "moving" aspect doesn't apply to the interrupt remapping side, +as domain specifics are maintained in the IRTEs here, not the DTE. We +also never disable interrupt remapping once it got enabled for a device +(the respective argument passed is always the immutable iommu_intremap). + +This is part of XSA-347. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -147,7 +147,22 @@ void amd_iommu_set_root_page_table( + u32 *dte, u64 root_ptr, u16 domain_id, u8 paging_mode, u8 valid) + { + u64 addr_hi, addr_lo; +- u32 entry; ++ u32 entry, dte0 = dte[0]; ++ ++ if ( valid || ++ get_field_from_reg_u32(dte0, IOMMU_DEV_TABLE_VALID_MASK, ++ IOMMU_DEV_TABLE_VALID_SHIFT) ) ++ { ++ set_field_in_reg_u32(IOMMU_CONTROL_DISABLED, dte0, ++ IOMMU_DEV_TABLE_TRANSLATION_VALID_MASK, ++ IOMMU_DEV_TABLE_TRANSLATION_VALID_SHIFT, &dte0); ++ set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, dte0, ++ IOMMU_DEV_TABLE_VALID_MASK, ++ IOMMU_DEV_TABLE_VALID_SHIFT, &dte0); ++ dte[0] = dte0; ++ smp_wmb(); ++ } ++ + set_field_in_reg_u32(domain_id, 0, + IOMMU_DEV_TABLE_DOMAIN_ID_MASK, + IOMMU_DEV_TABLE_DOMAIN_ID_SHIFT, &entry); +@@ -166,8 +181,9 @@ void amd_iommu_set_root_page_table( + IOMMU_DEV_TABLE_IO_READ_PERMISSION_MASK, + IOMMU_DEV_TABLE_IO_READ_PERMISSION_SHIFT, &entry); + dte[1] = entry; ++ smp_wmb(); + +- set_field_in_reg_u32((u32)addr_lo >> PAGE_SHIFT, 0, ++ set_field_in_reg_u32((u32)addr_lo >> PAGE_SHIFT, dte0, + IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_MASK, + IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_SHIFT, &entry); + set_field_in_reg_u32(paging_mode, entry, +@@ -180,7 +196,7 @@ void amd_iommu_set_root_page_table( + IOMMU_CONTROL_DISABLED, entry, + IOMMU_DEV_TABLE_VALID_MASK, + IOMMU_DEV_TABLE_VALID_SHIFT, &entry); +- dte[0] = entry; ++ write_atomic(&dte[0], entry); + } + + void iommu_dte_set_iotlb(u32 *dte, u8 i) +@@ -212,6 +228,7 @@ void __init amd_iommu_set_intremap_table + IOMMU_DEV_TABLE_INT_CONTROL_MASK, + IOMMU_DEV_TABLE_INT_CONTROL_SHIFT, &entry); + dte[5] = entry; ++ smp_wmb(); + + set_field_in_reg_u32((u32)addr_lo >> 6, 0, + IOMMU_DEV_TABLE_INT_TABLE_PTR_LOW_MASK, +@@ -229,7 +246,7 @@ void __init amd_iommu_set_intremap_table + IOMMU_CONTROL_DISABLED, entry, + IOMMU_DEV_TABLE_INT_VALID_MASK, + IOMMU_DEV_TABLE_INT_VALID_SHIFT, &entry); +- dte[4] = entry; ++ write_atomic(&dte[4], entry); + } + + void __init iommu_dte_add_device_entry(u32 *dte, struct ivrs_mappings *ivrs_dev) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa348-4.11.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa348-4.11.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa348-4.11.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa348-4.11.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,164 @@ +From: Jan Beulich +Subject: x86: avoid calling {svm,vmx}_do_resume() + +These functions follow the following path: hvm_do_resume() -> +handle_hvm_io_completion() -> hvm_wait_for_io() -> +wait_on_xen_event_channel() -> do_softirq() -> schedule() -> +sched_context_switch() -> continue_running() and hence may +recursively invoke themselves. If this ends up happening a couple of +times, a stack overflow would result. + +Prevent this by also resetting the stack at the +->arch.ctxt_switch->tail() invocations (in both places for consistency) +and thus jumping to the functions instead of calling them. + +This is XSA-348 / CVE-2020-29566. + +Reported-by: Julien Grall +Signed-off-by: Jan Beulich +Reviewed-by: Juergen Gross + +--- sle12sp4.orig/xen/arch/x86/domain.c 2020-10-15 17:35:17.000000000 +0200 ++++ sle12sp4/xen/arch/x86/domain.c 2020-11-10 17:56:59.000000000 +0100 +@@ -121,7 +121,7 @@ static void play_dead(void) + (*dead_idle)(); + } + +-static void idle_loop(void) ++static void noreturn idle_loop(void) + { + unsigned int cpu = smp_processor_id(); + +@@ -161,11 +161,6 @@ void startup_cpu_idle_loop(void) + reset_stack_and_jump(idle_loop); + } + +-static void noreturn continue_idle_domain(struct vcpu *v) +-{ +- reset_stack_and_jump(idle_loop); +-} +- + void dump_pageframe_info(struct domain *d) + { + struct page_info *page; +@@ -456,7 +451,7 @@ int arch_domain_create(struct domain *d, + static const struct arch_csw idle_csw = { + .from = paravirt_ctxt_switch_from, + .to = paravirt_ctxt_switch_to, +- .tail = continue_idle_domain, ++ .tail = idle_loop, + }; + + d->arch.ctxt_switch = &idle_csw; +@@ -1770,20 +1765,12 @@ void context_switch(struct vcpu *prev, s + /* Ensure that the vcpu has an up-to-date time base. */ + update_vcpu_system_time(next); + +- /* +- * Schedule tail *should* be a terminal function pointer, but leave a +- * bug frame around just in case it returns, to save going back into the +- * context switching code and leaving a far more subtle crash to diagnose. +- */ +- nextd->arch.ctxt_switch->tail(next); +- BUG(); ++ reset_stack_and_jump_ind(nextd->arch.ctxt_switch->tail); + } + + void continue_running(struct vcpu *same) + { +- /* See the comment above. */ +- same->domain->arch.ctxt_switch->tail(same); +- BUG(); ++ reset_stack_and_jump_ind(same->domain->arch.ctxt_switch->tail); + } + + int __sync_local_execstate(void) +--- sle12sp4.orig/xen/arch/x86/hvm/svm/svm.c 2020-06-18 15:13:13.001760095 +0200 ++++ sle12sp4/xen/arch/x86/hvm/svm/svm.c 2020-11-10 17:56:59.000000000 +0100 +@@ -1111,8 +1111,9 @@ static void svm_ctxt_switch_to(struct vc + wrmsr_tsc_aux(hvm_msr_tsc_aux(v)); + } + +-static void noreturn svm_do_resume(struct vcpu *v) ++static void noreturn svm_do_resume(void) + { ++ struct vcpu *v = current; + struct vmcb_struct *vmcb = v->arch.hvm_svm.vmcb; + bool debug_state = (v->domain->debugger_attached || + v->domain->arch.monitor.software_breakpoint_enabled || +--- sle12sp4.orig/xen/arch/x86/hvm/vmx/vmcs.c 2019-12-03 17:46:26.000000000 +0100 ++++ sle12sp4/xen/arch/x86/hvm/vmx/vmcs.c 2020-11-10 17:56:59.000000000 +0100 +@@ -1782,8 +1782,9 @@ void vmx_vmentry_failure(void) + domain_crash_synchronous(); + } + +-void vmx_do_resume(struct vcpu *v) ++void vmx_do_resume(void) + { ++ struct vcpu *v = current; + bool_t debug_state; + unsigned long host_cr4; + +--- sle12sp4.orig/xen/arch/x86/pv/domain.c 2019-06-25 23:47:11.000000000 +0200 ++++ sle12sp4/xen/arch/x86/pv/domain.c 2020-11-10 17:56:59.000000000 +0100 +@@ -58,7 +58,7 @@ static int parse_pcid(const char *s) + } + custom_runtime_param("pcid", parse_pcid); + +-static void noreturn continue_nonidle_domain(struct vcpu *v) ++static void noreturn continue_nonidle_domain(void) + { + check_wakeup_from_wait(); + reset_stack_and_jump(ret_from_intr); +--- sle12sp4.orig/xen/include/asm-x86/current.h 2019-06-25 23:47:11.000000000 +0200 ++++ sle12sp4/xen/include/asm-x86/current.h 2020-11-10 17:56:59.000000000 +0100 +@@ -124,16 +124,23 @@ unsigned long get_stack_dump_bottom (uns + # define CHECK_FOR_LIVEPATCH_WORK "" + #endif + +-#define reset_stack_and_jump(__fn) \ ++#define switch_stack_and_jump(fn, instr, constr) \ + ({ \ + __asm__ __volatile__ ( \ + "mov %0,%%"__OP"sp;" \ +- CHECK_FOR_LIVEPATCH_WORK \ +- "jmp %c1" \ +- : : "r" (guest_cpu_user_regs()), "i" (__fn) : "memory" ); \ ++ CHECK_FOR_LIVEPATCH_WORK \ ++ instr "1" \ ++ : : "r" (guest_cpu_user_regs()), constr (fn) : "memory" ); \ + unreachable(); \ + }) + ++#define reset_stack_and_jump(fn) \ ++ switch_stack_and_jump(fn, "jmp %c", "i") ++ ++/* The constraint may only specify non-call-clobbered registers. */ ++#define reset_stack_and_jump_ind(fn) \ ++ switch_stack_and_jump(fn, "INDIRECT_JMP %", "b") ++ + /* + * Which VCPU's state is currently running on each CPU? + * This is not necesasrily the same as 'current' as a CPU may be +--- sle12sp4.orig/xen/include/asm-x86/domain.h 2019-12-03 17:46:26.000000000 +0100 ++++ sle12sp4/xen/include/asm-x86/domain.h 2020-11-10 17:56:59.000000000 +0100 +@@ -328,7 +328,7 @@ struct arch_domain + const struct arch_csw { + void (*from)(struct vcpu *); + void (*to)(struct vcpu *); +- void (*tail)(struct vcpu *); ++ void noreturn (*tail)(void); + } *ctxt_switch; + + /* nestedhvm: translate l2 guest physical to host physical */ +--- sle12sp4.orig/xen/include/asm-x86/hvm/vmx/vmx.h 2019-12-03 17:46:26.000000000 +0100 ++++ sle12sp4/xen/include/asm-x86/hvm/vmx/vmx.h 2020-11-10 17:56:59.000000000 +0100 +@@ -95,7 +95,7 @@ typedef enum { + void vmx_asm_vmexit_handler(struct cpu_user_regs); + void vmx_asm_do_vmentry(void); + void vmx_intr_assist(void); +-void noreturn vmx_do_resume(struct vcpu *); ++void noreturn vmx_do_resume(void); + void vmx_vlapic_msr_changed(struct vcpu *v); + void vmx_realmode_emulate_one(struct hvm_emulate_ctxt *hvmemul_ctxt); + void vmx_realmode(struct cpu_user_regs *regs); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-arm.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-arm.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-arm.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-arm.patch 2022-05-30 08:33:19.000000000 +0100 @@ -0,0 +1,58 @@ +From: Julien Grall +Subject: xen/arm: Always trap AMU system registers + +The Activity Monitors Unit (AMU) has been introduced by ARMv8.4. It is +considered to be unsafe to be expose to guests as they might expose +information about code executed by other guests or the host. + +Arm provided a way to trap all the AMU system registers by setting +CPTR_EL2.TAM to 1. + +Unfortunately, on older revision of the specification, the bit 30 (now +CPTR_EL1.TAM) was RES0. Because of that, Xen is setting it to 0 and +therefore the system registers would be exposed to the guest when it is +run on processors with AMU. + +As the bit is mark as UNKNOWN at boot in Armv8.4, the only safe solution +for us is to always set CPTR_EL1.TAM to 1. + +Guest trying to access the AMU system registers will now receive an +undefined instruction. Unfortunately, this means that even well-behaved +guest may fail to boot because we don't sanitize the ID registers. + +This is a known issues with other Armv8.0+ features (e.g. SVE, Pointer +Auth). This will taken care separately. + +This is part of XSA-351 (or XSA-93 re-born). + +Signed-off-by: Julien Grall +Reviewed-by: Andre Przywara +Reviewed-by: Stefano Stabellini +Reviewed-by: Bertrand Marquis + +diff --git a/xen/arch/arm/traps.c b/xen/arch/arm/traps.c +index a36f145e67..22bd1bd4c6 100644 +--- a/xen/arch/arm/traps.c ++++ b/xen/arch/arm/traps.c +@@ -151,7 +151,8 @@ void init_traps(void) + * On ARM64 the TCPx bits which we set here (0..9,12,13) are all + * RES1, i.e. they would trap whether we did this write or not. + */ +- WRITE_SYSREG((HCPTR_CP_MASK & ~(HCPTR_CP(10) | HCPTR_CP(11))) | HCPTR_TTA, ++ WRITE_SYSREG((HCPTR_CP_MASK & ~(HCPTR_CP(10) | HCPTR_CP(11))) | ++ HCPTR_TTA | HCPTR_TAM, + CPTR_EL2); + + /* Setup hypervisor traps */ +diff --git a/xen/include/asm-arm/processor.h b/xen/include/asm-arm/processor.h +index 3ca67f8157..d3d12a9d19 100644 +--- a/xen/include/asm-arm/processor.h ++++ b/xen/include/asm-arm/processor.h +@@ -351,6 +351,7 @@ + #define VTCR_RES1 (_AC(1,UL)<<31) + + /* HCPTR Hyp. Coprocessor Trap Register */ ++#define HCPTR_TAM ((_AC(1,U)<<30)) + #define HCPTR_TTA ((_AC(1,U)<<20)) /* Trap trace registers */ + #define HCPTR_CP(x) ((_AC(1,U)<<(x))) /* Trap Coprocessor x */ + #define HCPTR_CP_MASK ((_AC(1,U)<<14)-1) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-1.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,163 @@ +From: =?UTF-8?q?Roger=20Pau=20Monn=C3=A9?= +Subject: x86/msr: fix handling of MSR_IA32_PERF_{STATUS/CTL} +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Currently a PV hardware domain can also be given control over the CPU +frequency, and such guest is allowed to write to MSR_IA32_PERF_CTL. +However since commit 322ec7c89f6 the default behavior has been changed +to reject accesses to not explicitly handled MSRs, preventing PV +guests that manage CPU frequency from reading +MSR_IA32_PERF_{STATUS/CTL}. + +Additionally some HVM guests (Windows at least) will attempt to read +MSR_IA32_PERF_CTL and will panic if given back a #GP fault: + + vmx.c:3035:d8v0 RDMSR 0x00000199 unimplemented + d8v0 VIRIDIAN CRASH: 3b c0000096 fffff806871c1651 ffffda0253683720 0 + +Move the handling of MSR_IA32_PERF_{STATUS/CTL} to the common MSR +handling shared between HVM and PV guests, and add an explicit case +for reads to MSR_IA32_PERF_{STATUS/CTL}. + +Restore previous behavior and allow PV guests with the required +permissions to read the contents of the mentioned MSRs. Non privileged +guests will get 0 when trying to read those registers, as writes to +MSR_IA32_PERF_CTL by such guest will already be silently dropped. + +Fixes: 322ec7c89f6 ('x86/pv: disallow access to unknown MSRs') +Fixes: 84e848fd7a1 ('x86/hvm: disallow access to unknown MSRs') +Signed-off-by: Roger Pau Monné +Signed-off-by: Andrew Cooper +Reviewed-by: Roger Pau Monné +Reviewed-by: Jan Beulich +(cherry picked from commit 3059178798a23ba870ff86ff54d442a07e6651fc) + +diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c +index 256e58d82b..3495ac9f4a 100644 +--- a/xen/arch/x86/msr.c ++++ b/xen/arch/x86/msr.c +@@ -141,6 +141,7 @@ int init_vcpu_msr_policy(struct vcpu *v) + + int guest_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) + { ++ const struct domain *d = v->domain; + const struct cpuid_policy *cp = v->domain->arch.cpuid; + const struct msr_domain_policy *dp = v->domain->arch.msr; + const struct msr_vcpu_policy *vp = v->arch.msr; +@@ -212,6 +213,25 @@ int guest_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) + break; + + /* ++ * These MSRs are not enumerated in CPUID. They have been around ++ * since the Pentium 4, and implemented by other vendors. ++ * ++ * Some versions of Windows try reading these before setting up a #GP ++ * handler, and Linux has several unguarded reads as well. Provide ++ * RAZ semantics, in general, but permit a cpufreq controller dom0 to ++ * have full access. ++ */ ++ case MSR_IA32_PERF_STATUS: ++ case MSR_IA32_PERF_CTL: ++ if ( !(cp->x86_vendor & (X86_VENDOR_INTEL | X86_VENDOR_CENTAUR)) ) ++ goto gp_fault; ++ ++ *val = 0; ++ if ( likely(!is_cpufreq_controller(d)) || rdmsr_safe(msr, *val) == 0 ) ++ break; ++ goto gp_fault; ++ ++ /* + * TODO: Implement when we have better topology representation. + case MSR_INTEL_CORE_THREAD_COUNT: + */ +@@ -241,6 +261,7 @@ int guest_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val) + case MSR_INTEL_CORE_THREAD_COUNT: + case MSR_INTEL_PLATFORM_INFO: + case MSR_ARCH_CAPABILITIES: ++ case MSR_IA32_PERF_STATUS: + /* Read-only */ + case MSR_TSX_FORCE_ABORT: + case MSR_TSX_CTRL: +@@ -345,6 +366,21 @@ int guest_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val) + break; + } + ++ /* ++ * This MSR is not enumerated in CPUID. It has been around since the ++ * Pentium 4, and implemented by other vendors. ++ * ++ * To match the RAZ semantics, implement as write-discard, except for ++ * a cpufreq controller dom0 which has full access. ++ */ ++ case MSR_IA32_PERF_CTL: ++ if ( !(cp->x86_vendor & (X86_VENDOR_INTEL | X86_VENDOR_CENTAUR)) ) ++ goto gp_fault; ++ ++ if ( likely(!is_cpufreq_controller(d)) || wrmsr_safe(msr, val) == 0 ) ++ break; ++ goto gp_fault; ++ + default: + return X86EMUL_UNHANDLEABLE; + } +diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c +index 8120ded330..755f00db33 100644 +--- a/xen/arch/x86/pv/emul-priv-op.c ++++ b/xen/arch/x86/pv/emul-priv-op.c +@@ -816,12 +816,6 @@ static inline uint64_t guest_misc_enable(uint64_t val) + return val; + } + +-static inline bool is_cpufreq_controller(const struct domain *d) +-{ +- return ((cpufreq_controller == FREQCTL_dom0_kernel) && +- is_hardware_domain(d)); +-} +- + static int read_msr(unsigned int reg, uint64_t *val, + struct x86_emulate_ctxt *ctxt) + { +@@ -1096,14 +1090,6 @@ static int write_msr(unsigned int reg, uint64_t val, + return X86EMUL_OKAY; + break; + +- case MSR_IA32_PERF_CTL: +- if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ) +- break; +- if ( likely(!is_cpufreq_controller(currd)) || +- wrmsr_safe(reg, val) == 0 ) +- return X86EMUL_OKAY; +- break; +- + case MSR_IA32_THERM_CONTROL: + case MSR_IA32_ENERGY_PERF_BIAS: + if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ) +diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h +index c0cc5d9336..7e4ad5d51b 100644 +--- a/xen/include/xen/sched.h ++++ b/xen/include/xen/sched.h +@@ -920,6 +920,22 @@ extern enum cpufreq_controller { + FREQCTL_none, FREQCTL_dom0_kernel, FREQCTL_xen + } cpufreq_controller; + ++static always_inline bool is_cpufreq_controller(const struct domain *d) ++{ ++ /* ++ * A PV dom0 can be nominated as the cpufreq controller, instead of using ++ * Xen's cpufreq driver, at which point dom0 gets direct access to certain ++ * MSRs. ++ * ++ * This interface only works when dom0 is identity pinned and has the same ++ * number of vCPUs as pCPUs on the system. ++ * ++ * It would be far better to paravirtualise the interface. ++ */ ++ return (is_pv_domain(d) && is_hardware_domain(d) && ++ cpufreq_controller == FREQCTL_dom0_kernel); ++} ++ + #define CPUPOOLID_NONE -1 + + struct cpupool *cpupool_get_by_id(int poolid); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa351-x86-4.11-2.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,118 @@ +From: Andrew Cooper +Subject: x86/msr: Disallow guest access to the RAPL MSRs + +Researchers have demonstrated using the RAPL interface to perform a +differential power analysis attack to recover AES keys used by other cores in +the system. + +Furthermore, even privileged guests cannot use this interface correctly, due +to MSR scope and vcpu scheduling issues. The interface would want to be +paravirtualised to be used sensibly. + +Disallow access to the RAPL MSRs completely, as well as other MSRs which +potentially access fine grain power information. + +This is part of XSA-351. + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +diff --git a/xen/arch/x86/msr.c b/xen/arch/x86/msr.c +index 3495ac9f4a..99c848ff41 100644 +--- a/xen/arch/x86/msr.c ++++ b/xen/arch/x86/msr.c +@@ -156,6 +156,15 @@ int guest_rdmsr(const struct vcpu *v, uint32_t msr, uint64_t *val) + case MSR_TSX_FORCE_ABORT: + case MSR_TSX_CTRL: + case MSR_MCU_OPT_CTRL: ++ case MSR_RAPL_POWER_UNIT: ++ case MSR_PKG_POWER_LIMIT ... MSR_PKG_POWER_INFO: ++ case MSR_DRAM_POWER_LIMIT ... MSR_DRAM_POWER_INFO: ++ case MSR_PP0_POWER_LIMIT ... MSR_PP0_POLICY: ++ case MSR_PP1_POWER_LIMIT ... MSR_PP1_POLICY: ++ case MSR_PLATFORM_ENERGY_COUNTER: ++ case MSR_PLATFORM_POWER_LIMIT: ++ case MSR_F15H_CU_POWER ... MSR_F15H_CU_MAX_POWER: ++ case MSR_AMD_RAPL_POWER_UNIT ... MSR_AMD_PKG_ENERGY_STATUS: + /* Not offered to guests. */ + goto gp_fault; + +@@ -266,6 +275,15 @@ int guest_wrmsr(struct vcpu *v, uint32_t msr, uint64_t val) + case MSR_TSX_FORCE_ABORT: + case MSR_TSX_CTRL: + case MSR_MCU_OPT_CTRL: ++ case MSR_RAPL_POWER_UNIT: ++ case MSR_PKG_POWER_LIMIT ... MSR_PKG_POWER_INFO: ++ case MSR_DRAM_POWER_LIMIT ... MSR_DRAM_POWER_INFO: ++ case MSR_PP0_POWER_LIMIT ... MSR_PP0_POLICY: ++ case MSR_PP1_POWER_LIMIT ... MSR_PP1_POLICY: ++ case MSR_PLATFORM_ENERGY_COUNTER: ++ case MSR_PLATFORM_POWER_LIMIT: ++ case MSR_F15H_CU_POWER ... MSR_F15H_CU_MAX_POWER: ++ case MSR_AMD_RAPL_POWER_UNIT ... MSR_AMD_PKG_ENERGY_STATUS: + /* Not offered to guests. */ + goto gp_fault; + +diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h +index 480d1d8102..a685dcdcca 100644 +--- a/xen/include/asm-x86/msr-index.h ++++ b/xen/include/asm-x86/msr-index.h +@@ -96,6 +96,38 @@ + /* Lower 6 bits define the format of the address in the LBR stack */ + #define MSR_IA32_PERF_CAP_LBR_FORMAT 0x3f + ++/* ++ * Intel Runtime Average Power Limiting (RAPL) interface. Power plane base ++ * addresses (MSR_*_POWER_LIMIT) are model specific, but have so-far been ++ * consistent since their introduction in SandyBridge. ++ * ++ * Offsets of functionality from the power plane base is architectural, but ++ * not all power planes support all functionality. ++ */ ++#define MSR_RAPL_POWER_UNIT 0x00000606 ++ ++#define MSR_PKG_POWER_LIMIT 0x00000610 ++#define MSR_PKG_ENERGY_STATUS 0x00000611 ++#define MSR_PKG_PERF_STATUS 0x00000613 ++#define MSR_PKG_POWER_INFO 0x00000614 ++ ++#define MSR_DRAM_POWER_LIMIT 0x00000618 ++#define MSR_DRAM_ENERGY_STATUS 0x00000619 ++#define MSR_DRAM_PERF_STATUS 0x0000061b ++#define MSR_DRAM_POWER_INFO 0x0000061c ++ ++#define MSR_PP0_POWER_LIMIT 0x00000638 ++#define MSR_PP0_ENERGY_STATUS 0x00000639 ++#define MSR_PP0_POLICY 0x0000063a ++ ++#define MSR_PP1_POWER_LIMIT 0x00000640 ++#define MSR_PP1_ENERGY_STATUS 0x00000641 ++#define MSR_PP1_POLICY 0x00000642 ++ ++/* Intel Platform-wide power interface. */ ++#define MSR_PLATFORM_ENERGY_COUNTER 0x0000064d ++#define MSR_PLATFORM_POWER_LIMIT 0x0000065c ++ + #define MSR_IA32_BNDCFGS 0x00000d90 + #define IA32_BNDCFGS_ENABLE 0x00000001 + #define IA32_BNDCFGS_PRESERVE 0x00000002 +@@ -218,6 +250,8 @@ + #define MSR_K8_VM_CR 0xc0010114 + #define MSR_K8_VM_HSAVE_PA 0xc0010117 + ++#define MSR_F15H_CU_POWER 0xc001007a ++#define MSR_F15H_CU_MAX_POWER 0xc001007b + #define MSR_AMD_FAM15H_EVNTSEL0 0xc0010200 + #define MSR_AMD_FAM15H_PERFCTR0 0xc0010201 + #define MSR_AMD_FAM15H_EVNTSEL1 0xc0010202 +@@ -231,6 +265,10 @@ + #define MSR_AMD_FAM15H_EVNTSEL5 0xc001020a + #define MSR_AMD_FAM15H_PERFCTR5 0xc001020b + ++#define MSR_AMD_RAPL_POWER_UNIT 0xc0010299 ++#define MSR_AMD_CORE_ENERGY_STATUS 0xc001029a ++#define MSR_AMD_PKG_ENERGY_STATUS 0xc001029b ++ + #define MSR_AMD_L7S0_FEATURE_MASK 0xc0011002 + #define MSR_AMD_THRM_FEATURE_MASK 0xc0011003 + #define MSR_K8_FEATURE_MASK 0xc0011004 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa352.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa352.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa352.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa352.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,42 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: only Dom0 can change node owner +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +Otherwise we can give quota away to another domain, either causing it to run +out of quota, or in case of Dom0 use unbounded amounts of memory and bypass +the quota system entirely. + +This was fixed in the C version of xenstored in 2006 (c/s db34d2aaa5f5, +predating the XSA process by 5 years). + +It was also fixed in the mirage version of xenstore in 2012, with a unit test +demonstrating the vulnerability: + + https://github.com/mirage/ocaml-xenstore/commit/6b91f3ac46b885d0530a51d57a9b3a57d64923a7 + https://github.com/mirage/ocaml-xenstore/commit/22ee5417c90b8fda905c38de0d534506152eace6 + +but possibly without realising that the vulnerability still affected the +in-tree oxenstored (added c/s f44af660412 in 2010). + +This is XSA-352. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/store.ml b/tools/ocaml/xenstored/store.ml +index 3b05128f1b..5f915f2bbe 100644 +--- a/tools/ocaml/xenstored/store.ml ++++ b/tools/ocaml/xenstored/store.ml +@@ -407,7 +407,8 @@ let setperms store perm path nperms = + | Some node -> + let old_owner = Node.get_owner node in + let new_owner = Perms.Node.get_owner nperms in +- if not ((old_owner = new_owner) || (Perms.Connection.is_dom0 perm)) then Quota.check store.quota new_owner 0; ++ if not ((old_owner = new_owner) || (Perms.Connection.is_dom0 perm)) then ++ raise Define.Permission_denied; + store.root <- path_setperms store perm path nperms; + Quota.del_entry store.quota old_owner; + Quota.add_entry store.quota new_owner diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa353.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa353.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa353.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa353.patch 2022-04-05 13:04:23.000000000 +0100 @@ -0,0 +1,89 @@ +From: =?UTF-8?q?Edwin=20T=C3=B6r=C3=B6k?= +Subject: tools/ocaml/xenstored: do permission checks on xenstore root +MIME-Version: 1.0 +Content-Type: text/plain; charset=UTF-8 +Content-Transfer-Encoding: 8bit + +This was lacking in a disappointing number of places. + +The xenstore root node is treated differently from all other nodes, because it +doesn't have a parent, and mutation requires changing the parent. + +Unfortunately this lead to open-coding the special case for root into every +single xenstore operation, and out of all the xenstore operations only read +did a permission check when handling the root node. + +This means that an unprivileged guest can: + + * xenstore-chmod / to its liking and subsequently write new arbitrary nodes + there (subject to quota) + * xenstore-rm -r / deletes almost the entire xenstore tree (xenopsd quickly + refills some, but you are left with a broken system) + * DIRECTORY on / lists all children when called through python + bindings (xenstore-ls stops at /local because it tries to list recursively) + * get-perms on / works too, but that is just a minor information leak + +Add the missing permission checks, but this should really be refactored to do +the root handling and permission checks on the node only once from a single +function, instead of getting it wrong nearly everywhere. + +This is XSA-353. + +Signed-off-by: Edwin Török +Acked-by: Christian Lindig +Reviewed-by: Andrew Cooper + +diff --git a/tools/ocaml/xenstored/store.ml b/tools/ocaml/xenstored/store.ml +index f299ec6461..92b6289b5e 100644 +--- a/tools/ocaml/xenstored/store.ml ++++ b/tools/ocaml/xenstored/store.ml +@@ -273,15 +273,17 @@ let path_rm store perm path = + Node.del_childname node name + with Not_found -> + raise Define.Doesnt_exist in +- if path = [] then ++ if path = [] then ( ++ Node.check_perm store.root perm Perms.WRITE; + Node.del_all_children store.root +- else ++ ) else + Path.apply_modify store.root path do_rm + + let path_setperms store perm path perms = +- if path = [] then ++ if path = [] then ( ++ Node.check_perm store.root perm Perms.WRITE; + Node.set_perms store.root perms +- else ++ ) else + let do_setperms node name = + let c = Node.find node name in + Node.check_owner c perm; +@@ -313,9 +315,10 @@ let read store perm path = + + let ls store perm path = + let children = +- if path = [] then +- (Node.get_children store.root) +- else ++ if path = [] then ( ++ Node.check_perm store.root perm Perms.READ; ++ Node.get_children store.root ++ ) else + let do_ls node name = + let cnode = Node.find node name in + Node.check_perm cnode perm Perms.READ; +@@ -324,9 +327,10 @@ let ls store perm path = + List.rev (List.map (fun n -> Symbol.to_string n.Node.name) children) + + let getperms store perm path = +- if path = [] then +- (Node.get_perms store.root) +- else ++ if path = [] then ( ++ Node.check_perm store.root perm Perms.READ; ++ Node.get_perms store.root ++ ) else + let fct n name = + let c = Node.find n name in + Node.check_perm c perm Perms.READ; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa355.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa355.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa355.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa355.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,23 @@ +From: Jan Beulich +Subject: memory: fix off-by-one in XSA-346 change + +The comparison against ARRAY_SIZE() needs to be >= in order to avoid +overrunning the pages[] array. + +This is XSA-355. + +Fixes: 5777a3742d88 ("IOMMU: hold page ref until after deferred TLB flush") +Signed-off-by: Jan Beulich +Reviewed-by: Julien Grall + +--- a/xen/common/memory.c ++++ b/xen/common/memory.c +@@ -854,7 +854,7 @@ int xenmem_add_to_physmap(struct domain + ++extra.ppage; + + /* Check for continuation if it's not the last iteration. */ +- if ( (++done > ARRAY_SIZE(pages) && extra.ppage) || ++ if ( (++done >= ARRAY_SIZE(pages) && extra.ppage) || + (xatp->size > done && hypercall_preempt_check()) ) + { + rc = start + done; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa358-4.14.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa358-4.14.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa358-4.14.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa358-4.14.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,54 @@ +From: Jan Beulich +Subject: evtchn/FIFO: re-order and synchronize (with) map_control_block() + +For evtchn_fifo_set_pending()'s check of the control block having been +set to be effective, ordering of respective reads and writes needs to be +ensured: The control block pointer needs to be recorded strictly after +the setting of all the queue heads, and it needs checking strictly +before any uses of them (this latter aspect was already guaranteed). + +This is XSA-358 / CVE-2020-29570. + +Reported-by: Julien Grall +Signed-off-by: Jan Beulich +Acked-by: Julien Grall + +--- a/xen/common/event_fifo.c ++++ b/xen/common/event_fifo.c +@@ -249,6 +249,10 @@ static void evtchn_fifo_set_pending(stru + goto unlock; + } + ++ /* ++ * This also acts as the read counterpart of the smp_wmb() in ++ * map_control_block(). ++ */ + if ( guest_test_and_set_bit(d, EVTCHN_FIFO_LINKED, word) ) + goto unlock; + +@@ -474,6 +478,7 @@ static int setup_control_block(struct vc + static int map_control_block(struct vcpu *v, uint64_t gfn, uint32_t offset) + { + void *virt; ++ struct evtchn_fifo_control_block *control_block; + unsigned int i; + int rc; + +@@ -484,10 +489,15 @@ static int map_control_block(struct vcpu + if ( rc < 0 ) + return rc; + +- v->evtchn_fifo->control_block = virt + offset; ++ control_block = virt + offset; + + for ( i = 0; i <= EVTCHN_FIFO_PRIORITY_MIN; i++ ) +- v->evtchn_fifo->queue[i].head = &v->evtchn_fifo->control_block->head[i]; ++ v->evtchn_fifo->queue[i].head = &control_block->head[i]; ++ ++ /* All queue heads must have been set before setting the control block. */ ++ smp_wmb(); ++ ++ v->evtchn_fifo->control_block = control_block; + + return 0; + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa359.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa359.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa359.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa359.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,40 @@ +From: Jan Beulich +Subject: evtchn/FIFO: add 2nd smp_rmb() to evtchn_fifo_word_from_port() + +Besides with add_page_to_event_array() the function also needs to +synchronize with evtchn_fifo_init_control() setting both d->evtchn_fifo +and (subsequently) d->evtchn_port_ops. + +This is XSA-359 / CVE-2020-29571. + +Reported-by: Julien Grall +Signed-off-by: Jan Beulich +Reviewed-by: Julien Grall + +--- a/xen/common/event_fifo.c ++++ b/xen/common/event_fifo.c +@@ -55,6 +55,13 @@ static inline event_word_t *evtchn_fifo_ + { + unsigned int p, w; + ++ /* ++ * Callers aren't required to hold d->event_lock, so we need to synchronize ++ * with evtchn_fifo_init_control() setting d->evtchn_port_ops /after/ ++ * d->evtchn_fifo. ++ */ ++ smp_rmb(); ++ + if ( unlikely(port >= d->evtchn_fifo->num_evtchns) ) + return NULL; + +@@ -606,6 +613,10 @@ int evtchn_fifo_init_control(struct evtc + if ( rc < 0 ) + goto error; + ++ /* ++ * This call, as a side effect, synchronizes with ++ * evtchn_fifo_word_from_port(). ++ */ + rc = map_control_block(v, gfn, offset); + if ( rc < 0 ) + goto error; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa364.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa364.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa364.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa364.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,69 @@ +From dadb5b4b21c904ce59024c686eb1c55be8f46c52 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Thu, 21 Jan 2021 10:16:08 +0000 +Subject: [PATCH] xen/page_alloc: Only flush the page to RAM once we know they + are scrubbed + +At the moment, each page are flushed to RAM just after the allocator +found some free pages. However, this is happening before check if the +page was scrubbed. + +As a consequence, on Arm, a guest may be able to access the old content +of the scrubbed pages if it has cache disabled (default at boot) and +the content didn't reach the Point of Coherency. + +The flush is now moved after we know the content of the page will not +change. This also has the benefit to reduce the amount of work happening +with the heap_lock held. + +This is XSA-364. + +Fixes: 307c3be3ccb2 ("mm: Don't scrub pages while holding heap lock in alloc_heap_pages()") +Signed-off-by: Julien Grall +Reviewed-by: Jan Beulich +--- + xen/common/page_alloc.c | 14 +++++++++----- + 1 file changed, 9 insertions(+), 5 deletions(-) + +diff --git a/xen/common/page_alloc.c b/xen/common/page_alloc.c +index 02ac1fa613e7..1744e6faa5c4 100644 +--- a/xen/common/page_alloc.c ++++ b/xen/common/page_alloc.c +@@ -924,6 +924,7 @@ static struct page_info *alloc_heap_pages( + bool need_tlbflush = false; + uint32_t tlbflush_timestamp = 0; + unsigned int dirty_cnt = 0; ++ mfn_t mfn; + + /* Make sure there are enough bits in memflags for nodeID. */ + BUILD_BUG_ON((_MEMF_bits - _MEMF_node) < (8 * sizeof(nodeid_t))); +@@ -1022,11 +1023,6 @@ static struct page_info *alloc_heap_pages( + pg[i].u.inuse.type_info = 0; + page_set_owner(&pg[i], NULL); + +- /* Ensure cache and RAM are consistent for platforms where the +- * guest can control its own visibility of/through the cache. +- */ +- flush_page_to_ram(mfn_x(page_to_mfn(&pg[i])), +- !(memflags & MEMF_no_icache_flush)); + } + + spin_unlock(&heap_lock); +@@ -1062,6 +1058,14 @@ static struct page_info *alloc_heap_pages( + if ( need_tlbflush ) + filtered_flush_tlb_mask(tlbflush_timestamp); + ++ /* ++ * Ensure cache and RAM are consistent for platforms where the guest ++ * can control its own visibility of/through the cache. ++ */ ++ mfn = page_to_mfn(pg); ++ for ( i = 0; i < (1U << order); i++ ) ++ flush_page_to_ram(mfn_x(mfn) + i, !(memflags & MEMF_no_icache_flush)); ++ + return pg; + } + +-- +2.17.1 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa366-4.11.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa366-4.11.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa366-4.11.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa366-4.11.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,39 @@ +From: Roger Pau Monne +Subject: x86/ept: fix missing IOMMU flush in atomic_write_ept_entry + +Backport of XSA-321 missed a flush in atomic_write_ept_entry when +level was different than 0. Such omission will undermine the fix for +XSA-321, because page table entries cached in the IOMMU can get out +of sync and contain stale entries. + +Fix this by slightly re-arranging the code to prevent the early return +when level is different that 0. Note that the early return is just an +optimization because foreign entries cannot have level > 0. + +This is XSA-366. + +Reported-by: M. Vefa Bicakci +Signed-off-by: Roger Pau Monné +Reviewed-by: Jan Beulich +--- + xen/arch/x86/mm/p2m-ept.c | 7 +------ + 1 file changed, 1 insertion(+), 6 deletions(-) + +diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c +index 036771f43c..fde2f5f7e3 100644 +--- a/xen/arch/x86/mm/p2m-ept.c ++++ b/xen/arch/x86/mm/p2m-ept.c +@@ -53,12 +53,7 @@ static int atomic_write_ept_entry(ept_entry_t *entryptr, ept_entry_t new, + bool_t check_foreign = (new.mfn != entryptr->mfn || + new.sa_p2mt != entryptr->sa_p2mt); + +- if ( level ) +- { +- ASSERT(!is_epte_superpage(&new) || !p2m_is_foreign(new.sa_p2mt)); +- write_atomic(&entryptr->epte, new.epte); +- return 0; +- } ++ ASSERT(!level || !is_epte_superpage(&new) || !p2m_is_foreign(new.sa_p2mt)); + + if ( unlikely(p2m_is_foreign(new.sa_p2mt)) ) + { diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-1.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,118 @@ +From: Jan Beulich +Subject: VT-d: size qinval queue dynamically + +With the present synchronous model, we need two slots for every +operation (the operation itself and a wait descriptor). There can be +one such pair of requests pending per CPU. To ensure that under all +normal circumstances a slot is always available when one is requested, +size the queue ring according to the number of present CPUs. + +This is part of XSA-373 / CVE-2021-28692. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/vtd/iommu.h ++++ b/xen/drivers/passthrough/vtd/iommu.h +@@ -447,17 +447,9 @@ struct qinval_entry { + }q; + }; + +-/* Order of queue invalidation pages(max is 8) */ +-#define QINVAL_PAGE_ORDER 2 +- +-#define QINVAL_ARCH_PAGE_ORDER (QINVAL_PAGE_ORDER + PAGE_SHIFT_4K - PAGE_SHIFT) +-#define QINVAL_ARCH_PAGE_NR ( QINVAL_ARCH_PAGE_ORDER < 0 ? \ +- 1 : \ +- 1 << QINVAL_ARCH_PAGE_ORDER ) +- + /* Each entry is 16 bytes, so 2^8 entries per page */ + #define QINVAL_ENTRY_ORDER ( PAGE_SHIFT - 4 ) +-#define QINVAL_ENTRY_NR (1 << (QINVAL_PAGE_ORDER + 8)) ++#define QINVAL_MAX_ENTRY_NR (1u << (7 + QINVAL_ENTRY_ORDER)) + + /* Status data flag */ + #define QINVAL_STAT_INIT 0 +--- a/xen/drivers/passthrough/vtd/qinval.c ++++ b/xen/drivers/passthrough/vtd/qinval.c +@@ -31,6 +31,9 @@ + + #define VTD_QI_TIMEOUT 1 + ++static unsigned int __read_mostly qi_pg_order; ++static unsigned int __read_mostly qi_entry_nr; ++ + static int __must_check invalidate_sync(struct iommu *iommu); + + static void print_qi_regs(struct iommu *iommu) +@@ -55,7 +58,7 @@ static unsigned int qinval_next_index(st + tail >>= QINVAL_INDEX_SHIFT; + + /* (tail+1 == head) indicates a full queue, wait for HW */ +- while ( ( tail + 1 ) % QINVAL_ENTRY_NR == ++ while ( ((tail + 1) & (qi_entry_nr - 1)) == + ( dmar_readq(iommu->reg, DMAR_IQH_REG) >> QINVAL_INDEX_SHIFT ) ) + cpu_relax(); + +@@ -68,7 +71,7 @@ static void qinval_update_qtail(struct i + + /* Need hold register lock when update tail */ + ASSERT( spin_is_locked(&iommu->register_lock) ); +- val = (index + 1) % QINVAL_ENTRY_NR; ++ val = (index + 1) & (qi_entry_nr - 1); + dmar_writeq(iommu->reg, DMAR_IQT_REG, (val << QINVAL_INDEX_SHIFT)); + } + +@@ -417,7 +420,27 @@ int enable_qinval(struct iommu *iommu) + if ( qi_ctrl->qinval_maddr == 0 ) + { + drhd = iommu_to_drhd(iommu); +- qi_ctrl->qinval_maddr = alloc_pgtable_maddr(drhd, QINVAL_ARCH_PAGE_NR); ++ if ( !qi_entry_nr ) ++ { ++ /* ++ * With the present synchronous model, we need two slots for every ++ * operation (the operation itself and a wait descriptor). There ++ * can be one such pair of requests pending per CPU. One extra ++ * entry is needed as the ring is considered full when there's ++ * only one entry left. ++ */ ++ BUILD_BUG_ON(CONFIG_NR_CPUS * 2 >= QINVAL_MAX_ENTRY_NR); ++ qi_pg_order = get_order_from_bytes((num_present_cpus() * 2 + 1) << ++ (PAGE_SHIFT - ++ QINVAL_ENTRY_ORDER)); ++ qi_entry_nr = 1u << (qi_pg_order + QINVAL_ENTRY_ORDER); ++ ++ dprintk(XENLOG_INFO VTDPREFIX, ++ "QI: using %u-entry ring(s)\n", qi_entry_nr); ++ } ++ ++ qi_ctrl->qinval_maddr = ++ alloc_pgtable_maddr(drhd, qi_entry_nr >> QINVAL_ENTRY_ORDER); + if ( qi_ctrl->qinval_maddr == 0 ) + { + dprintk(XENLOG_WARNING VTDPREFIX, +@@ -431,15 +454,16 @@ int enable_qinval(struct iommu *iommu) + + spin_lock_irqsave(&iommu->register_lock, flags); + +- /* Setup Invalidation Queue Address(IQA) register with the +- * address of the page we just allocated. QS field at +- * bits[2:0] to indicate size of queue is one 4KB page. +- * That's 256 entries. Queued Head (IQH) and Queue Tail (IQT) +- * registers are automatically reset to 0 with write +- * to IQA register. ++ /* ++ * Setup Invalidation Queue Address (IQA) register with the address of the ++ * pages we just allocated. The QS field at bits[2:0] indicates the size ++ * (page order) of the queue. ++ * ++ * Queued Head (IQH) and Queue Tail (IQT) registers are automatically ++ * reset to 0 with write to IQA register. + */ + dmar_writeq(iommu->reg, DMAR_IQA_REG, +- qi_ctrl->qinval_maddr | QINVAL_PAGE_ORDER); ++ qi_ctrl->qinval_maddr | qi_pg_order); + + dmar_writeq(iommu->reg, DMAR_IQT_REG, 0); + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-2.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,111 @@ +From: Jan Beulich +Subject: AMD/IOMMU: size command buffer dynamically + +With the present synchronous model, we need two slots for every +operation (the operation itself and a wait command). There can be one +such pair of commands pending per CPU. To ensure that under all normal +circumstances a slot is always available when one is requested, size the +command ring according to the number of present CPUs. + +This is part of XSA-373 / CVE-2021-28692. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_cmd.c ++++ b/xen/drivers/passthrough/amd/iommu_cmd.c +@@ -24,8 +24,7 @@ + + static int queue_iommu_command(struct amd_iommu *iommu, u32 cmd[]) + { +- u32 tail, head, *cmd_buffer; +- int i; ++ uint32_t tail, head; + + tail = iommu->cmd_buffer.tail; + if ( ++tail == iommu->cmd_buffer.entries ) +@@ -35,12 +34,9 @@ static int queue_iommu_command(struct am + IOMMU_CMD_BUFFER_HEAD_OFFSET)); + if ( head != tail ) + { +- cmd_buffer = (u32 *)(iommu->cmd_buffer.buffer + +- (iommu->cmd_buffer.tail * +- IOMMU_CMD_BUFFER_ENTRY_SIZE)); +- +- for ( i = 0; i < IOMMU_CMD_BUFFER_U32_PER_ENTRY; i++ ) +- cmd_buffer[i] = cmd[i]; ++ memcpy(iommu->cmd_buffer.buffer + ++ (iommu->cmd_buffer.tail * sizeof(cmd_entry_t)), ++ cmd, sizeof(cmd_entry_t)); + + iommu->cmd_buffer.tail = tail; + return 1; +--- a/xen/drivers/passthrough/amd/iommu_init.c ++++ b/xen/drivers/passthrough/amd/iommu_init.c +@@ -136,7 +136,7 @@ static void register_iommu_cmd_buffer_in + writel(entry, iommu->mmio_base + IOMMU_CMD_BUFFER_BASE_LOW_OFFSET); + + power_of2_entries = get_order_from_bytes(iommu->cmd_buffer.alloc_size) + +- IOMMU_CMD_BUFFER_POWER_OF2_ENTRIES_PER_PAGE; ++ PAGE_SHIFT - IOMMU_CMD_BUFFER_ENTRY_ORDER; + + entry = 0; + iommu_set_addr_hi_to_reg(&entry, addr_hi); +@@ -1000,9 +1000,31 @@ static void * __init allocate_ring_buffe + static void * __init allocate_cmd_buffer(struct amd_iommu *iommu) + { + /* allocate 'command buffer' in power of 2 increments of 4K */ ++ static unsigned int __read_mostly nr_ents; ++ ++ if ( !nr_ents ) ++ { ++ unsigned int order; ++ ++ /* ++ * With the present synchronous model, we need two slots for every ++ * operation (the operation itself and a wait command). There can be ++ * one such pair of requests pending per CPU. One extra entry is ++ * needed as the ring is considered full when there's only one entry ++ * left. ++ */ ++ BUILD_BUG_ON(CONFIG_NR_CPUS * 2 >= IOMMU_CMD_BUFFER_MAX_ENTRIES); ++ order = get_order_from_bytes((num_present_cpus() * 2 + 1) << ++ IOMMU_CMD_BUFFER_ENTRY_ORDER); ++ nr_ents = 1u << (order + PAGE_SHIFT - IOMMU_CMD_BUFFER_ENTRY_ORDER); ++ ++ AMD_IOMMU_DEBUG("using %u-entry cmd ring(s)\n", nr_ents); ++ } ++ ++ BUILD_BUG_ON(sizeof(cmd_entry_t) != (1u << IOMMU_CMD_BUFFER_ENTRY_ORDER)); ++ + return allocate_ring_buffer(&iommu->cmd_buffer, sizeof(cmd_entry_t), +- IOMMU_CMD_BUFFER_DEFAULT_ENTRIES, +- "Command Buffer"); ++ nr_ents, "Command Buffer"); + } + + static void * __init allocate_event_log(struct amd_iommu *iommu) +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h +@@ -20,9 +20,6 @@ + #ifndef _ASM_X86_64_AMD_IOMMU_DEFS_H + #define _ASM_X86_64_AMD_IOMMU_DEFS_H + +-/* IOMMU Command Buffer entries: in power of 2 increments, minimum of 256 */ +-#define IOMMU_CMD_BUFFER_DEFAULT_ENTRIES 512 +- + /* IOMMU Event Log entries: in power of 2 increments, minimum of 256 */ + #define IOMMU_EVENT_LOG_DEFAULT_ENTRIES 512 + +@@ -185,9 +182,8 @@ + #define IOMMU_CMD_BUFFER_LENGTH_MASK 0x0F000000 + #define IOMMU_CMD_BUFFER_LENGTH_SHIFT 24 + +-#define IOMMU_CMD_BUFFER_ENTRY_SIZE 16 +-#define IOMMU_CMD_BUFFER_POWER_OF2_ENTRIES_PER_PAGE 8 +-#define IOMMU_CMD_BUFFER_U32_PER_ENTRY (IOMMU_CMD_BUFFER_ENTRY_SIZE / 4) ++#define IOMMU_CMD_BUFFER_ENTRY_ORDER 4 ++#define IOMMU_CMD_BUFFER_MAX_ENTRIES (1u << 15) + + #define IOMMU_CMD_OPCODE_MASK 0xF0000000 + #define IOMMU_CMD_OPCODE_SHIFT 28 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-3.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-3.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-3.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-3.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,163 @@ +From: Jan Beulich +Subject: VT-d: eliminate flush related timeouts + +Leaving an in-progress operation pending when it appears to take too +long is problematic: If e.g. a QI command completed later, the write to +the "poll slot" may instead be understood to signal a subsequently +started command's completion. Also our accounting of the timeout period +was actually wrong: We included the time it took for the command to +actually make it to the front of the queue, which could be heavily +affected by guests other than the one for which the flush is being +performed. + +Do away with all timeout detection on all flush related code paths. +Log excessively long processing times (with a progressive threshold) to +have some indication of problems in this area. + +Additionally log (once) if qinval_next_index() didn't immediately find +an available slot. Together with the earlier change sizing the queue(s) +dynamically, we should now have a guarantee that with our fully +synchronous model any demand for slots can actually be satisfied. + +This is part of XSA-373 / CVE-2021-28692. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/vtd/dmar.h ++++ b/xen/drivers/passthrough/vtd/dmar.h +@@ -127,6 +127,34 @@ do { + } \ + } while (0) + ++#define IOMMU_FLUSH_WAIT(what, iommu, offset, op, cond, sts) \ ++do { \ ++ static unsigned int __read_mostly threshold = 1; \ ++ s_time_t start = NOW(); \ ++ s_time_t timeout = start + DMAR_OPERATION_TIMEOUT * threshold; \ ++ \ ++ for ( ; ; ) \ ++ { \ ++ sts = op(iommu->reg, offset); \ ++ if ( cond ) \ ++ break; \ ++ if ( timeout && NOW() > timeout ) \ ++ { \ ++ threshold |= threshold << 1; \ ++ printk(XENLOG_WARNING VTDPREFIX \ ++ " IOMMU#%u: %s flush taking too long\n", \ ++ iommu->index, what); \ ++ timeout = 0; \ ++ } \ ++ cpu_relax(); \ ++ } \ ++ \ ++ if ( !timeout ) \ ++ printk(XENLOG_WARNING VTDPREFIX \ ++ " IOMMU#%u: %s flush took %lums\n", \ ++ iommu->index, what, (NOW() - start) / 10000000); \ ++} while ( false ) ++ + int vtd_hw_check(void); + void disable_pmr(struct iommu *iommu); + int is_igd_drhd(struct acpi_drhd_unit *drhd); +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -357,8 +357,8 @@ static void iommu_flush_write_buffer(str + dmar_writel(iommu->reg, DMAR_GCMD_REG, val | DMA_GCMD_WBF); + + /* Make sure hardware complete it */ +- IOMMU_WAIT_OP(iommu, DMAR_GSTS_REG, dmar_readl, +- !(val & DMA_GSTS_WBFS), val); ++ IOMMU_FLUSH_WAIT("write buffer", iommu, DMAR_GSTS_REG, dmar_readl, ++ !(val & DMA_GSTS_WBFS), val); + + spin_unlock_irqrestore(&iommu->register_lock, flags); + } +@@ -408,8 +408,8 @@ static int __must_check flush_context_re + dmar_writeq(iommu->reg, DMAR_CCMD_REG, val); + + /* Make sure hardware complete it */ +- IOMMU_WAIT_OP(iommu, DMAR_CCMD_REG, dmar_readq, +- !(val & DMA_CCMD_ICC), val); ++ IOMMU_FLUSH_WAIT("context", iommu, DMAR_CCMD_REG, dmar_readq, ++ !(val & DMA_CCMD_ICC), val); + + spin_unlock_irqrestore(&iommu->register_lock, flags); + /* flush context entry will implicitly flush write buffer */ +@@ -491,8 +491,8 @@ static int __must_check flush_iotlb_reg( + dmar_writeq(iommu->reg, tlb_offset + 8, val); + + /* Make sure hardware complete it */ +- IOMMU_WAIT_OP(iommu, (tlb_offset + 8), dmar_readq, +- !(val & DMA_TLB_IVT), val); ++ IOMMU_FLUSH_WAIT("iotlb", iommu, (tlb_offset + 8), dmar_readq, ++ !(val & DMA_TLB_IVT), val); + spin_unlock_irqrestore(&iommu->register_lock, flags); + + /* check IOTLB invalidation granularity */ +--- a/xen/drivers/passthrough/vtd/qinval.c ++++ b/xen/drivers/passthrough/vtd/qinval.c +@@ -29,8 +29,6 @@ + #include "extern.h" + #include "../ats.h" + +-#define VTD_QI_TIMEOUT 1 +- + static unsigned int __read_mostly qi_pg_order; + static unsigned int __read_mostly qi_entry_nr; + +@@ -60,7 +58,11 @@ static unsigned int qinval_next_index(st + /* (tail+1 == head) indicates a full queue, wait for HW */ + while ( ((tail + 1) & (qi_entry_nr - 1)) == + ( dmar_readq(iommu->reg, DMAR_IQH_REG) >> QINVAL_INDEX_SHIFT ) ) ++ { ++ printk_once(XENLOG_ERR VTDPREFIX " IOMMU#%u: no QI slot available\n", ++ iommu->index); + cpu_relax(); ++ } + + return tail; + } +@@ -180,23 +182,32 @@ static int __must_check queue_invalidate + /* Now we don't support interrupt method */ + if ( sw ) + { +- s_time_t timeout; +- +- /* In case all wait descriptor writes to same addr with same data */ +- timeout = NOW() + MILLISECS(flush_dev_iotlb ? +- iommu_dev_iotlb_timeout : VTD_QI_TIMEOUT); ++ static unsigned int __read_mostly threshold = 1; ++ s_time_t start = NOW(); ++ s_time_t timeout = start + (flush_dev_iotlb ++ ? iommu_dev_iotlb_timeout ++ : 100) * MILLISECS(threshold); + + while ( ACCESS_ONCE(*this_poll_slot) != QINVAL_STAT_DONE ) + { +- if ( NOW() > timeout ) ++ if ( timeout && NOW() > timeout ) + { +- print_qi_regs(iommu); ++ threshold |= threshold << 1; + printk(XENLOG_WARNING VTDPREFIX +- " Queue invalidate wait descriptor timed out\n"); +- return -ETIMEDOUT; ++ " IOMMU#%u: QI%s wait descriptor taking too long\n", ++ iommu->index, flush_dev_iotlb ? " dev" : ""); ++ print_qi_regs(iommu); ++ timeout = 0; + } + cpu_relax(); + } ++ ++ if ( !timeout ) ++ printk(XENLOG_WARNING VTDPREFIX ++ " IOMMU#%u: QI%s wait descriptor took %lums\n", ++ iommu->index, flush_dev_iotlb ? " dev" : "", ++ (NOW() - start) / 10000000); ++ + return 0; + } + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-4.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-4.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-4.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-4.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,86 @@ +From: Jan Beulich +Subject: AMD/IOMMU: wait for command slot to be available + +No caller cared about send_iommu_command() indicating unavailability of +a slot. Hence if a sufficient number prior commands timed out, we did +blindly assume that the requested command was submitted to the IOMMU +when really it wasn't. This could mean both a hanging system (waiting +for a command to complete that was never seen by the IOMMU) or blindly +propagating success back to callers, making them believe they're fine +to e.g. free previously unmapped pages. + +Fold the three involved functions into one, add spin waiting for an +available slot along the lines of VT-d's qinval_next_index(), and as a +consequence drop all error indicator return types/values. + +This is part of XSA-373 / CVE-2021-28692. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_cmd.c ++++ b/xen/drivers/passthrough/amd/iommu_cmd.c +@@ -22,48 +22,36 @@ + #include + #include "../ats.h" + +-static int queue_iommu_command(struct amd_iommu *iommu, u32 cmd[]) ++static void send_iommu_command(struct amd_iommu *iommu, ++ const uint32_t cmd[4]) + { +- uint32_t tail, head; ++ uint32_t tail; + + tail = iommu->cmd_buffer.tail; + if ( ++tail == iommu->cmd_buffer.entries ) + tail = 0; + +- head = iommu_get_rb_pointer(readl(iommu->mmio_base + +- IOMMU_CMD_BUFFER_HEAD_OFFSET)); +- if ( head != tail ) ++ while ( tail == iommu_get_rb_pointer(readl(iommu->mmio_base + ++ IOMMU_CMD_BUFFER_HEAD_OFFSET)) ) + { +- memcpy(iommu->cmd_buffer.buffer + +- (iommu->cmd_buffer.tail * sizeof(cmd_entry_t)), +- cmd, sizeof(cmd_entry_t)); +- +- iommu->cmd_buffer.tail = tail; +- return 1; ++ printk_once(XENLOG_ERR ++ "AMD IOMMU %04x:%02x:%02x.%u: no cmd slot available\n", ++ iommu->seg, PCI_BUS(iommu->bdf), ++ PCI_SLOT(iommu->bdf), PCI_FUNC(iommu->bdf)); ++ cpu_relax(); + } + +- return 0; +-} ++ memcpy(iommu->cmd_buffer.buffer + ++ (iommu->cmd_buffer.tail * sizeof(cmd_entry_t)), ++ cmd, sizeof(cmd_entry_t)); + +-static void commit_iommu_command_buffer(struct amd_iommu *iommu) +-{ +- u32 tail = 0; ++ iommu->cmd_buffer.tail = tail; + ++ tail = 0; + iommu_set_rb_pointer(&tail, iommu->cmd_buffer.tail); + writel(tail, iommu->mmio_base+IOMMU_CMD_BUFFER_TAIL_OFFSET); + } + +-int send_iommu_command(struct amd_iommu *iommu, u32 cmd[]) +-{ +- if ( queue_iommu_command(iommu, cmd) ) +- { +- commit_iommu_command_buffer(iommu); +- return 1; +- } +- +- return 0; +-} +- + static void flush_command_buffer(struct amd_iommu *iommu) + { + u32 cmd[4], status; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-5.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-5.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-5.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa373-4.11-5.patch 2022-04-05 13:04:24.000000000 +0100 @@ -0,0 +1,145 @@ +From: Jan Beulich +Subject: AMD/IOMMU: drop command completion timeout + +First and foremost - such timeouts were not signaled to callers, making +them believe they're fine to e.g. free previously unmapped pages. + +Mirror VT-d's behavior: A fixed number of loop iterations is not a +suitable way to detect timeouts in an environment (CPU and bus speeds) +independent manner anyway. Furthermore, leaving an in-progress operation +pending when it appears to take too long is problematic: If a command +completed later, the signaling of its completion may instead be +understood to signal a subsequently started command's completion. + +Log excessively long processing times (with a progressive threshold) to +have some indication of problems in this area. Allow callers to specify +a non-default timeout bias for this logging, using the same values as +VT-d does, which in particular means a (by default) much larger value +for device IO TLB invalidation. + +This is part of XSA-373 / CVE-2021-28692. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_cmd.c ++++ b/xen/drivers/passthrough/amd/iommu_cmd.c +@@ -52,10 +52,12 @@ static void send_iommu_command(struct am + writel(tail, iommu->mmio_base+IOMMU_CMD_BUFFER_TAIL_OFFSET); + } + +-static void flush_command_buffer(struct amd_iommu *iommu) ++static void flush_command_buffer(struct amd_iommu *iommu, ++ unsigned int timeout_base) + { +- u32 cmd[4], status; +- int loop_count, comp_wait; ++ uint32_t cmd[4]; ++ s_time_t start, timeout; ++ static unsigned int __read_mostly threshold = 1; + + /* RW1C 'ComWaitInt' in status register */ + writel(IOMMU_STATUS_COMP_WAIT_INT_MASK, +@@ -71,24 +73,31 @@ static void flush_command_buffer(struct + IOMMU_COMP_WAIT_I_FLAG_SHIFT, &cmd[0]); + send_iommu_command(iommu, cmd); + +- /* Make loop_count long enough for polling completion wait bit */ +- loop_count = 1000; +- do { +- status = readl(iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET); +- comp_wait = get_field_from_reg_u32(status, +- IOMMU_STATUS_COMP_WAIT_INT_MASK, +- IOMMU_STATUS_COMP_WAIT_INT_SHIFT); +- --loop_count; +- } while ( !comp_wait && loop_count ); +- +- if ( comp_wait ) ++ start = NOW(); ++ timeout = start + (timeout_base ?: 100) * MILLISECS(threshold); ++ while ( !(readl(iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET) & ++ IOMMU_STATUS_COMP_WAIT_INT_MASK) ) + { +- /* RW1C 'ComWaitInt' in status register */ +- writel(IOMMU_STATUS_COMP_WAIT_INT_MASK, +- iommu->mmio_base + IOMMU_STATUS_MMIO_OFFSET); +- return; ++ if ( timeout && NOW() > timeout ) ++ { ++ threshold |= threshold << 1; ++ printk(XENLOG_WARNING ++ "AMD IOMMU %04x:%02x:%02x.%u: %scompletion wait taking too long\n", ++ iommu->seg, PCI_BUS(iommu->bdf), ++ PCI_SLOT(iommu->bdf), PCI_FUNC(iommu->bdf), ++ timeout_base ? "iotlb " : ""); ++ timeout = 0; ++ } ++ cpu_relax(); + } +- AMD_IOMMU_DEBUG("Warning: ComWaitInt bit did not assert!\n"); ++ ++ if ( !timeout ) ++ printk(XENLOG_WARNING ++ "AMD IOMMU %04x:%02x:%02x.%u: %scompletion wait took %lums\n", ++ iommu->seg, PCI_BUS(iommu->bdf), ++ PCI_SLOT(iommu->bdf), PCI_FUNC(iommu->bdf), ++ timeout_base ? "iotlb " : "", ++ (NOW() - start) / 10000000); + } + + /* Build low level iommu command messages */ +@@ -300,7 +309,7 @@ void amd_iommu_flush_iotlb(u8 devfn, con + /* send INVALIDATE_IOTLB_PAGES command */ + spin_lock_irqsave(&iommu->lock, flags); + invalidate_iotlb_pages(iommu, maxpend, 0, queueid, gaddr, req_id, order); +- flush_command_buffer(iommu); ++ flush_command_buffer(iommu, iommu_dev_iotlb_timeout); + spin_unlock_irqrestore(&iommu->lock, flags); + } + +@@ -337,7 +346,7 @@ static void _amd_iommu_flush_pages(struc + { + spin_lock_irqsave(&iommu->lock, flags); + invalidate_iommu_pages(iommu, gaddr, dom_id, order); +- flush_command_buffer(iommu); ++ flush_command_buffer(iommu, 0); + spin_unlock_irqrestore(&iommu->lock, flags); + } + +@@ -361,7 +370,7 @@ void amd_iommu_flush_device(struct amd_i + ASSERT( spin_is_locked(&iommu->lock) ); + + invalidate_dev_table_entry(iommu, bdf); +- flush_command_buffer(iommu); ++ flush_command_buffer(iommu, 0); + } + + void amd_iommu_flush_intremap(struct amd_iommu *iommu, uint16_t bdf) +@@ -369,7 +378,7 @@ void amd_iommu_flush_intremap(struct amd + ASSERT( spin_is_locked(&iommu->lock) ); + + invalidate_interrupt_table(iommu, bdf); +- flush_command_buffer(iommu); ++ flush_command_buffer(iommu, 0); + } + + void amd_iommu_flush_all_caches(struct amd_iommu *iommu) +@@ -377,7 +386,7 @@ void amd_iommu_flush_all_caches(struct a + ASSERT( spin_is_locked(&iommu->lock) ); + + invalidate_iommu_all(iommu); +- flush_command_buffer(iommu); ++ flush_command_buffer(iommu, 0); + } + + void amd_iommu_send_guest_cmd(struct amd_iommu *iommu, u32 cmd[]) +@@ -387,7 +396,8 @@ void amd_iommu_send_guest_cmd(struct amd + spin_lock_irqsave(&iommu->lock, flags); + + send_iommu_command(iommu, cmd); +- flush_command_buffer(iommu); ++ /* TBD: Timeout selection may require peeking into cmd[]. */ ++ flush_command_buffer(iommu, 0); + + spin_unlock_irqrestore(&iommu->lock, flags); + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa375-4.12.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa375-4.12.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa375-4.12.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa375-4.12.patch 2022-04-05 13:04:25.000000000 +0100 @@ -0,0 +1,50 @@ +From: Andrew Cooper +Subject: x86/spec-ctrl: Protect against Speculative Code Store Bypass + +Modern x86 processors have far-better-than-architecturally-guaranteed self +modifying code detection. Typically, when a write hits an instruction in +flight, a Machine Clear occurs to flush stale content in the frontend and +backend. + +For self modifying code, before a write which hits an instruction in flight +retires, the frontend can speculatively decode and execute the old instruction +stream. Speculation of this form can suffer from type confusion in registers, +and potentially leak data. + +Furthermore, updates are typically byte-wise, rather than atomic. Depending +on timing, speculation can race ahead multiple times between individual +writes, and execute the transiently-malformed instruction stream. + +Xen has stubs which are used in certain cases for emulation purposes. Inhibit +speculation between updating the stub and executing it. + +This is XSA-375 / CVE-2021-0089. + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c +index 6dc4f92a84..59c15ca0e7 100644 +--- a/xen/arch/x86/pv/emul-priv-op.c ++++ b/xen/arch/x86/pv/emul-priv-op.c +@@ -97,6 +97,8 @@ static io_emul_stub_t *io_emul_stub_setup(struct priv_op_ctxt *ctxt, u8 opcode, + BUILD_BUG_ON(STUB_BUF_SIZE / 2 < MAX(9, /* Default emul stub */ + 5 + IOEMUL_QUIRK_STUB_BYTES)); + ++ asm volatile ( "lfence" ::: "memory" ); /* SCSB */ ++ + /* Handy function-typed pointer to the stub. */ + return (void *)stub_va; + } +diff --git a/xen/arch/x86/x86_emulate/x86_emulate.c b/xen/arch/x86/x86_emulate/x86_emulate.c +index bba6dd0187..cd123492a6 100644 +--- a/xen/arch/x86/x86_emulate/x86_emulate.c ++++ b/xen/arch/x86/x86_emulate/x86_emulate.c +@@ -1093,6 +1093,7 @@ static inline int mkec(uint8_t e, int32_t ec, ...) + # define invoke_stub(pre, post, constraints...) do { \ + stub_exn.info = (union stub_exception_token) { .raw = ~0 }; \ + stub_exn.line = __LINE__; /* Utility outweighs livepatching cost */ \ ++ asm volatile ( "lfence" ::: "memory" ); /* SCSB */ \ + asm volatile ( pre "\n\tINDIRECT_CALL %[stub]\n\t" post "\n" \ + ".Lret%=:\n\t" \ + ".pushsection .fixup,\"ax\"\n" \ diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa377-4.11.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa377-4.11.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa377-4.11.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa377-4.11.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,27 @@ +From: Andrew Cooper +Subject: x86/spec-ctrl: Mitigate TAA after S3 resume + +The user chosen setting for MSR_TSX_CTRL needs restoring after S3. + +All APs get the correct setting via start_secondary(), but the BSP was missed +out. + +This is XSA-377 / CVE-2021-28690. + +Fixes: 8c4330818f6 ("x86/spec-ctrl: Mitigate the TSX Asynchronous Abort sidechannel") +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +diff --git a/xen/arch/x86/acpi/power.c b/xen/arch/x86/acpi/power.c +index 30e1bd5cd3..451cba622c 100644 +--- a/xen/arch/x86/acpi/power.c ++++ b/xen/arch/x86/acpi/power.c +@@ -259,6 +259,8 @@ static int enter_state(u32 state) + + microcode_resume_cpu(0); + ++ tsx_init(); /* Needs microcode. May change HLE/RTM feature bits. */ ++ + if ( !recheck_cpu_features(0) ) + panic("Missing previously available feature(s)."); + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0a.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0a.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0a.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0a.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,75 @@ +From: Jan Beulich +Subject: x86/p2m: fix PoD accounting in guest_physmap_add_entry() + +The initial observation was that the mfn_valid() check comes too late: +Neither mfn_add() nor mfn_to_page() (let alone de-referencing the +result of the latter) are valid for MFNs failing this check. Move it up +and - noticing that there's no caller doing so - also add an assertion +that this should never produce "false" here. + +In turn this would have meant that the "else" to that if() could now go +away, which didn't seem right at all. And indeed, considering callers +like memory_exchange() or various grant table functions, the PoD +accounting should have been outside of that if() from the very +beginning. + +Signed-off-by: Jan Beulich +Acked-by: Andrew Cooper + +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -794,6 +794,12 @@ guest_physmap_add_entry(struct domain *d + if ( p2m_is_foreign(t) ) + return -EINVAL; + ++ if ( !mfn_valid(mfn) ) ++ { ++ ASSERT_UNREACHABLE(); ++ return -EINVAL; ++ } ++ + p2m_lock(p2m); + + P2M_DEBUG("adding gfn=%#lx mfn=%#lx\n", gfn_x(gfn), mfn_x(mfn)); +@@ -894,12 +900,13 @@ guest_physmap_add_entry(struct domain *d + } + + /* Now, actually do the two-way mapping */ +- if ( mfn_valid(mfn) ) ++ rc = p2m_set_entry(p2m, gfn, mfn, page_order, t, p2m->default_access); ++ if ( rc == 0 ) + { +- rc = p2m_set_entry(p2m, gfn, mfn, page_order, t, +- p2m->default_access); +- if ( rc ) +- goto out; /* Failed to update p2m, bail without updating m2p. */ ++ pod_lock(p2m); ++ p2m->pod.entry_count -= pod_count; ++ BUG_ON(p2m->pod.entry_count < 0); ++ pod_unlock(p2m); + + if ( !p2m_is_grant(t) ) + { +@@ -908,22 +915,7 @@ guest_physmap_add_entry(struct domain *d + gfn_x(gfn_add(gfn, i))); + } + } +- else +- { +- gdprintk(XENLOG_WARNING, "Adding bad mfn to p2m map (%#lx -> %#lx)\n", +- gfn_x(gfn), mfn_x(mfn)); +- rc = p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, +- p2m_invalid, p2m->default_access); +- if ( rc == 0 ) +- { +- pod_lock(p2m); +- p2m->pod.entry_count -= pod_count; +- BUG_ON(p2m->pod.entry_count < 0); +- pod_unlock(p2m); +- } +- } + +-out: + p2m_unlock(p2m); + + return rc; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0b.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0b.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0b.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0b.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,60 @@ +From: Jan Beulich +Subject: x86/p2m: don't ignore p2m_remove_page()'s return value + +It's not very nice to return from guest_physmap_add_entry() after +perhaps already having made some changes to the P2M, but this is pre- +existing practice in the function, and imo better than ignoring errors. + +Take the liberty and replace an mfn_add() instance with a local variable +already holding the result (as proven by the check immediately ahead). + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Acked-by: Andrew Cooper + +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -702,8 +702,7 @@ void p2m_final_teardown(struct domain *d + p2m_teardown_hostp2m(d); + } + +- +-static int ++static int __must_check + p2m_remove_page(struct p2m_domain *p2m, unsigned long gfn_l, unsigned long mfn, + unsigned int page_order) + { +@@ -892,9 +891,9 @@ guest_physmap_add_entry(struct domain *d + ASSERT(mfn_valid(omfn)); + P2M_DEBUG("old gfn=%#lx -> mfn %#lx\n", + gfn_x(ogfn) , mfn_x(omfn)); +- if ( mfn_eq(omfn, mfn_add(mfn, i)) ) +- p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(mfn_add(mfn, i)), +- 0); ++ if ( mfn_eq(omfn, mfn_add(mfn, i)) && ++ (rc = p2m_remove_page(p2m, gfn_x(ogfn), mfn_x(omfn), 0)) ) ++ goto out; + } + } + } +@@ -916,6 +915,7 @@ guest_physmap_add_entry(struct domain *d + } + } + ++ out: + p2m_unlock(p2m); + + return rc; +@@ -2385,9 +2385,9 @@ int p2m_change_altp2m_gfn(struct domain + + if ( gfn_eq(new_gfn, INVALID_GFN) ) + { +- if ( mfn_valid(mfn) ) +- p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K); +- rc = 0; ++ rc = mfn_valid(mfn) ++ ? p2m_remove_page(ap2m, gfn_x(old_gfn), mfn_x(mfn), PAGE_ORDER_4K) ++ : 0; + goto out; + } + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0c.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0c.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0c.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-0c.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,57 @@ +From: Jan Beulich +Subject: x86/p2m: don't assert that the passed in MFN matches for a remove + +guest_physmap_remove_page() gets handed an MFN from the outside, yet +takes the necessary lock to prevent further changes to the GFN <-> MFN +mapping itself. While some callers, in particular guest_remove_page() +(by way of having called get_gfn_query()), hold the GFN lock already, +various others (most notably perhaps the 2nd instance in +xenmem_add_to_physmap_one()) don't. While it also is an option to fix +all the callers, deal with the issue in p2m_remove_page() instead: +Replace the ASSERT() by a conditional and split the loop into two, such +that all checking gets done before any modification would occur. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Acked-by: Andrew Cooper + +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -708,7 +708,6 @@ p2m_remove_page(struct p2m_domain *p2m, + { + unsigned long i; + gfn_t gfn = _gfn(gfn_l); +- mfn_t mfn_return; + p2m_type_t t; + p2m_access_t a; + +@@ -719,15 +718,26 @@ p2m_remove_page(struct p2m_domain *p2m, + ASSERT(gfn_locked_by_me(p2m, gfn)); + P2M_DEBUG("removing gfn=%#lx mfn=%#lx\n", gfn_l, mfn); + ++ for ( i = 0; i < (1UL << page_order); ) ++ { ++ unsigned int cur_order; ++ mfn_t mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, ++ &cur_order, NULL); ++ ++ if ( p2m_is_valid(t) && ++ (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) ) ++ return -EILSEQ; ++ ++ i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1)); ++ } ++ + if ( mfn_valid(_mfn(mfn)) ) + { + for ( i = 0; i < (1UL << page_order); i++ ) + { +- mfn_return = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, +- NULL, NULL); ++ p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL); + if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) ) + set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY); +- ASSERT( !p2m_is_valid(t) || mfn + i == mfn_x(mfn_return) ); + } + } + return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid, diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-1.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,142 @@ +From: Jan Beulich +Subject: AMD/IOMMU: correct global exclusion range extending + +Besides unity mapping regions, the AMD IOMMU spec also provides for +exclusion ranges (areas of memory not to be subject to DMA translation) +to be specified by firmware in the ACPI tables. The spec does not put +any constraints on the number of such regions. + +Blindly assuming all addresses between any two such ranges should also +be excluded can't be right. Since hardware has room for just a single +such range (comprised of the Exclusion Base Register and the Exclusion +Range Limit Register), combine only adjacent or overlapping regions (for +now; this may require further adjustment in case table entries aren't +sorted by address) with matching exclusion_allow_all settings. This +requires bubbling up error indicators, such that IOMMU init can be +failed when concatenation wasn't possible. + +Furthermore, since the exclusion range specified in IOMMU registers +implies R/W access, reject requests asking for less permissions (this +will be brought closer to the spec by a subsequent change). + +This is part of XSA-378 / CVE-2021-28695. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_acpi.c ++++ b/xen/drivers/passthrough/amd/iommu_acpi.c +@@ -98,12 +98,21 @@ static struct amd_iommu * __init find_io + return NULL; + } + +-static void __init reserve_iommu_exclusion_range( +- struct amd_iommu *iommu, uint64_t base, uint64_t limit) ++static int __init reserve_iommu_exclusion_range( ++ struct amd_iommu *iommu, uint64_t base, uint64_t limit, ++ bool all, bool iw, bool ir) + { ++ if ( !ir || !iw ) ++ return -EPERM; ++ + /* need to extend exclusion range? */ + if ( iommu->exclusion_enable ) + { ++ if ( iommu->exclusion_limit + PAGE_SIZE < base || ++ limit + PAGE_SIZE < iommu->exclusion_base || ++ iommu->exclusion_allow_all != all ) ++ return -EBUSY; ++ + if ( iommu->exclusion_base < base ) + base = iommu->exclusion_base; + if ( iommu->exclusion_limit > limit ) +@@ -111,16 +120,11 @@ static void __init reserve_iommu_exclusi + } + + iommu->exclusion_enable = IOMMU_CONTROL_ENABLED; ++ iommu->exclusion_allow_all = all; + iommu->exclusion_base = base; + iommu->exclusion_limit = limit; +-} + +-static void __init reserve_iommu_exclusion_range_all( +- struct amd_iommu *iommu, +- unsigned long base, unsigned long limit) +-{ +- reserve_iommu_exclusion_range(iommu, base, limit); +- iommu->exclusion_allow_all = IOMMU_CONTROL_ENABLED; ++ return 0; + } + + static void __init reserve_unity_map_for_device( +@@ -158,6 +162,7 @@ static int __init register_exclusion_ran + unsigned long range_top, iommu_top, length; + struct amd_iommu *iommu; + unsigned int bdf; ++ int rc = 0; + + /* is part of exclusion range inside of IOMMU virtual address space? */ + /* note: 'limit' parameter is assumed to be page-aligned */ +@@ -179,10 +184,15 @@ static int __init register_exclusion_ran + if ( limit >= iommu_top ) + { + for_each_amd_iommu( iommu ) +- reserve_iommu_exclusion_range_all(iommu, base, limit); ++ { ++ rc = reserve_iommu_exclusion_range(iommu, base, limit, ++ true /* all */, iw, ir); ++ if ( rc ) ++ break; ++ } + } + +- return 0; ++ return rc; + } + + static int __init register_exclusion_range_for_device( +@@ -193,6 +203,7 @@ static int __init register_exclusion_ran + unsigned long range_top, iommu_top, length; + struct amd_iommu *iommu; + u16 req; ++ int rc = 0; + + iommu = find_iommu_for_device(seg, bdf); + if ( !iommu ) +@@ -222,12 +233,13 @@ static int __init register_exclusion_ran + /* register IOMMU exclusion range settings for device */ + if ( limit >= iommu_top ) + { +- reserve_iommu_exclusion_range(iommu, base, limit); ++ rc = reserve_iommu_exclusion_range(iommu, base, limit, ++ false /* all */, iw, ir); + ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_ENABLED; + ivrs_mappings[req].dte_allow_exclusion = IOMMU_CONTROL_ENABLED; + } + +- return 0; ++ return rc; + } + + static int __init register_exclusion_range_for_iommu_devices( +@@ -237,6 +249,7 @@ static int __init register_exclusion_ran + unsigned long range_top, iommu_top, length; + unsigned int bdf; + u16 req; ++ int rc = 0; + + /* is part of exclusion range inside of IOMMU virtual address space? */ + /* note: 'limit' parameter is assumed to be page-aligned */ +@@ -267,8 +280,10 @@ static int __init register_exclusion_ran + + /* register IOMMU exclusion range settings */ + if ( limit >= iommu_top ) +- reserve_iommu_exclusion_range_all(iommu, base, limit); +- return 0; ++ rc = reserve_iommu_exclusion_range(iommu, base, limit, ++ true /* all */, iw, ir); ++ ++ return rc; + } + + static int __init parse_ivmd_device_select( diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-2.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,223 @@ +From: Jan Beulich +Subject: AMD/IOMMU: correct device unity map handling + +Blindly assuming all addresses between any two such ranges, specified by +firmware in the ACPI tables, should also be unity-mapped can't be right. +Nor can it be correct to merge ranges with differing permissions. Track +ranges individually; don't merge at all, but check for overlaps instead. +This requires bubbling up error indicators, such that IOMMU init can be +failed when allocation of a new tracking struct wasn't possible, or an +overlap was detected. + +At this occasion also stop ignoring +amd_iommu_reserve_domain_unity_map()'s return value. + +This is part of XSA-378 / CVE-2021-28695. + +Signed-off-by: Jan Beulich +Reviewed-by: George Dunlap +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_acpi.c ++++ b/xen/drivers/passthrough/amd/iommu_acpi.c +@@ -127,32 +127,48 @@ static int __init reserve_iommu_exclusio + return 0; + } + +-static void __init reserve_unity_map_for_device( +- u16 seg, u16 bdf, unsigned long base, +- unsigned long length, u8 iw, u8 ir) ++static int __init reserve_unity_map_for_device( ++ uint16_t seg, uint16_t bdf, unsigned long base, ++ unsigned long length, bool iw, bool ir) + { + struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg); +- unsigned long old_top, new_top; ++ struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map; + +- /* need to extend unity-mapped range? */ +- if ( ivrs_mappings[bdf].unity_map_enable ) ++ /* Check for overlaps. */ ++ for ( ; unity_map; unity_map = unity_map->next ) + { +- old_top = ivrs_mappings[bdf].addr_range_start + +- ivrs_mappings[bdf].addr_range_length; +- new_top = base + length; +- if ( old_top > new_top ) +- new_top = old_top; +- if ( ivrs_mappings[bdf].addr_range_start < base ) +- base = ivrs_mappings[bdf].addr_range_start; +- length = new_top - base; +- } +- +- /* extend r/w permissioms and keep aggregate */ +- ivrs_mappings[bdf].write_permission = iw; +- ivrs_mappings[bdf].read_permission = ir; +- ivrs_mappings[bdf].unity_map_enable = IOMMU_CONTROL_ENABLED; +- ivrs_mappings[bdf].addr_range_start = base; +- ivrs_mappings[bdf].addr_range_length = length; ++ /* ++ * Exact matches are okay. This can in particular happen when ++ * register_exclusion_range_for_device() calls here twice for the ++ * same (s,b,d,f). ++ */ ++ if ( base == unity_map->addr && length == unity_map->length && ++ ir == unity_map->read && iw == unity_map->write ) ++ return 0; ++ ++ if ( unity_map->addr + unity_map->length > base && ++ base + length > unity_map->addr ) ++ { ++ AMD_IOMMU_DEBUG("IVMD Error: overlap [%lx,%lx) vs [%lx,%lx)\n", ++ base, base + length, unity_map->addr, ++ unity_map->addr + unity_map->length); ++ return -EPERM; ++ } ++ } ++ ++ /* Populate and insert a new unity map. */ ++ unity_map = xmalloc(struct ivrs_unity_map); ++ if ( !unity_map ) ++ return -ENOMEM; ++ ++ unity_map->read = ir; ++ unity_map->write = iw; ++ unity_map->addr = base; ++ unity_map->length = length; ++ unity_map->next = ivrs_mappings[bdf].unity_map; ++ ivrs_mappings[bdf].unity_map = unity_map; ++ ++ return 0; + } + + static int __init register_exclusion_range_for_all_devices( +@@ -175,13 +191,13 @@ static int __init register_exclusion_ran + length = range_top - base; + /* reserve r/w unity-mapped page entries for devices */ + /* note: these entries are part of the exclusion range */ +- for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ ) +- reserve_unity_map_for_device(seg, bdf, base, length, iw, ir); ++ for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ ) ++ rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir); + /* push 'base' just outside of virtual address space */ + base = iommu_top; + } + /* register IOMMU exclusion range settings */ +- if ( limit >= iommu_top ) ++ if ( !rc && limit >= iommu_top ) + { + for_each_amd_iommu( iommu ) + { +@@ -223,15 +239,15 @@ static int __init register_exclusion_ran + length = range_top - base; + /* reserve unity-mapped page entries for device */ + /* note: these entries are part of the exclusion range */ +- reserve_unity_map_for_device(seg, bdf, base, length, iw, ir); +- reserve_unity_map_for_device(seg, req, base, length, iw, ir); ++ rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?: ++ reserve_unity_map_for_device(seg, req, base, length, iw, ir); + + /* push 'base' just outside of virtual address space */ + base = iommu_top; + } + + /* register IOMMU exclusion range settings for device */ +- if ( limit >= iommu_top ) ++ if ( !rc && limit >= iommu_top ) + { + rc = reserve_iommu_exclusion_range(iommu, base, limit, + false /* all */, iw, ir); +@@ -262,15 +278,15 @@ static int __init register_exclusion_ran + length = range_top - base; + /* reserve r/w unity-mapped page entries for devices */ + /* note: these entries are part of the exclusion range */ +- for ( bdf = 0; bdf < ivrs_bdf_entries; bdf++ ) ++ for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ ) + { + if ( iommu == find_iommu_for_device(iommu->seg, bdf) ) + { +- reserve_unity_map_for_device(iommu->seg, bdf, base, length, +- iw, ir); + req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id; +- reserve_unity_map_for_device(iommu->seg, req, base, length, +- iw, ir); ++ rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length, ++ iw, ir) ?: ++ reserve_unity_map_for_device(iommu->seg, req, base, length, ++ iw, ir); + } + } + +@@ -279,7 +295,7 @@ static int __init register_exclusion_ran + } + + /* register IOMMU exclusion range settings */ +- if ( limit >= iommu_top ) ++ if ( !rc && limit >= iommu_top ) + rc = reserve_iommu_exclusion_range(iommu, base, limit, + true /* all */, iw, ir); + +--- a/xen/drivers/passthrough/amd/iommu_init.c ++++ b/xen/drivers/passthrough/amd/iommu_init.c +@@ -1187,7 +1187,6 @@ static int __init alloc_ivrs_mappings(u1 + { + ivrs_mappings[bdf].dte_requestor_id = bdf; + ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_DISABLED; +- ivrs_mappings[bdf].unity_map_enable = IOMMU_CONTROL_DISABLED; + ivrs_mappings[bdf].iommu = NULL; + + ivrs_mappings[bdf].intremap_table = NULL; +--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c ++++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c +@@ -372,15 +372,17 @@ static int amd_iommu_assign_device(struc + struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg); + int bdf = PCI_BDF2(pdev->bus, devfn); + int req_id = get_dma_requestor_id(pdev->seg, bdf); ++ const struct ivrs_unity_map *unity_map; + +- if ( ivrs_mappings[req_id].unity_map_enable ) ++ for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map; ++ unity_map = unity_map->next ) + { +- amd_iommu_reserve_domain_unity_map( +- d, +- ivrs_mappings[req_id].addr_range_start, +- ivrs_mappings[req_id].addr_range_length, +- ivrs_mappings[req_id].write_permission, +- ivrs_mappings[req_id].read_permission); ++ int rc = amd_iommu_reserve_domain_unity_map( ++ d, unity_map->addr, unity_map->length, ++ unity_map->write, unity_map->read); ++ ++ if ( rc ) ++ return rc; + } + + return reassign_device(pdev->domain, d, devfn, pdev); +--- a/xen/include/asm-x86/amd-iommu.h ++++ b/xen/include/asm-x86/amd-iommu.h +@@ -108,15 +108,19 @@ struct amd_iommu { + struct list_head ats_devices; + }; + ++struct ivrs_unity_map { ++ bool read:1; ++ bool write:1; ++ paddr_t addr; ++ unsigned long length; ++ struct ivrs_unity_map *next; ++}; ++ + struct ivrs_mappings { + u16 dte_requestor_id; + u8 dte_allow_exclusion; +- u8 unity_map_enable; +- u8 write_permission; +- u8 read_permission; +- unsigned long addr_range_start; +- unsigned long addr_range_length; + struct amd_iommu *iommu; ++ struct ivrs_unity_map *unity_map; + + /* per device interrupt remapping table */ + void *intremap_table; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-3.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-3.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-3.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-3.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,102 @@ +From: Jan Beulich +Subject: IOMMU: also pass p2m_access_t to p2m_get_iommu_flags() + +A subsequent change will want to customize the IOMMU permissions based +on this. + +This is part of XSA-378. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/arch/x86/mm/p2m-ept.c ++++ b/xen/arch/x86/mm/p2m-ept.c +@@ -711,7 +711,7 @@ ept_set_entry(struct p2m_domain *p2m, gf + uint8_t ipat = 0; + bool_t need_modify_vtd_table = 1; + bool_t vtd_pte_present = 0; +- unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, mfn); ++ unsigned int iommu_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn); + bool_t needs_sync = 1; + ept_entry_t old_entry = { .epte = 0 }; + ept_entry_t new_entry = { .epte = 0 }; +@@ -837,8 +837,8 @@ ept_set_entry(struct p2m_domain *p2m, gf + + /* Safe to read-then-write because we hold the p2m lock */ + if ( ept_entry->mfn == new_entry.mfn && +- p2m_get_iommu_flags(ept_entry->sa_p2mt, _mfn(ept_entry->mfn)) == +- iommu_flags ) ++ p2m_get_iommu_flags(ept_entry->sa_p2mt, ept_entry->access, ++ _mfn(ept_entry->mfn)) == iommu_flags ) + need_modify_vtd_table = 0; + + ept_p2m_type_to_flags(p2m, &new_entry, p2mt, p2ma); +--- a/xen/arch/x86/mm/p2m-pt.c ++++ b/xen/arch/x86/mm/p2m-pt.c +@@ -471,6 +471,16 @@ int p2m_pt_handle_deferred_changes(uint6 + return rc; + } + ++/* Reconstruct a fake p2m_access_t from stored PTE flags. */ ++static p2m_access_t p2m_flags_to_access(unsigned int flags) ++{ ++ if ( flags & _PAGE_PRESENT ) ++ return p2m_access_n; ++ ++ /* No need to look at _PAGE_NX for now. */ ++ return flags & _PAGE_RW ? p2m_access_rw : p2m_access_r; ++} ++ + /* Returns: 0 for success, -errno for failure */ + static int + p2m_pt_set_entry(struct p2m_domain *p2m, gfn_t gfn_, mfn_t mfn, +@@ -487,7 +497,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m, + l2_pgentry_t l2e_content; + l3_pgentry_t l3e_content; + int rc; +- unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, mfn); ++ unsigned int iommu_pte_flags = p2m_get_iommu_flags(p2mt, p2ma, mfn); + /* + * old_mfn and iommu_old_flags control possible flush/update needs on the + * IOMMU: We need to flush when MFN or flags (i.e. permissions) change. +@@ -556,6 +566,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m, + old_mfn = l1e_get_pfn(*p2m_entry); + iommu_old_flags = + p2m_get_iommu_flags(p2m_flags_to_type(flags), ++ p2m_flags_to_access(flags), + _mfn(old_mfn)); + } + else +@@ -602,9 +613,10 @@ p2m_pt_set_entry(struct p2m_domain *p2m, + 0, L1_PAGETABLE_ENTRIES); + ASSERT(p2m_entry); + old_mfn = l1e_get_pfn(*p2m_entry); ++ flags = l1e_get_flags(*p2m_entry); + iommu_old_flags = +- p2m_get_iommu_flags(p2m_flags_to_type(l1e_get_flags(*p2m_entry)), +- _mfn(old_mfn)); ++ p2m_get_iommu_flags(p2m_flags_to_type(flags), ++ p2m_flags_to_access(flags), _mfn(old_mfn)); + + if ( mfn_valid(mfn) || p2m_allows_invalid_mfn(p2mt) ) + entry_content = p2m_l1e_from_pfn(mfn_x(mfn), +@@ -648,6 +660,7 @@ p2m_pt_set_entry(struct p2m_domain *p2m, + old_mfn = l1e_get_pfn(*p2m_entry); + iommu_old_flags = + p2m_get_iommu_flags(p2m_flags_to_type(flags), ++ p2m_flags_to_access(flags), + _mfn(old_mfn)); + } + else +--- a/xen/include/asm-x86/p2m.h ++++ b/xen/include/asm-x86/p2m.h +@@ -839,7 +839,8 @@ int p2m_altp2m_propagate_change(struct d + /* + * p2m type to IOMMU flags + */ +-static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, mfn_t mfn) ++static inline unsigned int p2m_get_iommu_flags(p2m_type_t p2mt, ++ p2m_access_t p2ma, mfn_t mfn) + { + unsigned int flags; + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-4.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-4.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-4.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-4.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,400 @@ +From: Jan Beulich +Subject: IOMMU: generalize VT-d's tracking of mapped RMRR regions + +In order to re-use it elsewhere, move the logic to vendor independent +code and strip it of RMRR specifics. + +Note that the prior "map" parameter gets folded into the new "p2ma" one +(which AMD IOMMU code will want to make use of), assigning alternative +meaning ("unmap") to p2m_access_x. Prepare set_identity_p2m_entry() and +p2m_get_iommu_flags() for getting passed access types other than +p2m_access_rw (in the latter case just for p2m_mmio_direct requests). + +Note also that, to be on the safe side, an overlap check gets added to +the main loop of iommu_identity_mapping(). + +This is part of XSA-378. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -1157,7 +1157,8 @@ int set_identity_p2m_entry(struct domain + { + if ( !need_iommu(d) ) + return 0; +- return iommu_map_page(d, gfn_l, gfn_l, IOMMUF_readable|IOMMUF_writable); ++ return iommu_map_page(d, gfn_l, gfn_l, ++ p2m_access_to_iommu_flags(p2ma)); + } + + gfn_lock(p2m, gfn, 0); +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -42,12 +42,6 @@ + #include "vtd.h" + #include "../ats.h" + +-struct mapped_rmrr { +- struct list_head list; +- u64 base, end; +- unsigned int count; +-}; +- + /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */ + bool __read_mostly untrusted_msi; + +@@ -1785,16 +1779,11 @@ out: + static void iommu_domain_teardown(struct domain *d) + { + struct domain_iommu *hd = dom_iommu(d); +- struct mapped_rmrr *mrmrr, *tmp; + + if ( list_empty(&acpi_drhd_units) ) + return; + +- list_for_each_entry_safe ( mrmrr, tmp, &hd->arch.mapped_rmrrs, list ) +- { +- list_del(&mrmrr->list); +- xfree(mrmrr); +- } ++ iommu_identity_map_teardown(d); + + if ( iommu_use_hap_pt(d) ) + return; +@@ -1903,74 +1892,6 @@ static void iommu_set_pgd(struct domain + pagetable_get_paddr(pagetable_from_mfn(pgd_mfn)); + } + +-static int rmrr_identity_mapping(struct domain *d, bool_t map, +- const struct acpi_rmrr_unit *rmrr, +- u32 flag) +-{ +- unsigned long base_pfn = rmrr->base_address >> PAGE_SHIFT_4K; +- unsigned long end_pfn = PAGE_ALIGN_4K(rmrr->end_address) >> PAGE_SHIFT_4K; +- struct mapped_rmrr *mrmrr; +- struct domain_iommu *hd = dom_iommu(d); +- +- ASSERT(pcidevs_locked()); +- ASSERT(rmrr->base_address < rmrr->end_address); +- +- /* +- * No need to acquire hd->arch.mapping_lock: Both insertion and removal +- * get done while holding pcidevs_lock. +- */ +- list_for_each_entry( mrmrr, &hd->arch.mapped_rmrrs, list ) +- { +- if ( mrmrr->base == rmrr->base_address && +- mrmrr->end == rmrr->end_address ) +- { +- int ret = 0; +- +- if ( map ) +- { +- ++mrmrr->count; +- return 0; +- } +- +- if ( --mrmrr->count ) +- return 0; +- +- while ( base_pfn < end_pfn ) +- { +- if ( clear_identity_p2m_entry(d, base_pfn) ) +- ret = -ENXIO; +- base_pfn++; +- } +- +- list_del(&mrmrr->list); +- xfree(mrmrr); +- return ret; +- } +- } +- +- if ( !map ) +- return -ENOENT; +- +- while ( base_pfn < end_pfn ) +- { +- int err = set_identity_p2m_entry(d, base_pfn, p2m_access_rw, flag); +- +- if ( err ) +- return err; +- base_pfn++; +- } +- +- mrmrr = xmalloc(struct mapped_rmrr); +- if ( !mrmrr ) +- return -ENOMEM; +- mrmrr->base = rmrr->base_address; +- mrmrr->end = rmrr->end_address; +- mrmrr->count = 1; +- list_add_tail(&mrmrr->list, &hd->arch.mapped_rmrrs); +- +- return 0; +-} +- + static int intel_iommu_add_device(u8 devfn, struct pci_dev *pdev) + { + struct acpi_rmrr_unit *rmrr; +@@ -2002,7 +1923,9 @@ static int intel_iommu_add_device(u8 dev + * Since RMRRs are always reserved in the e820 map for the hardware + * domain, there shouldn't be a conflict. + */ +- ret = rmrr_identity_mapping(pdev->domain, 1, rmrr, 0); ++ ret = iommu_identity_mapping(pdev->domain, p2m_access_rw, ++ rmrr->base_address, rmrr->end_address, ++ 0); + if ( ret ) + dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n", + pdev->domain->domain_id); +@@ -2047,7 +1970,8 @@ static int intel_iommu_remove_device(u8 + * Any flag is nothing to clear these mappings but here + * its always safe and strict to set 0. + */ +- rmrr_identity_mapping(pdev->domain, 0, rmrr, 0); ++ iommu_identity_mapping(pdev->domain, p2m_access_x, rmrr->base_address, ++ rmrr->end_address, 0); + } + + return domain_context_unmap(pdev->domain, devfn, pdev); +@@ -2214,7 +2138,8 @@ static void __hwdom_init setup_hwdom_rmr + * domain, there shouldn't be a conflict. So its always safe and + * strict to set 0. + */ +- ret = rmrr_identity_mapping(d, 1, rmrr, 0); ++ ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address, ++ rmrr->end_address, 0); + if ( ret ) + dprintk(XENLOG_ERR VTDPREFIX, + "IOMMU: mapping reserved region failed\n"); +@@ -2371,7 +2296,9 @@ static int reassign_device_ownership( + * Any RMRR flag is always ignored when remove a device, + * but its always safe and strict to set 0. + */ +- ret = rmrr_identity_mapping(source, 0, rmrr, 0); ++ ret = iommu_identity_mapping(source, p2m_access_x, ++ rmrr->base_address, ++ rmrr->end_address, 0); + if ( ret != -ENOENT ) + return ret; + } +@@ -2468,7 +2395,8 @@ static int intel_iommu_assign_device( + PCI_BUS(bdf) == bus && + PCI_DEVFN2(bdf) == devfn ) + { +- ret = rmrr_identity_mapping(d, 1, rmrr, flag); ++ ret = iommu_identity_mapping(d, p2m_access_rw, rmrr->base_address, ++ rmrr->end_address, flag); + if ( ret ) + { + int rc; +--- a/xen/drivers/passthrough/x86/iommu.c ++++ b/xen/drivers/passthrough/x86/iommu.c +@@ -144,7 +144,7 @@ int arch_iommu_domain_init(struct domain + struct domain_iommu *hd = dom_iommu(d); + + spin_lock_init(&hd->arch.mapping_lock); +- INIT_LIST_HEAD(&hd->arch.mapped_rmrrs); ++ INIT_LIST_HEAD(&hd->arch.identity_maps); + + return 0; + } +@@ -153,6 +153,99 @@ void arch_iommu_domain_destroy(struct do + { + } + ++struct identity_map { ++ struct list_head list; ++ paddr_t base, end; ++ p2m_access_t access; ++ unsigned int count; ++}; ++ ++int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma, ++ paddr_t base, paddr_t end, ++ unsigned int flag) ++{ ++ unsigned long base_pfn = base >> PAGE_SHIFT_4K; ++ unsigned long end_pfn = PAGE_ALIGN_4K(end) >> PAGE_SHIFT_4K; ++ struct identity_map *map; ++ struct domain_iommu *hd = dom_iommu(d); ++ ++ ASSERT(pcidevs_locked()); ++ ASSERT(base < end); ++ ++ /* ++ * No need to acquire hd->arch.mapping_lock: Both insertion and removal ++ * get done while holding pcidevs_lock. ++ */ ++ list_for_each_entry( map, &hd->arch.identity_maps, list ) ++ { ++ if ( map->base == base && map->end == end ) ++ { ++ int ret = 0; ++ ++ if ( p2ma != p2m_access_x ) ++ { ++ if ( map->access != p2ma ) ++ return -EADDRINUSE; ++ ++map->count; ++ return 0; ++ } ++ ++ if ( --map->count ) ++ return 0; ++ ++ while ( base_pfn < end_pfn ) ++ { ++ if ( clear_identity_p2m_entry(d, base_pfn) ) ++ ret = -ENXIO; ++ base_pfn++; ++ } ++ ++ list_del(&map->list); ++ xfree(map); ++ ++ return ret; ++ } ++ ++ if ( end >= map->base && map->end >= base ) ++ return -EADDRINUSE; ++ } ++ ++ if ( p2ma == p2m_access_x ) ++ return -ENOENT; ++ ++ while ( base_pfn < end_pfn ) ++ { ++ int err = set_identity_p2m_entry(d, base_pfn, p2ma, flag); ++ ++ if ( err ) ++ return err; ++ base_pfn++; ++ } ++ ++ map = xmalloc(struct identity_map); ++ if ( !map ) ++ return -ENOMEM; ++ map->base = base; ++ map->end = end; ++ map->access = p2ma; ++ map->count = 1; ++ list_add_tail(&map->list, &hd->arch.identity_maps); ++ ++ return 0; ++} ++ ++void iommu_identity_map_teardown(struct domain *d) ++{ ++ struct domain_iommu *hd = dom_iommu(d); ++ struct identity_map *map, *tmp; ++ ++ list_for_each_entry_safe ( map, tmp, &hd->arch.identity_maps, list ) ++ { ++ list_del(&map->list); ++ xfree(map); ++ } ++} ++ + /* + * Local variables: + * mode: C +--- a/xen/include/asm-x86/iommu.h ++++ b/xen/include/asm-x86/iommu.h +@@ -16,6 +16,7 @@ + + #include + #include ++#include + #include + #include + #include +@@ -36,7 +37,7 @@ struct arch_iommu + spinlock_t mapping_lock; /* io page table lock */ + int agaw; /* adjusted guest address width, 0 is level 2 30-bit */ + u64 iommu_bitmap; /* bitmap of iommu(s) that the domain uses */ +- struct list_head mapped_rmrrs; ++ struct list_head identity_maps; + + /* amd iommu support */ + int paging_mode; +@@ -94,6 +95,11 @@ bool_t iommu_supports_eim(void); + int iommu_enable_x2apic_IR(void); + void iommu_disable_x2apic_IR(void); + ++int iommu_identity_mapping(struct domain *d, p2m_access_t p2ma, ++ paddr_t base, paddr_t end, ++ unsigned int flag); ++void iommu_identity_map_teardown(struct domain *d); ++ + extern bool untrusted_msi; + + int pi_update_irte(const struct pi_desc *pi_desc, const struct pirq *pirq, +--- a/xen/include/asm-x86/mem_access.h ++++ b/xen/include/asm-x86/mem_access.h +@@ -44,10 +44,8 @@ bool p2m_mem_access_emulate_check(struct + const vm_event_response_t *rsp); + + /* Sanity check for mem_access hardware support */ +-static inline bool p2m_mem_access_sanity_check(struct domain *d) +-{ +- return is_hvm_domain(d) && cpu_has_vmx && hap_enabled(d); +-} ++#define p2m_mem_access_sanity_check(d) \ ++ (is_hvm_domain(d) && cpu_has_vmx && hap_enabled(d)) + + #endif /*__ASM_X86_MEM_ACCESS_H__ */ + +--- a/xen/include/asm-x86/p2m.h ++++ b/xen/include/asm-x86/p2m.h +@@ -836,6 +836,34 @@ int p2m_altp2m_propagate_change(struct d + mfn_t mfn, unsigned int page_order, + p2m_type_t p2mt, p2m_access_t p2ma); + ++/* p2m access to IOMMU flags */ ++static inline unsigned int p2m_access_to_iommu_flags(p2m_access_t p2ma) ++{ ++ switch ( p2ma ) ++ { ++ case p2m_access_rw: ++ case p2m_access_rwx: ++ return IOMMUF_readable | IOMMUF_writable; ++ ++ case p2m_access_r: ++ case p2m_access_rx: ++ case p2m_access_rx2rw: ++ return IOMMUF_readable; ++ ++ case p2m_access_w: ++ case p2m_access_wx: ++ return IOMMUF_writable; ++ ++ case p2m_access_n: ++ case p2m_access_x: ++ case p2m_access_n2rwx: ++ return 0; ++ } ++ ++ ASSERT_UNREACHABLE(); ++ return 0; ++} ++ + /* + * p2m type to IOMMU flags + */ +@@ -857,9 +885,10 @@ static inline unsigned int p2m_get_iommu + flags = IOMMUF_readable; + break; + case p2m_mmio_direct: +- flags = IOMMUF_readable; +- if ( !rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) ) +- flags |= IOMMUF_writable; ++ flags = p2m_access_to_iommu_flags(p2ma); ++ if ( (flags & IOMMUF_writable) && ++ rangeset_contains_singleton(mmio_ro_ranges, mfn_x(mfn)) ) ++ flags &= ~IOMMUF_writable; + break; + default: + flags = 0; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-5.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-5.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-5.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-5.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,200 @@ +From: Jan Beulich +Subject: AMD/IOMMU: re-arrange/complete re-assignment handling + +Prior to the assignment step having completed successfully, devices +should not get associated with their new owner. Hand the device to DomIO +(perhaps temporarily), until after the de-assignment step has completed. + +De-assignment of a device (from other than Dom0) as well as failure of +reassign_device() during assignment should result in unity mappings +getting torn down. This in turn requires switching to a refcounted +mapping approach, as was already used by VT-d for its RMRRs, to prevent +unmapping a region used by multiple devices. + +This is CVE-2021-28696 / part of XSA-378. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -716,27 +716,49 @@ int amd_iommu_unmap_page(struct domain * + return 0; + } + +-int amd_iommu_reserve_domain_unity_map(struct domain *domain, +- u64 phys_addr, +- unsigned long size, int iw, int ir) ++int amd_iommu_reserve_domain_unity_map(struct domain *d, ++ const struct ivrs_unity_map *map, ++ unsigned int flag) + { +- unsigned long npages, i; +- unsigned long gfn; +- unsigned int flags = !!ir; +- int rt = 0; +- +- if ( iw ) +- flags |= IOMMUF_writable; +- +- npages = region_to_pages(phys_addr, size); +- gfn = phys_addr >> PAGE_SHIFT; +- for ( i = 0; i < npages; i++ ) ++ int rc; ++ ++ if ( d == dom_io ) ++ return 0; ++ ++ for ( rc = 0; !rc && map; map = map->next ) + { +- rt = amd_iommu_map_page(domain, gfn +i, gfn +i, flags); +- if ( rt != 0 ) +- return rt; ++ p2m_access_t p2ma = p2m_access_n; ++ ++ if ( map->read ) ++ p2ma |= p2m_access_r; ++ if ( map->write ) ++ p2ma |= p2m_access_w; ++ ++ rc = iommu_identity_mapping(d, p2ma, map->addr, ++ map->addr + map->length - 1, flag); + } +- return 0; ++ ++ return rc; ++} ++ ++int amd_iommu_reserve_domain_unity_unmap(struct domain *d, ++ const struct ivrs_unity_map *map) ++{ ++ int rc; ++ ++ if ( d == dom_io ) ++ return 0; ++ ++ for ( rc = 0; map; map = map->next ) ++ { ++ int ret = iommu_identity_mapping(d, p2m_access_x, map->addr, ++ map->addr + map->length - 1, 0); ++ ++ if ( ret && ret != -ENOENT && !rc ) ++ rc = ret; ++ } ++ ++ return rc; + } + + /* Share p2m table with iommu. */ +--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c ++++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c +@@ -333,6 +333,7 @@ static int reassign_device(struct domain + struct amd_iommu *iommu; + int bdf, rc; + struct domain_iommu *t = dom_iommu(target); ++ const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg); + + bdf = PCI_BDF2(pdev->bus, pdev->devfn); + iommu = find_iommu_for_device(pdev->seg, bdf); +@@ -347,10 +348,24 @@ static int reassign_device(struct domain + + amd_iommu_disable_domain_device(source, iommu, devfn, pdev); + +- if ( devfn == pdev->devfn ) ++ /* ++ * If the device belongs to the hardware domain, and it has a unity mapping, ++ * don't remove it from the hardware domain, because BIOS may reference that ++ * mapping. ++ */ ++ if ( !is_hardware_domain(source) ) ++ { ++ rc = amd_iommu_reserve_domain_unity_unmap( ++ source, ++ ivrs_mappings[get_dma_requestor_id(pdev->seg, bdf)].unity_map); ++ if ( rc ) ++ return rc; ++ } ++ ++ if ( devfn == pdev->devfn && pdev->domain != dom_io ) + { +- list_move(&pdev->domain_list, &target->arch.pdev_list); +- pdev->domain = target; ++ list_move(&pdev->domain_list, &dom_io->arch.pdev_list); ++ pdev->domain = dom_io; + } + + rc = allocate_domain_resources(t); +@@ -362,6 +377,12 @@ static int reassign_device(struct domain + pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn), + source->domain_id, target->domain_id); + ++ if ( devfn == pdev->devfn && pdev->domain != target ) ++ { ++ list_move(&pdev->domain_list, &target->arch.pdev_list); ++ pdev->domain = target; ++ } ++ + return 0; + } + +@@ -372,20 +393,28 @@ static int amd_iommu_assign_device(struc + struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg); + int bdf = PCI_BDF2(pdev->bus, devfn); + int req_id = get_dma_requestor_id(pdev->seg, bdf); +- const struct ivrs_unity_map *unity_map; ++ int rc = amd_iommu_reserve_domain_unity_map( ++ d, ivrs_mappings[req_id].unity_map, flag); ++ ++ if ( !rc ) ++ rc = reassign_device(pdev->domain, d, devfn, pdev); + +- for ( unity_map = ivrs_mappings[req_id].unity_map; unity_map; +- unity_map = unity_map->next ) ++ if ( rc && !is_hardware_domain(d) ) + { +- int rc = amd_iommu_reserve_domain_unity_map( +- d, unity_map->addr, unity_map->length, +- unity_map->write, unity_map->read); ++ int ret = amd_iommu_reserve_domain_unity_unmap( ++ d, ivrs_mappings[req_id].unity_map); + +- if ( rc ) +- return rc; ++ if ( ret ) ++ { ++ printk(XENLOG_ERR "AMD-Vi: " ++ "unity-unmap for d%d/%04x:%02x:%02x.%u failed (%d)\n", ++ d->domain_id, pdev->seg, pdev->bus, ++ PCI_SLOT(devfn), PCI_FUNC(devfn), ret); ++ domain_crash(d); ++ } + } + +- return reassign_device(pdev->domain, d, devfn, pdev); ++ return rc; + } + + static void deallocate_next_page_table(struct page_info *pg, int level) +@@ -451,6 +480,7 @@ static void deallocate_iommu_page_tables + + static void amd_iommu_domain_destroy(struct domain *d) + { ++ iommu_identity_map_teardown(d); + deallocate_iommu_page_tables(d); + amd_iommu_flush_all_pages(d); + } +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +@@ -60,8 +60,10 @@ int __must_check amd_iommu_unmap_page(st + u64 amd_iommu_get_next_table_from_pte(u32 *entry); + int __must_check amd_iommu_alloc_root(struct domain_iommu *hd); + int amd_iommu_reserve_domain_unity_map(struct domain *domain, +- u64 phys_addr, unsigned long size, +- int iw, int ir); ++ const struct ivrs_unity_map *map, ++ unsigned int flag); ++int amd_iommu_reserve_domain_unity_unmap(struct domain *d, ++ const struct ivrs_unity_map *map); + + /* Share p2m table with iommu */ + void amd_iommu_share_p2m(struct domain *d); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-6.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-6.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-6.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-6.patch 2022-06-01 21:13:47.000000000 +0100 @@ -0,0 +1,411 @@ +From: Jan Beulich +Subject: AMD/IOMMU: re-arrange exclusion range and unity map recording + +The spec makes no provisions for OS behavior here to depend on the +amount of RAM found on the system. While the spec may not sufficiently +clearly distinguish both kinds of regions, they are surely meant to be +separate things: Only regions with ACPI_IVMD_EXCLUSION_RANGE set should +be candidates for putting in the exclusion range registers. (As there's +only a single such pair of registers per IOMMU, secondary non-adjacent +regions with the flag set already get converted to unity mapped +regions.) + +First of all, drop the dependency on max_page. With commit b4f042236ae0 +("AMD/IOMMU: Cease using a dynamic height for the IOMMU pagetables") the +use of it here was stale anyway; it was bogus already before, as it +didn't account for max_page getting increased later on. Simply try an +exclusion range registration first, and if it fails (for being +unsuitable or non-mergeable), register a unity mapping range. + +With this various local variables become unnecessary and hence get +dropped at the same time. + +With the max_page boundary dropped for using unity maps, the minimum +page table tree height now needs both recording and enforcing in +amd_iommu_domain_init(). Since we can't predict which devices may get +assigned to a domain, our only option is to uniformly force at least +that height for all domains, now that the height isn't dynamic anymore. + +Further don't make use of the exclusion range unless ACPI data says so. + +Note that exclusion range registration in +register_range_for_all_devices() is on a best effort basis. Hence unity +map entries also registered are redundant when the former succeeded, but +they also do no harm. Improvements in this area can be done later imo. + +Also adjust types where suitable without touching extra lines. + +This is part of XSA-378. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/drivers/passthrough/amd/iommu_acpi.c ++++ b/xen/drivers/passthrough/amd/iommu_acpi.c +@@ -99,12 +99,8 @@ static struct amd_iommu * __init find_io + } + + static int __init reserve_iommu_exclusion_range( +- struct amd_iommu *iommu, uint64_t base, uint64_t limit, +- bool all, bool iw, bool ir) ++ struct amd_iommu *iommu, paddr_t base, paddr_t limit, bool all) + { +- if ( !ir || !iw ) +- return -EPERM; +- + /* need to extend exclusion range? */ + if ( iommu->exclusion_enable ) + { +@@ -133,14 +129,18 @@ static int __init reserve_unity_map_for_ + { + struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg); + struct ivrs_unity_map *unity_map = ivrs_mappings[bdf].unity_map; ++ int paging_mode = amd_iommu_get_paging_mode(PFN_UP(base + length)); ++ ++ if ( paging_mode < 0 ) ++ return paging_mode; + + /* Check for overlaps. */ + for ( ; unity_map; unity_map = unity_map->next ) + { + /* + * Exact matches are okay. This can in particular happen when +- * register_exclusion_range_for_device() calls here twice for the +- * same (s,b,d,f). ++ * register_range_for_device() calls here twice for the same ++ * (s,b,d,f). + */ + if ( base == unity_map->addr && length == unity_map->length && + ir == unity_map->read && iw == unity_map->write ) +@@ -168,55 +168,52 @@ static int __init reserve_unity_map_for_ + unity_map->next = ivrs_mappings[bdf].unity_map; + ivrs_mappings[bdf].unity_map = unity_map; + ++ if ( paging_mode > amd_iommu_min_paging_mode ) ++ amd_iommu_min_paging_mode = paging_mode; ++ + return 0; + } + +-static int __init register_exclusion_range_for_all_devices( +- unsigned long base, unsigned long limit, u8 iw, u8 ir) ++static int __init register_range_for_all_devices( ++ paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion) + { + int seg = 0; /* XXX */ +- unsigned long range_top, iommu_top, length; + struct amd_iommu *iommu; +- unsigned int bdf; + int rc = 0; + + /* is part of exclusion range inside of IOMMU virtual address space? */ + /* note: 'limit' parameter is assumed to be page-aligned */ +- range_top = limit + PAGE_SIZE; +- iommu_top = max_page * PAGE_SIZE; +- if ( base < iommu_top ) +- { +- if ( range_top > iommu_top ) +- range_top = iommu_top; +- length = range_top - base; +- /* reserve r/w unity-mapped page entries for devices */ +- /* note: these entries are part of the exclusion range */ +- for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ ) +- rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir); +- /* push 'base' just outside of virtual address space */ +- base = iommu_top; +- } +- /* register IOMMU exclusion range settings */ +- if ( !rc && limit >= iommu_top ) ++ if ( exclusion ) + { + for_each_amd_iommu( iommu ) + { +- rc = reserve_iommu_exclusion_range(iommu, base, limit, +- true /* all */, iw, ir); +- if ( rc ) +- break; ++ int ret = reserve_iommu_exclusion_range(iommu, base, limit, ++ true /* all */); ++ ++ if ( ret && !rc ) ++ rc = ret; + } + } + ++ if ( !exclusion || rc ) ++ { ++ paddr_t length = limit + PAGE_SIZE - base; ++ unsigned int bdf; ++ ++ /* reserve r/w unity-mapped page entries for devices */ ++ for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ ) ++ rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir); ++ } ++ + return rc; + } + +-static int __init register_exclusion_range_for_device( +- u16 bdf, unsigned long base, unsigned long limit, u8 iw, u8 ir) ++static int __init register_range_for_device( ++ unsigned int bdf, paddr_t base, paddr_t limit, ++ bool iw, bool ir, bool exclusion) + { + int seg = 0; /* XXX */ + struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(seg); +- unsigned long range_top, iommu_top, length; + struct amd_iommu *iommu; + u16 req; + int rc = 0; +@@ -230,27 +227,19 @@ static int __init register_exclusion_ran + req = ivrs_mappings[bdf].dte_requestor_id; + + /* note: 'limit' parameter is assumed to be page-aligned */ +- range_top = limit + PAGE_SIZE; +- iommu_top = max_page * PAGE_SIZE; +- if ( base < iommu_top ) +- { +- if ( range_top > iommu_top ) +- range_top = iommu_top; +- length = range_top - base; ++ if ( exclusion ) ++ rc = reserve_iommu_exclusion_range(iommu, base, limit, ++ false /* all */); ++ if ( !exclusion || rc ) ++ { ++ paddr_t length = limit + PAGE_SIZE - base; ++ + /* reserve unity-mapped page entries for device */ +- /* note: these entries are part of the exclusion range */ + rc = reserve_unity_map_for_device(seg, bdf, base, length, iw, ir) ?: + reserve_unity_map_for_device(seg, req, base, length, iw, ir); +- +- /* push 'base' just outside of virtual address space */ +- base = iommu_top; + } +- +- /* register IOMMU exclusion range settings for device */ +- if ( !rc && limit >= iommu_top ) ++ else + { +- rc = reserve_iommu_exclusion_range(iommu, base, limit, +- false /* all */, iw, ir); + ivrs_mappings[bdf].dte_allow_exclusion = IOMMU_CONTROL_ENABLED; + ivrs_mappings[req].dte_allow_exclusion = IOMMU_CONTROL_ENABLED; + } +@@ -258,53 +247,42 @@ static int __init register_exclusion_ran + return rc; + } + +-static int __init register_exclusion_range_for_iommu_devices( +- struct amd_iommu *iommu, +- unsigned long base, unsigned long limit, u8 iw, u8 ir) ++static int __init register_range_for_iommu_devices( ++ struct amd_iommu *iommu, paddr_t base, paddr_t limit, ++ bool iw, bool ir, bool exclusion) + { +- unsigned long range_top, iommu_top, length; ++ /* note: 'limit' parameter is assumed to be page-aligned */ ++ paddr_t length = limit + PAGE_SIZE - base; + unsigned int bdf; + u16 req; +- int rc = 0; ++ int rc; + +- /* is part of exclusion range inside of IOMMU virtual address space? */ +- /* note: 'limit' parameter is assumed to be page-aligned */ +- range_top = limit + PAGE_SIZE; +- iommu_top = max_page * PAGE_SIZE; +- if ( base < iommu_top ) +- { +- if ( range_top > iommu_top ) +- range_top = iommu_top; +- length = range_top - base; +- /* reserve r/w unity-mapped page entries for devices */ +- /* note: these entries are part of the exclusion range */ +- for ( bdf = 0; !rc && bdf < ivrs_bdf_entries; bdf++ ) +- { +- if ( iommu == find_iommu_for_device(iommu->seg, bdf) ) +- { +- req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id; +- rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length, +- iw, ir) ?: +- reserve_unity_map_for_device(iommu->seg, req, base, length, +- iw, ir); +- } +- } +- +- /* push 'base' just outside of virtual address space */ +- base = iommu_top; ++ if ( exclusion ) ++ { ++ rc = reserve_iommu_exclusion_range(iommu, base, limit, true /* all */); ++ if ( !rc ) ++ return 0; + } + +- /* register IOMMU exclusion range settings */ +- if ( !rc && limit >= iommu_top ) +- rc = reserve_iommu_exclusion_range(iommu, base, limit, +- true /* all */, iw, ir); ++ /* reserve unity-mapped page entries for devices */ ++ for ( bdf = rc = 0; !rc && bdf < ivrs_bdf_entries; bdf++ ) ++ { ++ if ( iommu != find_iommu_for_device(iommu->seg, bdf) ) ++ continue; ++ ++ req = get_ivrs_mappings(iommu->seg)[bdf].dte_requestor_id; ++ rc = reserve_unity_map_for_device(iommu->seg, bdf, base, length, ++ iw, ir) ?: ++ reserve_unity_map_for_device(iommu->seg, req, base, length, ++ iw, ir); ++ } + + return rc; + } + + static int __init parse_ivmd_device_select( + const struct acpi_ivrs_memory *ivmd_block, +- unsigned long base, unsigned long limit, u8 iw, u8 ir) ++ paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion) + { + u16 bdf; + +@@ -315,12 +293,12 @@ static int __init parse_ivmd_device_sele + return -ENODEV; + } + +- return register_exclusion_range_for_device(bdf, base, limit, iw, ir); ++ return register_range_for_device(bdf, base, limit, iw, ir, exclusion); + } + + static int __init parse_ivmd_device_range( + const struct acpi_ivrs_memory *ivmd_block, +- unsigned long base, unsigned long limit, u8 iw, u8 ir) ++ paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion) + { + unsigned int first_bdf, last_bdf, bdf; + int error; +@@ -342,15 +320,15 @@ static int __init parse_ivmd_device_rang + } + + for ( bdf = first_bdf, error = 0; (bdf <= last_bdf) && !error; bdf++ ) +- error = register_exclusion_range_for_device( +- bdf, base, limit, iw, ir); ++ error = register_range_for_device( ++ bdf, base, limit, iw, ir, exclusion); + + return error; + } + + static int __init parse_ivmd_device_iommu( + const struct acpi_ivrs_memory *ivmd_block, +- unsigned long base, unsigned long limit, u8 iw, u8 ir) ++ paddr_t base, paddr_t limit, bool iw, bool ir, bool exclusion) + { + int seg = 0; /* XXX */ + struct amd_iommu *iommu; +@@ -365,14 +343,14 @@ static int __init parse_ivmd_device_iomm + return -ENODEV; + } + +- return register_exclusion_range_for_iommu_devices( +- iommu, base, limit, iw, ir); ++ return register_range_for_iommu_devices( ++ iommu, base, limit, iw, ir, exclusion); + } + + static int __init parse_ivmd_block(const struct acpi_ivrs_memory *ivmd_block) + { + unsigned long start_addr, mem_length, base, limit; +- u8 iw, ir; ++ bool iw = true, ir = true, exclusion = false; + + if ( ivmd_block->header.length < sizeof(*ivmd_block) ) + { +@@ -389,13 +367,11 @@ static int __init parse_ivmd_block(const + ivmd_block->header.type, start_addr, mem_length); + + if ( ivmd_block->header.flags & ACPI_IVMD_EXCLUSION_RANGE ) +- iw = ir = IOMMU_CONTROL_ENABLED; ++ exclusion = true; + else if ( ivmd_block->header.flags & ACPI_IVMD_UNITY ) + { +- iw = ivmd_block->header.flags & ACPI_IVMD_READ ? +- IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED; +- ir = ivmd_block->header.flags & ACPI_IVMD_WRITE ? +- IOMMU_CONTROL_ENABLED : IOMMU_CONTROL_DISABLED; ++ iw = ivmd_block->header.flags & ACPI_IVMD_READ; ++ ir = ivmd_block->header.flags & ACPI_IVMD_WRITE; + } + else + { +@@ -406,20 +382,20 @@ static int __init parse_ivmd_block(const + switch( ivmd_block->header.type ) + { + case ACPI_IVRS_TYPE_MEMORY_ALL: +- return register_exclusion_range_for_all_devices( +- base, limit, iw, ir); ++ return register_range_for_all_devices( ++ base, limit, iw, ir, exclusion); + + case ACPI_IVRS_TYPE_MEMORY_ONE: +- return parse_ivmd_device_select(ivmd_block, +- base, limit, iw, ir); ++ return parse_ivmd_device_select(ivmd_block, base, limit, ++ iw, ir, exclusion); + + case ACPI_IVRS_TYPE_MEMORY_RANGE: +- return parse_ivmd_device_range(ivmd_block, +- base, limit, iw, ir); ++ return parse_ivmd_device_range(ivmd_block, base, limit, ++ iw, ir, exclusion); + + case ACPI_IVRS_TYPE_MEMORY_IOMMU: +- return parse_ivmd_device_iommu(ivmd_block, +- base, limit, iw, ir); ++ return parse_ivmd_device_iommu(ivmd_block, base, limit, ++ iw, ir, exclusion); + + default: + AMD_IOMMU_DEBUG("IVMD Error: Invalid Block Type!\n"); +--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c ++++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c +@@ -218,6 +218,8 @@ static int __must_check allocate_domain_ + return rc; + } + ++int __read_mostly amd_iommu_min_paging_mode = 1; ++ + static int amd_iommu_domain_init(struct domain *d) + { + struct domain_iommu *hd = dom_iommu(d); +@@ -229,11 +231,13 @@ static int amd_iommu_domain_init(struct + * - HVM could in principle use 3 or 4 depending on how much guest + * physical address space we give it, but this isn't known yet so use 4 + * unilaterally. ++ * - Unity maps may require an even higher number. + */ +- hd->arch.paging_mode = amd_iommu_get_paging_mode( +- is_hvm_domain(d) +- ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT) +- : get_upper_mfn_bound() + 1); ++ hd->arch.paging_mode = max(amd_iommu_get_paging_mode( ++ is_hvm_domain(d) ++ ? 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT) ++ : get_upper_mfn_bound() + 1), ++ amd_iommu_min_paging_mode); + + return 0; + } +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +@@ -126,6 +126,8 @@ extern struct hpet_sbdf { + } init; + } hpet_sbdf; + ++extern int amd_iommu_min_paging_mode; ++ + extern void *shared_intremap_table; + extern unsigned long *shared_intremap_inuse; + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-7.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-7.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-7.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-7.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,88 @@ +From: Jan Beulich +Subject: x86/p2m: introduce p2m_is_special() + +Seeing the similarity of grant, foreign, and (subsequently) direct-MMIO +handling, introduce a new P2M type group named "special" (as in "needing +special accessors to create/destroy"). + +Also use -EPERM instead of other error codes on the two domain_crash() +paths touched. + +This is part of XSA-378. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -736,7 +736,7 @@ p2m_remove_page(struct p2m_domain *p2m, + for ( i = 0; i < (1UL << page_order); i++ ) + { + p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL); +- if ( !p2m_is_grant(t) && !p2m_is_shared(t) && !p2m_is_foreign(t) ) ++ if ( !p2m_is_special(t) && !p2m_is_shared(t) ) + set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY); + } + } +@@ -848,13 +848,13 @@ guest_physmap_add_entry(struct domain *d + &ot, &a, 0, NULL, NULL); + ASSERT(!p2m_is_shared(ot)); + } +- if ( p2m_is_grant(ot) || p2m_is_foreign(ot) ) ++ if ( p2m_is_special(ot) ) + { +- /* Really shouldn't be unmapping grant/foreign maps this way */ ++ /* Don't permit unmapping grant/foreign this way. */ + domain_crash(d); + p2m_unlock(p2m); + +- return -EINVAL; ++ return -EPERM; + } + else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) ) + { +@@ -947,8 +947,7 @@ int p2m_change_type_one(struct domain *d + struct p2m_domain *p2m = p2m_get_hostp2m(d); + int rc; + +- BUG_ON(p2m_is_grant(ot) || p2m_is_grant(nt)); +- BUG_ON(p2m_is_foreign(ot) || p2m_is_foreign(nt)); ++ BUG_ON(p2m_is_special(ot) || p2m_is_special(nt)); + + gfn_lock(p2m, gfn, 0); + +@@ -1091,11 +1090,11 @@ static int set_typed_p2m_entry(struct do + gfn_unlock(p2m, gfn, order); + return cur_order + 1; + } +- if ( p2m_is_grant(ot) || p2m_is_foreign(ot) ) ++ if ( p2m_is_special(ot) ) + { + gfn_unlock(p2m, gfn, order); + domain_crash(d); +- return -ENOENT; ++ return -EPERM; + } + else if ( p2m_is_ram(ot) ) + { +--- a/xen/include/asm-x86/p2m.h ++++ b/xen/include/asm-x86/p2m.h +@@ -142,6 +142,10 @@ typedef unsigned int p2m_query_t; + | p2m_to_mask(p2m_ram_logdirty) ) + #define P2M_SHARED_TYPES (p2m_to_mask(p2m_ram_shared)) + ++/* Types established/cleaned up via special accessors. */ ++#define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \ ++ p2m_to_mask(p2m_map_foreign)) ++ + /* Valid types not necessarily associated with a (valid) MFN. */ + #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES \ + | p2m_to_mask(p2m_mmio_direct) \ +@@ -170,6 +174,7 @@ typedef unsigned int p2m_query_t; + #define p2m_is_paged(_t) (p2m_to_mask(_t) & P2M_PAGED_TYPES) + #define p2m_is_sharable(_t) (p2m_to_mask(_t) & P2M_SHARABLE_TYPES) + #define p2m_is_shared(_t) (p2m_to_mask(_t) & P2M_SHARED_TYPES) ++#define p2m_is_special(_t) (p2m_to_mask(_t) & P2M_SPECIAL_TYPES) + #define p2m_is_broken(_t) (p2m_to_mask(_t) & P2M_BROKEN_TYPES) + #define p2m_is_foreign(_t) (p2m_to_mask(_t) & p2m_to_mask(p2m_map_foreign)) + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-8.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-8.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-8.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa378-4.11-8.patch 2022-05-26 17:34:24.000000000 +0100 @@ -0,0 +1,157 @@ +From: Jan Beulich +Subject: x86/p2m: guard (in particular) identity mapping entries + +Such entries, created by set_identity_p2m_entry(), should only be +destroyed by clear_identity_p2m_entry(). However, similarly, entries +created by set_mmio_p2m_entry() should only be torn down by +clear_mmio_p2m_entry(), so the logic gets based upon p2m_mmio_direct as +the entry type (separation between "ordinary" and 1:1 mappings would +require a further indicator to tell apart the two). + +As to the guest_remove_page() change, commit 48dfb297a20a ("x86/PVH: +allow guest_remove_page to remove p2m_mmio_direct pages"), which +introduced the call to clear_mmio_p2m_entry(), claimed this was done for +hwdom only without this actually having been the case. However, this +code shouldn't be there in the first place, as MMIO entries shouldn't be +dropped this way. Avoid triggering the warning again that 48dfb297a20a +silenced by an adjustment to xenmem_add_to_physmap_one() instead. + +Note that guest_physmap_mark_populate_on_demand() gets tightened beyond +the immediate purpose of this change. + +Note also that I didn't inspect code which isn't security supported, +e.g. sharing, paging, or altp2m. + +This is CVE-2021-28694 / part of XSA-378. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -4783,7 +4783,9 @@ int xenmem_add_to_physmap_one( + + /* Remove previously mapped page if it was present. */ + prev_mfn = mfn_x(get_gfn(d, gfn_x(gpfn), &p2mt)); +- if ( mfn_valid(_mfn(prev_mfn)) ) ++ if ( p2mt == p2m_mmio_direct ) ++ rc = -EPERM; ++ else if ( mfn_valid(_mfn(prev_mfn)) ) + { + if ( is_xen_heap_mfn(prev_mfn) ) + /* Xen heap frames are simply unhooked from this phys slot. */ +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -725,7 +725,8 @@ p2m_remove_page(struct p2m_domain *p2m, + &cur_order, NULL); + + if ( p2m_is_valid(t) && +- (!mfn_valid(_mfn(mfn)) || mfn + i != mfn_x(mfn_return)) ) ++ (!mfn_valid(_mfn(mfn)) || t == p2m_mmio_direct || ++ mfn + i != mfn_x(mfn_return)) ) + return -EILSEQ; + + i += (1UL << cur_order) - ((gfn_l + i) & ((1UL << cur_order) - 1)); +@@ -803,7 +804,7 @@ guest_physmap_add_entry(struct domain *d + if ( p2m_is_foreign(t) ) + return -EINVAL; + +- if ( !mfn_valid(mfn) ) ++ if ( !mfn_valid(mfn) || t == p2m_mmio_direct ) + { + ASSERT_UNREACHABLE(); + return -EINVAL; +@@ -850,7 +851,7 @@ guest_physmap_add_entry(struct domain *d + } + if ( p2m_is_special(ot) ) + { +- /* Don't permit unmapping grant/foreign this way. */ ++ /* Don't permit unmapping grant/foreign/direct-MMIO this way. */ + domain_crash(d); + p2m_unlock(p2m); + +@@ -1192,8 +1193,8 @@ int set_identity_p2m_entry(struct domain + * order+1 for caller to retry with order (guaranteed smaller than + * the order value passed in) + */ +-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, mfn_t mfn, +- unsigned int order) ++static int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn_l, ++ mfn_t mfn, unsigned int order) + { + int rc = -EINVAL; + gfn_t gfn = _gfn(gfn_l); +--- a/xen/arch/x86/mm/p2m-pod.c ++++ b/xen/arch/x86/mm/p2m-pod.c +@@ -1302,17 +1302,17 @@ guest_physmap_mark_populate_on_demand(st + + p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, &cur_order, NULL); + n = 1UL << min(order, cur_order); +- if ( p2m_is_ram(ot) ) ++ if ( ot == p2m_populate_on_demand ) ++ { ++ /* Count how many PoD entries we'll be replacing if successful */ ++ pod_count += n; ++ } ++ else if ( ot != p2m_invalid && ot != p2m_mmio_dm ) + { + P2M_DEBUG("gfn_to_mfn returned type %d!\n", ot); + rc = -EBUSY; + goto out; + } +- else if ( ot == p2m_populate_on_demand ) +- { +- /* Count how man PoD entries we'll be replacing if successful */ +- pod_count += n; +- } + } + + /* Now, actually do the two-way mapping */ +--- a/xen/common/memory.c ++++ b/xen/common/memory.c +@@ -335,7 +335,7 @@ int guest_remove_page(struct domain *d, + } + if ( p2mt == p2m_mmio_direct ) + { +- rc = clear_mmio_p2m_entry(d, gmfn, mfn, PAGE_ORDER_4K); ++ rc = -EPERM; + goto out_put_gfn; + } + #else +@@ -1651,6 +1651,15 @@ int prepare_ring_for_helper( + return -ENOENT; + } + #endif ++#ifdef CONFIG_X86 ++ if ( p2mt == p2m_mmio_direct ) ++ { ++ if ( page ) ++ put_page(page); ++ ++ return -EPERM; ++ } ++#endif + + if ( !page ) + return -EINVAL; +--- a/xen/include/asm-x86/p2m.h ++++ b/xen/include/asm-x86/p2m.h +@@ -144,7 +144,8 @@ typedef unsigned int p2m_query_t; + + /* Types established/cleaned up via special accessors. */ + #define P2M_SPECIAL_TYPES (P2M_GRANT_TYPES | \ +- p2m_to_mask(p2m_map_foreign)) ++ p2m_to_mask(p2m_map_foreign) | \ ++ p2m_to_mask(p2m_mmio_direct)) + + /* Valid types not necessarily associated with a (valid) MFN. */ + #define P2M_INVALID_MFN_TYPES (P2M_POD_TYPES \ +@@ -629,8 +630,6 @@ int set_foreign_p2m_entry(struct domain + /* Set mmio addresses in the p2m table (for pass-through) */ + int set_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn, + unsigned int order, p2m_access_t access); +-int clear_mmio_p2m_entry(struct domain *d, unsigned long gfn, mfn_t mfn, +- unsigned int order); + + /* Set identity addresses in the p2m table (for pass-through) */ + int set_identity_p2m_entry(struct domain *d, unsigned long gfn, diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa379-4.12.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa379-4.12.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa379-4.12.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa379-4.12.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,77 @@ +From: Jan Beulich +Subject: x86/mm: widen locked region in xenmem_add_to_physmap_one() + +For pages which can be made part of the P2M by the guest, but which can +also later be de-allocated (grant table v2 status pages being the +present example), it is imperative that they be mapped at no more than a +single GFN. We therefore need to make sure that of two parallel +XENMAPSPACE_grant_table requests for the same status page one completes +before the second checks at which other GFN the underlying MFN is +presently mapped. + +Push down the respective put_gfn(). This leverages that gfn_lock() +really aliases p2m_lock(), but the function makes this assumption +already anyway: In the XENMAPSPACE_gmfn case lock nesting constraints +for both involved GFNs would otherwise need to be enforced to avoid ABBA +deadlocks. + +This is CVE-2021-28697 / XSA-379. + +Signed-off-by: Jan Beulich +Reviewed-by: Julien Grall + +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -4807,8 +4807,20 @@ int xenmem_add_to_physmap_one( + goto put_both; + } + +- /* Remove previously mapped page if it was present. */ ++ /* ++ * Note that we're (ab)using GFN locking (to really be locking of the ++ * entire P2M) here in (at least) two ways: Finer grained locking would ++ * expose lock order violations in the XENMAPSPACE_gmfn case (due to the ++ * earlier get_gfn_unshare() above). Plus at the very least for the grant ++ * table v2 status page case we need to guarantee that the same page can ++ * only appear at a single GFN. While this is a property we want in ++ * general, for pages which can subsequently be freed this imperative: ++ * Upon freeing we wouldn't be able to find other mappings in the P2M ++ * (unless we did a brute force search). ++ */ + prev_mfn = mfn_x(get_gfn(d, gfn_x(gpfn), &p2mt)); ++ ++ /* Remove previously mapped page if it was present. */ + if ( p2mt == p2m_mmio_direct ) + rc = -EPERM; + else if ( mfn_valid(_mfn(prev_mfn)) ) +@@ -4820,27 +4832,21 @@ int xenmem_add_to_physmap_one( + /* Normal domain memory is freed, to avoid leaking memory. */ + rc = guest_remove_page(d, gfn_x(gpfn)); + } +- /* In the XENMAPSPACE_gmfn case we still hold a ref on the old page. */ +- put_gfn(d, gfn_x(gpfn)); +- +- if ( rc ) +- goto put_both; + + /* Unmap from old location, if any. */ + old_gpfn = get_gpfn_from_mfn(mfn_x(mfn)); + ASSERT(!SHARED_M2P(old_gpfn)); + if ( space == XENMAPSPACE_gmfn && old_gpfn != gfn ) +- { + rc = -EXDEV; +- goto put_both; +- } +- if ( old_gpfn != INVALID_M2P_ENTRY ) ++ else if ( !rc && old_gpfn != INVALID_M2P_ENTRY ) + rc = guest_physmap_remove_page(d, _gfn(old_gpfn), mfn, PAGE_ORDER_4K); + + /* Map at new location. */ + if ( !rc ) + rc = guest_physmap_add_page(d, gpfn, mfn, PAGE_ORDER_4K); + ++ put_gfn(d, gfn_x(gpfn)); ++ + put_both: + /* + * In the XENMAPSPACE_gmfn case, we took a ref of the gfn at the top. diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-3.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-3.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-3.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-3.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,74 @@ +From: Jan Beulich +Subject: gnttab: avoid triggering assertion in radix_tree_ulong_to_ptr() + +Relevant quotes from the C11 standard: + +"Except where explicitly stated otherwise, for the purposes of this + subclause unnamed members of objects of structure and union type do not + participate in initialization. Unnamed members of structure objects + have indeterminate value even after initialization." + +"If there are fewer initializers in a brace-enclosed list than there are + elements or members of an aggregate, [...], the remainder of the + aggregate shall be initialized implicitly the same as objects that have + static storage duration." + +"If an object that has static or thread storage duration is not + initialized explicitly, then: + [...] + — if it is an aggregate, every member is initialized (recursively) + according to these rules, and any padding is initialized to zero + bits; + [...]" + +"A bit-field declaration with no declarator, but only a colon and a + width, indicates an unnamed bit-field." Footnote: "An unnamed bit-field + structure member is useful for padding to conform to externally imposed + layouts." + +"There may be unnamed padding within a structure object, but not at its + beginning." + +Which makes me conclude: +- Whether an unnamed bit-field member is an unnamed member or padding is + unclear, and hence also whether the last quote above would render the + big endian case of the structure declaration invalid. +- Whether the number of members of an aggregate includes unnamed ones is + also not really clear. +- The initializer in map_grant_ref() initializes all fields of the "cnt" + sub-structure of the union, so assuming the second quote above applies + here (indirectly), the compiler isn't required to implicitly + initialize the rest (i.e. in particular any padding) like would happen + for static storage duration objects. + +Gcc 7.4.1 can be observed (apparently in debug builds only) to translate +aforementioned initializer to a read-modify-write operation of a stack +variable, leaving unchanged the top two bits of whatever was previously +in that stack slot. Clearly if either of the two bits were set, +radix_tree_ulong_to_ptr()'s assertion would trigger. + +Therefore, to be on the safe side, add an explicit padding field for the +non-big-endian-bitfields case and give a dummy name to both padding +fields. + +Fixes: 9781b51efde2 ("gnttab: replace mapkind()") +Signed-off-by: Jan Beulich +Acked-by: Andrew Cooper + +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -952,10 +952,13 @@ union maptrack_node { + struct { + /* Radix tree slot pointers use two of the bits. */ + #ifdef __BIG_ENDIAN_BITFIELD +- unsigned long : 2; ++ unsigned long _0 : 2; + #endif + unsigned long rd : BITS_PER_LONG / 2 - 1; + unsigned long wr : BITS_PER_LONG / 2 - 1; ++#ifndef __BIG_ENDIAN_BITFIELD ++ unsigned long _0 : 2; ++#endif + } cnt; + unsigned long raw; + }; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-1.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,139 @@ +From: Jan Beulich +Subject: gnttab: add preemption check to gnttab_release_mappings() + +A guest may die with many grant mappings still in place, or simply with +a large maptrack table. Iterating through this may take more time than +is reasonable without intermediate preemption (to run softirqs and +perhaps the scheduler). + +Move the invocation of the function to the section where other +restartable functions get invoked, and have the function itself check +for preemption every once in a while. Have it iterate the table +backwards, such that decreasing the maptrack limit is all it takes to +convey restart information. + +In domain_teardown() introduce PROG_none such that inserting at the +front will be easier going forward. + +This is part of CVE-2021-28698 / XSA-380. + +Reported-by: Andrew Cooper +Signed-off-by: Jan Beulich +Reviewed-by: Julien Grall + +--- a/xen/common/domain.c ++++ b/xen/common/domain.c +@@ -646,13 +646,15 @@ int domain_kill(struct domain *d) + if ( d->is_dying != DOMDYING_alive ) + return domain_kill(d); + d->is_dying = DOMDYING_dying; +- gnttab_release_mappings(d); + tmem_destroy(d->tmem_client); + vnuma_destroy(d->vnuma); + domain_set_outstanding_pages(d, 0); + d->tmem_client = NULL; + /* fallthrough */ + case DOMDYING_dying: ++ rc = gnttab_release_mappings(d); ++ if ( rc ) ++ break; + rc = evtchn_destroy(d); + if ( rc ) + break; +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -62,7 +62,13 @@ struct grant_table { + unsigned int nr_grant_frames; + /* Number of grant status frames shared with guest (for version 2) */ + unsigned int nr_status_frames; +- /* Number of available maptrack entries. */ ++ /* ++ * Number of available maptrack entries. For cleanup purposes it is ++ * important to realize that this field and @maptrack further down will ++ * only ever be accessed by the local domain. Thus it is okay to clean ++ * up early, and to shrink the limit for the purpose of tracking cleanup ++ * progress. ++ */ + unsigned int maptrack_limit; + /* Shared grant table (see include/public/grant_table.h). */ + union { +@@ -3618,9 +3624,7 @@ grant_table_create( + return ret; + } + +-void +-gnttab_release_mappings( +- struct domain *d) ++int gnttab_release_mappings(struct domain *d) + { + struct grant_table *gt = d->grant_table, *rgt; + struct grant_mapping *map; +@@ -3634,8 +3638,32 @@ gnttab_release_mappings( + + BUG_ON(!d->is_dying); + +- for ( handle = 0; handle < gt->maptrack_limit; handle++ ) ++ if ( !gt || !gt->maptrack ) ++ return 0; ++ ++ for ( handle = gt->maptrack_limit; handle; ) + { ++ /* ++ * Deal with full pages such that their freeing (in the body of the ++ * if()) remains simple. ++ */ ++ if ( handle < gt->maptrack_limit && !(handle % MAPTRACK_PER_PAGE) ) ++ { ++ /* ++ * Changing maptrack_limit alters nr_maptrack_frames()'es return ++ * value. Free the then excess trailing page right here, rather ++ * than leaving it to grant_table_destroy() (and in turn requiring ++ * to leave gt->maptrack_limit unaltered). ++ */ ++ gt->maptrack_limit = handle; ++ FREE_XENHEAP_PAGE(gt->maptrack[nr_maptrack_frames(gt)]); ++ ++ if ( hypercall_preempt_check() ) ++ return -ERESTART; ++ } ++ ++ --handle; ++ + map = &maptrack_entry(gt, handle); + if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) ) + continue; +@@ -3723,6 +3751,11 @@ gnttab_release_mappings( + + map->flags = 0; + } ++ ++ gt->maptrack_limit = 0; ++ FREE_XENHEAP_PAGE(gt->maptrack[0]); ++ ++ return 0; + } + + void grant_table_warn_active_grants(struct domain *d) +@@ -3785,8 +3818,7 @@ grant_table_destroy( + free_xenheap_page(t->shared_raw[i]); + xfree(t->shared_raw); + +- for ( i = 0; i < nr_maptrack_frames(t); i++ ) +- free_xenheap_page(t->maptrack[i]); ++ ASSERT(!t->maptrack_limit); + vfree(t->maptrack); + + for ( i = 0; i < nr_active_grant_frames(t); i++ ) +--- a/xen/include/xen/grant_table.h ++++ b/xen/include/xen/grant_table.h +@@ -46,9 +46,7 @@ int grant_table_set_limits(struct domain + void grant_table_warn_active_grants(struct domain *d); + + /* Domain death release of granted mappings of other domains' memory. */ +-void +-gnttab_release_mappings( +- struct domain *d); ++int gnttab_release_mappings(struct domain *d); + + int mem_sharing_gref_to_gfn(struct grant_table *gt, grant_ref_t ref, + gfn_t *gfn, uint16_t *status); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa380-4.11-2.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,384 @@ +From: Jan Beulich +Subject: gnttab: replace mapkind() + +mapkind() doesn't scale very well with larger maptrack entry counts, +using a brute force linear search through all entries, with the only +option of an early loop exit if a matching writable entry was found. +Introduce a radix tree alongside the main maptrack table, thus +allowing much faster MFN-based lookup. To avoid the need to actually +allocate space for the individual nodes, encode the two counters in the +node pointers themselves, thus limiting the number of permitted +simultaneous r/o and r/w mappings of the same MFN to 2³¹-1 (64-bit) / +2¹⁵-1 (32-bit) each. + +To avoid enforcing an unnecessarily low bound on the number of +simultaneous mappings of a single MFN, introduce +radix_tree_{ulong_to_ptr,ptr_to_ulong} paralleling +radix_tree_{int_to_ptr,ptr_to_int}. + +As a consequence locking changes are also applicable: With there no +longer being any inspection of the remote domain's active entries, +there's also no need anymore to hold the remote domain's grant table +lock. And since we're no longer iterating over the local domain's map +track table, the lock in map_grant_ref() can also be dropped before the +new maptrack entry actually gets populated. + +As a nice side effect this also reduces the number of IOMMU operations +in unmap_common(): Previously we would have "established" a readable +mapping whenever we didn't find a writable entry anymore (yet, of +course, at least one readable one). But we only need to do this if we +actually dropped the last writable entry, not if there were none already +before. + +This is part of CVE-2021-28698 / XSA-380. + +Signed-off-by: Jan Beulich +Reviewed-by: Julien Grall + +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -36,6 +36,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -80,8 +81,13 @@ struct grant_table { + grant_status_t **status; + /* Active grant table. */ + struct active_grant_entry **active; +- /* Mapping tracking table per vcpu. */ ++ /* Handle-indexed tracking table of mappings. */ + struct grant_mapping **maptrack; ++ /* ++ * MFN-indexed tracking tree of mappings, if needed. Note that this is ++ * protected by @lock, not @maptrack_lock. ++ */ ++ struct radix_tree_root maptrack_tree; + + /* Domain to which this struct grant_table belongs. */ + const struct domain *domain; +@@ -421,34 +427,6 @@ static int get_paged_frame(unsigned long + return rc; + } + +-static inline void +-double_gt_lock(struct grant_table *lgt, struct grant_table *rgt) +-{ +- /* +- * See mapkind() for why the write lock is also required for the +- * remote domain. +- */ +- if ( lgt < rgt ) +- { +- grant_write_lock(lgt); +- grant_write_lock(rgt); +- } +- else +- { +- if ( lgt != rgt ) +- grant_write_lock(rgt); +- grant_write_lock(lgt); +- } +-} +- +-static inline void +-double_gt_unlock(struct grant_table *lgt, struct grant_table *rgt) +-{ +- grant_write_unlock(lgt); +- if ( lgt != rgt ) +- grant_write_unlock(rgt); +-} +- + #define INVALID_MAPTRACK_HANDLE UINT_MAX + + static inline grant_handle_t +@@ -871,41 +849,17 @@ static struct active_grant_entry *grant_ + return ERR_PTR(-EINVAL); + } + +-#define MAPKIND_READ 1 +-#define MAPKIND_WRITE 2 +-static unsigned int mapkind( +- struct grant_table *lgt, const struct domain *rd, mfn_t mfn) +-{ +- struct grant_mapping *map; +- grant_handle_t handle, limit = lgt->maptrack_limit; +- unsigned int kind = 0; +- +- /* +- * Must have the local domain's grant table write lock when +- * iterating over its maptrack entries. +- */ +- ASSERT(percpu_rw_is_write_locked(&lgt->lock)); +- /* +- * Must have the remote domain's grant table write lock while +- * counting its active entries. +- */ +- ASSERT(percpu_rw_is_write_locked(&rd->grant_table->lock)); +- +- smp_rmb(); +- +- for ( handle = 0; !(kind & MAPKIND_WRITE) && handle < limit; handle++ ) +- { +- map = &maptrack_entry(lgt, handle); +- if ( !(map->flags & (GNTMAP_device_map|GNTMAP_host_map)) || +- map->domid != rd->domain_id ) +- continue; +- if ( mfn_eq(_active_entry(rd->grant_table, map->ref).frame, mfn) ) +- kind |= map->flags & GNTMAP_readonly ? +- MAPKIND_READ : MAPKIND_WRITE; +- } +- +- return kind; +-} ++union maptrack_node { ++ struct { ++ /* Radix tree slot pointers use two of the bits. */ ++#ifdef __BIG_ENDIAN_BITFIELD ++ unsigned long : 2; ++#endif ++ unsigned long rd : BITS_PER_LONG / 2 - 1; ++ unsigned long wr : BITS_PER_LONG / 2 - 1; ++ } cnt; ++ unsigned long raw; ++}; + + /* + * Returns 0 if TLB flush / invalidate required by caller. +@@ -931,7 +885,6 @@ map_grant_ref( + struct grant_mapping *mt; + grant_entry_header_t *shah; + uint16_t *status; +- bool_t need_iommu; + + led = current; + ld = led->domain; +@@ -1139,31 +1092,75 @@ map_grant_ref( + goto undo_out; + } + +- need_iommu = gnttab_need_iommu_mapping(ld); +- if ( need_iommu ) ++ if ( gnttab_need_iommu_mapping(ld) ) + { ++ union maptrack_node node = { ++ .cnt.rd = !!(op->flags & GNTMAP_readonly), ++ .cnt.wr = !(op->flags & GNTMAP_readonly), ++ }; ++ int err; ++ void **slot = NULL; + unsigned int kind; + +- double_gt_lock(lgt, rgt); ++ grant_write_lock(lgt); ++ ++ err = radix_tree_insert(&lgt->maptrack_tree, mfn_x(frame), ++ radix_tree_ulong_to_ptr(node.raw)); ++ if ( err == -EEXIST ) ++ { ++ slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(frame)); ++ if ( likely(slot) ) ++ { ++ node.raw = radix_tree_ptr_to_ulong(*slot); ++ err = -EBUSY; ++ ++ /* Update node only when refcount doesn't overflow. */ ++ if ( op->flags & GNTMAP_readonly ? ++node.cnt.rd ++ : ++node.cnt.wr ) ++ { ++ radix_tree_replace_slot(slot, ++ radix_tree_ulong_to_ptr(node.raw)); ++ err = 0; ++ } ++ } ++ else ++ ASSERT_UNREACHABLE(); ++ } + + /* + * We're not translated, so we know that dfns and mfns are + * the same things, so the IOMMU entry is always 1-to-1. + */ +- kind = mapkind(lgt, rd, frame); +- if ( !(op->flags & GNTMAP_readonly) && +- !(kind & MAPKIND_WRITE) ) ++ if ( !(op->flags & GNTMAP_readonly) && node.cnt.wr == 1 ) + kind = IOMMUF_readable | IOMMUF_writable; +- else if ( !kind ) ++ else if ( (op->flags & GNTMAP_readonly) && ++ node.cnt.rd == 1 && !node.cnt.wr ) + kind = IOMMUF_readable; + else + kind = 0; +- if ( kind && iommu_map_page(ld, mfn_x(frame), mfn_x(frame), kind) ) ++ if ( err || ++ (kind && iommu_map_page(ld, mfn_x(frame), mfn_x(frame), kind)) ) + { +- double_gt_unlock(lgt, rgt); ++ if ( !err ) ++ { ++ if ( slot ) ++ { ++ op->flags & GNTMAP_readonly ? node.cnt.rd-- ++ : node.cnt.wr--; ++ radix_tree_replace_slot(slot, ++ radix_tree_ulong_to_ptr(node.raw)); ++ } ++ else ++ radix_tree_delete(&lgt->maptrack_tree, mfn_x(frame)); ++ } ++ + rc = GNTST_general_error; +- goto undo_out; + } ++ ++ grant_write_unlock(lgt); ++ ++ if ( rc != GNTST_okay ) ++ goto undo_out; + } + + TRACE_1D(TRC_MEM_PAGE_GRANT_MAP, op->dom); +@@ -1171,10 +1168,6 @@ map_grant_ref( + /* + * All maptrack entry users check mt->flags first before using the + * other fields so just ensure the flags field is stored last. +- * +- * However, if gnttab_need_iommu_mapping() then this would race +- * with a concurrent mapkind() call (on an unmap, for example) +- * and a lock is required. + */ + mt = &maptrack_entry(lgt, handle); + mt->domid = op->dom; +@@ -1182,9 +1175,6 @@ map_grant_ref( + smp_wmb(); + write_atomic(&mt->flags, op->flags); + +- if ( need_iommu ) +- double_gt_unlock(lgt, rgt); +- + op->dev_bus_addr = mfn_to_maddr(frame); + op->handle = handle; + op->status = GNTST_okay; +@@ -1411,19 +1401,34 @@ unmap_common( + + if ( rc == GNTST_okay && gnttab_need_iommu_mapping(ld) ) + { +- unsigned int kind; ++ void **slot; ++ union maptrack_node node; + int err = 0; + +- double_gt_lock(lgt, rgt); ++ grant_write_lock(lgt); ++ slot = radix_tree_lookup_slot(&lgt->maptrack_tree, mfn_x(op->frame)); ++ node.raw = likely(slot) ? radix_tree_ptr_to_ulong(*slot) : 0; ++ ++ /* Refcount must not underflow. */ ++ if ( !(flags & GNTMAP_readonly ? node.cnt.rd-- ++ : node.cnt.wr--) ) ++ BUG(); + +- kind = mapkind(lgt, rd, op->frame); +- if ( !kind ) ++ if ( !node.raw ) + err = iommu_unmap_page(ld, mfn_x(op->frame)); +- else if ( !(kind & MAPKIND_WRITE) ) ++ else if ( !(flags & GNTMAP_readonly) && !node.cnt.wr ) + err = iommu_map_page(ld, mfn_x(op->frame), + mfn_x(op->frame), IOMMUF_readable); + +- double_gt_unlock(lgt, rgt); ++ if ( err ) ++ ; ++ else if ( !node.raw ) ++ radix_tree_delete(&lgt->maptrack_tree, mfn_x(op->frame)); ++ else ++ radix_tree_replace_slot(slot, ++ radix_tree_ulong_to_ptr(node.raw)); ++ ++ grant_write_unlock(lgt); + + if ( err ) + rc = GNTST_general_error; +@@ -1854,6 +1859,8 @@ grant_table_init(struct domain *d, struc + gt->maptrack = vzalloc(gt->max_maptrack_frames * sizeof(*gt->maptrack)); + if ( gt->maptrack == NULL ) + goto out; ++ ++ radix_tree_init(>->maptrack_tree); + } + + /* Shared grant table. */ +@@ -3643,6 +3650,8 @@ int gnttab_release_mappings(struct domai + + for ( handle = gt->maptrack_limit; handle; ) + { ++ mfn_t mfn; ++ + /* + * Deal with full pages such that their freeing (in the body of the + * if()) remains simple. +@@ -3744,17 +3753,31 @@ int gnttab_release_mappings(struct domai + if ( act->pin == 0 ) + gnttab_clear_flag(rd, _GTF_reading, status); + ++ mfn = act->frame; ++ + active_entry_release(act); + grant_read_unlock(rgt); + + rcu_unlock_domain(rd); + + map->flags = 0; ++ ++ /* ++ * This is excessive in that a single such call would suffice per ++ * mapped MFN (or none at all, if no entry was ever inserted). But it ++ * should be the common case for an MFN to be mapped just once, and ++ * this way we don't need to further maintain the counters. We also ++ * don't want to leave cleaning up of the tree as a whole to the end ++ * of the function, as this could take quite some time. ++ */ ++ radix_tree_delete(>->maptrack_tree, mfn_x(mfn)); + } + + gt->maptrack_limit = 0; + FREE_XENHEAP_PAGE(gt->maptrack[0]); + ++ radix_tree_destroy(>->maptrack_tree, NULL); ++ + return 0; + } + +--- a/xen/include/xen/radix-tree.h ++++ b/xen/include/xen/radix-tree.h +@@ -190,6 +190,25 @@ static inline int radix_tree_ptr_to_int( + return (int)((long)ptr >> 2); + } + ++/** ++ * radix_tree_{ulong_to_ptr,ptr_to_ulong}: ++ * ++ * Same for unsigned long values. Beware though that only BITS_PER_LONG-2 ++ * bits are actually usable for the value. ++ */ ++static inline void *radix_tree_ulong_to_ptr(unsigned long val) ++{ ++ unsigned long ptr = (val << 2) | 0x2; ++ ASSERT((ptr >> 2) == val); ++ return (void *)ptr; ++} ++ ++static inline unsigned long radix_tree_ptr_to_ulong(void *ptr) ++{ ++ ASSERT(((unsigned long)ptr & 0x3) == 0x2); ++ return (unsigned long)ptr >> 2; ++} ++ + int radix_tree_insert(struct radix_tree_root *, unsigned long, void *); + void *radix_tree_lookup(struct radix_tree_root *, unsigned long); + void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa382.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa382.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa382.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa382.patch 2022-06-17 09:29:25.000000000 +0100 @@ -0,0 +1,34 @@ +From: Jan Beulich +Subject: gnttab: fix array capacity check in gnttab_get_status_frames() + +The number of grant frames is of no interest here; converting the passed +in op.nr_frames this way means we allow for 8 times as many GFNs to be +written as actually fit in the array. We would corrupt xlat areas of +higher vCPU-s (after having faulted many times while trying to write to +the guard pages between any two areas) for 32-bit PV guests. For HVM +guests we'd simply crash as soon as we hit the first guard page, as +accesses to the xlat area are simply memcpy() there. + +This is CVE-2021-28699 / XSA-382. + +Fixes: 18b1be5e324b ("gnttab: make resource limits per domain") +Signed-off-by: Jan Beulich + +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -3177,12 +3177,11 @@ gnttab_get_status_frames(XEN_GUEST_HANDL + goto unlock; + } + +- if ( unlikely(limit_max < grant_to_status_frames(op.nr_frames)) ) ++ if ( unlikely(limit_max < op.nr_frames) ) + { + gdprintk(XENLOG_WARNING, +- "grant_to_status_frames(%u) for d%d is too large (%u,%u)\n", +- op.nr_frames, d->domain_id, +- grant_to_status_frames(op.nr_frames), limit_max); ++ "nr_status_frames for %pd is too large (%u,%u)\n", ++ d, op.nr_frames, limit_max); + op.status = GNTST_general_error; + goto unlock; + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa384-4.11.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa384-4.11.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa384-4.11.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa384-4.11.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,79 @@ +From: Jan Beulich +Subject: gnttab: deal with status frame mapping race + +Once gnttab_map_frame() drops the grant table lock, the MFN it reports +back to its caller is free to other manipulation. In particular +gnttab_unpopulate_status_frames() might free it, by a racing request on +another CPU, thus resulting in a reference to a deallocated page getting +added to a domain's P2M. + +Obtain a page reference in gnttab_map_frame() to prevent freeing of the +page until xenmem_add_to_physmap_one() has actually completed its acting +on the page. Do so uniformly, even if only strictly required for v2 +status pages, to avoid extra conditionals (which then would all need to +be kept in sync going forward). + +This is CVE-2021-28701 / XSA-384. + +Reported-by: Julien Grall +Signed-off-by: Jan Beulich +Reviewed-by: Julien Grall + +--- a/xen/arch/arm/mm.c ++++ b/xen/arch/arm/mm.c +@@ -1238,6 +1238,8 @@ int xenmem_add_to_physmap_one( + if ( rc ) + return rc; + ++ /* Need to take care of the reference obtained in gnttab_map_frame(). */ ++ page = mfn_to_page(mfn); + t = p2m_ram_rw; + + break; +@@ -1304,9 +1306,12 @@ int xenmem_add_to_physmap_one( + /* Map at new location. */ + rc = guest_physmap_add_entry(d, gfn, mfn, 0, t); + +- /* If we fail to add the mapping, we need to drop the reference we +- * took earlier on foreign pages */ +- if ( rc && space == XENMAPSPACE_gmfn_foreign ) ++ /* ++ * For XENMAPSPACE_gmfn_foreign if we failed to add the mapping, we need ++ * to drop the reference we took earlier. In all other cases we need to ++ * drop any reference we took earlier (perhaps indirectly). ++ */ ++ if ( space == XENMAPSPACE_gmfn_foreign ? rc : page != NULL ) + { + ASSERT(page != NULL); + put_page(page); +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -4751,6 +4751,8 @@ int xenmem_add_to_physmap_one( + rc = gnttab_map_frame(d, idx, gpfn, &mfn); + if ( rc ) + return rc; ++ /* Need to take care of the ref obtained in gnttab_map_frame(). */ ++ page = mfn_to_page(mfn); + break; + case XENMAPSPACE_gmfn: + { +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -3964,7 +3964,16 @@ int gnttab_map_frame(struct domain *d, u + *mfn, 0); + + if ( !rc ) +- gnttab_set_frame_gfn(gt, status, idx, gfn); ++ { ++ /* ++ * Make sure gnttab_unpopulate_status_frames() won't (successfully) ++ * free the page until our caller has completed its operation. ++ */ ++ if ( get_page(mfn_to_page(*mfn), d) ) ++ gnttab_set_frame_gfn(gt, status, idx, gfn); ++ else ++ rc = -EBUSY; ++ } + + grant_write_unlock(gt); + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa385-4.12.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa385-4.12.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa385-4.12.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa385-4.12.patch 2022-06-01 21:38:43.000000000 +0100 @@ -0,0 +1,73 @@ +From: Julien Grall +Subject: xen/page_alloc: Harden assign_pages() + +domain_tot_pages() and d->max_pages are 32-bit values. While the order +should always be quite small, it would still be possible to overflow +if domain_tot_pages() is near to (2^32 - 1). + +As this code may be called by a guest via XENMEM_increase_reservation +and XENMEM_populate_physmap, we want to make sure the guest is not going +to be able to allocate more than it is allowed. + +Rework the allocation check to avoid any possible overflow. While the +check domain_tot_pages() < d->max_pages should technically not be +necessary, it is probably best to have it to catch any possible +inconsistencies in the future. + +This is CVE-2021-28706 / XSA-385. + +Signed-off-by: Julien Grall +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné + +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -2239,7 +2239,8 @@ gnttab_transfer( + * pages when it is dying. + */ + if ( unlikely(e->is_dying) || +- unlikely(e->tot_pages >= e->max_pages) ) ++ unlikely(e->tot_pages >= e->max_pages) || ++ unlikely(!(e->tot_pages + 1)) ) + { + spin_unlock(&e->page_alloc_lock); + +@@ -2248,8 +2249,8 @@ gnttab_transfer( + e->domain_id); + else + gdprintk(XENLOG_INFO, +- "Transferee d%d has no headroom (tot %u, max %u)\n", +- e->domain_id, e->tot_pages, e->max_pages); ++ "Transferee %pd has no headroom (tot %u, max %u)\n", ++ e, e->tot_pages, e->max_pages); + + gop.status = GNTST_general_error; + goto unlock_and_copyback; +--- a/xen/common/page_alloc.c ++++ b/xen/common/page_alloc.c +@@ -2278,12 +2278,21 @@ int assign_pages( + + if ( !(memflags & MEMF_no_refcount) ) + { +- if ( unlikely((d->tot_pages + (1 << order)) > d->max_pages) ) ++ unsigned int nr = 1u << order; ++ ++ if ( unlikely(d->tot_pages > d->max_pages) ) ++ { ++ gprintk(XENLOG_INFO, "Inconsistent allocation for %pd: %u > %u\n", ++ d, d->tot_pages, d->max_pages); ++ rc = -EPERM; ++ goto out; ++ } ++ ++ if ( unlikely(nr > d->max_pages - d->tot_pages) ) + { + if ( !tmem_enabled() || order != 0 || d->tot_pages != d->max_pages ) +- gprintk(XENLOG_INFO, "Over-allocation for domain %u: " +- "%u > %u\n", d->domain_id, +- d->tot_pages + (1 << order), d->max_pages); ++ gprintk(XENLOG_INFO, "Over-allocation for %pd: %Lu > %u\n", ++ d, d->tot_pages + 0ull + nr, d->max_pages); + rc = -E2BIG; + goto out; + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-1.patch 2022-06-02 07:31:01.000000000 +0100 @@ -0,0 +1,172 @@ +From: Jan Beulich +Subject: x86/PoD: deal with misaligned GFNs + +Users of XENMEM_decrease_reservation and XENMEM_populate_physmap aren't +required to pass in order-aligned GFN values. (While I consider this +bogus, I don't think we can fix this there, as that might break existing +code, e.g Linux'es swiotlb, which - while affecting PV only - until +recently had been enforcing only page alignment on the original +allocation.) Only non-PoD code paths (guest_physmap_{add,remove}_page(), +p2m_set_entry()) look to be dealing with this properly (in part by being +implemented inefficiently, handling every 4k page separately). + +Introduce wrappers taking care of splitting the incoming request into +aligned chunks, without putting much effort in trying to determine the +largest possible chunk at every iteration. + +Also "handle" p2m_set_entry() failure for non-order-0 requests by +crashing the domain in one more place. Alongside putting a log message +there, also add one to the other similar path. + +Note regarding locking: This is left in the actual worker functions on +the assumption that callers aren't guaranteed atomicity wrt acting on +multiple pages at a time. For mis-aligned GFNs gfn_lock() wouldn't have +locked the correct GFN range anyway, if it didn't simply resolve to +p2m_lock(), and for well-behaved callers there continues to be only a +single iteration, i.e. behavior is unchanged for them. (FTAOD pulling +out just pod_lock() into p2m_pod_decrease_reservation() would result in +a lock order violation.) + +This is CVE-2021-28704 and CVE-2021-28707 / part of XSA-388. + +Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné + +--- a/xen/arch/x86/mm/p2m-pod.c ++++ b/xen/arch/x86/mm/p2m-pod.c +@@ -495,7 +495,7 @@ p2m_pod_zero_check_superpage(struct p2m_ + + + /* +- * This function is needed for two reasons: ++ * This pair of functions is needed for two reasons: + * + To properly handle clearing of PoD entries + * + To "steal back" memory being freed for the PoD cache, rather than + * releasing it. +@@ -503,8 +503,8 @@ p2m_pod_zero_check_superpage(struct p2m_ + * Once both of these functions have been completed, we can return and + * allow decrease_reservation() to handle everything else. + */ +-unsigned long +-p2m_pod_decrease_reservation(struct domain *d, gfn_t gfn, unsigned int order) ++static unsigned long ++decrease_reservation(struct domain *d, gfn_t gfn, unsigned int order) + { + unsigned long ret = 0, i, n; + struct p2m_domain *p2m = p2m_get_hostp2m(d); +@@ -551,8 +551,10 @@ p2m_pod_decrease_reservation(struct doma + * All PoD: Mark the whole region invalid and tell caller + * we're done. + */ +- if ( p2m_set_entry(p2m, gfn, INVALID_MFN, order, p2m_invalid, +- p2m->default_access) ) ++ int rc = p2m_set_entry(p2m, gfn, INVALID_MFN, order, p2m_invalid, ++ p2m->default_access); ++ ++ if ( rc ) + { + /* + * If this fails, we can't tell how much of the range was changed. +@@ -560,7 +562,12 @@ p2m_pod_decrease_reservation(struct doma + * impossible. + */ + if ( order != 0 ) ++ { ++ printk(XENLOG_G_ERR ++ "%pd: marking GFN %#lx (order %u) as non-PoD failed: %d\n", ++ d, gfn_x(gfn), order, rc); + domain_crash(d); ++ } + goto out_unlock; + } + ret = 1UL << order; +@@ -667,6 +674,22 @@ out_unlock: + return ret; + } + ++unsigned long ++p2m_pod_decrease_reservation(struct domain *d, gfn_t gfn, unsigned int order) ++{ ++ unsigned long left = 1UL << order, ret = 0; ++ unsigned int chunk_order = find_first_set_bit(gfn_x(gfn) | left); ++ ++ do { ++ ret += decrease_reservation(d, gfn, chunk_order); ++ ++ left -= 1UL << chunk_order; ++ gfn = gfn_add(gfn, 1UL << chunk_order); ++ } while ( left ); ++ ++ return ret; ++} ++ + void p2m_pod_dump_data(struct domain *d) + { + struct p2m_domain *p2m = p2m_get_hostp2m(d); +@@ -1266,19 +1289,15 @@ remap_and_retry: + return true; + } + +- +-int +-guest_physmap_mark_populate_on_demand(struct domain *d, unsigned long gfn_l, +- unsigned int order) ++static int ++mark_populate_on_demand(struct domain *d, unsigned long gfn_l, ++ unsigned int order) + { + struct p2m_domain *p2m = p2m_get_hostp2m(d); + gfn_t gfn = _gfn(gfn_l); + unsigned long i, n, pod_count = 0; + int rc = 0; + +- if ( !paging_mode_translate(d) ) +- return -EINVAL; +- + gfn_lock(p2m, gfn, order); + + P2M_DEBUG("mark pod gfn=%#lx\n", gfn_l); +@@ -1316,10 +1335,42 @@ guest_physmap_mark_populate_on_demand(st + BUG_ON(p2m->pod.entry_count < 0); + pod_unlock(p2m); + } ++ else if ( order ) ++ { ++ /* ++ * If this failed, we can't tell how much of the range was changed. ++ * Best to crash the domain. ++ */ ++ printk(XENLOG_G_ERR ++ "%pd: marking GFN %#lx (order %u) as PoD failed: %d\n", ++ d, gfn_l, order, rc); ++ domain_crash(d); ++ } + + out: + gfn_unlock(p2m, gfn, order); + + return rc; + } ++ ++int ++guest_physmap_mark_populate_on_demand(struct domain *d, unsigned long gfn, ++ unsigned int order) ++{ ++ unsigned long left = 1UL << order; ++ unsigned int chunk_order = find_first_set_bit(gfn | left); ++ int rc; ++ ++ if ( !paging_mode_translate(d) ) ++ return -EINVAL; ++ ++ do { ++ rc = mark_populate_on_demand(d, gfn, chunk_order); ++ ++ left -= 1UL << chunk_order; ++ gfn += 1UL << chunk_order; ++ } while ( !rc && left ); ++ ++ return rc; ++} + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa388-4.14-2.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,36 @@ +From: Jan Beulich +Subject: x86/PoD: handle intermediate page orders in p2m_pod_cache_add() + +p2m_pod_decrease_reservation() may pass pages to the function which +aren't 4k, 2M, or 1G. Handle all intermediate orders as well, to avoid +hitting the BUG() at the switch() statement's "default" case. + +This is CVE-2021-28708 / part of XSA-388. + +Fixes: 3c352011c0d3 ("x86/PoD: shorten certain operations on higher order ranges") +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné + +--- a/xen/arch/x86/mm/p2m-pod.c ++++ b/xen/arch/x86/mm/p2m-pod.c +@@ -111,15 +111,13 @@ p2m_pod_cache_add(struct p2m_domain *p2m + /* Then add to the appropriate populate-on-demand list. */ + switch ( order ) + { +- case PAGE_ORDER_1G: +- for ( i = 0; i < (1UL << PAGE_ORDER_1G); i += 1UL << PAGE_ORDER_2M ) ++ case PAGE_ORDER_2M ... PAGE_ORDER_1G: ++ for ( i = 0; i < (1UL << order); i += 1UL << PAGE_ORDER_2M ) + page_list_add_tail(page + i, &p2m->pod.super); + break; +- case PAGE_ORDER_2M: +- page_list_add_tail(page, &p2m->pod.super); +- break; +- case PAGE_ORDER_4K: +- page_list_add_tail(page, &p2m->pod.single); ++ case PAGE_ORDER_4K ... PAGE_ORDER_2M - 1: ++ for ( i = 0; i < (1UL << order); i += 1UL << PAGE_ORDER_4K ) ++ page_list_add_tail(page + i, &p2m->pod.single); + break; + default: + BUG(); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa389-4.12.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa389-4.12.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa389-4.12.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa389-4.12.patch 2022-06-02 13:03:02.000000000 +0100 @@ -0,0 +1,178 @@ +From: Jan Beulich +Subject: x86/P2M: deal with partial success of p2m_set_entry() + +M2P and PoD stats need to remain in sync with P2M; if an update succeeds +only partially, respective adjustments need to be made. If updates get +made before the call, they may also need undoing upon complete failure +(i.e. including the single-page case). + +Log-dirty state would better also be kept in sync. + +Note that the change to set_typed_p2m_entry() may not be strictly +necessary (due to the order restriction enforced near the top of the +function), but is being kept here to be on the safe side. + +This is CVE-2021-28705 and CVE-2021-28709 / XSA-389. + +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné + +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -780,6 +780,7 @@ p2m_remove_page(struct p2m_domain *p2m, + gfn_t gfn = _gfn(gfn_l); + p2m_type_t t; + p2m_access_t a; ++ int rc; + + /* IOMMU for PV guests is handled in get_page_type() and put_page(). */ + if ( !paging_mode_translate(p2m->domain) ) +@@ -811,8 +812,27 @@ p2m_remove_page(struct p2m_domain *p2m, + set_gpfn_from_mfn(mfn+i, INVALID_M2P_ENTRY); + } + } +- return p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid, +- p2m->default_access); ++ rc = p2m_set_entry(p2m, gfn, INVALID_MFN, page_order, p2m_invalid, ++ p2m->default_access); ++ if ( likely(!rc) || !mfn_valid(_mfn(mfn)) ) ++ return rc; ++ ++ /* ++ * The operation may have partially succeeded. For the failed part we need ++ * to undo the M2P update and, out of precaution, mark the pages dirty ++ * again. ++ */ ++ for ( i = 0; i < (1UL << page_order); ++i ) ++ { ++ p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, NULL, NULL); ++ if ( !p2m_is_hole(t) && !p2m_is_special(t) && !p2m_is_shared(t) ) ++ { ++ set_gpfn_from_mfn(mfn + i, gfn_l + i); ++ paging_mark_pfn_dirty(p2m->domain, _pfn(gfn_l + i)); ++ } ++ } ++ ++ return rc; + } + + int +@@ -980,13 +1000,8 @@ guest_physmap_add_entry(struct domain *d + + /* Now, actually do the two-way mapping */ + rc = p2m_set_entry(p2m, gfn, mfn, page_order, t, p2m->default_access); +- if ( rc == 0 ) ++ if ( likely(!rc) ) + { +- pod_lock(p2m); +- p2m->pod.entry_count -= pod_count; +- BUG_ON(p2m->pod.entry_count < 0); +- pod_unlock(p2m); +- + if ( !p2m_is_grant(t) ) + { + for ( i = 0; i < (1UL << page_order); i++ ) +@@ -996,6 +1009,42 @@ guest_physmap_add_entry(struct domain *d + gfn_x(gfn_add(gfn, i))); + } + } ++ else ++ { ++ /* ++ * The operation may have partially succeeded. For the successful part ++ * we need to update M2P and dirty state, while for the failed part we ++ * may need to adjust PoD stats as well as undo the earlier M2P update. ++ */ ++ for ( i = 0; i < (1UL << page_order); ++i ) ++ { ++ omfn = p2m->get_entry(p2m, gfn_add(gfn, i), &ot, &a, 0, NULL, NULL); ++ if ( p2m_is_pod(ot) ) ++ { ++ BUG_ON(!pod_count); ++ --pod_count; ++ } ++ else if ( mfn_eq(omfn, mfn_add(mfn, i)) && ot == t && ++ a == p2m->default_access && !p2m_is_grant(t) ) ++ { ++ set_gpfn_from_mfn(mfn_x(omfn), gfn_x(gfn) + i); ++ paging_mark_pfn_dirty(d, _pfn(gfn_x(gfn) + i)); ++ } ++ else if ( p2m_is_ram(ot) && !p2m_is_paged(ot) ) ++ { ++ ASSERT(mfn_valid(omfn)); ++ set_gpfn_from_mfn(mfn_x(omfn), gfn_x(gfn) + i); ++ } ++ } ++ } ++ ++ if ( pod_count ) ++ { ++ pod_lock(p2m); ++ p2m->pod.entry_count -= pod_count; ++ BUG_ON(p2m->pod.entry_count < 0); ++ pod_unlock(p2m); ++ } + + out: + p2m_unlock(p2m); +@@ -1278,6 +1329,47 @@ static int set_typed_p2m_entry(struct do + domain_crash(d); + return -EPERM; + } ++ ++ P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn_l, mfn_x(mfn)); ++ rc = p2m_set_entry(p2m, gfn, mfn, order, gfn_p2mt, access); ++ if ( unlikely(rc) ) ++ { ++ gdprintk(XENLOG_ERR, "p2m_set_entry: %#lx:%u -> %d (0x%"PRI_mfn")\n", ++ gfn_l, order, rc, mfn_x(mfn)); ++ ++ /* ++ * The operation may have partially succeeded. For the successful part ++ * we need to update PoD stats, M2P, and dirty state. ++ */ ++ if ( order != PAGE_ORDER_4K ) ++ { ++ unsigned long i; ++ ++ for ( i = 0; i < (1UL << order); ++i ) ++ { ++ p2m_type_t t; ++ mfn_t cmfn = p2m->get_entry(p2m, gfn_add(gfn, i), &t, &a, 0, ++ NULL, NULL); ++ ++ if ( !mfn_eq(cmfn, mfn_add(mfn, i)) || t != gfn_p2mt || ++ a != access ) ++ continue; ++ ++ if ( p2m_is_ram(ot) ) ++ { ++ ASSERT(mfn_valid(mfn_add(omfn, i))); ++ set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY); ++ } ++ else if ( p2m_is_pod(ot) ) ++ { ++ pod_lock(p2m); ++ BUG_ON(!p2m->pod.entry_count); ++ --p2m->pod.entry_count; ++ pod_unlock(p2m); ++ } ++ } ++ } ++ } + else if ( p2m_is_ram(ot) ) + { + unsigned long i; +@@ -1288,12 +1381,6 @@ static int set_typed_p2m_entry(struct do + set_gpfn_from_mfn(mfn_x(omfn) + i, INVALID_M2P_ENTRY); + } + } +- +- P2M_DEBUG("set %d %lx %lx\n", gfn_p2mt, gfn_l, mfn_x(mfn)); +- rc = p2m_set_entry(p2m, gfn, mfn, order, gfn_p2mt, access); +- if ( rc ) +- gdprintk(XENLOG_ERR, "p2m_set_entry: %#lx:%u -> %d (0x%"PRI_mfn")\n", +- gfn_l, order, rc, mfn_x(mfn)); + else if ( p2m_is_pod(ot) ) + { + pod_lock(p2m); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa394-4.12.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa394-4.12.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa394-4.12.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa394-4.12.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,59 @@ +From 604fb691eee5bbeba770126451d880b932565e65 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Wed, 5 Jan 2022 17:55:48 +0000 +Subject: [PATCH] xen/grant-table: Only decrement the refcounter when grant is + fully unmapped + +The grant unmapping hypercall (GNTTABOP_unmap_grant_ref) is not a +simple revert of the changes done by the grant mapping hypercall +(GNTTABOP_map_grant_ref). + +Instead, it is possible to partially (or even not) clear some flags. +This will leave the grant is mapped until a future call where all +the flags would be cleared. + +XSA-380 introduced a refcounting that is meant to only be dropped +when the grant is fully unmapped. Unfortunately, unmap_common() will +decrement the refcount for every successful call. + +A consequence is a domain would be able to underflow the refcount +and trigger a BUG(). + +Looking at the code, it is not clear to me why a domain would +want to partially clear some flags in the grant-table. But as +this is part of the ABI, it is better to not change the behavior +for now. + +Fix it by checking if the maptrack handle has been released before +decrementing the refcounting. + +This is CVE-2022-23034 / XSA-394. + +Fixes: 9781b51efde2 ("gnttab: replace mapkind()") +Signed-off-by: Julien Grall +Reviewed-by: Jan Beulich +--- + xen/common/grant_table.c | 7 ++++++- + 1 file changed, 6 insertions(+), 1 deletion(-) + +diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c +index ee5748e74eb9..61d29df7bdf6 100644 +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -1402,7 +1402,12 @@ unmap_common( + if ( put_handle ) + put_maptrack_handle(lgt, op->handle); + +- if ( rc == GNTST_okay && gnttab_need_iommu_mapping(ld) ) ++ /* ++ * map_grant_ref() will only increment the refcount (and update the ++ * IOMMU) once per mapping. So we only want to decrement it once the ++ * maptrack handle has been put, alongside the further IOMMU update. ++ */ ++ if ( put_handle && gnttab_need_iommu_mapping(ld) ) + { + void **slot; + union maptrack_node node; +-- +2.32.0 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa395-4.14.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa395-4.14.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa395-4.14.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa395-4.14.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,42 @@ +From 743348f5d545c7fff9cdea746840b795f5c26d43 Mon Sep 17 00:00:00 2001 +From: Julien Grall +Date: Wed, 5 Jan 2022 18:09:39 +0000 +Subject: [PATCH] passthrough/x86: stop pirq iteration immediately in case of + error + +pt_pirq_iterate() will iterate in batch over all the PIRQs. The outer +loop will bail out if 'rc' is non-zero but the inner loop will continue. + +This means 'rc' will get clobbered and we may miss any errors (such as +-ERESTART in the case of the callback pci_clean_dpci_irq()). + +This is CVE-2022-23035 / XSA-395. + +Fixes: c24536b636f2 ("replace d->nr_pirqs sized arrays with radix tree") +Fixes: f6dd295381f4 ("dpci: replace tasklet with softirq") +Signed-off-by: Julien Grall +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné +--- + xen/drivers/passthrough/io.c | 4 ++++ + 1 file changed, 4 insertions(+) + +diff --git a/xen/drivers/passthrough/io.c b/xen/drivers/passthrough/io.c +index 71eaf2c17e27..b6e88ebc8646 100644 +--- a/xen/drivers/passthrough/io.c ++++ b/xen/drivers/passthrough/io.c +@@ -810,7 +810,11 @@ int pt_pirq_iterate(struct domain *d, + + pirq = pirqs[i]->pirq; + if ( (pirq_dpci->flags & HVM_IRQ_DPCI_MAPPED) ) ++ { + rc = cb(d, pirq_dpci, arg); ++ if ( rc ) ++ break; ++ } + } + } while ( !rc && ++pirq < d->nr_pirqs && n == ARRAY_SIZE(pirqs) ); + +-- +2.32.0 + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa397-4.12.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa397-4.12.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa397-4.12.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa397-4.12.patch 2022-05-26 17:34:25.000000000 +0100 @@ -0,0 +1,98 @@ +From: Roger Pau Monne +Subject: x86/hap: do not switch on log dirty for VRAM tracking + +XEN_DMOP_track_dirty_vram possibly calls into paging_log_dirty_enable +when using HAP mode, and it can interact badly with other ongoing +paging domctls, as XEN_DMOP_track_dirty_vram is not holding the domctl +lock. + +This was detected as a result of the following assert triggering when +doing repeated migrations of a HAP HVM domain with a stubdom: + +Assertion 'd->arch.paging.log_dirty.allocs == 0' failed at paging.c:198 +----[ Xen-4.17-unstable x86_64 debug=y Not tainted ]---- +CPU: 34 +RIP: e008:[] arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x6 +RFLAGS: 0000000000010206 CONTEXT: hypervisor (d0v23) +[...] +Xen call trace: + [] R arch/x86/mm/paging.c#paging_free_log_dirty_bitmap+0x606/0x63a + [] S xsm/flask/hooks.c#domain_has_perm+0x5a/0x67 + [] F paging_domctl+0x251/0xd41 + [] F paging_domctl_continuation+0x19d/0x202 + [] F pv_hypercall+0x150/0x2a7 + [] F lstar_enter+0x12d/0x140 + +Such assert triggered because the stubdom used +XEN_DMOP_track_dirty_vram while dom0 was in the middle of executing +XEN_DOMCTL_SHADOW_OP_OFF, and so log dirty become enabled while +retiring the old structures, thus leading to new entries being +populated in already clear slots. + +Fix this by not enabling log dirty for VRAM tracking, similar to what +is done when using shadow instead of HAP. Call +p2m_enable_hardware_log_dirty when enabling VRAM tracking in order to +get some hardware assistance if available. As a side effect the memory +pressure on the p2m pool should go down if only VRAM tracking is +enabled, as the dirty bitmap is no longer allocated. + +Note that paging_log_dirty_range (used to get the dirty bitmap for +VRAM tracking) doesn't use the log dirty bitmap, and instead relies on +checking whether each gfn on the range has been switched from +p2m_ram_logdirty to p2m_ram_rw in order to account for dirty pages. + +This is CVE-2022-26356 / XSA-397. + +Signed-off-by: Roger Pau Monné +Reviewed-by: Jan Beulich + +--- a/xen/include/asm-x86/paging.h ++++ b/xen/include/asm-x86/paging.h +@@ -144,9 +144,6 @@ void paging_log_dirty_range(struct domai + unsigned long nr, + uint8_t *dirty_bitmap); + +-/* enable log dirty */ +-int paging_log_dirty_enable(struct domain *d, bool_t log_global); +- + /* log dirty initialization */ + void paging_log_dirty_init(struct domain *d, const struct log_dirty_ops *ops); + +--- a/xen/arch/x86/mm/hap/hap.c ++++ b/xen/arch/x86/mm/hap/hap.c +@@ -69,13 +69,6 @@ int hap_track_dirty_vram(struct domain * + { + int size = (nr + BITS_PER_BYTE - 1) / BITS_PER_BYTE; + +- if ( !paging_mode_log_dirty(d) ) +- { +- rc = paging_log_dirty_enable(d, 0); +- if ( rc ) +- goto out; +- } +- + rc = -ENOMEM; + dirty_bitmap = vzalloc(size); + if ( !dirty_bitmap ) +@@ -107,6 +100,10 @@ int hap_track_dirty_vram(struct domain * + + paging_unlock(d); + ++ domain_pause(d); ++ p2m_enable_hardware_log_dirty(d); ++ domain_unpause(d); ++ + if ( oend > ostart ) + p2m_change_type_range(d, ostart, oend, + p2m_ram_logdirty, p2m_ram_rw); +--- a/xen/arch/x86/mm/paging.c ++++ b/xen/arch/x86/mm/paging.c +@@ -209,7 +209,7 @@ static int paging_free_log_dirty_bitmap( + return rc; + } + +-int paging_log_dirty_enable(struct domain *d, bool_t log_global) ++static int paging_log_dirty_enable(struct domain *d, bool_t log_global) + { + int ret; + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-1-xen-arm-Introduce-new-Arm-processors.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-1-xen-arm-Introduce-new-Arm-processors.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-1-xen-arm-Introduce-new-Arm-processors.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-1-xen-arm-Introduce-new-Arm-processors.patch 2022-06-02 13:15:13.000000000 +0100 @@ -0,0 +1,63 @@ +From f1346b2cfdbeb468b50be7b6f7aa38ce3c1acf2a Mon Sep 17 00:00:00 2001 +From: Bertrand Marquis +Date: Tue, 15 Feb 2022 10:37:51 +0000 +Subject: xen/arm: Introduce new Arm processors + +Add some new processor identifiers in processor.h and sync Xen +definitions with status of Linux 5.17 (declared in +arch/arm64/include/asm/cputype.h). + +This is part of XSA-398 / CVE-2022-23960. + +Signed-off-by: Bertrand Marquis +Acked-by: Julien Grall +(cherry picked from commit 35d1b85a6b43483f6bd007d48757434e54743e98) + +diff --git a/xen/include/asm-arm/processor.h b/xen/include/asm-arm/processor.h +index 0f35ec59d15e..cd45fba9786f 100644 +--- a/xen/include/asm-arm/processor.h ++++ b/xen/include/asm-arm/processor.h +@@ -48,19 +48,43 @@ + #define ARM_CPU_PART_CORTEX_A17 0xC0E + #define ARM_CPU_PART_CORTEX_A15 0xC0F + #define ARM_CPU_PART_CORTEX_A53 0xD03 ++#define ARM_CPU_PART_CORTEX_A35 0xD04 ++#define ARM_CPU_PART_CORTEX_A55 0xD05 + #define ARM_CPU_PART_CORTEX_A57 0xD07 + #define ARM_CPU_PART_CORTEX_A72 0xD08 + #define ARM_CPU_PART_CORTEX_A73 0xD09 + #define ARM_CPU_PART_CORTEX_A75 0xD0A ++#define ARM_CPU_PART_CORTEX_A76 0xD0B ++#define ARM_CPU_PART_NEOVERSE_N1 0xD0C ++#define ARM_CPU_PART_CORTEX_A77 0xD0D ++#define ARM_CPU_PART_NEOVERSE_V1 0xD40 ++#define ARM_CPU_PART_CORTEX_A78 0xD41 ++#define ARM_CPU_PART_CORTEX_X1 0xD44 ++#define ARM_CPU_PART_CORTEX_A710 0xD47 ++#define ARM_CPU_PART_CORTEX_X2 0xD48 ++#define ARM_CPU_PART_NEOVERSE_N2 0xD49 ++#define ARM_CPU_PART_CORTEX_A78C 0xD4B + + #define MIDR_CORTEX_A12 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A12) + #define MIDR_CORTEX_A17 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A17) + #define MIDR_CORTEX_A15 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A15) + #define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A53) ++#define MIDR_CORTEX_A35 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A35) ++#define MIDR_CORTEX_A55 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A55) + #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A57) + #define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A72) + #define MIDR_CORTEX_A73 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A73) + #define MIDR_CORTEX_A75 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A75) ++#define MIDR_CORTEX_A76 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A76) ++#define MIDR_NEOVERSE_N1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N1) ++#define MIDR_CORTEX_A77 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A77) ++#define MIDR_NEOVERSE_V1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_V1) ++#define MIDR_CORTEX_A78 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78) ++#define MIDR_CORTEX_X1 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X1) ++#define MIDR_CORTEX_A710 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A710) ++#define MIDR_CORTEX_X2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_X2) ++#define MIDR_NEOVERSE_N2 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_NEOVERSE_N2) ++#define MIDR_CORTEX_A78C MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, ARM_CPU_PART_CORTEX_A78C) + + /* MPIDR Multiprocessor Affinity Register */ + #define _MPIDR_UP (30) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-2-xen-arm-move-errata-CSV2-check-earlier.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-2-xen-arm-move-errata-CSV2-check-earlier.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-2-xen-arm-move-errata-CSV2-check-earlier.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-2-xen-arm-move-errata-CSV2-check-earlier.patch 2022-05-26 17:34:26.000000000 +0100 @@ -0,0 +1,53 @@ +From 35164a1704fe13e1f83dbd4b5b79838f07d564c6 Mon Sep 17 00:00:00 2001 +From: Bertrand Marquis +Date: Tue, 15 Feb 2022 10:39:47 +0000 +Subject: xen/arm: move errata CSV2 check earlier + +CSV2 availability check is done after printing to the user that +workaround 1 will be used. Move the check before to prevent saying to the +user that workaround 1 is used when it is not because it is not needed. +This will also allow to reuse install_bp_hardening_vec function for +other use cases. + +Code previously returning "true", now returns "0" to conform to +enable_smccc_arch_workaround_1 returning an int and surrounding code +doing a "return 0" if workaround is not needed. + +This is part of XSA-398 / CVE-2022-23960. + +Signed-off-by: Bertrand Marquis +Reviewed-by: Julien Grall +(cherry picked from commit 599616d70eb886b9ad0ef9d6b51693ce790504ba) + +diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c +index b254b9865783..9e1ecd071470 100644 +--- a/xen/arch/arm/cpuerrata.c ++++ b/xen/arch/arm/cpuerrata.c +@@ -102,13 +102,6 @@ install_bp_hardening_vec(const struct arm_cpu_capabilities *entry, + printk(XENLOG_INFO "CPU%u will %s on exception entry\n", + smp_processor_id(), desc); + +- /* +- * No need to install hardened vector when the processor has +- * ID_AA64PRF0_EL1.CSV2 set. +- */ +- if ( cpu_data[smp_processor_id()].pfr64.csv2 ) +- return true; +- + spin_lock(&bp_lock); + + /* +@@ -167,6 +160,13 @@ static int enable_smccc_arch_workaround_1(void *data) + if ( !entry->matches(entry) ) + return 0; + ++ /* ++ * No need to install hardened vector when the processor has ++ * ID_AA64PRF0_EL1.CSV2 set. ++ */ ++ if ( cpu_data[smp_processor_id()].pfr64.csv2 ) ++ return 0; ++ + if ( smccc_ver < SMCCC_VERSION(1, 1) ) + goto warn; + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-3-xen-arm-Add-ECBHB-and-CLEARBHB-ID-fields.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-3-xen-arm-Add-ECBHB-and-CLEARBHB-ID-fields.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-3-xen-arm-Add-ECBHB-and-CLEARBHB-ID-fields.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-3-xen-arm-Add-ECBHB-and-CLEARBHB-ID-fields.patch 2022-06-05 19:16:55.000000000 +0100 @@ -0,0 +1,76 @@ +From 2e519fd8c1e3e7ae5370a6638615d2a52169db28 Mon Sep 17 00:00:00 2001 +From: Bertrand Marquis +Date: Wed, 23 Feb 2022 09:42:18 +0000 +Subject: xen/arm: Add ECBHB and CLEARBHB ID fields + +Introduce ID coprocessor register ID_AA64ISAR2_EL1. +Add definitions in cpufeature and sysregs of ECBHB field in mmfr1 and +CLEARBHB in isar2 ID coprocessor registers. + +This is part of XSA-398 / CVE-2022-23960. + +Signed-off-by: Bertrand Marquis +Acked-by: Julien Grall +(cherry picked from commit 4b68d12d98b8790d8002fcc2c25a9d713374a4d7) + +diff --git a/xen/arch/arm/cpu.c b/xen/arch/arm/cpu.c +index 44126dbf0723..13dac7ccaf94 100644 +--- a/xen/arch/arm/cpu.c ++++ b/xen/arch/arm/cpu.c +@@ -36,6 +36,7 @@ void identify_cpu(struct cpuinfo_arm *c) + + c->isa64.bits[0] = READ_SYSREG64(ID_AA64ISAR0_EL1); + c->isa64.bits[1] = READ_SYSREG64(ID_AA64ISAR1_EL1); ++ c->isa64.bits[2] = READ_SYSREG64(ID_AA64ISAR2_EL1); + #endif + + c->pfr32.bits[0] = READ_SYSREG32(ID_PFR0_EL1); +diff --git a/xen/include/asm-arm/arm64/sysregs.h b/xen/include/asm-arm/arm64/sysregs.h +index 08585a969ebd..5f1e9b998f33 100644 +--- a/xen/include/asm-arm/arm64/sysregs.h ++++ b/xen/include/asm-arm/arm64/sysregs.h +@@ -166,6 +166,10 @@ + #define ICH_AP1R2_EL2 __AP1Rx_EL2(2) + #define ICH_AP1R3_EL2 __AP1Rx_EL2(3) + ++#ifndef ID_AA64ISAR2_EL1 ++#define ID_AA64ISAR2_EL1 S3_0_C0_C6_2 ++#endif ++ + #endif /* _ASM_ARM_ARM64_SYSREGS_H */ + + /* +diff --git a/xen/include/asm-arm/processor.h b/xen/include/asm-arm/processor.h +index 60e677d84200..c748fc17fe66 100644 +--- a/xen/include/asm-arm/processor.h ++++ b/xen/include/asm-arm/processor.h +@@ -425,12 +425,26 @@ struct cpuinfo_arm { + unsigned long lo:4; + unsigned long pan:4; + unsigned long __res1:8; +- unsigned long __res2:32; ++ unsigned long __res2:28; ++ unsigned long ecbhb:4; + }; + } mm64; + +- struct { +- uint64_t bits[2]; ++ union { ++ uint64_t bits[3]; ++ struct { ++ /* ISAR0 */ ++ unsigned long __res0:64; ++ ++ /* ISAR1 */ ++ unsigned long __res1:64; ++ ++ /* ISAR2 */ ++ unsigned long __res3:28; ++ unsigned long clearbhb:4; ++ ++ unsigned long __res4:32; ++ }; + } isa64; + + #endif diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-4-xen-arm-Add-Spectre-BHB-handling.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-4-xen-arm-Add-Spectre-BHB-handling.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-4-xen-arm-Add-Spectre-BHB-handling.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-4-xen-arm-Add-Spectre-BHB-handling.patch 2022-06-19 23:22:25.000000000 +0100 @@ -0,0 +1,358 @@ +From d340fad8be324e1760ea29d7c25658a8aec83306 Mon Sep 17 00:00:00 2001 +From: Rahul Singh +Date: Mon, 14 Feb 2022 18:47:32 +0000 +Subject: xen/arm: Add Spectre BHB handling + +This commit is adding Spectre BHB handling to Xen on Arm. +The commit is introducing new alternative code to be executed during +exception entry: +- SMCC workaround 3 call +- loop workaround (with 8, 24 or 32 iterations) +- use of new clearbhb instruction + +Cpuerrata is modified by this patch to apply the required workaround for +CPU affected by Spectre BHB when CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR is +enabled. + +To do this the system previously used to apply smcc workaround 1 is +reused and new alternative code to be copied in the exception handler is +introduced. + +To define the type of workaround required by a processor, 4 new cpu +capabilities are introduced (for each number of loop and for smcc +workaround 3). + +When a processor is affected, enable_spectre_bhb_workaround is called +and if the processor does not have CSV2 set to 3 or ECBHB feature (which +would mean that the processor is doing what is required in hardware), +the proper code is enabled at exception entry. + +In the case where workaround 3 is not supported by the firmware, we +enable workaround 1 when possible as it will also mitigate Spectre BHB +on systems without CSV2. + +This is part of XSA-398 / CVE-2022-23960. + +Signed-off-by: Bertrand Marquis +Signed-off-by: Rahul Singh +Acked-by: Julien Grall +(cherry picked from commit 62c91eb66a2904eefb1d1d9642e3697a1e3c3a3c) + +diff --git a/xen/arch/arm/arm64/bpi.S b/xen/arch/arm/arm64/bpi.S +index d8743d955c4a..4e6382522048 100644 +--- a/xen/arch/arm/arm64/bpi.S ++++ b/xen/arch/arm/arm64/bpi.S +@@ -16,6 +16,7 @@ + * along with this program. If not, see . + */ + ++#include + #include + + .macro ventry target +@@ -58,16 +58,42 @@ ENTRY(__bp_harden_hyp_vecs_start) + .endr + ENTRY(__bp_harden_hyp_vecs_end) + +-ENTRY(__smccc_workaround_1_smc_start) ++.macro mitigate_spectre_bhb_loop count ++ENTRY(__mitigate_spectre_bhb_loop_start_\count) ++ stp x0, x1, [sp, #-16]! ++ mov x0, \count ++.Lspectre_bhb_loop\@: ++ b . + 4 ++ subs x0, x0, #1 ++ b.ne .Lspectre_bhb_loop\@ ++ sb ++ ldp x0, x1, [sp], #16 ++ENTRY(__mitigate_spectre_bhb_loop_end_\count) ++.endm ++ ++.macro smccc_workaround num smcc_id ++ENTRY(__smccc_workaround_smc_start_\num) + sub sp, sp, #(8 * 4) + stp x0, x1, [sp, #(8 * 2)] + stp x2, x3, [sp, #(8 * 0)] +- mov w0, #ARM_SMCCC_ARCH_WORKAROUND_1_FID ++ mov w0, \smcc_id + smc #0 + ldp x2, x3, [sp, #(8 * 0)] + ldp x0, x1, [sp, #(8 * 2)] + add sp, sp, #(8 * 4) +-ENTRY(__smccc_workaround_1_smc_end) ++ENTRY(__smccc_workaround_smc_end_\num) ++.endm ++ ++ENTRY(__mitigate_spectre_bhb_clear_insn_start) ++ clearbhb ++ isb ++ENTRY(__mitigate_spectre_bhb_clear_insn_end) ++ ++mitigate_spectre_bhb_loop 8 ++mitigate_spectre_bhb_loop 24 ++mitigate_spectre_bhb_loop 32 ++smccc_workaround 1, #ARM_SMCCC_ARCH_WORKAROUND_1_FID ++smccc_workaround 3, #ARM_SMCCC_ARCH_WORKAROUND_3_FID + + /* + * Local variables: +diff --git a/xen/include/asm-arm/arm64/macros.h b/xen/include/asm-arm/arm64/macros.h +index 9c5e676b37..a13ad8e2b1 100644 +--- a/xen/include/asm-arm/arm64/macros.h ++++ b/xen/include/asm-arm/arm64/macros.h +@@ -21,5 +21,10 @@ + ldr \dst, [\dst, \tmp] + .endm + ++ /* clearbhb instruction clearing the branch history */ ++ .macro clearbhb ++ hint #22 ++ .endm ++ + #endif /* __ASM_ARM_ARM64_MACROS_H */ + +diff --git a/xen/arch/arm/cpuerrata.c b/xen/arch/arm/cpuerrata.c +index 9e1ecd071470..d70d1e16e946 100644 +--- a/xen/arch/arm/cpuerrata.c ++++ b/xen/arch/arm/cpuerrata.c +@@ -142,7 +142,16 @@ install_bp_hardening_vec(const struct arm_cpu_capabilities *entry, + return ret; + } + +-extern char __smccc_workaround_1_smc_start[], __smccc_workaround_1_smc_end[]; ++extern char __smccc_workaround_smc_start_1[], __smccc_workaround_smc_end_1[]; ++extern char __smccc_workaround_smc_start_3[], __smccc_workaround_smc_end_3[]; ++extern char __mitigate_spectre_bhb_clear_insn_start[], ++ __mitigate_spectre_bhb_clear_insn_end[]; ++extern char __mitigate_spectre_bhb_loop_start_8[], ++ __mitigate_spectre_bhb_loop_end_8[]; ++extern char __mitigate_spectre_bhb_loop_start_24[], ++ __mitigate_spectre_bhb_loop_end_24[]; ++extern char __mitigate_spectre_bhb_loop_start_32[], ++ __mitigate_spectre_bhb_loop_end_32[]; + + static int enable_smccc_arch_workaround_1(void *data) + { +@@ -174,8 +183,8 @@ static int enable_smccc_arch_workaround_1(void *data) + if ( (int)res.a0 < 0 ) + goto warn; + +- return !install_bp_hardening_vec(entry,__smccc_workaround_1_smc_start, +- __smccc_workaround_1_smc_end, ++ return !install_bp_hardening_vec(entry,__smccc_workaround_smc_start_1, ++ __smccc_workaround_smc_end_1, + "call ARM_SMCCC_ARCH_WORKAROUND_1"); + + warn: +@@ -190,6 +199,93 @@ static int enable_smccc_arch_workaround_1(void *data) + return 0; + } + ++/* ++ * Spectre BHB Mitigation ++ * ++ * CPU is either: ++ * - Having CVS2.3 so it is not affected. ++ * - Having ECBHB and is clearing the branch history buffer when an exception ++ * to a different exception level is happening so no mitigation is needed. ++ * - Mitigating using a loop on exception entry (number of loop depending on ++ * the CPU). ++ * - Mitigating using the firmware. ++ */ ++static int enable_spectre_bhb_workaround(void *data) ++{ ++ const struct arm_cpu_capabilities *entry = data; ++ ++ /* ++ * Enable callbacks are called on every CPU based on the capabilities, so ++ * double-check whether the CPU matches the entry. ++ */ ++ if ( !entry->matches(entry) ) ++ return 0; ++ ++ if ( cpu_data[smp_processor_id()].pfr64.csv2 == 3 ) ++ return 0; ++ ++ if ( cpu_data[smp_processor_id()].mm64.ecbhb ) ++ return 0; ++ ++ if ( cpu_data[smp_processor_id()].isa64.clearbhb ) ++ return !install_bp_hardening_vec(entry, ++ __mitigate_spectre_bhb_clear_insn_start, ++ __mitigate_spectre_bhb_clear_insn_end, ++ "use clearBHB instruction"); ++ ++ /* Apply solution depending on hwcaps set on arm_errata */ ++ if ( cpus_have_cap(ARM_WORKAROUND_BHB_LOOP_8) ) ++ return !install_bp_hardening_vec(entry, ++ __mitigate_spectre_bhb_loop_start_8, ++ __mitigate_spectre_bhb_loop_end_8, ++ "use 8 loops workaround"); ++ ++ if ( cpus_have_cap(ARM_WORKAROUND_BHB_LOOP_24) ) ++ return !install_bp_hardening_vec(entry, ++ __mitigate_spectre_bhb_loop_start_24, ++ __mitigate_spectre_bhb_loop_end_24, ++ "use 24 loops workaround"); ++ ++ if ( cpus_have_cap(ARM_WORKAROUND_BHB_LOOP_32) ) ++ return !install_bp_hardening_vec(entry, ++ __mitigate_spectre_bhb_loop_start_32, ++ __mitigate_spectre_bhb_loop_end_32, ++ "use 32 loops workaround"); ++ ++ if ( cpus_have_cap(ARM_WORKAROUND_BHB_SMCC_3) ) ++ { ++ struct arm_smccc_res res; ++ ++ if ( smccc_ver < SMCCC_VERSION(1, 1) ) ++ goto warn; ++ ++ arm_smccc_1_1_smc(ARM_SMCCC_ARCH_FEATURES_FID, ++ ARM_SMCCC_ARCH_WORKAROUND_3_FID, &res); ++ /* The return value is in the lower 32-bits. */ ++ if ( (int)res.a0 < 0 ) ++ { ++ /* ++ * On processor affected with CSV2=0, workaround 1 will mitigate ++ * both Spectre v2 and BHB so use it when available ++ */ ++ if ( enable_smccc_arch_workaround_1(data) ) ++ return 1; ++ ++ goto warn; ++ } ++ ++ return !install_bp_hardening_vec(entry,__smccc_workaround_smc_start_3, ++ __smccc_workaround_smc_end_3, ++ "call ARM_SMCCC_ARCH_WORKAROUND_3"); ++ } ++ ++warn: ++ printk_once("**** No support for any spectre BHB workaround. ****\n" ++ "**** Please update your firmware. ****\n"); ++ ++ return 0; ++} ++ + #endif /* CONFIG_ARM64_HARDEN_BRANCH_PREDICTOR */ + + /* Hardening Branch predictor code for Arm32 */ +@@ -449,19 +545,77 @@ static const struct arm_cpu_capabilities arm_errata[] = { + }, + { + .capability = ARM_HARDEN_BRANCH_PREDICTOR, +- MIDR_ALL_VERSIONS(MIDR_CORTEX_A72), ++ MIDR_RANGE(MIDR_CORTEX_A72, 0, 1 << MIDR_VARIANT_SHIFT), + .enable = enable_smccc_arch_workaround_1, + }, + { +- .capability = ARM_HARDEN_BRANCH_PREDICTOR, ++ .capability = ARM_WORKAROUND_BHB_SMCC_3, + MIDR_ALL_VERSIONS(MIDR_CORTEX_A73), +- .enable = enable_smccc_arch_workaround_1, ++ .enable = enable_spectre_bhb_workaround, + }, + { +- .capability = ARM_HARDEN_BRANCH_PREDICTOR, ++ .capability = ARM_WORKAROUND_BHB_SMCC_3, + MIDR_ALL_VERSIONS(MIDR_CORTEX_A75), +- .enable = enable_smccc_arch_workaround_1, ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ /* spectre BHB */ ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_8, ++ MIDR_RANGE(MIDR_CORTEX_A72, 1 << MIDR_VARIANT_SHIFT, ++ (MIDR_VARIANT_MASK | MIDR_REVISION_MASK)), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_24, ++ MIDR_ALL_VERSIONS(MIDR_CORTEX_A76), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_24, ++ MIDR_ALL_VERSIONS(MIDR_CORTEX_A77), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_32, ++ MIDR_ALL_VERSIONS(MIDR_CORTEX_A78), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_32, ++ MIDR_ALL_VERSIONS(MIDR_CORTEX_A78C), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_32, ++ MIDR_ALL_VERSIONS(MIDR_CORTEX_X1), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_32, ++ MIDR_ALL_VERSIONS(MIDR_CORTEX_X2), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_32, ++ MIDR_ALL_VERSIONS(MIDR_CORTEX_A710), ++ .enable = enable_spectre_bhb_workaround, + }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_24, ++ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N1), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_32, ++ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_N2), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ { ++ .capability = ARM_WORKAROUND_BHB_LOOP_32, ++ MIDR_ALL_VERSIONS(MIDR_NEOVERSE_V1), ++ .enable = enable_spectre_bhb_workaround, ++ }, ++ + #endif + #ifdef CONFIG_ARM32_HARDEN_BRANCH_PREDICTOR + { +diff --git a/xen/include/asm-arm/cpufeature.h b/xen/include/asm-arm/cpufeature.h +index c748fc17fe66..87989eac6fc2 100644 +--- a/xen/include/asm-arm/cpufeature.h ++++ b/xen/include/asm-arm/cpufeature.h +@@ -44,8 +44,12 @@ + #define SKIP_CTXT_SWITCH_SERROR_SYNC 6 + #define ARM_HARDEN_BRANCH_PREDICTOR 7 + #define ARM_SSBD 8 ++#define ARM_WORKAROUND_BHB_LOOP_8 9 ++#define ARM_WORKAROUND_BHB_LOOP_24 10 ++#define ARM_WORKAROUND_BHB_LOOP_32 11 ++#define ARM_WORKAROUND_BHB_SMCC_3 12 + +-#define ARM_NCAPS 9 ++#define ARM_NCAPS 13 + + #ifndef __ASSEMBLY__ + +diff --git a/xen/include/asm-arm/smccc.h b/xen/include/asm-arm/smccc.h +index 126399dd7088..2abbffc3bd8a 100644 +--- a/xen/include/asm-arm/smccc.h ++++ b/xen/include/asm-arm/smccc.h +@@ -274,6 +274,12 @@ void __arm_smccc_1_0_smc(register_t a0, register_t a1, register_t a2, + ARM_SMCCC_OWNER_ARCH, \ + 0x7FFF) + ++#define ARM_SMCCC_ARCH_WORKAROUND_3_FID \ ++ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \ ++ ARM_SMCCC_CONV_32, \ ++ ARM_SMCCC_OWNER_ARCH, \ ++ 0x3FFF) ++ + /* SMCCC error codes */ + #define ARM_SMCCC_NOT_REQUIRED (-2) + #define ARM_SMCCC_ERR_UNKNOWN_FUNCTION (-1) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-5-xen-arm-Allow-to-discover-and-use-SMCCC_ARCH_WORKARO.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-5-xen-arm-Allow-to-discover-and-use-SMCCC_ARCH_WORKARO.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-5-xen-arm-Allow-to-discover-and-use-SMCCC_ARCH_WORKARO.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-5-xen-arm-Allow-to-discover-and-use-SMCCC_ARCH_WORKARO.patch 2022-05-26 17:34:26.000000000 +0100 @@ -0,0 +1,91 @@ +From 21f5a7b22687aa1e384782c8a1c04148f288ad9f Mon Sep 17 00:00:00 2001 +From: Bertrand Marquis +Date: Thu, 17 Feb 2022 14:52:54 +0000 +Subject: xen/arm: Allow to discover and use SMCCC_ARCH_WORKAROUND_3 + +Allow guest to discover whether or not SMCCC_ARCH_WORKAROUND_3 is +supported and create a fastpath in the code to handle guests request to +do the workaround. + +The function SMCCC_ARCH_WORKAROUND_3 will be called by the guest for +flushing the branch history. So we want the handling to be as fast as +possible. + +As the mitigation is applied on every guest exit, we can check for the +call before saving all context and return very early. + +This is part of XSA-398 / CVE-2022-23960. + +Signed-off-by: Bertrand Marquis +Reviewed-by: Julien Grall +(cherry picked from commit c0a56ea0fd92ecb471936b7355ddbecbaea3707c) + +diff --git a/xen/arch/arm/arm64/entry.S b/xen/arch/arm/arm64/entry.S +index 97bd06217bcd..788d0a1912f0 100644 +--- a/xen/arch/arm/arm64/entry.S ++++ b/xen/arch/arm/arm64/entry.S +@@ -343,16 +343,26 @@ guest_sync: + cbnz x1, guest_sync_slowpath /* should be 0 for HVC #0 */ + + /* +- * Fastest path possible for ARM_SMCCC_ARCH_WORKAROUND_1. +- * The workaround has already been applied on the exception ++ * Fastest path possible for ARM_SMCCC_ARCH_WORKAROUND_1 and ++ * ARM_SMCCC_ARCH_WORKAROUND_3. ++ * The workaround needed has already been applied on the exception + * entry from the guest, so let's quickly get back to the guest. + * + * Note that eor is used because the function identifier cannot + * be encoded as an immediate for cmp. + */ + eor w0, w0, #ARM_SMCCC_ARCH_WORKAROUND_1_FID +- cbnz w0, check_wa2 ++ cbz w0, fastpath_out_workaround + ++ /* ARM_SMCCC_ARCH_WORKAROUND_2 handling */ ++ eor w0, w0, #(ARM_SMCCC_ARCH_WORKAROUND_1_FID ^ ARM_SMCCC_ARCH_WORKAROUND_2_FID) ++ cbz w0, wa2_ssbd ++ ++ /* Fastpath out for ARM_SMCCC_ARCH_WORKAROUND_3 */ ++ eor w0, w0, #(ARM_SMCCC_ARCH_WORKAROUND_2_FID ^ ARM_SMCCC_ARCH_WORKAROUND_3_FID) ++ cbnz w0, guest_sync_slowpath ++ ++fastpath_out_workaround: + /* + * Clobber both x0 and x1 to prevent leakage. Note that thanks + * the eor, x0 = 0. +@@ -361,10 +371,7 @@ guest_sync: + eret + sb + +-check_wa2: +- /* ARM_SMCCC_ARCH_WORKAROUND_2 handling */ +- eor w0, w0, #(ARM_SMCCC_ARCH_WORKAROUND_1_FID ^ ARM_SMCCC_ARCH_WORKAROUND_2_FID) +- cbnz w0, guest_sync_slowpath ++wa2_ssbd: + #ifdef CONFIG_ARM_SSBD + alternative_cb arm_enable_wa2_handling + b wa2_end +diff --git a/xen/arch/arm/vsmc.c b/xen/arch/arm/vsmc.c +index ecf4faa13da3..643976db6537 100644 +--- a/xen/arch/arm/vsmc.c ++++ b/xen/arch/arm/vsmc.c +@@ -123,6 +123,10 @@ static bool handle_arch(struct cpu_user_regs *regs) + break; + } + break; ++ case ARM_SMCCC_ARCH_WORKAROUND_3_FID: ++ if ( cpus_have_cap(ARM_WORKAROUND_BHB_SMCC_3) ) ++ ret = 0; ++ break; + } + + set_user_reg(regs, 0, ret); +@@ -131,6 +135,7 @@ static bool handle_arch(struct cpu_user_regs *regs) + } + + case ARM_SMCCC_ARCH_WORKAROUND_1_FID: ++ case ARM_SMCCC_ARCH_WORKAROUND_3_FID: + /* No return value */ + return true; + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-6-x86-spec-ctrl-Cease-using-thunk-lfence-on-AMD.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-6-x86-spec-ctrl-Cease-using-thunk-lfence-on-AMD.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-6-x86-spec-ctrl-Cease-using-thunk-lfence-on-AMD.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa398-4.12-6-x86-spec-ctrl-Cease-using-thunk-lfence-on-AMD.patch 2022-06-05 22:17:52.000000000 +0100 @@ -0,0 +1,39 @@ +From 944afa38d9339a67f0164d07fb7ac8a54e9a4c60 Mon Sep 17 00:00:00 2001 +From: Andrew Cooper +Date: Mon, 7 Mar 2022 16:35:52 +0000 +Subject: x86/spec-ctrl: Cease using thunk=lfence on AMD + +AMD have updated their Spectre v2 guidance, and lfence/jmp is no longer +considered safe. AMD are recommending using retpoline everywhere. + +Update the default heuristics to never select THUNK_LFENCE. + +This is part of XSA-398 / CVE-2021-26401. + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich +(cherry picked from commit 8d03080d2a339840d3a59e0932a94f804e45110d) + +diff --git a/xen/arch/x86/spec_ctrl.c b/xen/arch/x86/spec_ctrl.c +index e2fcefc86a60..866b864918fd 100644 +--- a/xen/arch/x86/spec_ctrl.c ++++ b/xen/arch/x86/spec_ctrl.c +@@ -953,16 +953,10 @@ void __init init_speculation_mitigations(void) + if ( IS_ENABLED(CONFIG_INDIRECT_THUNK) ) + { + /* +- * AMD's recommended mitigation is to set lfence as being dispatch +- * serialising, and to use IND_THUNK_LFENCE. +- */ +- if ( cpu_has_lfence_dispatch ) +- thunk = THUNK_LFENCE; +- /* +- * On Intel hardware, we'd like to use retpoline in preference to ++ * On all hardware, we'd like to use retpoline in preference to + * IBRS, but only if it is safe on this hardware. + */ +- else if ( retpoline_safe(caps) ) ++ if ( retpoline_safe(caps) ) + thunk = THUNK_RETPOLINE; + else if ( boot_cpu_has(X86_FEATURE_IBRSB) ) + ibrs = true; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa399-4.12.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa399-4.12.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa399-4.12.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa399-4.12.patch 2022-05-26 17:34:26.000000000 +0100 @@ -0,0 +1,45 @@ +From: Jan Beulich +Subject: VT-d: correct ordering of operations in cleanup_domid_map() + +The function may be called without any locks held (leaving aside the +domctl one, which we surely don't want to depend on here), so needs to +play safe wrt other accesses to domid_map[] and domid_bitmap[]. This is +to avoid context_set_domain_id()'s writing of domid_map[] to be reset to +zero right away in the case of it racing the freeing of a DID. + +For the interaction with context_set_domain_id() and ->domid_map[] reads +see the code comment. + +{check_,}cleanup_domid_map() are called with pcidevs_lock held or during +domain cleanup only (and pcidevs_lock is also held around +context_set_domain_id()), i.e. racing calls with the same (dom, iommu) +tuple cannot occur. + +domain_iommu_domid(), besides its use by cleanup_domid_map(), has its +result used only to control flushing, and hence a stale result would +only lead to a stray extra flush. + +This is CVE-2022-26357 / XSA-399. + +Fixes: b9c20c78789f ("VT-d: per-iommu domain-id") +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -1770,8 +1770,14 @@ static int domain_context_unmap(struct d + goto out; + } + ++ /* ++ * Update domid_map[] /before/ domid_bitmap[] to avoid a race with ++ * context_set_domain_id(), setting the slot to DOMID_INVALID for ++ * ->domid_map[] reads to produce a suitable value while the bit is ++ * still set. ++ */ ++ iommu->domid_map[iommu_domid] = DOMID_INVALID; + clear_bit(iommu_domid, iommu->domid_bitmap); +- iommu->domid_map[iommu_domid] = 0; + } + + out: diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-00.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-00.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-00.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-00.patch 2022-05-26 17:34:26.000000000 +0100 @@ -0,0 +1,138 @@ +From: Jan Beulich +Subject: VT-d: split domid map cleanup check into a function + +This logic will want invoking from elsewhere. + +No functional change intended. + +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné +Reviewed-by: Kevin Tian + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -152,6 +152,68 @@ static void __init free_intel_iommu(stru + xfree(intel); + } + ++static void cleanup_domid_map(struct domain *domain, struct iommu *iommu) ++{ ++ int iommu_domid = domain_iommu_domid(domain, iommu); ++ ++ if ( iommu_domid >= 0 ) ++ { ++ /* ++ * Update domid_map[] /before/ domid_bitmap[] to avoid a race with ++ * context_set_domain_id(), setting the slot to DOMID_INVALID for ++ * ->domid_map[] reads to produce a suitable value while the bit is ++ * still set. ++ */ ++ iommu->domid_map[iommu_domid] = DOMID_INVALID; ++ clear_bit(iommu_domid, iommu->domid_bitmap); ++ } ++} ++ ++static bool any_pdev_behind_iommu(const struct domain *d, ++ const struct pci_dev *exclude, ++ const struct iommu *iommu) ++{ ++ const struct pci_dev *pdev; ++ ++ for_each_pdev ( d, pdev ) ++ { ++ const struct acpi_drhd_unit *drhd; ++ ++ if ( pdev == exclude ) ++ continue; ++ ++ drhd = acpi_find_matched_drhd_unit(pdev); ++ if ( drhd && drhd->iommu == iommu ) ++ return true; ++ } ++ ++ return false; ++} ++ ++/* ++ * If no other devices under the same iommu owned by this domain, ++ * clear iommu in iommu_bitmap and clear domain_id in domid_bitmap. ++ */ ++static void check_cleanup_domid_map(struct domain *d, ++ const struct pci_dev *exclude, ++ struct iommu *iommu) ++{ ++ bool found = any_pdev_behind_iommu(d, exclude, iommu); ++ ++ /* ++ * Hidden devices are associated with DomXEN but usable by the hardware ++ * domain. Hence they need considering here as well. ++ */ ++ if ( !found && is_hardware_domain(d) ) ++ found = any_pdev_behind_iommu(dom_xen, exclude, iommu); ++ ++ if ( !found ) ++ { ++ clear_bit(iommu->index, &dom_iommu(d)->arch.iommu_bitmap); ++ cleanup_domid_map(d, iommu); ++ } ++} ++ + static int iommus_incoherent; + + static void sync_cache(const void *addr, unsigned int size) +@@ -1671,7 +1733,6 @@ static int domain_context_unmap(struct d + struct iommu *iommu; + int ret = 0; + u8 seg = pdev->seg, bus = pdev->bus, tmp_bus, tmp_devfn, secbus; +- int found = 0; + + drhd = acpi_find_matched_drhd_unit(pdev); + if ( !drhd ) +@@ -1740,45 +1801,8 @@ static int domain_context_unmap(struct d + goto out; + } + +- /* +- * if no other devices under the same iommu owned by this domain, +- * clear iommu in iommu_bitmap and clear domain_id in domid_bitmp +- */ +- for_each_pdev ( domain, pdev ) +- { +- if ( pdev->seg == seg && pdev->bus == bus && pdev->devfn == devfn ) +- continue; +- +- drhd = acpi_find_matched_drhd_unit(pdev); +- if ( drhd && drhd->iommu == iommu ) +- { +- found = 1; +- break; +- } +- } +- +- if ( found == 0 ) +- { +- int iommu_domid; +- +- clear_bit(iommu->index, &dom_iommu(domain)->arch.iommu_bitmap); +- +- iommu_domid = domain_iommu_domid(domain, iommu); +- if ( iommu_domid == -1 ) +- { +- ret = -EINVAL; +- goto out; +- } +- +- /* +- * Update domid_map[] /before/ domid_bitmap[] to avoid a race with +- * context_set_domain_id(), setting the slot to DOMID_INVALID for +- * ->domid_map[] reads to produce a suitable value while the bit is +- * still set. +- */ +- iommu->domid_map[iommu_domid] = DOMID_INVALID; +- clear_bit(iommu_domid, iommu->domid_bitmap); +- } ++ if ( !ret ) ++ check_cleanup_domid_map(domain, pdev, iommu); + + out: + return ret; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-01.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-01.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-01.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-01.patch 2022-05-26 17:34:26.000000000 +0100 @@ -0,0 +1,105 @@ +From: Jan Beulich +Subject: VT-d: fix (de)assign ordering when RMRRs are in use + +In the event that the RMRR mappings are essential for device operation, +they should be established before updating the device's context entry, +while they should be torn down only after the device's context entry was +successfully updated. + +Also adjust a related log message. + +This is CVE-2022-26358 / part of XSA-400. + +Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné +Reviewed-by: Paul Durrant +Reviewed-by: Kevin Tian + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -2352,6 +2352,10 @@ static int reassign_device_ownership( + { + int ret; + ++ ret = domain_context_unmap(source, devfn, pdev); ++ if ( ret ) ++ return ret; ++ + /* + * Devices assigned to untrusted domains (here assumed to be any domU) + * can attempt to send arbitrary LAPIC/MSI messages. We are unprotected +@@ -2388,10 +2392,6 @@ static int reassign_device_ownership( + } + } + +- ret = domain_context_unmap(source, devfn, pdev); +- if ( ret ) +- return ret; +- + if ( devfn == pdev->devfn && pdev->domain != dom_io ) + { + list_move(&pdev->domain_list, &dom_io->arch.pdev_list); +@@ -2468,9 +2468,8 @@ static int intel_iommu_assign_device( + } + } + +- ret = reassign_device_ownership(s, d, devfn, pdev); +- if ( ret || d == dom_io ) +- return ret; ++ if ( d == dom_io ) ++ return reassign_device_ownership(s, d, devfn, pdev); + + /* Setup rmrr identity mapping */ + for_each_rmrr_device( rmrr, bdf, i ) +@@ -2483,20 +2482,37 @@ static int intel_iommu_assign_device( + rmrr->end_address, flag); + if ( ret ) + { +- int rc; +- +- rc = reassign_device_ownership(d, s, devfn, pdev); + printk(XENLOG_G_ERR VTDPREFIX +- " cannot map reserved region (%"PRIx64",%"PRIx64"] for Dom%d (%d)\n", +- rmrr->base_address, rmrr->end_address, +- d->domain_id, ret); +- if ( rc ) +- { +- printk(XENLOG_ERR VTDPREFIX +- " failed to reclaim %04x:%02x:%02x.%u from %pd (%d)\n", +- seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), d, rc); +- domain_crash(d); +- } ++ "%pd: cannot map reserved region [%"PRIx64",%"PRIx64"]: %d\n", ++ d, rmrr->base_address, rmrr->end_address, ret); ++ break; ++ } ++ } ++ } ++ ++ if ( !ret ) ++ ret = reassign_device_ownership(s, d, devfn, pdev); ++ ++ /* See reassign_device_ownership() for the hwdom aspect. */ ++ if ( !ret || is_hardware_domain(d) ) ++ return ret; ++ ++ for_each_rmrr_device( rmrr, bdf, i ) ++ { ++ if ( rmrr->segment == seg && ++ PCI_BUS(bdf) == bus && ++ PCI_DEVFN2(bdf) == devfn ) ++ { ++ int rc = iommu_identity_mapping(d, p2m_access_x, ++ rmrr->base_address, ++ rmrr->end_address, 0); ++ ++ if ( rc && rc != -ENOENT ) ++ { ++ printk(XENLOG_ERR VTDPREFIX ++ "%pd: cannot unmap reserved region [%"PRIx64",%"PRIx64"]: %d\n", ++ d, rmrr->base_address, rmrr->end_address, rc); ++ domain_crash(d); + break; + } + } diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-02.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-02.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-02.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-02.patch 2022-05-26 17:34:26.000000000 +0100 @@ -0,0 +1,80 @@ +From: Jan Beulich +Subject: VT-d: fix add/remove ordering when RMRRs are in use + +In the event that the RMRR mappings are essential for device operation, +they should be established before updating the device's context entry, +while they should be torn down only after the device's context entry was +successfully cleared. + +Also switch to %pd in related log messages. + +Fixes: fa88cfadf918 ("vt-d: Map RMRR in intel_iommu_add_device() if the device has RMRR") +Fixes: 8b99f4400b69 ("VT-d: fix RMRR related error handling") +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné +Reviewed-by: Kevin Tian + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -1985,14 +1985,6 @@ static int intel_iommu_add_device(u8 dev + if ( !pdev->domain ) + return -EINVAL; + +- ret = domain_context_mapping(pdev->domain, devfn, pdev); +- if ( ret ) +- { +- dprintk(XENLOG_ERR VTDPREFIX, "d%d: context mapping failed\n", +- pdev->domain->domain_id); +- return ret; +- } +- + for_each_rmrr_device ( rmrr, bdf, i ) + { + if ( rmrr->segment == pdev->seg && +@@ -2009,12 +2001,17 @@ static int intel_iommu_add_device(u8 dev + rmrr->base_address, rmrr->end_address, + 0); + if ( ret ) +- dprintk(XENLOG_ERR VTDPREFIX, "d%d: RMRR mapping failed\n", +- pdev->domain->domain_id); ++ dprintk(XENLOG_ERR VTDPREFIX, "%pd: RMRR mapping failed\n", ++ pdev->domain); + } + } + +- return 0; ++ ret = domain_context_mapping(pdev->domain, devfn, pdev); ++ if ( ret ) ++ dprintk(XENLOG_ERR VTDPREFIX, "%pd: context mapping failed\n", ++ pdev->domain); ++ ++ return ret; + } + + static int intel_iommu_enable_device(struct pci_dev *pdev) +@@ -2036,11 +2033,15 @@ static int intel_iommu_remove_device(u8 + { + struct acpi_rmrr_unit *rmrr; + u16 bdf; +- int i; ++ int ret, i; + + if ( !pdev->domain ) + return -EINVAL; + ++ ret = domain_context_unmap(pdev->domain, devfn, pdev); ++ if ( ret ) ++ return ret; ++ + for_each_rmrr_device ( rmrr, bdf, i ) + { + if ( rmrr->segment != pdev->seg || +@@ -2056,7 +2057,7 @@ static int intel_iommu_remove_device(u8 + rmrr->end_address, 0); + } + +- return domain_context_unmap(pdev->domain, devfn, pdev); ++ return 0; + } + + static int __hwdom_init setup_hwdom_device(u8 devfn, struct pci_dev *pdev) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-03.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-03.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-03.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-03.patch 2022-06-05 22:22:30.000000000 +0100 @@ -0,0 +1,99 @@ +From: Jan Beulich +Subject: VT-d: drop ownership checking from domain_context_mapping_one() + +Despite putting in quite a bit of effort it was not possible to +establish why exactly this code exists (beyond possibly sanity +checking). Instead of a subsequent change further complicating this +logic, simply get rid of it. + +Take the opportunity and move the respective unmap_vtd_domain_page() out +of the locked region. + +Signed-off-by: Jan Beulich +Reviewed-by: Roger Pau Monné +Reviewed-by: Paul Durrant +Reviewed-by: Kevin Tian + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -112,28 +112,6 @@ static int context_set_domain_id(struct + return 0; + } + +-static int context_get_domain_id(struct context_entry *context, +- struct iommu *iommu) +-{ +- unsigned long dom_index, nr_dom; +- int domid = -1; +- +- if (iommu && context) +- { +- nr_dom = cap_ndoms(iommu->cap); +- +- dom_index = context_domain_id(*context); +- +- if ( dom_index < nr_dom && iommu->domid_map ) +- domid = iommu->domid_map[dom_index]; +- else +- dprintk(XENLOG_DEBUG VTDPREFIX, +- "dom_index %lu exceeds nr_dom %lu or iommu has no domid_map\n", +- dom_index, nr_dom); +- } +- return domid; +-} +- + static struct intel_iommu *__init alloc_intel_iommu(void) + { + struct intel_iommu *intel; +@@ -1433,49 +1411,9 @@ int domain_context_mapping_one( + + if ( context_present(*context) ) + { +- int res = 0; +- +- /* Try to get domain ownership from device structure. If that's +- * not available, try to read it from the context itself. */ +- if ( pdev ) +- { +- if ( pdev->domain != domain ) +- { +- printk(XENLOG_G_INFO VTDPREFIX +- "d%d: %04x:%02x:%02x.%u owned by d%d!", +- domain->domain_id, +- seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), +- pdev->domain ? pdev->domain->domain_id : -1); +- res = -EINVAL; +- } +- } +- else +- { +- int cdomain; +- cdomain = context_get_domain_id(context, iommu); +- +- if ( cdomain < 0 ) +- { +- printk(XENLOG_G_WARNING VTDPREFIX +- "d%d: %04x:%02x:%02x.%u mapped, but can't find owner!\n", +- domain->domain_id, +- seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); +- res = -EINVAL; +- } +- else if ( cdomain != domain->domain_id ) +- { +- printk(XENLOG_G_INFO VTDPREFIX +- "d%d: %04x:%02x:%02x.%u already mapped to d%d!", +- domain->domain_id, +- seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), +- cdomain); +- res = -EINVAL; +- } +- } +- +- unmap_vtd_domain_page(context_entries); + spin_unlock(&iommu->lock); +- return res; ++ unmap_vtd_domain_page(context_entries); ++ return 0; + } + + if ( iommu_passthrough && is_hardware_domain(domain) ) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-04.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-04.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-04.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-04.patch 2022-06-06 12:57:27.000000000 +0100 @@ -0,0 +1,561 @@ +From: Jan Beulich +Subject: VT-d: re-assign devices directly + +Devices with RMRRs, due to it being unspecified how/when the specified +memory regions may get accessed, may not be left disconnected from their +respective mappings (as long as it's not certain that the device has +been fully quiesced). Hence rather than unmapping the old context and +then mapping the new one, re-assignment needs to be done in a single +step. + +This is CVE-2022-26359 / part of XSA-400. + +Reported-by: Roger Pau Monné + +Similarly quarantining scratch-page mode relies on page tables to be +continuously wired up. + +To avoid complicating things more than necessary, treat all devices +mostly equally, i.e. regardless of their association with any RMRRs. The +main difference is when it comes to updating context entries, which need +to be atomic when there are RMRRs. Yet atomicity can only be achieved +with CMPXCHG16B, availability of which we can't take for given. + +The seemingly complicated choice of non-negative return values for +domain_context_mapping_one() is to limit code churn: This way callers +passing NULL for pdev don't need fiddling with. + +Signed-off-by: Jan Beulich +Reviewed-by: Kevin Tian +Reviewed-by: Roger Pau Monné + +--- a/xen/drivers/passthrough/vtd/extern.h ++++ b/xen/drivers/passthrough/vtd/extern.h +@@ -70,7 +70,8 @@ void free_pgtable_maddr(u64 maddr); + void *map_vtd_domain_page(u64 maddr); + void unmap_vtd_domain_page(void *va); + int domain_context_mapping_one(struct domain *domain, struct iommu *iommu, +- u8 bus, u8 devfn, const struct pci_dev *); ++ uint8_t bus, uint8_t devfn, ++ const struct pci_dev *pdev, unsigned int mode); + int domain_context_unmap_one(struct domain *domain, struct iommu *iommu, + u8 bus, u8 devfn); + int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt); +@@ -90,8 +91,8 @@ int is_igd_vt_enabled_quirk(void); + void platform_quirks_init(void); + void vtd_ops_preamble_quirk(struct iommu* iommu); + void vtd_ops_postamble_quirk(struct iommu* iommu); +-int __must_check me_wifi_quirk(struct domain *domain, +- u8 bus, u8 devfn, int map); ++int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus, ++ uint8_t devfn, unsigned int mode); + void pci_vtd_quirk(const struct pci_dev *); + void quirk_iommu_caps(struct iommu *iommu); + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -108,6 +108,7 @@ static int context_set_domain_id(struct + } + + set_bit(i, iommu->domid_bitmap); ++ context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET); + context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET; + return 0; + } +@@ -1389,15 +1390,27 @@ static void __hwdom_init intel_iommu_hwd + } + } + ++/* ++ * This function returns ++ * - a negative errno value upon error, ++ * - zero upon success when previously the entry was non-present, or this isn't ++ * the "main" request for a device (pdev == NULL), or for no-op quarantining ++ * assignments, ++ * - positive (one) upon success when previously the entry was present and this ++ * is the "main" request for a device (pdev != NULL). ++ */ + int domain_context_mapping_one( + struct domain *domain, + struct iommu *iommu, +- u8 bus, u8 devfn, const struct pci_dev *pdev) ++ uint8_t bus, uint8_t devfn, const struct pci_dev *pdev, ++ unsigned int mode) + { + struct domain_iommu *hd = dom_iommu(domain); +- struct context_entry *context, *context_entries; ++ struct context_entry *context, *context_entries, lctxt; ++ __uint128_t old; + u64 maddr, pgd_maddr; +- u16 seg = iommu->intel->drhd->segment; ++ uint16_t seg = iommu->intel->drhd->segment, prev_did = 0; ++ struct domain *prev_dom = NULL; + int agaw, rc, ret; + bool_t flush_dev_iotlb; + +@@ -1406,17 +1419,32 @@ int domain_context_mapping_one( + maddr = bus_to_context_maddr(iommu, bus); + context_entries = (struct context_entry *)map_vtd_domain_page(maddr); + context = &context_entries[devfn]; ++ old = (lctxt = *context).full; + +- if ( context_present(*context) ) ++ if ( context_present(lctxt) ) + { +- spin_unlock(&iommu->lock); +- unmap_vtd_domain_page(context_entries); +- return 0; ++ domid_t domid; ++ ++ prev_did = context_domain_id(lctxt); ++ domid = iommu->domid_map[prev_did]; ++ if ( domid < DOMID_FIRST_RESERVED ) ++ prev_dom = rcu_lock_domain_by_id(domid); ++ else if ( domid == DOMID_IO ) ++ prev_dom = rcu_lock_domain(dom_io); ++ if ( !prev_dom ) ++ { ++ spin_unlock(&iommu->lock); ++ unmap_vtd_domain_page(context_entries); ++ dprintk(XENLOG_DEBUG VTDPREFIX, ++ "no domain for did %u (nr_dom %u)\n", ++ prev_did, cap_ndoms(iommu->cap)); ++ return -ESRCH; ++ } + } + + if ( iommu_passthrough && is_hardware_domain(domain) ) + { +- context_set_translation_type(*context, CONTEXT_TT_PASS_THRU); ++ context_set_translation_type(lctxt, CONTEXT_TT_PASS_THRU); + agaw = level_to_agaw(iommu->nr_pt_levels); + } + else +@@ -1433,6 +1461,8 @@ int domain_context_mapping_one( + spin_unlock(&hd->arch.mapping_lock); + spin_unlock(&iommu->lock); + unmap_vtd_domain_page(context_entries); ++ if ( prev_dom ) ++ rcu_unlock_domain(prev_dom); + return -ENOMEM; + } + } +@@ -1450,33 +1480,102 @@ int domain_context_mapping_one( + goto nomem; + } + +- context_set_address_root(*context, pgd_maddr); ++ context_set_address_root(lctxt, pgd_maddr); + if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) ) +- context_set_translation_type(*context, CONTEXT_TT_DEV_IOTLB); ++ context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB); + else +- context_set_translation_type(*context, CONTEXT_TT_MULTI_LEVEL); ++ context_set_translation_type(lctxt, CONTEXT_TT_MULTI_LEVEL); + + spin_unlock(&hd->arch.mapping_lock); + } + +- if ( context_set_domain_id(context, domain, iommu) ) ++ rc = context_set_domain_id(&lctxt, domain, iommu); ++ if ( rc ) + { ++ unlock: + spin_unlock(&iommu->lock); + unmap_vtd_domain_page(context_entries); +- return -EFAULT; ++ if ( prev_dom ) ++ rcu_unlock_domain(prev_dom); ++ return rc; ++ } ++ ++ if ( !prev_dom ) ++ { ++ context_set_address_width(lctxt, agaw); ++ context_set_fault_enable(lctxt); ++ context_set_present(lctxt); ++ } ++ else if ( prev_dom == domain ) ++ { ++ ASSERT(lctxt.full == context->full); ++ rc = !!pdev; ++ goto unlock; ++ } ++ else ++ { ++ ASSERT(context_address_width(lctxt) == agaw); ++ ASSERT(!context_fault_disable(lctxt)); ++ } ++ ++ if ( cpu_has_cx16 ) ++ { ++ __uint128_t res = cmpxchg16b(context, &old, &lctxt.full); ++ ++ /* ++ * Hardware does not update the context entry behind our backs, ++ * so the return value should match "old". ++ */ ++ if ( res != old ) ++ { ++ if ( pdev ) ++ check_cleanup_domid_map(domain, pdev, iommu); ++ printk(XENLOG_ERR ++ "%04x:%02x:%02x.%u: unexpected context entry %016lx_%016lx (expected %016lx_%016lx)\n", ++ pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn), ++ (uint64_t)(res >> 64), (uint64_t)res, ++ (uint64_t)(old >> 64), (uint64_t)old); ++ rc = -EILSEQ; ++ goto unlock; ++ } ++ } ++ else if ( !prev_dom || !(mode & MAP_WITH_RMRR) ) ++ { ++ context_clear_present(*context); ++ iommu_sync_cache(context, sizeof(*context)); ++ ++ write_atomic(&context->hi, lctxt.hi); ++ /* No barrier should be needed between these two. */ ++ write_atomic(&context->lo, lctxt.lo); ++ } ++ else /* Best effort, updating DID last. */ ++ { ++ /* ++ * By non-atomically updating the context entry's DID field last, ++ * during a short window in time TLB entries with the old domain ID ++ * but the new page tables may be inserted. This could affect I/O ++ * of other devices using this same (old) domain ID. Such updating ++ * therefore is not a problem if this was the only device associated ++ * with the old domain ID. Diverting I/O of any of a dying domain's ++ * devices to the quarantine page tables is intended anyway. ++ */ ++ if ( !(mode & (MAP_OWNER_DYING | MAP_SINGLE_DEVICE)) ) ++ printk(XENLOG_WARNING VTDPREFIX ++ " %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n", ++ seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), prev_dom); ++ ++ write_atomic(&context->lo, lctxt.lo); ++ /* No barrier should be needed between these two. */ ++ write_atomic(&context->hi, lctxt.hi); + } + +- context_set_address_width(*context, agaw); +- context_set_fault_enable(*context); +- context_set_present(*context); + iommu_sync_cache(context, sizeof(struct context_entry)); + spin_unlock(&iommu->lock); + +- /* Context entry was previously non-present (with domid 0). */ +- rc = iommu_flush_context_device(iommu, 0, PCI_BDF2(bus, devfn), +- DMA_CCMD_MASK_NOBIT, 1); ++ rc = iommu_flush_context_device(iommu, prev_did, PCI_BDF2(bus, devfn), ++ DMA_CCMD_MASK_NOBIT, !prev_dom); + flush_dev_iotlb = !!find_ats_dev_drhd(iommu); +- ret = iommu_flush_iotlb_dsi(iommu, 0, 1, flush_dev_iotlb); ++ ret = iommu_flush_iotlb_dsi(iommu, prev_did, !prev_dom, flush_dev_iotlb); + + /* + * The current logic for returns: +@@ -1497,17 +1596,35 @@ int domain_context_mapping_one( + unmap_vtd_domain_page(context_entries); + + if ( !seg && !rc ) +- rc = me_wifi_quirk(domain, bus, devfn, MAP_ME_PHANTOM_FUNC); ++ rc = me_wifi_quirk(domain, bus, devfn, mode); + +- return rc; ++ if ( rc ) ++ { ++ if ( !prev_dom ) ++ domain_context_unmap_one(domain, iommu, bus, devfn); ++ else if ( prev_dom != domain ) /* Avoid infinite recursion. */ ++ domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev, ++ mode & MAP_WITH_RMRR); ++ } ++ ++ if ( prev_dom ) ++ rcu_unlock_domain(prev_dom); ++ ++ return rc ?: pdev && prev_dom; + } + ++static int domain_context_unmap(struct domain *d, uint8_t devfn, ++ struct pci_dev *pdev); ++ + static int domain_context_mapping(struct domain *domain, u8 devfn, + struct pci_dev *pdev) + { + struct acpi_drhd_unit *drhd; ++ const struct acpi_rmrr_unit *rmrr; + int ret = 0; +- u8 seg = pdev->seg, bus = pdev->bus, secbus; ++ unsigned int i, mode = 0; ++ uint16_t seg = pdev->seg, bdf; ++ uint8_t bus = pdev->bus, secbus; + + drhd = acpi_find_matched_drhd_unit(pdev); + if ( !drhd ) +@@ -1515,8 +1632,30 @@ static int domain_context_mapping(struct + + ASSERT(pcidevs_locked()); + ++ for_each_rmrr_device( rmrr, bdf, i ) ++ { ++ if ( rmrr->segment != pdev->seg || ++ bdf != PCI_BDF2(pdev->bus, pdev->devfn) ) ++ continue; ++ ++ mode |= MAP_WITH_RMRR; ++ break; ++ } ++ ++ if ( domain != pdev->domain ) ++ { ++ if ( pdev->domain->is_dying ) ++ mode |= MAP_OWNER_DYING; ++ else if ( drhd && ++ !any_pdev_behind_iommu(pdev->domain, pdev, drhd->iommu) && ++ !pdev->phantom_stride ) ++ mode |= MAP_SINGLE_DEVICE; ++ } ++ + switch ( pdev->type ) + { ++ bool prev_present; ++ + case DEV_TYPE_PCI_HOST_BRIDGE: + if ( iommu_debug ) + printk(VTDPREFIX "d%d:Hostbridge: skip %04x:%02x:%02x.%u map\n", +@@ -1537,7 +1676,9 @@ static int domain_context_mapping(struct + domain->domain_id, seg, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn)); + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- pdev); ++ pdev, mode); ++ if ( ret > 0 ) ++ ret = 0; + if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 ) + enable_ats_device(pdev, &drhd->iommu->ats_devices); + +@@ -1550,20 +1691,33 @@ static int domain_context_mapping(struct + PCI_SLOT(devfn), PCI_FUNC(devfn)); + + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- pdev); +- if ( ret ) ++ pdev, mode); ++ if ( ret < 0 ) + break; ++ prev_present = ret; ++ ret = 0; + + if ( find_upstream_bridge(seg, &bus, &devfn, &secbus) < 1 ) + break; + + /* ++ * Strictly speaking if the device is the only one behind this bridge ++ * and the only one with this (secbus,0,0) tuple, it could be allowed ++ * to be re-assigned regardless of RMRR presence. But let's deal with ++ * that case only if it is actually found in the wild. ++ */ ++ if ( prev_present && (mode & MAP_WITH_RMRR) && ++ domain != pdev->domain ) ++ ret = -EOPNOTSUPP; ++ ++ /* + * Mapping a bridge should, if anything, pass the struct pci_dev of + * that bridge. Since bridges don't normally get assigned to guests, + * their owner would be the wrong one. Pass NULL instead. + */ +- ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- NULL); ++ if ( ret >= 0 ) ++ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, ++ NULL, mode); + + /* + * Devices behind PCIe-to-PCI/PCIx bridge may generate different +@@ -1578,7 +1732,15 @@ static int domain_context_mapping(struct + if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE && + (secbus != pdev->bus || pdev->devfn != 0) ) + ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0, +- NULL); ++ NULL, mode); ++ ++ if ( ret ) ++ { ++ if ( !prev_present ) ++ domain_context_unmap(domain, devfn, pdev); ++ else if ( pdev->domain != domain ) /* Avoid infinite recursion. */ ++ domain_context_mapping(pdev->domain, devfn, pdev); ++ } + + break; + +@@ -2237,9 +2399,8 @@ static int reassign_device_ownership( + { + int ret; + +- ret = domain_context_unmap(source, devfn, pdev); +- if ( ret ) +- return ret; ++ if ( !has_arch_pdevs(target) ) ++ vmx_pi_hooks_assign(target); + + /* + * Devices assigned to untrusted domains (here assumed to be any domU) +@@ -2249,6 +2410,31 @@ static int reassign_device_ownership( + if ( (target != hardware_domain) && !iommu_intremap ) + untrusted_msi = true; + ++ ret = domain_context_mapping(target, devfn, pdev); ++ if ( ret ) ++ { ++ if ( !has_arch_pdevs(target) ) ++ vmx_pi_hooks_deassign(target); ++ return ret; ++ } ++ ++ if ( pdev->devfn == devfn ) ++ { ++ const struct acpi_drhd_unit *drhd = acpi_find_matched_drhd_unit(pdev); ++ ++ if ( drhd ) ++ check_cleanup_domid_map(source, pdev, drhd->iommu); ++ } ++ ++ if ( devfn == pdev->devfn && pdev->domain != target ) ++ { ++ list_move(&pdev->domain_list, &target->arch.pdev_list); ++ pdev->domain = target; ++ } ++ ++ if ( !has_arch_pdevs(source) ) ++ vmx_pi_hooks_deassign(source); ++ + /* + * If the device belongs to the hardware domain, and it has RMRR, don't + * remove it from the hardware domain, because BIOS may use RMRR at +@@ -2277,34 +2463,7 @@ static int reassign_device_ownership( + } + } + +- if ( devfn == pdev->devfn && pdev->domain != dom_io ) +- { +- list_move(&pdev->domain_list, &dom_io->arch.pdev_list); +- pdev->domain = dom_io; +- } +- +- if ( !has_arch_pdevs(source) ) +- vmx_pi_hooks_deassign(source); +- +- if ( !has_arch_pdevs(target) ) +- vmx_pi_hooks_assign(target); +- +- ret = domain_context_mapping(target, devfn, pdev); +- if ( ret ) +- { +- if ( !has_arch_pdevs(target) ) +- vmx_pi_hooks_deassign(target); +- +- return ret; +- } +- +- if ( devfn == pdev->devfn && pdev->domain != target ) +- { +- list_move(&pdev->domain_list, &target->arch.pdev_list); +- pdev->domain = target; +- } +- +- return ret; ++ return 0; + } + + static int intel_iommu_assign_device( +--- a/xen/drivers/passthrough/vtd/iommu.h ++++ b/xen/drivers/passthrough/vtd/iommu.h +@@ -201,8 +201,12 @@ struct root_entry { + do {(root).val |= ((value) & PAGE_MASK_4K);} while(0) + + struct context_entry { +- u64 lo; +- u64 hi; ++ union { ++ struct { ++ uint64_t lo, hi; ++ }; ++ __uint128_t full; ++ }; + }; + #define ROOT_ENTRY_NR (PAGE_SIZE_4K/sizeof(struct root_entry)) + #define context_present(c) ((c).lo & 1) +--- a/xen/drivers/passthrough/vtd/quirks.c ++++ b/xen/drivers/passthrough/vtd/quirks.c +@@ -330,7 +330,8 @@ void __init platform_quirks_init(void) + */ + + static int __must_check map_me_phantom_function(struct domain *domain, +- u32 dev, int map) ++ unsigned int dev, ++ unsigned int mode) + { + struct acpi_drhd_unit *drhd; + struct pci_dev *pdev; +@@ -341,9 +342,9 @@ static int __must_check map_me_phantom_f + drhd = acpi_find_matched_drhd_unit(pdev); + + /* map or unmap ME phantom function */ +- if ( map ) ++ if ( !(mode & UNMAP_ME_PHANTOM_FUNC) ) + rc = domain_context_mapping_one(domain, drhd->iommu, 0, +- PCI_DEVFN(dev, 7), NULL); ++ PCI_DEVFN(dev, 7), NULL, mode); + else + rc = domain_context_unmap_one(domain, drhd->iommu, 0, + PCI_DEVFN(dev, 7)); +@@ -351,7 +352,8 @@ static int __must_check map_me_phantom_f + return rc; + } + +-int me_wifi_quirk(struct domain *domain, u8 bus, u8 devfn, int map) ++int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn, ++ unsigned int mode) + { + u32 id; + int rc = 0; +@@ -375,7 +377,7 @@ int me_wifi_quirk(struct domain *domain, + case 0x423b8086: + case 0x423c8086: + case 0x423d8086: +- rc = map_me_phantom_function(domain, 3, map); ++ rc = map_me_phantom_function(domain, 3, mode); + break; + default: + break; +@@ -401,7 +403,7 @@ int me_wifi_quirk(struct domain *domain, + case 0x42388086: /* Puma Peak */ + case 0x422b8086: + case 0x422c8086: +- rc = map_me_phantom_function(domain, 22, map); ++ rc = map_me_phantom_function(domain, 22, mode); + break; + default: + break; +--- a/xen/drivers/passthrough/vtd/vtd.h ++++ b/xen/drivers/passthrough/vtd/vtd.h +@@ -22,8 +22,14 @@ + + #include + +-#define MAP_ME_PHANTOM_FUNC 1 +-#define UNMAP_ME_PHANTOM_FUNC 0 ++/* ++ * Values for domain_context_mapping_one()'s and me_wifi_quirk()'s "mode" ++ * parameters. ++ */ ++#define MAP_WITH_RMRR (1u << 0) ++#define MAP_OWNER_DYING (1u << 1) ++#define MAP_SINGLE_DEVICE (1u << 2) ++#define UNMAP_ME_PHANTOM_FUNC (1u << 3) + + /* Allow for both IOAPIC and IOSAPIC. */ + #define IO_xAPIC_route_entry IO_APIC_route_entry diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-05.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-05.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-05.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-05.patch 2022-06-13 22:16:03.000000000 +0100 @@ -0,0 +1,443 @@ +From: Jan Beulich +Subject: AMD/IOMMU: re-assign devices directly + +Devices with unity map ranges, due to it being unspecified how/when +these memory ranges may get accessed, may not be left disconnected from +their unity mappings (as long as it's not certain that the device has +been fully quiesced). Hence rather than tearing down the old root page +table pointer and then establishing the new one, re-assignment needs to +be done in a single step. + +This is CVE-2022-26360 / part of XSA-400. + +Reported-by: Roger Pau Monné + +Similarly quarantining scratch-page mode relies on page tables to be +continuously wired up. + +To avoid complicating things more than necessary, treat all devices +mostly equally, i.e. regardless of their association with any unity map +ranges. The main difference is when it comes to updating DTEs, which need +to be atomic when there are unity mappings. Yet atomicity can only be +achieved with CMPXCHG16B, availability of which we can't take for given. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Reviewed-by: Roger Pau Monné + +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +@@ -72,8 +72,11 @@ void amd_iommu_share_p2m(struct domain * + int get_dma_requestor_id(u16 seg, u16 bdf); + void amd_iommu_set_intremap_table( + u32 *dte, u64 intremap_ptr, u8 int_valid); +-void amd_iommu_set_root_page_table( +- u32 *dte, u64 root_ptr, u16 domain_id, u8 paging_mode, u8 valid); ++#define SET_ROOT_VALID (1u << 0) ++#define SET_ROOT_WITH_UNITY_MAP (1u << 1) ++int __must_check amd_iommu_set_root_page_table( ++ u32 *dte, u64 root_ptr, u16 domain_id, u8 paging_mode, unsigned int flags); ++paddr_t amd_iommu_get_root_page_table(const u32 *dte); + void iommu_dte_set_iotlb(u32 *dte, u8 i); + void iommu_dte_add_device_entry(u32 *dte, struct ivrs_mappings *ivrs_dev); + void iommu_dte_set_guest_cr3(u32 *dte, u16 dom_id, u64 gcr3, +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -143,12 +143,105 @@ static unsigned int set_iommu_pte_presen + return need_flush; + } + +-void amd_iommu_set_root_page_table( +- u32 *dte, u64 root_ptr, u16 domain_id, u8 paging_mode, u8 valid) ++/* ++ * This function returns ++ * - -errno for errors, ++ * - 0 for a successful update, atomic when necessary ++ * - 1 for a successful but non-atomic update, which may need to be warned ++ * about by the caller. ++ */ ++int amd_iommu_set_root_page_table(u32 *dte, u64 root_ptr, u16 domain_id, ++ u8 paging_mode, unsigned int flags) + { ++ bool valid = flags & SET_ROOT_VALID; + u64 addr_hi, addr_lo; + u32 entry, dte0 = dte[0]; + ++ addr_lo = root_ptr & DMA_32BIT_MASK; ++ addr_hi = root_ptr >> 32; ++ ++ if ( get_field_from_reg_u32(dte0, IOMMU_DEV_TABLE_VALID_MASK, ++ IOMMU_DEV_TABLE_VALID_SHIFT) && ++ get_field_from_reg_u32(dte0, IOMMU_DEV_TABLE_TRANSLATION_VALID_MASK, ++ IOMMU_DEV_TABLE_TRANSLATION_VALID_SHIFT) && ++ (cpu_has_cx16 || (flags & SET_ROOT_WITH_UNITY_MAP)) ) ++ { ++ union { ++ u32 dte[4]; ++ u64 raw64[2]; ++ __uint128_t raw128; ++ } ldte; ++ __uint128_t old; ++ int ret = 0; ++ ++ memcpy(ldte.dte, dte, sizeof(ldte)); ++ old = ldte.raw128; ++ ++ set_field_in_reg_u32(domain_id, ldte.dte[2], ++ IOMMU_DEV_TABLE_DOMAIN_ID_MASK, ++ IOMMU_DEV_TABLE_DOMAIN_ID_SHIFT, &ldte.dte[2]); ++ ++ set_field_in_reg_u32(addr_hi, ldte.dte[1], ++ IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_MASK, ++ IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_SHIFT, ++ &ldte.dte[1]); ++ set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, ldte.dte[1], ++ IOMMU_DEV_TABLE_IO_WRITE_PERMISSION_MASK, ++ IOMMU_DEV_TABLE_IO_WRITE_PERMISSION_SHIFT, ++ &ldte.dte[1]); ++ set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, ldte.dte[1], ++ IOMMU_DEV_TABLE_IO_READ_PERMISSION_MASK, ++ IOMMU_DEV_TABLE_IO_READ_PERMISSION_SHIFT, ++ &ldte.dte[1]); ++ ++ set_field_in_reg_u32(addr_lo >> PAGE_SHIFT, ldte.dte[0], ++ IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_MASK, ++ IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_SHIFT, ++ &ldte.dte[0]); ++ set_field_in_reg_u32(paging_mode, ldte.dte[0], ++ IOMMU_DEV_TABLE_PAGING_MODE_MASK, ++ IOMMU_DEV_TABLE_PAGING_MODE_SHIFT, &ldte.dte[0]); ++ set_field_in_reg_u32(IOMMU_CONTROL_ENABLED, ldte.dte[0], ++ IOMMU_DEV_TABLE_TRANSLATION_VALID_MASK, ++ IOMMU_DEV_TABLE_TRANSLATION_VALID_SHIFT, ++ &ldte.dte[0]); ++ set_field_in_reg_u32(valid ? IOMMU_CONTROL_ENABLED ++ : IOMMU_CONTROL_DISABLED, ++ ldte.dte[0], IOMMU_DEV_TABLE_VALID_MASK, ++ IOMMU_DEV_TABLE_VALID_SHIFT, &ldte.dte[0]); ++ ++ if ( cpu_has_cx16 ) ++ { ++ __uint128_t res = cmpxchg16b(dte, &old, &ldte.raw128); ++ ++ /* ++ * Hardware does not update the DTE behind our backs, so the ++ * return value should match "old". ++ */ ++ if ( res != old ) ++ { ++ printk(XENLOG_ERR ++ "Dom%d: unexpected DTE %016lx_%016lx (expected %016lx_%016lx)\n", ++ domain_id, ++ (u64)(res >> 64), (u64)res, ++ (u64)(old >> 64), (u64)old); ++ ret = -EILSEQ; ++ } ++ } ++ else /* Best effort, updating domain_id last. */ ++ { ++ u64 *ptr = (void *)dte; ++ ++ write_atomic(ptr + 0, ldte.raw64[0]); ++ /* No barrier should be needed between these two. */ ++ write_atomic(ptr + 1, ldte.raw64[1]); ++ ++ ret = 1; ++ } ++ ++ return ret; ++ } ++ + if ( valid || + get_field_from_reg_u32(dte0, IOMMU_DEV_TABLE_VALID_MASK, + IOMMU_DEV_TABLE_VALID_SHIFT) ) +@@ -183,9 +276,6 @@ void amd_iommu_set_root_page_table(uint3 + IOMMU_DEV_TABLE_DOMAIN_ID_SHIFT, &entry); + dte[2] = entry; + +- addr_lo = root_ptr & DMA_32BIT_MASK; +- addr_hi = root_ptr >> 32; +- + set_field_in_reg_u32((u32)addr_hi, 0, + IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_MASK, + IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_SHIFT, &entry); +@@ -197,6 +287,20 @@ void amd_iommu_set_root_page_table(uint3 + IOMMU_DEV_TABLE_VALID_MASK, + IOMMU_DEV_TABLE_VALID_SHIFT, &entry); + write_atomic(&dte[0], entry); ++ ++ return 0; ++} ++ ++paddr_t amd_iommu_get_root_page_table(const u32 *dte) ++{ ++ u32 lo = get_field_from_reg_u32( ++ dte[0], IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_MASK, ++ IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_SHIFT); ++ u32 hi = get_field_from_reg_u32( ++ dte[1], IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_MASK, ++ IOMMU_DEV_TABLE_PAGE_TABLE_PTR_HIGH_SHIFT); ++ ++ return ((paddr_t)hi << 32) | (lo << PAGE_SHIFT); + } + + void iommu_dte_set_iotlb(u32 *dte, u8 i) +--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c ++++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c +@@ -107,22 +107,60 @@ static void disable_translation(u32 *dte + dte[0] = entry; + } + +-static void amd_iommu_setup_domain_device( ++static int __must_check allocate_domain_resources(struct domain_iommu *hd) ++{ ++ int rc; ++ ++ spin_lock(&hd->arch.mapping_lock); ++ rc = amd_iommu_alloc_root(hd); ++ spin_unlock(&hd->arch.mapping_lock); ++ ++ return rc; ++} ++ ++static bool any_pdev_behind_iommu(const struct domain *d, ++ const struct pci_dev *exclude, ++ const struct amd_iommu *iommu) ++{ ++ const struct pci_dev *pdev; ++ ++ for_each_pdev ( d, pdev ) ++ { ++ if ( pdev == exclude ) ++ continue; ++ ++ if ( find_iommu_for_device(pdev->seg, ++ PCI_BDF2(pdev->bus, pdev->devfn)) == iommu ) ++ return true; ++ } ++ ++ return false; ++} ++ ++static int __must_check amd_iommu_setup_domain_device( + struct domain *domain, struct amd_iommu *iommu, + u8 devfn, struct pci_dev *pdev) + { +- void *dte; ++ u32 *dte; + unsigned long flags; +- int req_id, valid = 1; +- int dte_i = 0; ++ unsigned int req_id, sr_flags; ++ int dte_i = 0, rc; + u8 bus = pdev->bus; +- const struct domain_iommu *hd = dom_iommu(domain); ++ struct domain_iommu *hd = dom_iommu(domain); ++ const struct ivrs_mappings *ivrs_dev; ++ ++ BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer); + +- BUG_ON( !hd->arch.root_table || !hd->arch.paging_mode || +- !iommu->dev_table.buffer ); ++ rc = allocate_domain_resources(hd); ++ if ( rc ) ++ return rc; + +- if ( iommu_passthrough && is_hardware_domain(domain) ) +- valid = 0; ++ req_id = get_dma_requestor_id(iommu->seg, ++ PCI_BDF2(pdev->bus, pdev->devfn)); ++ ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id]; ++ sr_flags = (iommu_passthrough && is_hardware_domain(domain) ++ ? 0 : SET_ROOT_VALID) ++ | (ivrs_dev->unity_map ? SET_ROOT_WITH_UNITY_MAP : 0); + + if ( ats_enabled ) + dte_i = 1; +@@ -130,32 +168,87 @@ static void amd_iommu_setup_domain_devic + /* get device-table entry */ + req_id = get_dma_requestor_id(iommu->seg, PCI_BDF2(bus, devfn)); + dte = iommu->dev_table.buffer + (req_id * IOMMU_DEV_TABLE_ENTRY_SIZE); ++ ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id]; + + spin_lock_irqsave(&iommu->lock, flags); + + if ( !is_translation_valid((u32 *)dte) ) + { + /* bind DTE to domain page-tables */ +- amd_iommu_set_root_page_table( +- (u32 *)dte, page_to_maddr(hd->arch.root_table), domain->domain_id, +- hd->arch.paging_mode, valid); ++ rc = amd_iommu_set_root_page_table( ++ dte, page_to_maddr(hd->arch.root_table), ++ domain->domain_id, hd->arch.paging_mode, sr_flags); ++ if ( rc ) ++ { ++ ASSERT(rc < 0); ++ spin_unlock_irqrestore(&iommu->lock, flags); ++ return rc; ++ } + + if ( pci_ats_device(iommu->seg, bus, pdev->devfn) && + iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) ) + iommu_dte_set_iotlb((u32 *)dte, dte_i); + + amd_iommu_flush_device(iommu, req_id); ++ } ++ else if ( amd_iommu_get_root_page_table(dte) != ++ page_to_maddr(hd->arch.root_table) ) ++ { ++ /* ++ * Strictly speaking if the device is the only one with this requestor ++ * ID, it could be allowed to be re-assigned regardless of unity map ++ * presence. But let's deal with that case only if it is actually ++ * found in the wild. ++ */ ++ if ( req_id != PCI_BDF2(bus, devfn) && ++ (sr_flags & SET_ROOT_WITH_UNITY_MAP) ) ++ rc = -EOPNOTSUPP; ++ else ++ rc = amd_iommu_set_root_page_table( ++ dte, page_to_maddr(hd->arch.root_table), ++ domain->domain_id, hd->arch.paging_mode, sr_flags); ++ if ( rc < 0 ) ++ { ++ spin_unlock_irqrestore(&iommu->lock, flags); ++ return rc; ++ } ++ if ( rc && ++ domain != pdev->domain && ++ /* ++ * By non-atomically updating the DTE's domain ID field last, ++ * during a short window in time TLB entries with the old domain ++ * ID but the new page tables may have been inserted. This could ++ * affect I/O of other devices using this same (old) domain ID. ++ * Such updating therefore is not a problem if this was the only ++ * device associated with the old domain ID. Diverting I/O of any ++ * of a dying domain's devices to the quarantine page tables is ++ * intended anyway. ++ */ ++ !pdev->domain->is_dying && ++ (any_pdev_behind_iommu(pdev->domain, pdev, iommu) || ++ pdev->phantom_stride) ) ++ printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n", ++ pdev->seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), ++ pdev->domain); + +- AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, " +- "root table = %#"PRIx64", " +- "domain = %d, paging mode = %d\n", +- req_id, pdev->type, +- page_to_maddr(hd->arch.root_table), +- domain->domain_id, hd->arch.paging_mode); ++ if ( pci_ats_device(iommu->seg, bus, pdev->devfn) && ++ iommu_has_cap(iommu, PCI_CAP_IOTLB_SHIFT) ) ++ ASSERT(get_field_from_reg_u32( ++ dte[3], IOMMU_DEV_TABLE_IOTLB_SUPPORT_MASK, ++ IOMMU_DEV_TABLE_IOTLB_SUPPORT_SHIFT) == dte_i); ++ ++ amd_iommu_flush_device(iommu, req_id); + } + + spin_unlock_irqrestore(&iommu->lock, flags); + ++ AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, " ++ "root table = %#"PRIx64", " ++ "domain = %d, paging mode = %d\n", ++ req_id, pdev->type, ++ page_to_maddr(hd->arch.root_table), ++ domain->domain_id, hd->arch.paging_mode); ++ + ASSERT(pcidevs_locked()); + + if ( pci_ats_device(iommu->seg, bus, pdev->devfn) && +@@ -166,6 +259,8 @@ static void amd_iommu_setup_domain_devic + + amd_iommu_flush_iotlb(devfn, pdev, INV_IOMMU_ALL_PAGES_ADDRESS, 0); + } ++ ++ return 0; + } + + int __init amd_iov_detect(void) +@@ -207,17 +302,6 @@ int amd_iommu_alloc_root(struct domain_i + return 0; + } + +-static int __must_check allocate_domain_resources(struct domain_iommu *hd) +-{ +- int rc; +- +- spin_lock(&hd->arch.mapping_lock); +- rc = amd_iommu_alloc_root(hd); +- spin_unlock(&hd->arch.mapping_lock); +- +- return rc; +-} +- + int __read_mostly amd_iommu_min_paging_mode = 1; + + static int amd_iommu_domain_init(struct domain *d) +@@ -336,7 +420,6 @@ static int reassign_device(struct domain + { + struct amd_iommu *iommu; + int bdf, rc; +- struct domain_iommu *t = dom_iommu(target); + const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg); + + bdf = PCI_BDF2(pdev->bus, pdev->devfn); +@@ -350,7 +433,15 @@ static int reassign_device(struct domain + return -ENODEV; + } + +- amd_iommu_disable_domain_device(source, iommu, devfn, pdev); ++ rc = amd_iommu_setup_domain_device(target, iommu, devfn, pdev); ++ if ( rc ) ++ return rc; ++ ++ if ( devfn == pdev->devfn && pdev->domain != target ) ++ { ++ list_move(&pdev->domain_list, &target->arch.pdev_list); ++ pdev->domain = target; ++ } + + /* + * If the device belongs to the hardware domain, and it has a unity mapping, +@@ -366,27 +457,10 @@ static int reassign_device(struct domain + return rc; + } + +- if ( devfn == pdev->devfn && pdev->domain != dom_io ) +- { +- list_move(&pdev->domain_list, &dom_io->arch.pdev_list); +- pdev->domain = dom_io; +- } +- +- rc = allocate_domain_resources(t); +- if ( rc ) +- return rc; +- +- amd_iommu_setup_domain_device(target, iommu, devfn, pdev); + AMD_IOMMU_DEBUG("Re-assign %04x:%02x:%02x.%u from dom%d to dom%d\n", + pdev->seg, pdev->bus, PCI_SLOT(devfn), PCI_FUNC(devfn), + source->domain_id, target->domain_id); + +- if ( devfn == pdev->devfn && pdev->domain != target ) +- { +- list_move(&pdev->domain_list, &target->arch.pdev_list); +- pdev->domain = target; +- } +- + return 0; + } + +@@ -517,8 +591,7 @@ static int amd_iommu_add_device(u8 devfn + return -ENODEV; + } + +- amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev); +- return 0; ++ return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev); + } + + static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev) diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-06.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-06.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-06.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-06.patch 2022-06-06 07:52:05.000000000 +0100 @@ -0,0 +1,281 @@ +From: Jan Beulich +Subject: VT-d: prepare for per-device quarantine page tables (part I) + +Arrange for domain ID and page table root to be passed around, the latter in +particular to domain_pgd_maddr() such that taking it from the per-domain +fields can be overridden. + +No functional change intended. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Reviewed-by: Roger Pau Monné +Reviewed-by: Kevin Tian + +--- a/xen/drivers/passthrough/vtd/extern.h ++++ b/xen/drivers/passthrough/vtd/extern.h +@@ -72,9 +72,10 @@ void *map_vtd_domain_page(u64 maddr); + void unmap_vtd_domain_page(void *va); + int domain_context_mapping_one(struct domain *domain, struct iommu *iommu, + uint8_t bus, uint8_t devfn, +- const struct pci_dev *pdev, unsigned int mode); ++ const struct pci_dev *pdev, domid_t domid, ++ paddr_t pgd_maddr, unsigned int mode); + int domain_context_unmap_one(struct domain *domain, struct iommu *iommu, +- u8 bus, u8 devfn); ++ uint8_t bus, uint8_t devfn, domid_t domid); + int intel_iommu_get_reserved_device_memory(iommu_grdm_t *func, void *ctxt); + + unsigned int io_apic_read_remap_rte(unsigned int apic, unsigned int reg); +@@ -93,7 +94,8 @@ void platform_quirks_init(void); + void vtd_ops_preamble_quirk(struct iommu* iommu); + void vtd_ops_postamble_quirk(struct iommu* iommu); + int __must_check me_wifi_quirk(struct domain *domain, uint8_t bus, +- uint8_t devfn, unsigned int mode); ++ uint8_t devfn, domid_t domid, paddr_t pgd_maddr, ++ unsigned int mode); + void pci_vtd_quirk(const struct pci_dev *); + void quirk_iommu_caps(struct iommu *iommu); + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -1405,12 +1405,12 @@ int domain_context_mapping_one( + struct domain *domain, + struct iommu *iommu, + uint8_t bus, uint8_t devfn, const struct pci_dev *pdev, +- unsigned int mode) ++ domid_t domid, paddr_t pgd_maddr, unsigned int mode) + { + struct domain_iommu *hd = dom_iommu(domain); + struct context_entry *context, *context_entries, lctxt; + __uint128_t old; +- u64 maddr, pgd_maddr; ++ uint64_t maddr; + uint16_t seg = iommu->intel->drhd->segment, prev_did = 0; + struct domain *prev_dom = NULL; + int agaw, rc, ret; +@@ -1451,10 +1451,12 @@ int domain_context_mapping_one( + } + else + { ++ paddr_t root = pgd_maddr; ++ + spin_lock(&hd->arch.mapping_lock); + + /* Ensure we have pagetables allocated down to leaf PTE. */ +- if ( hd->arch.pgd_maddr == 0 ) ++ if ( !root ) + { + addr_to_dma_page_maddr(domain, 0, 1); + if ( hd->arch.pgd_maddr == 0 ) +@@ -1467,22 +1469,24 @@ int domain_context_mapping_one( + rcu_unlock_domain(prev_dom); + return -ENOMEM; + } ++ ++ root = hd->arch.pgd_maddr; + } + + /* Skip top levels of page tables for 2- and 3-level DRHDs. */ +- pgd_maddr = hd->arch.pgd_maddr; + for ( agaw = level_to_agaw(4); + agaw != level_to_agaw(iommu->nr_pt_levels); + agaw-- ) + { +- struct dma_pte *p = map_vtd_domain_page(pgd_maddr); +- pgd_maddr = dma_pte_addr(*p); ++ struct dma_pte *p = map_vtd_domain_page(root); ++ ++ root = dma_pte_addr(*p); + unmap_vtd_domain_page(p); +- if ( pgd_maddr == 0 ) ++ if ( !root ) + goto nomem; + } + +- context_set_address_root(lctxt, pgd_maddr); ++ context_set_address_root(lctxt, root); + if ( ats_enabled && ecap_dev_iotlb(iommu->ecap) ) + context_set_translation_type(lctxt, CONTEXT_TT_DEV_IOTLB); + else +@@ -1598,15 +1602,21 @@ int domain_context_mapping_one( + unmap_vtd_domain_page(context_entries); + + if ( !seg && !rc ) +- rc = me_wifi_quirk(domain, bus, devfn, mode); ++ rc = me_wifi_quirk(domain, bus, devfn, domid, pgd_maddr, mode); + + if ( rc ) + { + if ( !prev_dom ) +- domain_context_unmap_one(domain, iommu, bus, devfn); ++ domain_context_unmap_one(domain, iommu, bus, devfn, ++ domain->domain_id); + else if ( prev_dom != domain ) /* Avoid infinite recursion. */ ++ { ++ hd = dom_iommu(prev_dom); + domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev, ++ domain->domain_id, ++ hd->arch.pgd_maddr, + mode & MAP_WITH_RMRR); ++ } + } + + if ( prev_dom ) +@@ -1623,6 +1633,7 @@ static int domain_context_mapping(struct + { + struct acpi_drhd_unit *drhd; + const struct acpi_rmrr_unit *rmrr; ++ paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr; + int ret = 0; + unsigned int i, mode = 0; + uint16_t seg = pdev->seg, bdf; +@@ -1678,7 +1689,8 @@ static int domain_context_mapping(struct + domain->domain_id, seg, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn)); + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- pdev, mode); ++ pdev, domain->domain_id, pgd_maddr, ++ mode); + if ( ret > 0 ) + ret = 0; + if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 ) +@@ -1693,7 +1705,8 @@ static int domain_context_mapping(struct + PCI_SLOT(devfn), PCI_FUNC(devfn)); + + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- pdev, mode); ++ pdev, domain->domain_id, pgd_maddr, ++ mode); + if ( ret < 0 ) + break; + prev_present = ret; +@@ -1719,7 +1732,8 @@ static int domain_context_mapping(struct + */ + if ( ret >= 0 ) + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- NULL, mode); ++ NULL, domain->domain_id, pgd_maddr, ++ mode); + + /* + * Devices behind PCIe-to-PCI/PCIx bridge may generate different +@@ -1734,7 +1748,8 @@ static int domain_context_mapping(struct + if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE && + (secbus != pdev->bus || pdev->devfn != 0) ) + ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0, +- NULL, mode); ++ NULL, domain->domain_id, pgd_maddr, ++ mode); + + if ( ret ) + { +@@ -1763,7 +1778,7 @@ static int domain_context_mapping(struct + int domain_context_unmap_one( + struct domain *domain, + struct iommu *iommu, +- u8 bus, u8 devfn) ++ uint8_t bus, uint8_t devfn, domid_t domid) + { + struct context_entry *context, *context_entries; + u64 maddr; +@@ -1821,7 +1836,7 @@ int domain_context_unmap_one( + unmap_vtd_domain_page(context_entries); + + if ( !iommu->intel->drhd->segment && !rc ) +- rc = me_wifi_quirk(domain, bus, devfn, UNMAP_ME_PHANTOM_FUNC); ++ rc = me_wifi_quirk(domain, bus, devfn, domid, 0, UNMAP_ME_PHANTOM_FUNC); + + return rc; + } +@@ -1860,7 +1875,8 @@ static int domain_context_unmap(struct d + printk(VTDPREFIX "d%d:PCIe: unmap %04x:%02x:%02x.%u\n", + domain->domain_id, seg, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn)); +- ret = domain_context_unmap_one(domain, iommu, bus, devfn); ++ ret = domain_context_unmap_one(domain, iommu, bus, devfn, ++ domain->domain_id); + if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 ) + disable_ats_device(pdev); + +@@ -1870,7 +1886,8 @@ static int domain_context_unmap(struct d + if ( iommu_debug ) + printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n", + domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); +- ret = domain_context_unmap_one(domain, iommu, bus, devfn); ++ ret = domain_context_unmap_one(domain, iommu, bus, devfn, ++ domain->domain_id); + if ( ret ) + break; + +@@ -1882,14 +1899,17 @@ static int domain_context_unmap(struct d + /* PCIe to PCI/PCIx bridge */ + if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE ) + { +- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn); ++ ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, ++ domain->domain_id); + if ( ret ) + return ret; + +- ret = domain_context_unmap_one(domain, iommu, secbus, 0); ++ ret = domain_context_unmap_one(domain, iommu, secbus, 0, ++ domain->domain_id); + } + else /* Legacy PCI bridge */ +- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn); ++ ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, ++ domain->domain_id); + + break; + +--- a/xen/drivers/passthrough/vtd/quirks.c ++++ b/xen/drivers/passthrough/vtd/quirks.c +@@ -331,6 +331,8 @@ void __init platform_quirks_init(void) + + static int __must_check map_me_phantom_function(struct domain *domain, + unsigned int dev, ++ domid_t domid, ++ paddr_t pgd_maddr, + unsigned int mode) + { + struct acpi_drhd_unit *drhd; +@@ -344,16 +346,17 @@ static int __must_check map_me_phantom_f + /* map or unmap ME phantom function */ + if ( !(mode & UNMAP_ME_PHANTOM_FUNC) ) + rc = domain_context_mapping_one(domain, drhd->iommu, 0, +- PCI_DEVFN(dev, 7), NULL, mode); ++ PCI_DEVFN(dev, 7), NULL, ++ domid, pgd_maddr, mode); + else + rc = domain_context_unmap_one(domain, drhd->iommu, 0, +- PCI_DEVFN(dev, 7)); ++ PCI_DEVFN(dev, 7), domid); + + return rc; + } + + int me_wifi_quirk(struct domain *domain, uint8_t bus, uint8_t devfn, +- unsigned int mode) ++ domid_t domid, paddr_t pgd_maddr, unsigned int mode) + { + u32 id; + int rc = 0; +@@ -377,7 +380,7 @@ int me_wifi_quirk(struct domain *domain, + case 0x423b8086: + case 0x423c8086: + case 0x423d8086: +- rc = map_me_phantom_function(domain, 3, mode); ++ rc = map_me_phantom_function(domain, 3, domid, pgd_maddr, mode); + break; + default: + break; +@@ -403,7 +406,7 @@ int me_wifi_quirk(struct domain *domain, + case 0x42388086: /* Puma Peak */ + case 0x422b8086: + case 0x422c8086: +- rc = map_me_phantom_function(domain, 22, mode); ++ rc = map_me_phantom_function(domain, 22, domid, pgd_maddr, mode); + break; + default: + break; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-07.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-07.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-07.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-07.patch 2022-06-06 07:52:05.000000000 +0100 @@ -0,0 +1,126 @@ +From: Jan Beulich +Subject: VT-d: prepare for per-device quarantine page tables (part II) + +Replace the passing of struct domain * by domid_t in preparation of +per-device quarantine page tables also requiring per-device pseudo +domain IDs, which aren't going to be associated with any struct domain +instances. + +No functional change intended (except for slightly adjusted log message +text). + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Reviewed-by: Kevin Tian +Reviewed-by: Roger Pau Monné + +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -52,8 +52,8 @@ static struct tasklet vtd_fault_tasklet; + static int setup_hwdom_device(u8 devfn, struct pci_dev *); + static void setup_hwdom_rmrr(struct domain *d); + +-static int domain_iommu_domid(struct domain *d, +- struct iommu *iommu) ++static int get_iommu_did(domid_t domid, const struct iommu *iommu, ++ bool warn) + { + unsigned long nr_dom, i; + +@@ -61,23 +61,24 @@ static int domain_iommu_domid(struct dom + i = find_first_bit(iommu->domid_bitmap, nr_dom); + while ( i < nr_dom ) + { +- if ( iommu->domid_map[i] == d->domain_id ) ++ if ( iommu->domid_map[i] == domid ) + return i; + + i = find_next_bit(iommu->domid_bitmap, nr_dom, i+1); + } + +- dprintk(XENLOG_ERR VTDPREFIX, +- "Cannot get valid iommu domid: domid=%d iommu->index=%d\n", +- d->domain_id, iommu->index); ++ if ( warn ) ++ dprintk(XENLOG_ERR VTDPREFIX, ++ "No valid iommu %u domid for Dom%d\n", ++ iommu->index, domid); ++ + return -1; + } + + #define DID_FIELD_WIDTH 16 + #define DID_HIGH_OFFSET 8 + static int context_set_domain_id(struct context_entry *context, +- struct domain *d, +- struct iommu *iommu) ++ domid_t domid, struct iommu *iommu) + { + unsigned long nr_dom, i; + int found = 0; +@@ -88,7 +89,7 @@ static int context_set_domain_id(struct + i = find_first_bit(iommu->domid_bitmap, nr_dom); + while ( i < nr_dom ) + { +- if ( iommu->domid_map[i] == d->domain_id ) ++ if ( iommu->domid_map[i] == domid ) + { + found = 1; + break; +@@ -104,7 +105,7 @@ static int context_set_domain_id(struct + dprintk(XENLOG_ERR VTDPREFIX, "IOMMU: no free domain ids\n"); + return -EFAULT; + } +- iommu->domid_map[i] = d->domain_id; ++ iommu->domid_map[i] = domid; + } + + set_bit(i, iommu->domid_bitmap); +@@ -131,9 +132,9 @@ static void __init free_intel_iommu(stru + xfree(intel); + } + +-static void cleanup_domid_map(struct domain *domain, struct iommu *iommu) ++static void cleanup_domid_map(domid_t domid, struct iommu *iommu) + { +- int iommu_domid = domain_iommu_domid(domain, iommu); ++ int iommu_domid = get_iommu_did(domid, iommu, false); + + if ( iommu_domid >= 0 ) + { +@@ -189,7 +190,7 @@ static void check_cleanup_domid_map(stru + if ( !found ) + { + clear_bit(iommu->index, &dom_iommu(d)->arch.iommu_bitmap); +- cleanup_domid_map(d, iommu); ++ cleanup_domid_map(d->domain_id, iommu); + } + } + +@@ -670,7 +671,7 @@ static int __must_check iommu_flush_iotl + continue; + + flush_dev_iotlb = !!find_ats_dev_drhd(iommu); +- iommu_domid= domain_iommu_domid(d, iommu); ++ iommu_domid = get_iommu_did(d->domain_id, iommu, !d->is_dying); + if ( iommu_domid == -1 ) + continue; + +@@ -1495,7 +1496,7 @@ int domain_context_mapping_one( + spin_unlock(&hd->arch.mapping_lock); + } + +- rc = context_set_domain_id(&lctxt, domain, iommu); ++ rc = context_set_domain_id(&lctxt, domid, iommu); + if ( rc ) + { + unlock: +@@ -1803,7 +1804,7 @@ int domain_context_unmap_one( + context_clear_entry(*context); + iommu_sync_cache(context, sizeof(struct context_entry)); + +- iommu_domid= domain_iommu_domid(domain, iommu); ++ iommu_domid = get_iommu_did(domid, iommu, !domain->is_dying); + if ( iommu_domid == -1 ) + { + spin_unlock(&iommu->lock); diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-08.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-08.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-08.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-08.patch 2022-06-13 22:37:51.000000000 +0100 @@ -0,0 +1,440 @@ +From: Jan Beulich +Subject: IOMMU/x86: maintain a per-device pseudo domain ID + +In order to subsequently enable per-device quarantine page tables, we'll +need domain-ID-like identifiers to be inserted in the respective device +(AMD) or context (Intel) table entries alongside the per-device page +table root addresses. + +Make use of "real" domain IDs occupying only half of the value range +coverable by domid_t. + +Note that in VT-d's iommu_alloc() I didn't want to introduce new memory +leaks in case of error, but existing ones don't get plugged - that'll be +the subject of a later change. + +The VT-d changes are slightly asymmetric, but this way we can avoid +assigning pseudo domain IDs to devices which would never be mapped while +still avoiding to add a new parameter to domain_context_unmap(). + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Reviewed-by: Kevin Tian +Reviewed-by: Roger Pau Monné + +--- a/xen/include/asm-x86/iommu.h ++++ b/xen/include/asm-x86/iommu.h +@@ -112,6 +112,10 @@ int pi_update_irte(const struct pi_desc + ops->sync_cache(addr, size); \ + }) + ++unsigned long *iommu_init_domid(void); ++domid_t iommu_alloc_domid(unsigned long *map); ++void iommu_free_domid(domid_t domid, unsigned long *map); ++ + #endif /* !__ARCH_X86_IOMMU_H__ */ + /* + * Local variables: +--- a/xen/include/asm-x86/pci.h ++++ b/xen/include/asm-x86/pci.h +@@ -15,6 +15,12 @@ + + struct arch_pci_dev { + vmask_t used_vectors; ++ /* ++ * These fields are (de)initialized under pcidevs-lock. Other uses of ++ * them don't race (de)initialization and hence don't strictly need any ++ * locking. ++ */ ++ domid_t pseudo_domid; + }; + + int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, +--- a/xen/include/asm-x86/amd-iommu.h ++++ b/xen/include/asm-x86/amd-iommu.h +@@ -97,6 +97,7 @@ struct amd_iommu { + struct ring_buffer cmd_buffer; + struct ring_buffer event_log; + struct ring_buffer ppr_log; ++ unsigned long *domid_map; + + int exclusion_enable; + int exclusion_allow_all; +--- a/xen/drivers/passthrough/amd/iommu_detect.c ++++ b/xen/drivers/passthrough/amd/iommu_detect.c +@@ -150,6 +150,11 @@ int __init amd_iommu_detect_one_acpi( + if ( rt ) + goto out; + ++ iommu->domid_map = iommu_init_domid(); ++ rt = -ENOMEM; ++ if ( !iommu->domid_map ) ++ goto out; ++ + rt = pci_ro_device(iommu->seg, bus, PCI_DEVFN(dev, func)); + if ( rt ) + printk(XENLOG_ERR +@@ -161,7 +166,10 @@ int __init amd_iommu_detect_one_acpi( + + out: + if ( rt ) ++ { ++ xfree(iommu->domid_map); + xfree(iommu); ++ } + + return rt; + } +--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c ++++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c +@@ -567,6 +567,8 @@ static int amd_iommu_add_device(u8 devfn + { + struct amd_iommu *iommu; + u16 bdf; ++ bool fresh_domid = false; ++ int ret; + + if ( !pdev->domain ) + return -EINVAL; +@@ -591,7 +593,22 @@ static int amd_iommu_add_device(u8 devfn + return -ENODEV; + } + +- return amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev); ++ if ( iommu_quarantine && pdev->arch.pseudo_domid == DOMID_INVALID ) ++ { ++ pdev->arch.pseudo_domid = iommu_alloc_domid(iommu->domid_map); ++ if ( pdev->arch.pseudo_domid == DOMID_INVALID ) ++ return -ENOSPC; ++ fresh_domid = true; ++ } ++ ++ ret = amd_iommu_setup_domain_device(pdev->domain, iommu, devfn, pdev); ++ if ( ret && fresh_domid ) ++ { ++ iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map); ++ pdev->arch.pseudo_domid = DOMID_INVALID; ++ } ++ ++ return ret; + } + + static int amd_iommu_remove_device(u8 devfn, struct pci_dev *pdev) +@@ -613,6 +630,10 @@ static int amd_iommu_remove_device(u8 de + } + + amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev); ++ ++ iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map); ++ pdev->arch.pseudo_domid = DOMID_INVALID; ++ + return 0; + } + +--- a/xen/drivers/passthrough/pci.c ++++ b/xen/drivers/passthrough/pci.c +@@ -314,6 +314,7 @@ static struct pci_dev *alloc_pdev(struct + *((u8*) &pdev->bus) = bus; + *((u8*) &pdev->devfn) = devfn; + pdev->domain = NULL; ++ pdev->arch.pseudo_domid = DOMID_INVALID; + INIT_LIST_HEAD(&pdev->msi_list); + + if ( pci_find_cap_offset(pseg->nr, bus, PCI_SLOT(devfn), PCI_FUNC(devfn), +@@ -1268,10 +1269,13 @@ static int _dump_pci_devices(struct pci_ + + list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list ) + { +- printk("%04x:%02x:%02x.%u - dom %-3d - node %-3d - MSIs < ", +- pseg->nr, pdev->bus, +- PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn), +- pdev->domain ? pdev->domain->domain_id : -1, ++ printk("%04x:%02x:%02x.%u - ", pseg->nr, pdev->bus, ++ PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn)); ++ if ( pdev->domain == dom_io ) ++ printk("DomIO:%x", pdev->arch.pseudo_domid); ++ else if ( pdev->domain ) ++ printk("Dom%d", pdev->domain->domain_id); ++ printk(" - node %-3d - MSIs < ", + (pdev->node != NUMA_NO_NODE) ? pdev->node : -1); + list_for_each_entry ( msi, &pdev->msi_list, list ) + printk("%d ", msi->irq); +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -22,6 +22,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -1228,7 +1229,7 @@ int __init iommu_alloc(struct acpi_drhd_ + { + struct iommu *iommu; + unsigned long sagaw, nr_dom; +- int agaw; ++ int agaw, rc; + + if ( nr_iommus > MAX_IOMMUS ) + { +@@ -1318,10 +1319,19 @@ int __init iommu_alloc(struct acpi_drhd_ + if ( !iommu->domid_map ) + return -ENOMEM ; + ++ iommu->pseudo_domid_map = iommu_init_domid(); ++ rc = -ENOMEM; ++ if ( !iommu->pseudo_domid_map ) ++ goto free; ++ + spin_lock_init(&iommu->lock); + spin_lock_init(&iommu->register_lock); + + return 0; ++ ++ free: ++ iommu_free(drhd); ++ return rc; + } + + void __init iommu_free(struct acpi_drhd_unit *drhd) +@@ -1344,6 +1354,7 @@ void __init iommu_free(struct acpi_drhd_ + + xfree(iommu->domid_bitmap); + xfree(iommu->domid_map); ++ xfree(iommu->pseudo_domid_map); + + free_intel_iommu(iommu->intel); + if ( iommu->msi.irq >= 0 ) +@@ -1624,8 +1635,8 @@ int domain_context_mapping_one( + return rc ?: pdev && prev_dom; + } + +-static int domain_context_unmap(struct domain *d, uint8_t devfn, +- struct pci_dev *pdev); ++static const struct acpi_drhd_unit *domain_context_unmap( ++ struct domain *d, uint8_t devfn, struct pci_dev *pdev); + + static int domain_context_mapping(struct domain *domain, u8 devfn, + struct pci_dev *pdev) +@@ -1633,6 +1644,7 @@ static int domain_context_mapping(struct + struct acpi_drhd_unit *drhd; + const struct acpi_rmrr_unit *rmrr; + paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr; ++ domid_t orig_domid = pdev->arch.pseudo_domid; + int ret = 0; + unsigned int i, mode = 0; + uint16_t seg = pdev->seg, bdf; +@@ -1683,6 +1695,14 @@ static int domain_context_mapping(struct + break; + + case DEV_TYPE_PCIe_ENDPOINT: ++ if ( iommu_quarantine && orig_domid == DOMID_INVALID ) ++ { ++ pdev->arch.pseudo_domid = ++ iommu_alloc_domid(drhd->iommu->pseudo_domid_map); ++ if ( pdev->arch.pseudo_domid == DOMID_INVALID ) ++ return -ENOSPC; ++ } ++ + if ( iommu_debug ) + printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n", + domain->domain_id, seg, bus, +@@ -1698,6 +1718,14 @@ static int domain_context_mapping(struct + break; + + case DEV_TYPE_PCI: ++ if ( iommu_quarantine && orig_domid == DOMID_INVALID ) ++ { ++ pdev->arch.pseudo_domid = ++ iommu_alloc_domid(drhd->iommu->pseudo_domid_map); ++ if ( pdev->arch.pseudo_domid == DOMID_INVALID ) ++ return -ENOSPC; ++ } ++ + if ( iommu_debug ) + printk(VTDPREFIX "d%d:PCI: map %04x:%02x:%02x.%u\n", + domain->domain_id, seg, bus, +@@ -1771,6 +1799,13 @@ static int domain_context_mapping(struct + if ( !ret && devfn == pdev->devfn ) + pci_vtd_quirk(pdev); + ++ if ( ret && drhd && orig_domid == DOMID_INVALID ) ++ { ++ iommu_free_domid(pdev->arch.pseudo_domid, ++ drhd->iommu->pseudo_domid_map); ++ pdev->arch.pseudo_domid = DOMID_INVALID; ++ } ++ + return ret; + } + +@@ -1840,8 +1875,10 @@ int domain_context_unmap_one( + return rc; + } + +-static int domain_context_unmap(struct domain *domain, u8 devfn, +- struct pci_dev *pdev) ++static const struct acpi_drhd_unit *domain_context_unmap( ++ struct domain *domain, ++ uint8_t devfn, ++ struct pci_dev *pdev) + { + struct acpi_drhd_unit *drhd; + struct iommu *iommu; +@@ -1850,7 +1887,7 @@ static int domain_context_unmap(struct d + + drhd = acpi_find_matched_drhd_unit(pdev); + if ( !drhd ) +- return -ENODEV; ++ return ERR_PTR(-ENODEV); + iommu = drhd->iommu; + + switch ( pdev->type ) +@@ -1861,7 +1898,7 @@ static int domain_context_unmap(struct d + domain->domain_id, seg, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn)); + if ( !is_hardware_domain(domain) ) +- return -EPERM; ++ return ERR_PTR(-EPERM); + goto out; + + case DEV_TYPE_PCIe_BRIDGE: +@@ -1900,11 +1937,9 @@ static int domain_context_unmap(struct d + { + ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, + domain->domain_id); +- if ( ret ) +- return ret; +- +- ret = domain_context_unmap_one(domain, iommu, secbus, 0, +- domain->domain_id); ++ if ( !ret ) ++ ret = domain_context_unmap_one(domain, iommu, secbus, 0, ++ domain->domain_id); + } + else /* Legacy PCI bridge */ + ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, +@@ -1924,7 +1959,7 @@ static int domain_context_unmap(struct d + check_cleanup_domid_map(domain, pdev, iommu); + + out: +- return ret; ++ return ret ? ERR_PTR(ret) : drhd; + } + + static void iommu_domain_teardown(struct domain *d) +@@ -2100,16 +2135,17 @@ static int intel_iommu_enable_device(str + + static int intel_iommu_remove_device(u8 devfn, struct pci_dev *pdev) + { ++ const struct acpi_drhd_unit *drhd; + struct acpi_rmrr_unit *rmrr; + u16 bdf; +- int ret, i; ++ unsigned int i; + + if ( !pdev->domain ) + return -EINVAL; + +- ret = domain_context_unmap(pdev->domain, devfn, pdev); +- if ( ret ) +- return ret; ++ drhd = domain_context_unmap(pdev->domain, devfn, pdev); ++ if ( IS_ERR(drhd) ) ++ return PTR_ERR(drhd); + + for_each_rmrr_device ( rmrr, bdf, i ) + { +@@ -2126,6 +2162,13 @@ static int intel_iommu_remove_device(u8 + rmrr->end_address, 0); + } + ++ if ( drhd ) ++ { ++ iommu_free_domid(pdev->arch.pseudo_domid, ++ drhd->iommu->pseudo_domid_map); ++ pdev->arch.pseudo_domid = DOMID_INVALID; ++ } ++ + return 0; + } + +--- a/xen/drivers/passthrough/vtd/iommu.h ++++ b/xen/drivers/passthrough/vtd/iommu.h +@@ -538,6 +538,7 @@ struct iommu { + struct msi_desc msi; + struct intel_iommu *intel; + struct list_head ats_devices; ++ unsigned long *pseudo_domid_map; /* "pseudo" domain id bitmap */ + unsigned long *domid_bitmap; /* domain id bitmap */ + u16 *domid_map; /* domain id mapping array */ + }; +--- a/xen/drivers/passthrough/x86/iommu.c ++++ b/xen/drivers/passthrough/x86/iommu.c +@@ -246,6 +246,53 @@ void arch_iommu_domain_destroy(struct domain *d) + } + } + ++unsigned long *__init iommu_init_domid(void) ++{ ++ if ( !iommu_quarantine ) ++ return ZERO_BLOCK_PTR; ++ ++ BUILD_BUG_ON(DOMID_MASK * 2U >= UINT16_MAX); ++ ++ return xzalloc_array(unsigned long, ++ BITS_TO_LONGS(UINT16_MAX - DOMID_MASK)); ++} ++ ++domid_t iommu_alloc_domid(unsigned long *map) ++{ ++ /* ++ * This is used uniformly across all IOMMUs, such that on typical ++ * systems we wouldn't re-use the same ID very quickly (perhaps never). ++ */ ++ static unsigned int start; ++ unsigned int idx = find_next_zero_bit(map, UINT16_MAX - DOMID_MASK, start); ++ ++ ASSERT(pcidevs_locked()); ++ ++ if ( idx >= UINT16_MAX - DOMID_MASK ) ++ idx = find_first_zero_bit(map, UINT16_MAX - DOMID_MASK); ++ if ( idx >= UINT16_MAX - DOMID_MASK ) ++ return DOMID_INVALID; ++ ++ __set_bit(idx, map); ++ ++ start = idx + 1; ++ ++ return idx | (DOMID_MASK + 1); ++} ++ ++void iommu_free_domid(domid_t domid, unsigned long *map) ++{ ++ ASSERT(pcidevs_locked()); ++ ++ if ( domid == DOMID_INVALID ) ++ return; ++ ++ ASSERT(domid > DOMID_MASK); ++ ++ if ( !__test_and_clear_bit(domid & DOMID_MASK, map) ) ++ BUG(); ++} ++ + /* + * Local variables: + * mode: C +--- a/xen/include/public/xen.h ++++ b/xen/include/public/xen.h +@@ -584,6 +584,9 @@ DEFINE_XEN_GUEST_HANDLE(mmuext_op_t); + /* Idle domain. */ + #define DOMID_IDLE xen_mk_uint(0x7FFF) + ++/* Mask for valid domain id values */ ++#define DOMID_MASK xen_mk_uint(0x7FFF) ++ + #ifndef __ASSEMBLY__ + + typedef uint16_t domid_t; diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-09.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-09.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-09.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-09.patch 2022-06-14 12:21:14.000000000 +0100 @@ -0,0 +1,48 @@ +From: Jan Beulich +Subject: IOMMU/x86: drop TLB flushes from quarantine_init() hooks + +The page tables just created aren't hooked up yet anywhere, so there's +nothing that could be present in any TLB, and hence nothing to flush. +Dropping this flush is, at least on the VT-d side, a prereq to per- +device domain ID use when quarantining devices, as dom_io isn't going +to be assigned a DID anymore: The warning in get_iommu_did() would +trigger. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Reviewed-by: Roger Pau Monné +Reviewed-by: Kevin Tian + +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -943,8 +943,6 @@ int __init amd_iommu_quarantine_init(str + out: + spin_unlock(&hd->arch.mapping_lock); + +- amd_iommu_flush_all_pages(d); +- + /* Pages leaked in failure case */ + return level ? -ENOMEM : 0; + } +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -2804,7 +2804,6 @@ static int __init intel_iommu_quarantine + struct dma_pte *parent; + unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH); + unsigned int level = agaw_to_level(agaw); +- int rc; + + if ( hd->arch.pgd_maddr ) + { +@@ -2905,10 +2904,8 @@ static int __init intel_iommu_quarantine + out: + spin_unlock(&hd->arch.mapping_lock); + +- rc = iommu_flush_iotlb_all(d); +- + /* Pages leaked in failure case */ +- return level ? -ENOMEM : rc; ++ return level ? -ENOMEM : 0; + } + + const struct iommu_ops intel_iommu_ops = { diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-10.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-10.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-10.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-10.patch 2022-06-14 12:31:12.000000000 +0100 @@ -0,0 +1,41 @@ +From: Jan Beulich +Subject: AMD/IOMMU: abstract maximum number of page table levels + +We will want to use the constant elsewhere. + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant + +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +@@ -183,7 +183,7 @@ static inline int amd_iommu_get_paging_m + while ( max_frames > PTE_PER_TABLE_SIZE ) + { + max_frames = PTE_PER_TABLE_ALIGN(max_frames) >> PTE_PER_TABLE_SHIFT; +- if ( ++level > 6 ) ++ if ( ++level > IOMMU_MAX_PT_LEVELS ) + return -ENOMEM; + } + +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-defs.h +@@ -115,6 +115,8 @@ + #define IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_MASK 0xFFFFF000 + #define IOMMU_DEV_TABLE_PAGE_TABLE_PTR_LOW_SHIFT 12 + ++#define IOMMU_MAX_PT_LEVELS 6 ++ + /* DeviceTable Entry[63:32] */ + #define IOMMU_DEV_TABLE_GV_SHIFT 23 + #define IOMMU_DEV_TABLE_GV_MASK 0x800000 +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -477,7 +477,7 @@ static int iommu_pde_from_dfn(struct dom + table = hd->arch.root_table; + level = hd->arch.paging_mode; + +- BUG_ON( table == NULL || level < 1 || level > 6 ); ++ BUG_ON( table == NULL || level < 1 || level > IOMMU_MAX_PT_LEVELS ); + + /* + * A frame number past what the current page tables can represent can't diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-11.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-11.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-11.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa400-4.12-11.patch 2022-06-16 00:58:02.000000000 +0100 @@ -0,0 +1,891 @@ +From: Jan Beulich +Subject: IOMMU/x86: use per-device page tables for quarantining + +Devices with RMRRs / unity mapped regions, due to it being unspecified +how/when these memory regions may be accessed, may not be left +disconnected from the mappings of these regions (as long as it's not +certain that the device has been fully quiesced). Hence even the page +tables used when quarantining such devices need to have mappings of +those regions. This implies installing page tables in the first place +even when not in scratch-page quarantining mode. + +This is CVE-2022-26361 / part of XSA-400. + +While for the purpose here it would be sufficient to have devices with +RMRRs / unity mapped regions use per-device page tables, extend this to +all devices (in scratch-page quarantining mode). This allows the leaf +pages to be mapped r/w, thus covering also memory writes (rather than +just reads) issued by non-quiescent devices. + +Set up quarantine page tables as late as possible, yet early enough to +not encounter failure during de-assign. This means setup generally +happens in assign_device(), while (for now) the one in deassign_device() +is there mainly to be on the safe side. + +In VT-d's DID allocation function don't require the IOMMU lock to be +held anymore: All involved code paths hold pcidevs_lock, so this way we +avoid the need to acquire the IOMMU lock around the new call to +context_set_domain_id(). + +Signed-off-by: Jan Beulich +Reviewed-by: Paul Durrant +Reviewed-by: Kevin Tian +Reviewed-by: Roger Pau Monné + +--- a/xen/arch/x86/mm/p2m.c ++++ b/xen/arch/x86/mm/p2m.c +@@ -1239,7 +1239,7 @@ int set_identity_p2m_entry(struct domain + struct p2m_domain *p2m = p2m_get_hostp2m(d); + int ret; + +- if ( !paging_mode_translate(p2m->domain) ) ++ if ( !paging_mode_translate(d) ) + { + if ( !need_iommu(d) ) + return 0; +--- a/xen/include/asm-x86/pci.h ++++ b/xen/include/asm-x86/pci.h +@@ -1,6 +1,8 @@ + #ifndef __X86_PCI_H__ + #define __X86_PCI_H__ + ++#include ++ + #define CF8_BDF(cf8) ( ((cf8) & 0x00ffff00) >> 8) + #define CF8_ADDR_LO(cf8) ( (cf8) & 0x000000fc) + #define CF8_ADDR_HI(cf8) ( ((cf8) & 0x0f000000) >> 16) +@@ -20,7 +22,18 @@ struct arch_pci_dev { + * them don't race (de)initialization and hence don't strictly need any + * locking. + */ ++ union { ++ /* Subset of struct arch_iommu's fields, to be used in dom_io. */ ++ struct { ++ uint64_t pgd_maddr; ++ } vtd; ++ struct { ++ struct page_info *root_table; ++ } amd; ++ }; + domid_t pseudo_domid; ++ mfn_t leaf_mfn; ++ struct page_list_head pgtables_list; + }; + + int pci_conf_write_intercept(unsigned int seg, unsigned int bdf, +--- a/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h ++++ b/xen/include/asm-x86/hvm/svm/amd-iommu-proto.h +@@ -52,7 +52,8 @@ + int amd_iommu_update_ivrs_mapping_acpi(void); + +-int amd_iommu_quarantine_init(struct domain *d); ++int amd_iommu_quarantine_init(struct pci_dev *pdev); ++void amd_iommu_quarantine_teardown(struct pci_dev *pdev); + + /* mapping functions */ + int __must_check amd_iommu_map_page(struct domain *d, unsigned long gfn, + unsigned long mfn, unsigned int flags); +--- a/xen/drivers/passthrough/amd/iommu_map.c ++++ b/xen/drivers/passthrough/amd/iommu_map.c +@@ -883,62 +883,148 @@ void amd_iommu_share_p2m(struct domain * + } + } + +-int __init amd_iommu_quarantine_init(struct domain *d) ++static int fill_qpt(uint64_t *this, unsigned int level, ++ struct page_info *pgs[IOMMU_MAX_PT_LEVELS], ++ struct pci_dev *pdev) + { +- struct domain_iommu *hd = dom_iommu(d); ++ unsigned int i; ++ int rc = 0; ++ ++ for ( i = 0; !rc && i < PTE_PER_TABLE_SIZE; ++i ) ++ { ++ uint32_t *pte = (uint32_t *)&this[i]; ++ uint64_t *next; ++ ++ if ( !get_field_from_reg_u32(pte[0], IOMMU_PTE_PRESENT_MASK, ++ IOMMU_PTE_PRESENT_SHIFT) ) ++ { ++ if ( !pgs[level] ) ++ { ++ /* ++ * The pgtable allocator is fine for the leaf page, as well as ++ * page table pages, and the resulting allocations are always ++ * zeroed. ++ */ ++ pgs[level] = alloc_amd_iommu_pgtable(); ++ if ( !pgs[level] ) ++ { ++ rc = -ENOMEM; ++ break; ++ } ++ ++ page_list_add(pgs[level], &pdev->arch.pgtables_list); ++ ++ if ( level ) ++ { ++ next = __map_domain_page(pgs[level]); ++ rc = fill_qpt(next, level - 1, pgs, pdev); ++ unmap_domain_page(next); ++ } ++ } ++ ++ /* ++ * PDEs are essentially a subset of PTEs, so this function ++ * is fine to use even at the leaf. ++ */ ++ set_iommu_pde_present(pte, mfn_x(page_to_mfn(pgs[level])), level, ++ true, true); ++ } ++ else if ( level && ++ get_field_from_reg_u32(pte[0], ++ IOMMU_PDE_NEXT_LEVEL_MASK, ++ IOMMU_PDE_NEXT_LEVEL_SHIFT) ) ++ { ++ paddr_t addr_hi = get_field_from_reg_u32(pte[1], ++ IOMMU_PTE_ADDR_HIGH_MASK, ++ IOMMU_PTE_ADDR_HIGH_SHIFT); ++ paddr_t addr_lo = get_field_from_reg_u32(pte[0], ++ IOMMU_PTE_ADDR_LOW_MASK, ++ IOMMU_PTE_ADDR_LOW_SHIFT); ++ unsigned long mfn = (addr_hi << (32 - PAGE_SHIFT)) | addr_lo; ++ ++ page_list_add(mfn_to_page(_mfn(mfn)), &pdev->arch.pgtables_list); ++ next = map_domain_page(_mfn(mfn)); ++ rc = fill_qpt(next, level - 1, pgs, pdev); ++ unmap_domain_page(next); ++ } ++ } ++ ++ return rc; ++} ++ ++int amd_iommu_quarantine_init(struct pci_dev *pdev) ++{ ++ struct domain_iommu *hd = dom_iommu(dom_io); + unsigned long end_gfn = + 1ul << (DEFAULT_DOMAIN_ADDRESS_WIDTH - PAGE_SHIFT); + unsigned int level = amd_iommu_get_paging_mode(end_gfn); +- uint64_t *table; ++ unsigned int req_id = get_dma_requestor_id(pdev->seg, ++ PCI_BDF2(pdev->bus, pdev->devfn)); ++ const struct ivrs_mappings *ivrs_mappings = get_ivrs_mappings(pdev->seg); ++ int rc; + +- if ( hd->arch.root_table ) ++ ASSERT(pcidevs_locked()); ++ ASSERT(!hd->arch.root_table); ++ ++ ASSERT(pdev->arch.pseudo_domid != DOMID_INVALID); ++ ++ if ( pdev->arch.amd.root_table ) + { +- ASSERT_UNREACHABLE(); ++ clear_domain_page(pdev->arch.leaf_mfn); + return 0; + } + +- spin_lock(&hd->arch.mapping_lock); +- +- hd->arch.root_table = alloc_amd_iommu_pgtable(); +- if ( !hd->arch.root_table ) +- goto out; +- +- table = __map_domain_page(hd->arch.root_table); +- while ( level ) ++ pdev->arch.amd.root_table = alloc_amd_iommu_pgtable(); ++ if ( !pdev->arch.amd.root_table ) ++ return -ENOMEM; ++ ++ /* Transiently install the root into DomIO, for iommu_identity_mapping(). */ ++ hd->arch.root_table = pdev->arch.amd.root_table; ++ ++ rc = amd_iommu_reserve_domain_unity_map(dom_io, ++ ivrs_mappings[req_id].unity_map, ++ 0); ++ ++ iommu_identity_map_teardown(dom_io); ++ hd->arch.root_table = NULL; ++ ++ if ( rc ) ++ printk("%04x:%02x:%02x.%u: quarantine unity mapping failed\n", ++ pdev->seg, pdev->bus, ++ PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn)); ++ else + { +- struct page_info *pg; +- unsigned int i; ++ uint64_t *root; ++ struct page_info *pgs[IOMMU_MAX_PT_LEVELS] = {}; + +- /* +- * The pgtable allocator is fine for the leaf page, as well as +- * page table pages, and the resulting allocations are always +- * zeroed. +- */ +- pg = alloc_amd_iommu_pgtable(); +- if ( !pg ) +- break; ++ spin_lock(&hd->arch.mapping_lock); + +- for ( i = 0; i < PTE_PER_TABLE_SIZE; i++ ) +- { +- uint32_t *pde = (uint32_t *)&table[i]; ++ root = __map_domain_page(pdev->arch.amd.root_table); ++ rc = fill_qpt(root, level - 1, pgs, pdev); ++ unmap_domain_page(root); + +- /* +- * PDEs are essentially a subset of PTEs, so this function +- * is fine to use even at the leaf. +- */ +- set_iommu_pde_present(pde, mfn_x(page_to_mfn(pg)), level - 1, +- false, true); +- } ++ pdev->arch.leaf_mfn = page_to_mfn(pgs[0]); + +- unmap_domain_page(table); +- table = __map_domain_page(pg); +- level--; ++ spin_unlock(&hd->arch.mapping_lock); + } +- unmap_domain_page(table); + +- out: +- spin_unlock(&hd->arch.mapping_lock); ++ if ( rc ) ++ amd_iommu_quarantine_teardown(pdev); ++ ++ return rc; ++} ++ ++void amd_iommu_quarantine_teardown(struct pci_dev *pdev) ++{ ++ struct page_info *pg; ++ ++ ASSERT(pcidevs_locked()); ++ ++ if ( !pdev->arch.amd.root_table ) ++ return; ++ ++ while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) ) ++ free_amd_iommu_pgtable(pg); + +- /* Pages leaked in failure case */ +- return level ? -ENOMEM : 0; ++ pdev->arch.amd.root_table = NULL; + } +--- a/xen/drivers/passthrough/amd/pci_amd_iommu.c ++++ b/xen/drivers/passthrough/amd/pci_amd_iommu.c +@@ -148,6 +148,8 @@ static int __must_check amd_iommu_setup_ + u8 bus = pdev->bus; + struct domain_iommu *hd = dom_iommu(domain); + const struct ivrs_mappings *ivrs_dev; ++ const struct page_info *root_pg; ++ domid_t domid; + + BUG_ON(!hd->arch.paging_mode || !iommu->dev_table.buffer); + +@@ -170,14 +172,25 @@ static int __must_check amd_iommu_setup_ + dte = iommu->dev_table.buffer + (req_id * IOMMU_DEV_TABLE_ENTRY_SIZE); + ivrs_dev = &get_ivrs_mappings(iommu->seg)[req_id]; + ++ if ( domain != dom_io ) ++ { ++ root_pg = hd->arch.root_table; ++ domid = domain->domain_id; ++ } ++ else ++ { ++ root_pg = pdev->arch.amd.root_table; ++ domid = pdev->arch.pseudo_domid; ++ } ++ + spin_lock_irqsave(&iommu->lock, flags); + + if ( !is_translation_valid((u32 *)dte) ) + { + /* bind DTE to domain page-tables */ + rc = amd_iommu_set_root_page_table( +- dte, page_to_maddr(hd->arch.root_table), +- domain->domain_id, hd->arch.paging_mode, sr_flags); ++ dte, page_to_maddr(root_pg), domid, ++ hd->arch.paging_mode, sr_flags); + if ( rc ) + { + ASSERT(rc < 0); +@@ -191,8 +204,7 @@ static int __must_check amd_iommu_setup_ + + amd_iommu_flush_device(iommu, req_id); + } +- else if ( amd_iommu_get_root_page_table(dte) != +- page_to_maddr(hd->arch.root_table) ) ++ else if ( amd_iommu_get_root_page_table(dte) != page_to_maddr(root_pg) ) + { + /* + * Strictly speaking if the device is the only one with this requestor +@@ -205,8 +217,8 @@ static int __must_check amd_iommu_setup_ + rc = -EOPNOTSUPP; + else + rc = amd_iommu_set_root_page_table( +- dte, page_to_maddr(hd->arch.root_table), +- domain->domain_id, hd->arch.paging_mode, sr_flags); ++ dte, page_to_maddr(root_pg), domid, ++ hd->arch.paging_mode, sr_flags); + if ( rc < 0 ) + { + spin_unlock_irqrestore(&iommu->lock, flags); +@@ -225,6 +237,7 @@ static int __must_check amd_iommu_setup_ + * intended anyway. + */ + !pdev->domain->is_dying && ++ pdev->domain != dom_io && + (any_pdev_behind_iommu(pdev->domain, pdev, iommu) || + pdev->phantom_stride) ) + printk(" %04x:%02x:%02x.%u: reassignment may cause %pd data corruption\n", +@@ -245,9 +258,8 @@ static int __must_check amd_iommu_setup_ + AMD_IOMMU_DEBUG("Setup I/O page table: device id = %#x, type = %#x, " + "root table = %#"PRIx64", " + "domain = %d, paging mode = %d\n", +- req_id, pdev->type, +- page_to_maddr(hd->arch.root_table), +- domain->domain_id, hd->arch.paging_mode); ++ req_id, pdev->type, page_to_maddr(root_pg), ++ domid, hd->arch.paging_mode); + + ASSERT(pcidevs_locked()); + +@@ -292,7 +304,7 @@ int __init amd_iov_detect(void) + + int amd_iommu_alloc_root(struct domain_iommu *hd) + { +- if ( unlikely(!hd->arch.root_table) ) ++ if ( unlikely(!hd->arch.root_table) && hd != dom_iommu(dom_io) ) + { + hd->arch.root_table = alloc_amd_iommu_pgtable(); + if ( !hd->arch.root_table ) +@@ -402,7 +414,10 @@ void amd_iommu_disable_domain_device(str + + AMD_IOMMU_DEBUG("Disable: device id = %#x, " + "domain = %d, paging mode = %d\n", +- req_id, domain->domain_id, ++ req_id, ++ get_field_from_reg_u32(((uint32_t *)dte)[2], ++ IOMMU_DEV_TABLE_DOMAIN_ID_MASK, ++ IOMMU_DEV_TABLE_DOMAIN_ID_SHIFT), + dom_iommu(domain)->arch.paging_mode); + } + spin_unlock_irqrestore(&iommu->lock, flags); +@@ -631,6 +646,8 @@ static int amd_iommu_remove_device(u8 de + + amd_iommu_disable_domain_device(pdev->domain, iommu, devfn, pdev); + ++ amd_iommu_quarantine_teardown(pdev); ++ + iommu_free_domid(pdev->arch.pseudo_domid, iommu->domid_map); + pdev->arch.pseudo_domid = DOMID_INVALID; + +--- a/xen/drivers/passthrough/iommu.c ++++ b/xen/drivers/passthrough/iommu.c +@@ -380,19 +380,19 @@ int iommu_iotlb_flush_all(struct domain + return rc; + } + +-static int __init iommu_quarantine_init(void) ++int iommu_quarantine_dev_init(device_t *dev) + { + const struct domain_iommu *hd = dom_iommu(dom_io); +- int rc; +- +- rc = iommu_domain_init(dom_io); +- if ( rc ) +- return rc; + +- if ( !hd->platform_ops->quarantine_init ) ++ if ( !iommu_quarantine || !hd->platform_ops->quarantine_init ) + return 0; + +- return hd->platform_ops->quarantine_init(dom_io); ++ return hd->platform_ops->quarantine_init(dev); ++} ++ ++static int __init iommu_quarantine_init(void) ++{ ++ return iommu_domain_init(dom_io); + } + + int __init iommu_setup(void) +--- a/xen/drivers/passthrough/pci.c ++++ b/xen/drivers/passthrough/pci.c +@@ -1469,6 +1469,13 @@ static int assign_device(struct domain * + msixtbl_init(d); + } + ++ if ( pdev->domain != dom_io ) ++ { ++ rc = iommu_quarantine_dev_init(pci_to_dev(pdev)); ++ if ( rc ) ++ goto done; ++ } ++ + pdev->fault.count = 0; + + if ( (rc = hd->platform_ops->assign_device(d, devfn, pci_to_dev(pdev), flag)) ) +@@ -1515,9 +1522,16 @@ int deassign_device(struct domain *d, u1 + return -ENODEV; + + /* De-assignment from dom_io should de-quarantine the device */ +- target = ((pdev->quarantine || iommu_quarantine) && +- pdev->domain != dom_io) ? +- dom_io : hardware_domain; ++ if ( (pdev->quarantine || iommu_quarantine) && pdev->domain != dom_io ) ++ { ++ ret = iommu_quarantine_dev_init(pci_to_dev(pdev)); ++ if ( ret ) ++ return ret; ++ ++ target = dom_io; ++ } ++ else ++ target = hardware_domain; + + while ( pdev->phantom_stride ) + { +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -43,6 +43,12 @@ + #include "vtd.h" + #include "../ats.h" + ++#define DEVICE_DOMID(d, pdev) ((d) != dom_io ? (d)->domain_id \ ++ : (pdev)->arch.pseudo_domid) ++#define DEVICE_PGTABLE(d, pdev) ((d) != dom_io \ ++ ? dom_iommu(d)->arch.pgd_maddr \ ++ : (pdev)->arch.vtd.pgd_maddr) ++ + /* Possible unfiltered LAPIC/MSI messages from untrusted sources? */ + bool __read_mostly untrusted_msi; + +@@ -78,13 +84,18 @@ static int get_iommu_did(domid_t domid, + + #define DID_FIELD_WIDTH 16 + #define DID_HIGH_OFFSET 8 ++ ++/* ++ * This function may have "context" passed as NULL, to merely obtain a DID ++ * for "domid". ++ */ + static int context_set_domain_id(struct context_entry *context, + domid_t domid, struct iommu *iommu) + { + unsigned long nr_dom, i; + int found = 0; + +- ASSERT(spin_is_locked(&iommu->lock)); ++ ASSERT(pcidevs_locked()); + + nr_dom = cap_ndoms(iommu->cap); + i = find_first_bit(iommu->domid_bitmap, nr_dom); +@@ -110,8 +121,13 @@ static int context_set_domain_id(struct + } + + set_bit(i, iommu->domid_bitmap); +- context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET); +- context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET; ++ ++ if ( context ) ++ { ++ context->hi &= ~(((1 << DID_FIELD_WIDTH) - 1) << DID_HIGH_OFFSET); ++ context->hi |= (i & ((1 << DID_FIELD_WIDTH) - 1)) << DID_HIGH_OFFSET; ++ } ++ + return 0; + } + +@@ -179,8 +195,12 @@ static void check_cleanup_domid_map(stru + const struct pci_dev *exclude, + struct iommu *iommu) + { +- bool found = any_pdev_behind_iommu(d, exclude, iommu); ++ bool found; + ++ if ( d == dom_io ) ++ return; ++ ++ found = any_pdev_behind_iommu(d, exclude, iommu); + /* + * Hidden devices are associated with DomXEN but usable by the hardware + * domain. Hence they need considering here as well. +@@ -1441,7 +1461,7 @@ int domain_context_mapping_one( + domid = iommu->domid_map[prev_did]; + if ( domid < DOMID_FIRST_RESERVED ) + prev_dom = rcu_lock_domain_by_id(domid); +- else if ( domid == DOMID_IO ) ++ else if ( pdev ? domid == pdev->arch.pseudo_domid : domid > DOMID_MASK ) + prev_dom = rcu_lock_domain(dom_io); + if ( !prev_dom ) + { +@@ -1618,15 +1638,12 @@ int domain_context_mapping_one( + { + if ( !prev_dom ) + domain_context_unmap_one(domain, iommu, bus, devfn, +- domain->domain_id); ++ DEVICE_DOMID(domain, pdev)); + else if ( prev_dom != domain ) /* Avoid infinite recursion. */ +- { +- hd = dom_iommu(prev_dom); + domain_context_mapping_one(prev_dom, iommu, bus, devfn, pdev, +- domain->domain_id, +- hd->arch.pgd_maddr, ++ DEVICE_DOMID(prev_dom, pdev), ++ DEVICE_PGTABLE(prev_dom, pdev), + mode & MAP_WITH_RMRR); +- } + } + + if ( prev_dom ) +@@ -1643,7 +1660,7 @@ static int domain_context_mapping(struct + { + struct acpi_drhd_unit *drhd; + const struct acpi_rmrr_unit *rmrr; +- paddr_t pgd_maddr = dom_iommu(domain)->arch.pgd_maddr; ++ paddr_t pgd_maddr = DEVICE_PGTABLE(domain, pdev); + domid_t orig_domid = pdev->arch.pseudo_domid; + int ret = 0; + unsigned int i, mode = 0; +@@ -1666,7 +1683,7 @@ static int domain_context_mapping(struct + break; + } + +- if ( domain != pdev->domain ) ++ if ( domain != pdev->domain && pdev->domain != dom_io ) + { + if ( pdev->domain->is_dying ) + mode |= MAP_OWNER_DYING; +@@ -1707,8 +1724,8 @@ static int domain_context_mapping(struct + printk(VTDPREFIX "d%d:PCIe: map %04x:%02x:%02x.%u\n", + domain->domain_id, seg, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn)); +- ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- pdev, domain->domain_id, pgd_maddr, ++ ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, pdev, ++ DEVICE_DOMID(domain, pdev), pgd_maddr, + mode); + if ( ret > 0 ) + ret = 0; +@@ -1732,8 +1749,8 @@ static int domain_context_mapping(struct + PCI_SLOT(devfn), PCI_FUNC(devfn)); + + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- pdev, domain->domain_id, pgd_maddr, +- mode); ++ pdev, DEVICE_DOMID(domain, pdev), ++ pgd_maddr, mode); + if ( ret < 0 ) + break; + prev_present = ret; +@@ -1759,8 +1776,8 @@ static int domain_context_mapping(struct + */ + if ( ret >= 0 ) + ret = domain_context_mapping_one(domain, drhd->iommu, bus, devfn, +- NULL, domain->domain_id, pgd_maddr, +- mode); ++ NULL, DEVICE_DOMID(domain, pdev), ++ pgd_maddr, mode); + + /* + * Devices behind PCIe-to-PCI/PCIx bridge may generate different +@@ -1775,8 +1792,8 @@ static int domain_context_mapping(struct + if ( !ret && pdev_type(seg, bus, devfn) == DEV_TYPE_PCIe2PCI_BRIDGE && + (secbus != pdev->bus || pdev->devfn != 0) ) + ret = domain_context_mapping_one(domain, drhd->iommu, secbus, 0, +- NULL, domain->domain_id, pgd_maddr, +- mode); ++ NULL, DEVICE_DOMID(domain, pdev), ++ pgd_maddr, mode); + + if ( ret ) + { +@@ -1912,7 +1929,7 @@ static const struct acpi_drhd_unit *doma + domain->domain_id, seg, bus, + PCI_SLOT(devfn), PCI_FUNC(devfn)); + ret = domain_context_unmap_one(domain, iommu, bus, devfn, +- domain->domain_id); ++ DEVICE_DOMID(domain, pdev)); + if ( !ret && devfn == pdev->devfn && ats_device(pdev, drhd) > 0 ) + disable_ats_device(pdev); + +@@ -1923,7 +1940,7 @@ static const struct acpi_drhd_unit *doma + printk(VTDPREFIX "d%d:PCI: unmap %04x:%02x:%02x.%u\n", + domain->domain_id, seg, bus, PCI_SLOT(devfn), PCI_FUNC(devfn)); + ret = domain_context_unmap_one(domain, iommu, bus, devfn, +- domain->domain_id); ++ DEVICE_DOMID(domain, pdev)); + if ( ret ) + break; + +@@ -1932,18 +1949,12 @@ static const struct acpi_drhd_unit *doma + if ( find_upstream_bridge(seg, &tmp_bus, &tmp_devfn, &secbus) < 1 ) + break; + ++ ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, ++ DEVICE_DOMID(domain, pdev)); + /* PCIe to PCI/PCIx bridge */ +- if ( pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE ) +- { +- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, +- domain->domain_id); +- if ( !ret ) +- ret = domain_context_unmap_one(domain, iommu, secbus, 0, +- domain->domain_id); +- } +- else /* Legacy PCI bridge */ +- ret = domain_context_unmap_one(domain, iommu, tmp_bus, tmp_devfn, +- domain->domain_id); ++ if ( !ret && pdev_type(seg, tmp_bus, tmp_devfn) == DEV_TYPE_PCIe2PCI_BRIDGE ) ++ ret = domain_context_unmap_one(domain, iommu, secbus, 0, ++ DEVICE_DOMID(domain, pdev)); + + break; + +@@ -1980,6 +1991,25 @@ static void iommu_domain_teardown(struct + spin_unlock(&hd->arch.mapping_lock); + } + ++static void quarantine_teardown(struct pci_dev *pdev, ++ const struct acpi_drhd_unit *drhd) ++{ ++ struct page_info *pg; ++ ++ ASSERT(pcidevs_locked()); ++ ++ if ( !pdev->arch.vtd.pgd_maddr ) ++ return; ++ ++ while ( (pg = page_list_remove_head(&pdev->arch.pgtables_list)) ) ++ free_domheap_page(pg); ++ ++ pdev->arch.vtd.pgd_maddr = 0; ++ ++ if ( drhd ) ++ cleanup_domid_map(pdev->arch.pseudo_domid, drhd->iommu); ++} ++ + static int __must_check intel_iommu_map_page(struct domain *d, + unsigned long gfn, + unsigned long mfn, +@@ -2162,6 +2192,8 @@ static int intel_iommu_remove_device(u8 + rmrr->end_address, 0); + } + ++ quarantine_teardown(pdev, drhd); ++ + if ( drhd ) + { + iommu_free_domid(pdev->arch.pseudo_domid, +@@ -2798,60 +2830,139 @@ static void vtd_dump_p2m_table(struct do + vtd_dump_p2m_table_level(hd->arch.pgd_maddr, agaw_to_level(hd->arch.agaw), 0, 0); + } + +-static int __init intel_iommu_quarantine_init(struct domain *d) ++static int fill_qpt(struct dma_pte *this, unsigned int level, ++ paddr_t maddrs[6], struct pci_dev *pdev) + { +- struct domain_iommu *hd = dom_iommu(d); +- struct dma_pte *parent; ++ unsigned int i; ++ int rc = 0; ++ ++ for ( i = 0; !rc && i < PTE_NUM; ++i ) ++ { ++ struct dma_pte *pte = &this[i], *next; ++ ++ if ( !dma_pte_present(*pte) ) ++ { ++ if ( !maddrs[level] ) ++ { ++ /* ++ * The pgtable allocator is fine for the leaf page, as well as ++ * page table pages, and the resulting allocations are always ++ * zeroed. ++ */ ++ maddrs[level] = alloc_pgtable_maddr(NULL, 1); ++ if ( !maddrs[level] ) ++ { ++ rc = -ENOMEM; ++ break; ++ } ++ ++ page_list_add(maddr_to_page(maddrs[level]), ++ &pdev->arch.pgtables_list); ++ ++ if ( level ) ++ { ++ next = map_vtd_domain_page(maddrs[level]); ++ rc = fill_qpt(next, level - 1, maddrs, pdev); ++ unmap_vtd_domain_page(next); ++ } ++ } ++ ++ dma_set_pte_addr(*pte, maddrs[level]); ++ dma_set_pte_readable(*pte); ++ dma_set_pte_writable(*pte); ++ } ++ else if ( level && !dma_pte_superpage(*pte) ) ++ { ++ page_list_add(maddr_to_page(dma_pte_addr(*pte)), ++ &pdev->arch.pgtables_list); ++ next = map_vtd_domain_page(dma_pte_addr(*pte)); ++ rc = fill_qpt(next, level - 1, maddrs, pdev); ++ unmap_vtd_domain_page(next); ++ } ++ } ++ ++ return rc; ++} ++ ++static int intel_iommu_quarantine_init(struct pci_dev *pdev) ++{ ++ struct domain_iommu *hd = dom_iommu(dom_io); ++ paddr_t maddr; + unsigned int agaw = width_to_agaw(DEFAULT_DOMAIN_ADDRESS_WIDTH); + unsigned int level = agaw_to_level(agaw); ++ const struct acpi_drhd_unit *drhd; ++ const struct acpi_rmrr_unit *rmrr; ++ unsigned int i, bdf; ++ bool rmrr_found = false; ++ int rc; + +- if ( hd->arch.pgd_maddr ) ++ ASSERT(pcidevs_locked()); ++ ASSERT(!hd->arch.pgd_maddr); ++ ++ if ( pdev->arch.vtd.pgd_maddr ) + { +- ASSERT_UNREACHABLE(); ++ clear_domain_page(pdev->arch.leaf_mfn); + return 0; + } + +- spin_lock(&hd->arch.mapping_lock); ++ drhd = acpi_find_matched_drhd_unit(pdev); ++ if ( !drhd ) ++ return -ENODEV; + +- hd->arch.pgd_maddr = alloc_pgtable_maddr(NULL, 1); +- if ( !hd->arch.pgd_maddr ) +- goto out; ++ maddr = alloc_pgtable_maddr(NULL, 1); ++ if ( !maddr ) ++ return -ENOMEM; + +- parent = map_vtd_domain_page(hd->arch.pgd_maddr); +- while ( level ) +- { +- uint64_t maddr; +- unsigned int offset; ++ rc = context_set_domain_id(NULL, pdev->arch.pseudo_domid, drhd->iommu); + +- /* +- * The pgtable allocator is fine for the leaf page, as well as +- * page table pages, and the resulting allocations are always +- * zeroed. +- */ +- maddr = alloc_pgtable_maddr(NULL, 1); +- if ( !maddr ) ++ /* Transiently install the root into DomIO, for iommu_identity_mapping(). */ ++ hd->arch.pgd_maddr = maddr; ++ ++ for_each_rmrr_device ( rmrr, bdf, i ) ++ { ++ if ( rc ) + break; + +- for ( offset = 0; offset < PTE_NUM; offset++ ) ++ if ( rmrr->segment == pdev->seg && ++ bdf == PCI_BDF2(pdev->bus, pdev->devfn) ) + { +- struct dma_pte *pte = &parent[offset]; ++ rmrr_found = true; + +- dma_set_pte_addr(*pte, maddr); +- dma_set_pte_readable(*pte); ++ rc = iommu_identity_mapping(dom_io, p2m_access_rw, ++ rmrr->base_address, rmrr->end_address, ++ 0); ++ if ( rc ) ++ printk(XENLOG_ERR VTDPREFIX ++ "%04x:%02x:%02x.%u: RMRR quarantine mapping failed\n", ++ pdev->seg, pdev->bus, ++ PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn)); + } +- iommu_sync_cache(parent, PAGE_SIZE); ++ } + +- unmap_vtd_domain_page(parent); +- parent = map_vtd_domain_page(maddr); +- level--; ++ iommu_identity_map_teardown(dom_io); ++ hd->arch.pgd_maddr = 0; ++ pdev->arch.vtd.pgd_maddr = maddr; ++ ++ if ( !rc ) ++ { ++ struct dma_pte *root; ++ paddr_t maddrs[6] = {}; ++ ++ spin_lock(&hd->arch.mapping_lock); ++ ++ root = map_vtd_domain_page(maddr); ++ rc = fill_qpt(root, level - 1, maddrs, pdev); ++ unmap_vtd_domain_page(root); ++ ++ pdev->arch.leaf_mfn = maddr_to_mfn(maddrs[0]); ++ ++ spin_unlock(&hd->arch.mapping_lock); + } +- unmap_vtd_domain_page(parent); + +- out: +- spin_unlock(&hd->arch.mapping_lock); ++ if ( rc ) ++ quarantine_teardown(pdev, drhd); + +- /* Pages leaked in failure case */ +- return level ? -ENOMEM : 0; ++ return rc; + } + + const struct iommu_ops intel_iommu_ops = { +--- a/xen/drivers/passthrough/vtd/iommu.h ++++ b/xen/drivers/passthrough/vtd/iommu.h +@@ -532,7 +532,7 @@ struct iommu { + u32 nr_pt_levels; + u64 cap; + u64 ecap; +- spinlock_t lock; /* protect context, domain ids */ ++ spinlock_t lock; /* protect context */ + spinlock_t register_lock; /* protect iommu register handling */ + u64 root_maddr; /* root entry machine address */ + struct msi_desc msi; +--- a/xen/include/xen/iommu.h ++++ b/xen/include/xen/iommu.h +@@ -139,7 +139,7 @@ typedef int iommu_grdm_t(xen_pfn_t start + struct iommu_ops { + int (*init)(struct domain *d); + void (*hwdom_init)(struct domain *d); +- int (*quarantine_init)(struct domain *d); ++ int (*quarantine_init)(device_t *dev); + int (*add_device)(u8 devfn, device_t *dev); + int (*enable_device)(device_t *dev); + int (*remove_device)(u8 devfn, device_t *dev); +@@ -178,6 +178,7 @@ int __must_check iommu_suspend(void); + void iommu_resume(void); + void iommu_crash_shutdown(void); + int iommu_get_reserved_device_memory(iommu_grdm_t *, void *); ++int iommu_quarantine_dev_init(device_t *dev); + + void iommu_share_p2m_table(struct domain *d); + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-1.patch 2022-06-16 01:11:07.000000000 +0100 @@ -0,0 +1,170 @@ +From: Andrew Cooper +Subject: x86/pv: Clean up _get_page_type() + +Various fixes for clarity, ahead of making complicated changes. + + * Split the overflow check out of the if/else chain for type handling, as + it's somewhat unrelated. + * Comment the main if/else chain to explain what is going on. Adjust one + ASSERT() and state the bit layout for validate-locked and partial states. + * Correct the comment about TLB flushing, as it's backwards. The problem + case is when writeable mappings are retained to a page becoming read-only, + as it allows the guest to bypass Xen's safety checks for updates. + * Reduce the scope of 'y'. It is an artefact of the cmpxchg loop and not + valid for use by subsequent logic. Switch to using ACCESS_ONCE() to treat + all reads as explicitly volatile. The only thing preventing the validated + wait-loop being infinite is the compiler barrier hidden in cpu_relax(). + * Replace one page_get_owner(page) with the already-calculated 'd' already in + scope. + +No functional change. + +This is part of XSA-401 / CVE-2022-26362. + +Signed-off-by: Andrew Cooper +Signed-off-by: George Dunlap +Reviewed-by: Jan Beulich +Reviewed-by: George Dunlap + +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index ad89bfb45fff..96738b027827 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -2954,16 +2954,17 @@ static int _put_page_type(struct page_info *page, unsigned int flags, + static int _get_page_type(struct page_info *page, unsigned long type, + bool preemptible) + { +- unsigned long nx, x, y = page->u.inuse.type_info; ++ unsigned long nx, x; + int rc = 0, iommu_ret = 0; + + ASSERT(!(type & ~(PGT_type_mask | PGT_pae_xen_l2))); + ASSERT(!in_irq()); + +- for ( ; ; ) ++ for ( unsigned long y = ACCESS_ONCE(page->u.inuse.type_info); ; ) + { + x = y; + nx = x + 1; ++ + if ( unlikely((nx & PGT_count_mask) == 0) ) + { + gdprintk(XENLOG_WARNING, +@@ -2971,8 +2972,15 @@ static int _get_page_type(struct page_info *page, unsigned long type, + mfn_x(page_to_mfn(page))); + return -EINVAL; + } +- else if ( unlikely((x & PGT_count_mask) == 0) ) ++ ++ if ( unlikely((x & PGT_count_mask) == 0) ) + { ++ /* ++ * Typeref 0 -> 1. ++ * ++ * Type changes are permitted when the typeref is 0. If the type ++ * actually changes, the page needs re-validating. ++ */ + struct domain *d = page_get_owner(page); + + if ( d && shadow_mode_enabled(d) ) +@@ -2983,8 +2991,8 @@ static int _get_page_type(struct page_info *page, unsigned long type, + { + /* + * On type change we check to flush stale TLB entries. It is +- * vital that no other CPUs are left with mappings of a frame +- * which is about to become writeable to the guest. ++ * vital that no other CPUs are left with writeable mappings ++ * to a frame which is intending to become pgtable/segdesc. + */ + cpumask_t *mask = this_cpu(scratch_cpumask); + +@@ -2996,7 +3004,7 @@ static int _get_page_type(struct page_info *page, unsigned long type, + + if ( unlikely(!cpumask_empty(mask)) && + /* Shadow mode: track only writable pages. */ +- (!shadow_mode_enabled(page_get_owner(page)) || ++ (!shadow_mode_enabled(d) || + ((nx & PGT_type_mask) == PGT_writable_page)) ) + { + perfc_incr(need_flush_tlb_flush); +@@ -3017,7 +3025,14 @@ static int _get_page_type(struct page_info *page, unsigned long type, + } + else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) ) + { +- /* Don't log failure if it could be a recursive-mapping attempt. */ ++ /* ++ * else, we're trying to take a new reference, of the wrong type. ++ * ++ * This (being able to prohibit use of the wrong type) is what the ++ * typeref system exists for, but skip printing the failure if it ++ * looks like a recursive mapping, as subsequent logic might ++ * ultimately permit the attempt. ++ */ + if ( ((x & PGT_type_mask) == PGT_l2_page_table) && + (type == PGT_l1_page_table) ) + return -EINVAL; +@@ -3036,18 +3051,46 @@ static int _get_page_type(struct page_info *page, unsigned long type, + } + else if ( unlikely(!(x & PGT_validated)) ) + { ++ /* ++ * else, the count is non-zero, and we're grabbing the right type; ++ * but the page hasn't been validated yet. ++ * ++ * The page is in one of two states (depending on PGT_partial), ++ * and should have exactly one reference. ++ */ ++ ASSERT((x & (PGT_type_mask | PGT_count_mask)) == (type | 1)); ++ + if ( !(x & PGT_partial) ) + { +- /* Someone else is updating validation of this page. Wait... */ ++ /* ++ * The page has been left in the "validate locked" state ++ * (i.e. PGT_[type] | 1) which means that a concurrent caller ++ * of _get_page_type() is in the middle of validation. ++ * ++ * Spin waiting for the concurrent user to complete (partial ++ * or fully validated), then restart our attempt to acquire a ++ * type reference. ++ */ + do { + if ( preemptible && hypercall_preempt_check() ) + return -EINTR; + cpu_relax(); +- } while ( (y = page->u.inuse.type_info) == x ); ++ } while ( (y = ACCESS_ONCE(page->u.inuse.type_info)) == x ); + continue; + } +- /* Type ref count was left at 1 when PGT_partial got set. */ +- ASSERT((x & PGT_count_mask) == 1); ++ ++ /* ++ * The page has been left in the "partial" state ++ * (i.e., PGT_[type] | PGT_partial | 1). ++ * ++ * Rather than bumping the type count, we need to try to grab the ++ * validation lock; if we succeed, we need to validate the page, ++ * then drop the general ref associated with the PGT_partial bit. ++ * ++ * We grab the validation lock by setting nx to (PGT_[type] | 1) ++ * (i.e., non-zero type count, neither PGT_validated nor ++ * PGT_partial set). ++ */ + nx = x & ~PGT_partial; + } + +@@ -3094,6 +3137,13 @@ static int _get_page_type(struct page_info *page, unsigned long type, + } + + out: ++ /* ++ * Did we drop the PGT_partial bit when acquiring the typeref? If so, ++ * drop the general reference that went along with it. ++ * ++ * N.B. validate_page() may have have re-set PGT_partial, not reflected in ++ * nx, but will have taken an extra ref when doing so. ++ */ + if ( (x & PGT_partial) && !(nx & PGT_partial) ) + put_page(page); + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa401-4.13-2.patch 2022-06-09 13:09:25.000000000 +0100 @@ -0,0 +1,171 @@ +From: Andrew Cooper +Subject: x86/pv: Fix ABAC cmpxchg() race in _get_page_type() + +_get_page_type() suffers from a race condition where it incorrectly assumes +that because 'x' was read and a subsequent a cmpxchg() succeeds, the type +cannot have changed in-between. Consider: + +CPU A: + 1. Creates an L2e referencing pg + `-> _get_page_type(pg, PGT_l1_page_table), sees count 0, type PGT_writable_page + 2. Issues flush_tlb_mask() +CPU B: + 3. Creates a writeable mapping of pg + `-> _get_page_type(pg, PGT_writable_page), count increases to 1 + 4. Writes into new mapping, creating a TLB entry for pg + 5. Removes the writeable mapping of pg + `-> _put_page_type(pg), count goes back down to 0 +CPU A: + 7. Issues cmpxchg(), setting count 1, type PGT_l1_page_table + +CPU B now has a writeable mapping to pg, which Xen believes is a pagetable and +suitably protected (i.e. read-only). The TLB flush in step 2 must be deferred +until after the guest is prohibited from creating new writeable mappings, +which is after step 7. + +Defer all safety actions until after the cmpxchg() has successfully taken the +intended typeref, because that is what prevents concurrent users from using +the old type. + +Also remove the early validation for writeable and shared pages. This removes +race conditions where one half of a parallel mapping attempt can return +successfully before: + * The IOMMU pagetables are in sync with the new page type + * Writeable mappings to shared pages have been torn down + +This is part of XSA-401 / CVE-2022-26362. + +Reported-by: Jann Horn +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich +Reviewed-by: George Dunlap + +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index 96738b027827..ee91c7fe5f69 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -3005,46 +3005,12 @@ static int _get_page_type(struct page_info *page, unsigned long type, + * Type changes are permitted when the typeref is 0. If the type + * actually changes, the page needs re-validating. + */ +- struct domain *d = page_get_owner(page); +- +- if ( d && shadow_mode_enabled(d) ) +- shadow_prepare_page_type_change(d, page, type); + + ASSERT(!(x & PGT_pae_xen_l2)); + if ( (x & PGT_type_mask) != type ) + { +- /* +- * On type change we check to flush stale TLB entries. It is +- * vital that no other CPUs are left with writeable mappings +- * to a frame which is intending to become pgtable/segdesc. +- */ +- cpumask_t *mask = this_cpu(scratch_cpumask); +- +- BUG_ON(in_irq()); +- cpumask_copy(mask, d->dirty_cpumask); +- +- /* Don't flush if the timestamp is old enough */ +- tlbflush_filter(mask, page->tlbflush_timestamp); +- +- if ( unlikely(!cpumask_empty(mask)) && +- /* Shadow mode: track only writable pages. */ +- (!shadow_mode_enabled(d) || +- ((nx & PGT_type_mask) == PGT_writable_page)) ) +- { +- perfc_incr(need_flush_tlb_flush); +- flush_tlb_mask(mask); +- } +- +- /* We lose existing type and validity. */ + nx &= ~(PGT_type_mask | PGT_validated); + nx |= type; +- +- /* +- * No special validation needed for writable pages. +- * Page tables and GDT/LDT need to be scanned for validity. +- */ +- if ( type == PGT_writable_page || type == PGT_shared_page ) +- nx |= PGT_validated; + } + } + else if ( unlikely((x & (PGT_type_mask|PGT_pae_xen_l2)) != type) ) +@@ -3125,6 +3091,46 @@ static int _get_page_type(struct page_info *page, unsigned long type, + return -EINTR; + } + ++ /* ++ * One typeref has been taken and is now globally visible. ++ * ++ * The page is either in the "validate locked" state (PGT_[type] | 1) or ++ * fully validated (PGT_[type] | PGT_validated | >0). ++ */ ++ ++ if ( unlikely((x & PGT_count_mask) == 0) ) ++ { ++ struct domain *d = page_get_owner(page); ++ ++ if ( d && shadow_mode_enabled(d) ) ++ shadow_prepare_page_type_change(d, page, type); ++ ++ if ( (x & PGT_type_mask) != type ) ++ { ++ /* ++ * On type change we check to flush stale TLB entries. It is ++ * vital that no other CPUs are left with writeable mappings ++ * to a frame which is intending to become pgtable/segdesc. ++ */ ++ cpumask_t *mask = this_cpu(scratch_cpumask); ++ ++ BUG_ON(in_irq()); ++ cpumask_copy(mask, d->dirty_cpumask); ++ ++ /* Don't flush if the timestamp is old enough */ ++ tlbflush_filter(mask, page->tlbflush_timestamp); ++ ++ if ( unlikely(!cpumask_empty(mask)) && ++ /* Shadow mode: track only writable pages. */ ++ (!shadow_mode_enabled(d) || ++ ((nx & PGT_type_mask) == PGT_writable_page)) ) ++ { ++ perfc_incr(need_flush_tlb_flush); ++ flush_tlb_mask(mask); ++ } ++ } ++ } ++ + if ( unlikely((x & PGT_type_mask) != type) ) + { + /* Special pages should not be accessible from devices. */ +@@ -3149,13 +3155,25 @@ static int _get_page_type(struct page_info *page, unsigned long type, + + if ( unlikely(!(nx & PGT_validated)) ) + { +- if ( !(x & PGT_partial) ) ++ /* ++ * No special validation needed for writable or shared pages. Page ++ * tables and GDT/LDT need to have their contents audited. ++ * ++ * per validate_page(), non-atomic updates are fine here. ++ */ ++ if ( type == PGT_writable_page || type == PGT_shared_page ) ++ page->u.inuse.type_info |= PGT_validated; ++ else + { +- page->nr_validated_ptes = 0; +- page->partial_flags = 0; +- page->linear_pt_count = 0; ++ if ( !(x & PGT_partial) ) ++ { ++ page->nr_validated_ptes = 0; ++ page->partial_flags = 0; ++ page->linear_pt_count = 0; ++ } ++ ++ rc = alloc_page_type(page, type, preemptible); + } +- rc = alloc_page_type(page, type, preemptible); + } + + out: diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-1.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-1.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-1.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-1.patch 2022-06-09 13:09:25.000000000 +0100 @@ -0,0 +1,43 @@ +From: Andrew Cooper +Subject: x86/page: Introduce _PAGE_* constants for memory types + +... rather than opencoding the PAT/PCD/PWT attributes in __PAGE_HYPERVISOR_* +constants. These are going to be needed by forthcoming logic. + +No functional change. + +This is part of XSA-402. + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +diff --git a/xen/include/asm-x86/page.h b/xen/include/asm-x86/page.h +index c1e92937c073..7269ae89b880 100644 +--- a/xen/include/asm-x86/page.h ++++ b/xen/include/asm-x86/page.h +@@ -320,6 +320,14 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t); + + #define PAGE_CACHE_ATTRS (_PAGE_PAT | _PAGE_PCD | _PAGE_PWT) + ++/* Memory types, encoded under Xen's choice of MSR_PAT. */ ++#define _PAGE_WB ( 0) ++#define _PAGE_WT ( _PAGE_PWT) ++#define _PAGE_UCM ( _PAGE_PCD ) ++#define _PAGE_UC ( _PAGE_PCD | _PAGE_PWT) ++#define _PAGE_WC (_PAGE_PAT ) ++#define _PAGE_WP (_PAGE_PAT | _PAGE_PWT) ++ + /* + * Debug option: Ensure that granted mappings are not implicitly unmapped. + * WARNING: This will need to be disabled to run OSes that use the spare PTE +@@ -338,8 +346,8 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t); + #define __PAGE_HYPERVISOR_RX (_PAGE_PRESENT | _PAGE_ACCESSED) + #define __PAGE_HYPERVISOR (__PAGE_HYPERVISOR_RX | \ + _PAGE_DIRTY | _PAGE_RW) +-#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_PCD) +-#define __PAGE_HYPERVISOR_UC (__PAGE_HYPERVISOR | _PAGE_PCD | _PAGE_PWT) ++#define __PAGE_HYPERVISOR_UCMINUS (__PAGE_HYPERVISOR | _PAGE_UCM) ++#define __PAGE_HYPERVISOR_UC (__PAGE_HYPERVISOR | _PAGE_UC) + + #define MAP_SMALL_PAGES _PAGE_AVAIL0 /* don't use superpages mappings */ + diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-2.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-2.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-2.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-2.patch 2022-06-09 13:09:25.000000000 +0100 @@ -0,0 +1,204 @@ +From: Andrew Cooper +Subject: x86: Don't change the cacheability of the directmap + +Changeset 55f97f49b7ce ("x86: Change cache attributes of Xen 1:1 page mappings +in response to guest mapping requests") attempted to keep the cacheability +consistent between different mappings of the same page. + +The reason wasn't described in the changelog, but it is understood to be in +regards to a concern over machine check exceptions, owing to errata when using +mixed cacheabilities. It did this primarily by updating Xen's mapping of the +page in the direct map when the guest mapped a page with reduced cacheability. + +Unfortunately, the logic didn't actually prevent mixed cacheability from +occurring: + * A guest could map a page normally, and then map the same page with + different cacheability; nothing prevented this. + * The cacheability of the directmap was always latest-takes-precedence in + terms of guest requests. + * Grant-mapped frames with lesser cacheability didn't adjust the page's + cacheattr settings. + * The map_domain_page() function still unconditionally created WB mappings, + irrespective of the page's cacheattr settings. + +Additionally, update_xen_mappings() had a bug where the alias calculation was +wrong for mfn's which were .init content, which should have been treated as +fully guest pages, not Xen pages. + +Worse yet, the logic introduced a vulnerability whereby necessary +pagetable/segdesc adjustments made by Xen in the validation logic could become +non-coherent between the cache and main memory. The CPU could subsequently +operate on the stale value in the cache, rather than the safe value in main +memory. + +The directmap contains primarily mappings of RAM. PAT/MTRR conflict +resolution is asymmetric, and generally for MTRR=WB ranges, PAT of lesser +cacheability resolves to being coherent. The special case is WC mappings, +which are non-coherent against MTRR=WB regions (except for fully-coherent +CPUs). + +Xen must not have any WC cacheability in the directmap, to prevent Xen's +actions from creating non-coherency. (Guest actions creating non-coherency is +dealt with in subsequent patches.) As all memory types for MTRR=WB ranges +inter-operate coherently, so leave Xen's directmap mappings as WB. + +Only PV guests with access to devices can use reduced-cacheability mappings to +begin with, and they're trusted not to mount DoSs against the system anyway. + +Drop PGC_cacheattr_{base,mask} entirely, and the logic to manipulate them. +Shift the later PGC_* constants up, to gain 3 extra bits in the main reference +count. Retain the check in get_page_from_l1e() for special_pages() because a +guest has no business using reduced cacheability on these. + +This reverts changeset 55f97f49b7ce6c3520c555d19caac6cf3f9a5df0 + +This is CVE-2022-26363, part of XSA-402. + +Signed-off-by: Andrew Cooper +Reviewed-by: George Dunlap + +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index ee91c7fe5f69..859646b670a8 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -786,24 +786,6 @@ bool is_iomem_page(mfn_t mfn) + return (page_get_owner(page) == dom_io); + } + +-static int update_xen_mappings(unsigned long mfn, unsigned int cacheattr) +-{ +- int err = 0; +- bool alias = mfn >= PFN_DOWN(xen_phys_start) && +- mfn < PFN_UP(xen_phys_start + xen_virt_end - XEN_VIRT_START); +- unsigned long xen_va = +- XEN_VIRT_START + ((mfn - PFN_DOWN(xen_phys_start)) << PAGE_SHIFT); +- +- if ( unlikely(alias) && cacheattr ) +- err = map_pages_to_xen(xen_va, _mfn(mfn), 1, 0); +- if ( !err ) +- err = map_pages_to_xen((unsigned long)mfn_to_virt(mfn), _mfn(mfn), 1, +- PAGE_HYPERVISOR | cacheattr_to_pte_flags(cacheattr)); +- if ( unlikely(alias) && !cacheattr && !err ) +- err = map_pages_to_xen(xen_va, _mfn(mfn), 1, PAGE_HYPERVISOR); +- return err; +-} +- + #ifndef NDEBUG + struct mmio_emul_range_ctxt { + const struct domain *d; +@@ -1008,47 +990,14 @@ get_page_from_l1e( + goto could_not_pin; + } + +- if ( pte_flags_to_cacheattr(l1f) != +- ((page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base) ) ++ if ( (l1f & PAGE_CACHE_ATTRS) != _PAGE_WB && is_xen_heap_page(page) ) + { +- unsigned long x, nx, y = page->count_info; +- unsigned long cacheattr = pte_flags_to_cacheattr(l1f); +- int err; +- +- if ( is_xen_heap_page(page) ) +- { +- if ( write ) +- put_page_type(page); +- put_page(page); +- gdprintk(XENLOG_WARNING, +- "Attempt to change cache attributes of Xen heap page\n"); +- return -EACCES; +- } +- +- do { +- x = y; +- nx = (x & ~PGC_cacheattr_mask) | (cacheattr << PGC_cacheattr_base); +- } while ( (y = cmpxchg(&page->count_info, x, nx)) != x ); +- +- err = update_xen_mappings(mfn, cacheattr); +- if ( unlikely(err) ) +- { +- cacheattr = y & PGC_cacheattr_mask; +- do { +- x = y; +- nx = (x & ~PGC_cacheattr_mask) | cacheattr; +- } while ( (y = cmpxchg(&page->count_info, x, nx)) != x ); +- +- if ( write ) +- put_page_type(page); +- put_page(page); +- +- gdprintk(XENLOG_WARNING, "Error updating mappings for mfn %" PRI_mfn +- " (pfn %" PRI_pfn ", from L1 entry %" PRIpte ") for d%d\n", +- mfn, get_gpfn_from_mfn(mfn), +- l1e_get_intpte(l1e), l1e_owner->domain_id); +- return err; +- } ++ if ( write ) ++ put_page_type(page); ++ put_page(page); ++ gdprintk(XENLOG_WARNING, ++ "Attempt to change cache attributes of Xen heap page\n"); ++ return -EACCES; + } + + return 0; +@@ -2541,25 +2490,10 @@ static int mod_l4_entry(l4_pgentry_t *pl4e, + */ + static int cleanup_page_mappings(struct page_info *page) + { +- unsigned int cacheattr = +- (page->count_info & PGC_cacheattr_mask) >> PGC_cacheattr_base; + int rc = 0; + unsigned long mfn = mfn_x(page_to_mfn(page)); + + /* +- * If we've modified xen mappings as a result of guest cache +- * attributes, restore them to the "normal" state. +- */ +- if ( unlikely(cacheattr) ) +- { +- page->count_info &= ~PGC_cacheattr_mask; +- +- BUG_ON(is_xen_heap_page(page)); +- +- rc = update_xen_mappings(mfn, 0); +- } +- +- /* + * If this may be in a PV domain's IOMMU, remove it. + * + * NB that writable xenheap pages have their type set and cleared by +diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h +index 320c6cd19669..db09849f73f8 100644 +--- a/xen/include/asm-x86/mm.h ++++ b/xen/include/asm-x86/mm.h +@@ -64,22 +64,19 @@ + /* Set when is using a page as a page table */ + #define _PGC_page_table PG_shift(3) + #define PGC_page_table PG_mask(1, 3) +- /* 3-bit PAT/PCD/PWT cache-attribute hint. */ +-#define PGC_cacheattr_base PG_shift(6) +-#define PGC_cacheattr_mask PG_mask(7, 6) + /* Page is broken? */ +-#define _PGC_broken PG_shift(7) +-#define PGC_broken PG_mask(1, 7) ++#define _PGC_broken PG_shift(4) ++#define PGC_broken PG_mask(1, 4) + /* Mutually-exclusive page states: { inuse, offlining, offlined, free }. */ +-#define PGC_state PG_mask(3, 9) +-#define PGC_state_inuse PG_mask(0, 9) +-#define PGC_state_offlining PG_mask(1, 9) +-#define PGC_state_offlined PG_mask(2, 9) +-#define PGC_state_free PG_mask(3, 9) ++#define PGC_state PG_mask(3, 6) ++#define PGC_state_inuse PG_mask(0, 6) ++#define PGC_state_offlining PG_mask(1, 6) ++#define PGC_state_offlined PG_mask(2, 6) ++#define PGC_state_free PG_mask(3, 6) + #define page_state_is(pg, st) (((pg)->count_info&PGC_state) == PGC_state_##st) + + /* Count of references to this frame. */ +-#define PGC_count_width PG_shift(9) ++#define PGC_count_width PG_shift(6) + #define PGC_count_mask ((1UL< +Subject: x86: Split cache_flush() out of cache_writeback() + +Subsequent changes will want a fully flushing version. + +Use the new helper rather than opencoding it in flush_area_local(). This +resolves an outstanding issue where the conditional sfence is on the wrong +side of the clflushopt loop. clflushopt is ordered with respect to older +stores, not to younger stores. + +Rename gnttab_cache_flush()'s helper to avoid colliding in name. +grant_table.c can see the prototype from cache.h so the build fails +otherwise. + +This is part of XSA-402. + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +Xen 4.16 and earlier: + * Also backport half of c/s 3330013e67396 "VT-d / x86: re-arrange cache + syncing" to split cache_writeback() out of the IOMMU logic, but without the + associated hooks changes. + +diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c +index 03f92c23dcaf..8568491c7ea9 100644 +--- a/xen/arch/x86/flushtlb.c ++++ b/xen/arch/x86/flushtlb.c +@@ -230,7 +230,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags) + if ( flags & FLUSH_CACHE ) + { + const struct cpuinfo_x86 *c = ¤t_cpu_data; +- unsigned long i, sz = 0; ++ unsigned long sz = 0; + + if ( order < (BITS_PER_LONG - PAGE_SHIFT) ) + sz = 1UL << (order + PAGE_SHIFT); +@@ -240,13 +240,7 @@ unsigned int flush_area_local(const void *va, unsigned int flags) + c->x86_clflush_size && c->x86_cache_size && sz && + ((sz >> 10) < c->x86_cache_size) ) + { +- alternative(ASM_NOP3, "sfence", X86_FEATURE_CLFLUSHOPT); +- for ( i = 0; i < sz; i += c->x86_clflush_size ) +- alternative_input(".byte " __stringify(NOP_DS_PREFIX) ";" +- " clflush %0", +- "data16 clflush %0", /* clflushopt */ +- X86_FEATURE_CLFLUSHOPT, +- "m" (((const char *)va)[i])); ++ cache_flush(va, sz); + flags &= ~FLUSH_CACHE; + } + else +@@ -262,3 +256,77 @@ unsigned int flush_area_local(const void *va, unsigned int flags) + + return flags; + } ++ ++void cache_flush(const void *addr, unsigned int size) ++{ ++ /* ++ * This function may be called before current_cpu_data is established. ++ * Hence a fallback is needed to prevent the loop below becoming infinite. ++ */ ++ unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16; ++ const void *end = addr + size; ++ ++ addr -= (unsigned long)addr & (clflush_size - 1); ++ for ( ; addr < end; addr += clflush_size ) ++ { ++ /* ++ * Note regarding the "ds" prefix use: it's faster to do a clflush ++ * + prefix than a clflush + nop, and hence the prefix is added instead ++ * of letting the alternative framework fill the gap by appending nops. ++ */ ++ alternative_io("ds; clflush %[p]", ++ "data16 clflush %[p]", /* clflushopt */ ++ X86_FEATURE_CLFLUSHOPT, ++ /* no outputs */, ++ [p] "m" (*(const char *)(addr))); ++ } ++ ++ alternative("", "sfence", X86_FEATURE_CLFLUSHOPT); ++} ++ ++void cache_writeback(const void *addr, unsigned int size) ++{ ++ unsigned int clflush_size; ++ const void *end = addr + size; ++ ++ /* Fall back to CLFLUSH{,OPT} when CLWB isn't available. */ ++ if ( !boot_cpu_has(X86_FEATURE_CLWB) ) ++ return cache_flush(addr, size); ++ ++ /* ++ * This function may be called before current_cpu_data is established. ++ * Hence a fallback is needed to prevent the loop below becoming infinite. ++ */ ++ clflush_size = current_cpu_data.x86_clflush_size ?: 16; ++ addr -= (unsigned long)addr & (clflush_size - 1); ++ for ( ; addr < end; addr += clflush_size ) ++ { ++/* ++ * The arguments to a macro must not include preprocessor directives. Doing so ++ * results in undefined behavior, so we have to create some defines here in ++ * order to avoid it. ++ */ ++#if defined(HAVE_AS_CLWB) ++# define CLWB_ENCODING "clwb %[p]" ++#elif defined(HAVE_AS_XSAVEOPT) ++# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */ ++#else ++# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */ ++#endif ++ ++#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr)) ++#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT) ++# define INPUT BASE_INPUT ++#else ++# define INPUT(addr) "a" (addr), BASE_INPUT(addr) ++#endif ++ ++ asm volatile (CLWB_ENCODING :: INPUT(addr)); ++ ++#undef INPUT ++#undef BASE_INPUT ++#undef CLWB_ENCODING ++ } ++ ++ asm volatile ("sfence" ::: "memory"); ++} +diff --git a/xen/common/grant_table.c b/xen/common/grant_table.c +index cbb2ce17c001..709509e0fc9e 100644 +--- a/xen/common/grant_table.c ++++ b/xen/common/grant_table.c +@@ -3320,7 +3320,7 @@ gnttab_swap_grant_ref(XEN_GUEST_HANDLE_PARAM(gnttab_swap_grant_ref_t) uop, + return 0; + } + +-static int cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref) ++static int _cache_flush(const gnttab_cache_flush_t *cflush, grant_ref_t *cur_ref) + { + struct domain *d, *owner; + struct page_info *page; +@@ -3414,7 +414,7 @@ gnttab_cache_flush(XEN_GUEST_HANDLE_PARAM(gnttab_cache_flush_t) uop, + return -EFAULT; + for ( ; ; ) + { +- int ret = cache_flush(&op, cur_ref); ++ int ret = _cache_flush(&op, cur_ref); + + if ( ret < 0 ) + return ret; +diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h +index fbe951b2fad0..3defe9677f06 100644 +--- a/xen/drivers/passthrough/vtd/extern.h ++++ b/xen/drivers/passthrough/vtd/extern.h +@@ -77,7 +77,6 @@ int __must_check qinval_device_iotlb_sync(struct iommu *iommu, + struct pci_dev *pdev, + u16 did, u16 size, u64 addr); + +-unsigned int get_cache_line_size(void); + void flush_all_cache(void); + + u64 alloc_pgtable_maddr(struct acpi_drhd_unit *drhd, unsigned long npages); +diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c +index f051a55764b9..2bf5f02c08de 100644 +--- a/xen/drivers/passthrough/vtd/iommu.c ++++ b/xen/drivers/passthrough/vtd/iommu.c +@@ -31,6 +31,7 @@ + #include + #include + #include ++#include + #include + #include + #include +@@ -219,53 +220,10 @@ static int iommus_incoherent; + + static void sync_cache(const void *addr, unsigned int size) + { +- static unsigned long clflush_size = 0; +- const void *end = addr + size; +- + if ( !iommus_incoherent ) + return; + +- if ( clflush_size == 0 ) +- clflush_size = get_cache_line_size(); +- +- addr -= (unsigned long)addr & (clflush_size - 1); +- for ( ; addr < end; addr += clflush_size ) +-/* +- * The arguments to a macro must not include preprocessor directives. Doing so +- * results in undefined behavior, so we have to create some defines here in +- * order to avoid it. +- */ +-#if defined(HAVE_AS_CLWB) +-# define CLWB_ENCODING "clwb %[p]" +-#elif defined(HAVE_AS_XSAVEOPT) +-# define CLWB_ENCODING "data16 xsaveopt %[p]" /* clwb */ +-#else +-# define CLWB_ENCODING ".byte 0x66, 0x0f, 0xae, 0x30" /* clwb (%%rax) */ +-#endif +- +-#define BASE_INPUT(addr) [p] "m" (*(const char *)(addr)) +-#if defined(HAVE_AS_CLWB) || defined(HAVE_AS_XSAVEOPT) +-# define INPUT BASE_INPUT +-#else +-# define INPUT(addr) "a" (addr), BASE_INPUT(addr) +-#endif +- /* +- * Note regarding the use of NOP_DS_PREFIX: it's faster to do a clflush +- * + prefix than a clflush + nop, and hence the prefix is added instead +- * of letting the alternative framework fill the gap by appending nops. +- */ +- alternative_io_2(".byte " __stringify(NOP_DS_PREFIX) "; clflush %[p]", +- "data16 clflush %[p]", /* clflushopt */ +- X86_FEATURE_CLFLUSHOPT, +- CLWB_ENCODING, +- X86_FEATURE_CLWB, /* no outputs */, +- INPUT(addr)); +-#undef INPUT +-#undef BASE_INPUT +-#undef CLWB_ENCODING +- +- alternative_2("", "sfence", X86_FEATURE_CLFLUSHOPT, +- "sfence", X86_FEATURE_CLWB); ++ cache_writeback(addr, size); + } + + /* Allocate page table, return its machine address */ +diff --git a/xen/drivers/passthrough/vtd/x86/vtd.c b/xen/drivers/passthrough/vtd/x86/vtd.c +index 229938f3a812..2a18b76e800d 100644 +--- a/xen/drivers/passthrough/vtd/x86/vtd.c ++++ b/xen/drivers/passthrough/vtd/x86/vtd.c +@@ -48,11 +48,6 @@ void unmap_vtd_domain_page(void *va) + unmap_domain_page(va); + } + +-unsigned int get_cache_line_size(void) +-{ +- return ((cpuid_ebx(1) >> 8) & 0xff) * 8; +-} +- + void flush_all_cache() + { + wbinvd(); +diff --git a/xen/include/asm-x86/cache.h b/xen/include/asm-x86/cache.h +index 1f7173d8c72c..e4770efb22b9 100644 +--- a/xen/include/asm-x86/cache.h ++++ b/xen/include/asm-x86/cache.h +@@ -11,4 +11,11 @@ + + #define __read_mostly __section(".data.read_mostly") + ++#ifndef __ASSEMBLY__ ++ ++void cache_flush(const void *addr, unsigned int size); ++void cache_writeback(const void *addr, unsigned int size); ++ ++#endif ++ + #endif diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-4.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-4.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-4.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-4.patch 2022-06-16 01:38:06.000000000 +0100 @@ -0,0 +1,83 @@ +From: Andrew Cooper +Subject: x86/amd: Work around CLFLUSH ordering on older parts + +On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakely ordered with everything, +including reads and writes to the address, and LFENCE/SFENCE instructions. + +This creates a multitude of problematic corner cases, laid out in the manual. +Arrange to use MFENCE on both sides of the CLFLUSH to force proper ordering. + +This is part of XSA-402. + +Signed-off-by: Andrew Cooper +Reviewed-by: Jan Beulich + +diff --git a/xen/arch/x86/cpu/amd.c b/xen/arch/x86/cpu/amd.c +index b77fa1929733..aa1b9d0dda6b 100644 +--- a/xen/arch/x86/cpu/amd.c ++++ b/xen/arch/x86/cpu/amd.c +@@ -624,6 +624,14 @@ static void init_amd(struct cpuinfo_x86 *c) + if (!cpu_has_lfence_dispatch) + __set_bit(X86_FEATURE_MFENCE_RDTSC, c->x86_capability); + ++ /* ++ * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with ++ * everything, including reads and writes to address, and ++ * LFENCE/SFENCE instructions. ++ */ ++ if (!cpu_has_clflushopt) ++ setup_force_cpu_cap(X86_BUG_CLFLUSH_MFENCE); ++ + switch(c->x86) + { + case 0xf ... 0x11: +diff --git a/xen/arch/x86/flushtlb.c b/xen/arch/x86/flushtlb.c +index 8568491c7ea9..6f3f5ab1a3c4 100644 +--- a/xen/arch/x86/flushtlb.c ++++ b/xen/arch/x86/flushtlb.c +@@ -257,6 +257,13 @@ unsigned int flush_area_local(const void *va, unsigned int flags) + return flags; + } + ++/* ++ * On pre-CLFLUSHOPT AMD CPUs, CLFLUSH is weakly ordered with everything, ++ * including reads and writes to address, and LFENCE/SFENCE instructions. ++ * ++ * This function only works safely after alternatives have run. Luckily, at ++ * the time of writing, we don't flush the caches that early. ++ */ + void cache_flush(const void *addr, unsigned int size) + { + /* +@@ -266,6 +273,8 @@ void cache_flush(const void *addr, unsigned int size) + unsigned int clflush_size = current_cpu_data.x86_clflush_size ?: 16; + const void *end = addr + size; + ++ alternative("", "mfence", X86_BUG_CLFLUSH_MFENCE); ++ + addr -= (unsigned long)addr & (clflush_size - 1); + for ( ; addr < end; addr += clflush_size ) + { +@@ -281,7 +290,9 @@ void cache_flush(const void *addr, unsigned int size) + [p] "m" (*(const char *)(addr))); + } + +- alternative("", "sfence", X86_FEATURE_CLFLUSHOPT); ++ alternative_2("", ++ "sfence", X86_FEATURE_CLFLUSHOPT, ++ "mfence", X86_BUG_CLFLUSH_MFENCE); + } + + void cache_writeback(const void *addr, unsigned int size) +diff --git a/xen/include/asm-x86/cpufeatures.h b/xen/include/asm-x86/cpufeatures.h +index b9d3cac97538..a8222e978cd9 100644 +--- a/xen/include/asm-x86/cpufeatures.h ++++ b/xen/include/asm-x86/cpufeatures.h +@@ -44,6 +44,7 @@ XEN_CPUFEATURE(SC_VERW_IDLE, X86_SYNTH(25)) /* VERW used by Xen for idle */ + #define X86_BUG(x) ((FSCAPINTS + X86_NR_SYNTH) * 32 + (x)) + + #define X86_BUG_FPU_PTRS X86_BUG( 0) /* (F)X{SAVE,RSTOR} doesn't save/restore FOP/FIP/FDP. */ ++#define X86_BUG_CLFLUSH_MFENCE X86_BUG( 2) /* MFENCE needed to serialise CLFLUSH */ + + /* Total number of capability words, inc synth and bug words. */ + #define NCAPINTS (FSCAPINTS + X86_NR_SYNTH + X86_NR_BUG) /* N 32-bit words worth of info */ diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-5.patch xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-5.patch --- xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-5.patch 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/patches/xsa402-4.13-5.patch 2022-06-16 22:26:27.000000000 +0100 @@ -0,0 +1,160 @@ +From: Andrew Cooper +Subject: x86/pv: Track and flush non-coherent mappings of RAM + +There are legitimate uses of WC mappings of RAM, e.g. for DMA buffers with +devices that make non-coherent writes. The Linux sound subsystem makes +extensive use of this technique. + +For such usecases, the guest's DMA buffer is mapped and consistently used as +WC, and Xen doesn't interact with the buffer. + +However, a mischevious guest can use WC mappings to deliberately create +non-coherency between the cache and RAM, and use this to trick Xen into +validating a pagetable which isn't actually safe. + +Allocate a new PGT_non_coherent to track the non-coherency of mappings. Set +it whenever a non-coherent writeable mapping is created. If the page is used +as anything other than PGT_writable_page, force a cache flush before +validation. Also force a cache flush before the page is returned to the heap. + +This is CVE-2022-26364, part of XSA-402. + +Reported-by: Jann Horn +Signed-off-by: Andrew Cooper +Reviewed-by: George Dunlap +Reviewed-by: Jan Beulich + +diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c +index 859646b670a8..f5eeddce5867 100644 +--- a/xen/arch/x86/mm.c ++++ b/xen/arch/x86/mm.c +@@ -1085,6 +1085,15 @@ get_page_from_l1e( + return -EACCES; + } + ++ /* ++ * Track writeable non-coherent mappings to RAM pages, to trigger a cache ++ * flush later if the target is used as anything but a PGT_writeable page. ++ * We care about all writeable mappings, including foreign mappings. ++ */ ++ if ( !boot_cpu_has(X86_FEATURE_XEN_SELFSNOOP) && ++ (l1f & (PAGE_CACHE_ATTRS | _PAGE_RW)) == (_PAGE_WC | _PAGE_RW) ) ++ set_bit(_PGT_non_coherent, &page->u.inuse.type_info); ++ + return 0; + + could_not_pin: +@@ -2516,6 +2525,19 @@ static int cleanup_page_mappings(struct page_info *page) + } + } + ++ /* ++ * Flush the cache if there were previously non-coherent writeable ++ * mappings of this page. This forces the page to be coherent before it ++ * is freed back to the heap. ++ */ ++ if ( __test_and_clear_bit(_PGT_non_coherent, &page->u.inuse.type_info) ) ++ { ++ void *addr = __map_domain_page(page); ++ ++ cache_flush(addr, PAGE_SIZE); ++ unmap_domain_page(addr); ++ } ++ + return rc; + } + +@@ -3068,6 +3090,22 @@ static int _get_page_type(struct page_info *page, unsigned long type, + if ( unlikely(!(nx & PGT_validated)) ) + { + /* ++ * Flush the cache if there were previously non-coherent mappings of ++ * this page, and we're trying to use it as anything other than a ++ * writeable page. This forces the page to be coherent before we ++ * validate its contents for safety. ++ */ ++ if ( (nx & PGT_non_coherent) && type != PGT_writable_page ) ++ { ++ void *addr = __map_domain_page(page); ++ ++ cache_flush(addr, PAGE_SIZE); ++ unmap_domain_page(addr); ++ ++ page->u.inuse.type_info &= ~PGT_non_coherent; ++ } ++ ++ /* + * No special validation needed for writable or shared pages. Page + * tables and GDT/LDT need to have their contents audited. + * +diff --git a/xen/arch/x86/pv/grant_table.c b/xen/arch/x86/pv/grant_table.c +index 0325618c9883..81c72e61ed55 100644 +--- a/xen/arch/x86/pv/grant_table.c ++++ b/xen/arch/x86/pv/grant_table.c +@@ -109,7 +109,17 @@ int create_grant_pv_mapping(uint64_t addr, mfn_t frame, + + ol1e = *pl1e; + if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, mfn_x(gl1mfn), curr, 0) ) ++ { ++ /* ++ * We always create mappings in this path. However, our caller, ++ * map_grant_ref(), only passes potentially non-zero cache_flags for ++ * MMIO frames, so this path doesn't create non-coherent mappings of ++ * RAM frames and there's no need to calculate PGT_non_coherent. ++ */ ++ ASSERT(!cache_flags || is_iomem_page(frame)); ++ + rc = GNTST_okay; ++ } + + out_unlock: + page_unlock(page); +@@ -294,7 +304,18 @@ int replace_grant_pv_mapping(uint64_t addr, mfn_t frame, + l1e_get_flags(ol1e), addr, grant_pte_flags); + + if ( UPDATE_ENTRY(l1, pl1e, ol1e, nl1e, mfn_x(gl1mfn), curr, 0) ) ++ { ++ /* ++ * Generally, replace_grant_pv_mapping() is used to destroy mappings ++ * (n1le = l1e_empty()), but it can be a present mapping on the ++ * GNTABOP_unmap_and_replace path. ++ * ++ * In such cases, the PTE is fully transplanted from its old location ++ * via steal_linear_addr(), so we need not perform PGT_non_coherent ++ * checking here. ++ */ + rc = GNTST_okay; ++ } + + out_unlock: + page_unlock(page); +diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h +index db09849f73f8..82d0fd6104a2 100644 +--- a/xen/include/asm-x86/mm.h ++++ b/xen/include/asm-x86/mm.h +@@ -48,8 +48,12 @@ + #define _PGT_partial PG_shift(8) + #define PGT_partial PG_mask(1, 8) + ++/* Has this page been mapped writeable with a non-coherent memory type? */ ++#define _PGT_non_coherent PG_shift(9) ++#define PGT_non_coherent PG_mask(1, 9) ++ + /* Count of uses of this frame as its current type. */ +-#define PGT_count_width PG_shift(8) ++#define PGT_count_width PG_shift(9) + #define PGT_count_mask ((1UL< /dev/null && [ -d /boot/grub ]; then + update-grub || : + fi + ;; + + abort-upgrade|abort-remove|abort-deconfigure) + ;; + + *) + echo "postinst called with unknown argument \`$1'" >&2 + exit 1 + ;; +esac + +#DEBHELPER# + +exit 0 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.bug-control xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.bug-control --- xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.bug-control 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.bug-control 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,2 @@ +# autogenerated, do not edit +Submit-As: src:xen diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postinst xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postinst --- xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postinst 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postinst 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,24 @@ +#!/bin/bash +# autogenerated, do not edit + +set -e + +case "$1" in + configure) + if command -v update-grub > /dev/null && [ -d /boot/grub ]; then + update-grub || : + fi + ;; + + abort-upgrade|abort-remove|abort-deconfigure) + ;; + + *) + echo "postinst called with unknown argument \`$1'" >&2 + exit 1 + ;; +esac + +#DEBHELPER# + +exit 0 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postrm xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postrm --- xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postrm 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-4.11.postrm 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,24 @@ +#!/bin/bash +# autogenerated, do not edit + +set -e + +case "$1" in + remove) + if command -v update-grub > /dev/null && [ -d /boot/grub ]; then + update-grub || : + fi + ;; + + purge|upgrade|failed-upgrade|abort-install|abort-upgrade|disappear) + ;; + + *) + echo "postrm called with unknown argument \`$1'" >&2 + exit 1 + ;; +esac + +#DEBHELPER# + +exit 0 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-V-F.install.vsn-in xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-V-F.install.vsn-in --- xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-V-F.install.vsn-in 2020-02-28 13:14:00.000000000 +0000 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-hypervisor-V-F.install.vsn-in 2022-06-19 07:08:13.000000000 +0100 @@ -1,4 +1,3 @@ - usr/lib/debug/xen* usr/lib/debug/ # ^ The xen* wildcard excludes the shim symbols. The shim is treated # as part of the toolstack - see xen-utils-V.install.vsn-in. diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.bug-control xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.bug-control --- xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.bug-control 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.bug-control 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,2 @@ +# autogenerated, do not edit +Submit-As: src:xen diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.install xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.install --- xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.install 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.install 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,8 @@ +# autogenerated, do not edit +usr/lib/xen-4.11/bin +usr/lib/xen-4.11/lib/python + +usr/lib/xen-4.11/boot +usr/lib/debug/usr/lib/xen-*/boot/* usr/lib/debug/xen-syms-4.11-shim +# ^ Yes, the upstream build system really does install the shim symbols +# file in debian/tmp/usr/lib/debug/usr/lib/xen-4.11/boot/xen-shim-syms diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.lintian-overrides xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.lintian-overrides --- xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.lintian-overrides 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.lintian-overrides 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,14 @@ +# autogenerated, do not edit +statically-linked-binary usr/lib/xen-4.11/boot/hvmloader +statically-linked-binary usr/lib/xen-4.11/boot/xen-shim + +binary-has-unneeded-section usr/lib/xen-4.11/boot/xen-shim .note +# ^ that section is certainly needed for the tools etc. to be able +# to load it! + +binary-from-other-architecture usr/lib/debug/xen-syms-4.11-shim/xen-shim-syms +# ^ this is a symbols file for the shim + +binary-or-shlib-defines-rpath usr/lib/xen-4.11/lib/python/fsimage.so /usr/lib/xen-4.11/lib/x86_64-linux-gnu +# ^ this module needs to load the libfsimage .so from within +# the xen-utils private directory. less +/fsimage debian/rules diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.postinst xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.postinst --- xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.postinst 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.postinst 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,25 @@ +#!/bin/sh +# autogenerated, do not edit + +set -e + +case "$1" in + configure) + update-alternatives --remove xen-default /usr/lib/xen-4.11 + if [ -x "/etc/init.d/xen" ]; then + invoke-rc.d xen start || exit $? + fi + ;; + + abort-upgrade|abort-remove|abort-deconfigure) + ;; + + *) + echo "postinst called with unknown argument \`$1'" >&2 + exit 1 + ;; +esac + +#DEBHELPER# + +exit 0 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.prerm xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.prerm --- xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.prerm 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.prerm 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,25 @@ +#!/bin/bash +# autogenerated, do not edit + +set -e + +case "$1" in + remove|upgrade) + update-alternatives --remove xen-default /usr/lib/xen-4.11 + if [ -x "/etc/init.d/xen" ]; then + invoke-rc.d xen stop || exit $? + fi + ;; + + deconfigure|failed-upgrade) + ;; + + *) + echo "prerm called with unknown argument \`$1'" >&2 + exit 1 + ;; +esac + +#DEBHELPER# + +exit 0 diff -Nru xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.README.Debian xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.README.Debian --- xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.README.Debian 1970-01-01 01:00:00.000000000 +0100 +++ xen-4.11.3+24-g14b62ab3e5/debian/xen-utils-4.11.README.Debian 2022-06-17 09:11:17.000000000 +0100 @@ -0,0 +1,49 @@ +# autogenerated, do not edit +Xen for Debian +============== + +Config behaviour +---------------- + +The Debian packages changes the behaviour of some config options. + +The options "kernel", "initrd" and "loader" searches in the Xen private boot +directory (/usr/lib/xen-$version/boot) first. "bootloader" and "device_model" +also searches the Xen private bin directory (/usr/lib/xen-$version/bin). This +means that the following entries will properly find anything: + loader = 'hvmloader' + bootloader = 'pygrub' + +Network setup +------------- + +The Debian package of Xen don't change the network setup in any way. This +differs from the upstream version, which overwrites the main network card +(eth0) with a bridge setup and may break the network at this point.. + +To setup a bridge please follow the instructions in the manpage for +bridge-utils-interfaces(5). + +Loop devices +------------ + +If you plan hosting virtual domains with file backed block devices (ie. the +ones xen-tools creates by default) be careful about two issues: + +1. Maximum number of loop devices + By default the loop driver supports a maximum of 8 loop devices. Of + course since every Xen domain uses at least two (one for the data and one + for the swap) this number is absolutely insufficient. You should increase + it by adding a file named local-loop in /etc/modprobe.d containing the + string "options loop max_loop=128", if the loop driver is compiled as a + module, or by appending the string max_loop=128 to your kernel parameters + if the driver is in-kernel. Of course you can increase or decrease the + number 128 as you see fit. + +2. Driver loading (only if loop is compiled as a module) + Normally the loop driver gets loaded when the first loop device is + accessed. When using udev, though, the loop devices get created only + after the driver gets loaded. This means that Xen will fail if the loop + driver is not already loaded when it tries to start a file-backed virtual + domain. To fix this just add "loop" in your /etc/modules file, thus + forcing it to be loaded at boot time.