A memory page poisoned from the hypervisor level is no longer readable.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:
Program terminated with signal SIGBUS, Bus error.
#0 _mm256_loadu_si256
#1 buffer_zero_avx2
#2 select_accel_fn
#3 buffer_is_zero
#4 save_zero_page
#5 ram_save_target_page_legacy
#6 ram_save_host_page
#7 ram_find_and_save_block
#8 ram_save_iterate
#9 qemu_savevm_state_iterate
#10 migration_iteration_run
#11 migration_thread
#12 qemu_thread_start
To avoid this VM crash during the migration, prevent the migration
when a known hardware poison exists on the VM.
Signed-off-by: William Roche <william.roche@oracle.com>
Link: https://lore.kernel.org/r/20240130190640.139364-2-william.roche@oracle.com
Signed-off-by: Peter Xu <peterx@redhat.com>
kvm_get_radix_page_info() is only defined for ppc targets (in
target/ppc/kvm.c). The declaration is not useful in other targets,
reduce its scope.
Rename using the 'kvmppc_' prefix following other declarations
from target/ppc/kvm_ppc.h.
Suggested-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20231003070427.69621-2-philmd@linaro.org>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
NR_IOBUS_DEVS was increased to 200 in Linux 2.6.34. By Linux 3.5 it had
increased to 1000 and later ioeventfds were changed to not count against
the limit. But the earlier limit of 200 would already be enough for
kvm_check_many_ioeventfds() to be true, so remove the check.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is a remnant of pre-VFIO device assignment; it is not defined
anymore by Linux and not used by QEMU.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_IRQFD was introduced in Linux 2.6.32, and since then it has always been
available on architectures that support an in-kernel interrupt controller.
We can require it unconditionally.
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM_IRQFD was introduced in Linux 2.6.32, and since then it has always been
available on architectures that support an in-kernel interrupt controller.
We can require it unconditionally.
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This was introduced in KVM in Linux 3.5, we can require it unconditionally
in kvm_irqchip_send_msi(). However, not all architectures have to implement
it so check it only in x86, the only architecture that ever had MSI injection
but not KVM_CAP_SIGNAL_MSI.
ARM uses it to detect the presence of the ITS emulation in the kernel,
introduced in Linux 4.8. Assume that it's there and possibly fail when
realizing the arm-its-kvm device.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Let's return the number of free slots instead of only checking if there
is a free slot. While at it, check all address spaces, which will also
consider SMM under x86 correctly.
This is a preparation for memory devices that consume multiple memslots.
Message-ID: <20230926185738.277351-5-david@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
kvm_has_pit_state2() is only defined for x86 targets (in
target/i386/kvm/kvm.c). Its declaration is pointless on
all other targets. Have it return a boolean.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-13-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_get_apic_state() is only defined for x86 targets (in
hw/i386/kvm/apic.c). Its declaration is pointless on all
other targets.
Since we include "linux-headers/asm-x86/kvm.h", no need
to forward-declare 'struct kvm_lapic_state'.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-12-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_arch_get_supported_cpuid() / kvm_arch_get_supported_msr_feature()
are only defined for x86 targets (in target/i386/kvm/kvm.c). Their
declarations are pointless on other targets.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20230904124325.79040-11-philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Changes the signature of the target-defined functions for
inserting/removing kvm hw breakpoints. The address and length arguments
are now of vaddr type, which both matches the type used internally in
accel/kvm/kvm-all.c and makes the api target-agnostic.
Signed-off-by: Anton Johansson <anjo@rev.ng>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230807155706.9580-4-anjo@rev.ng>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
kvm_arch_get_default_type() returns the default KVM type. This hook is
particularly useful to derive a KVM type that is valid for "none"
machine model, which is used by libvirt to probe the availability of
KVM.
For MIPS, the existing mips_kvm_type() is reused. This function ensures
the availability of VZ which is mandatory to use KVM on the current
QEMU.
Cc: qemu-stable@nongnu.org
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-id: 20230727073134.134102-2-akihiko.odaki@daynix.com
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
[PMM: added doc comment for new function]
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Commit 1e05888ab5 ("sysemu/kvm: Remove unused headers") was
a bit overzealous while cleaning "sysemu/kvm.h" headers:
kvm_arch_post_run() returns a MemTxAttrs type, so depends on
"exec/memattrs.h" for its definition.
Fixes: 1e05888ab5 ("sysemu/kvm: Remove unused headers")
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230619074153.44268-5-philmd@linaro.org>
Several hypervisor capabilities in KVM are target-specific. When exposed
to QEMU users as accelerator properties (i.e. -accel kvm,prop=value), they
should not be available for all targets.
Add a hook for targets to add their own properties to -accel kvm, for
now no such property is defined.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220929072014.20705-3-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As HW virtualization requires specific support to handle breakpoints
lets push out special casing out of the core gdbstub code and into
AccelOpsClass. This will make it easier to add other accelerator
support and reduces some of the stub shenanigans.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Mads Ynddal <mads@ynddal.dk>
Message-Id: <20220929114231.583801-45-alex.bennee@linaro.org>
When AMD's SEV memory encryption is in use, flash memory banks (which are
initialed by pc_system_flash_map()) need to be encrypted with the guest's
key, so that the guest can read them.
That's abstracted via the kvm_memcrypt_encrypt_data() callback in the KVM
state.. except, that it doesn't really abstract much at all.
For starters, the only call site is in code specific to the 'pc'
family of machine types, so it's obviously specific to those and to
x86 to begin with. But it makes a bunch of further assumptions that
need not be true about an arbitrary confidential guest system based on
memory encryption, let alone one based on other mechanisms:
* it assumes that the flash memory is defined to be encrypted with the
guest key, rather than being shared with hypervisor
* it assumes that that hypervisor has some mechanism to encrypt data into
the guest, even though it can't decrypt it out, since that's the whole
point
* the interface assumes that this encrypt can be done in place, which
implies that the hypervisor can write into a confidential guests's
memory, even if what it writes isn't meaningful
So really, this "abstraction" is actually pretty specific to the way SEV
works. So, this patch removes it and instead has the PC flash
initialization code call into a SEV specific callback.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
When split irqchip support was introduced, the meaning of
kvm_irqchip_in_kernel() changed: now it only means the LAPIC is
in kernel. The PIC, IOAPIC, and PIT might be in userspace if
irqchip=split was set. Update the doc comment to reflect that.
While at it, remove the "the user asked us" part in
kvm_irqchip_is_split() doc comment. That macro has nothing to do
with existence of explicit user-provided options.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20200922203612.2178370-1-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Some typedefs and macros are defined after the type check macros.
This makes it difficult to automatically replace their
definitions with OBJECT_DECLARE_TYPE.
Patch generated using:
$ ./scripts/codeconverter/converter.py -i \
--pattern=QOMStructTypedefSplit $(git grep -l '' -- '*.[ch]')
which will split "typdef struct { ... } TypedefName"
declarations.
Followed by:
$ ./scripts/codeconverter/converter.py -i --pattern=MoveSymbols \
$(git grep -l '' -- '*.[ch]')
which will:
- move the typedefs and #defines above the type check macros
- add missing #include "qom/object.h" lines if necessary
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-9-ehabkost@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20200831210740.126168-10-ehabkost@redhat.com>
Message-Id: <20200831210740.126168-11-ehabkost@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
This is majorly only for X86 because that's the only one that supports
split irqchip for now.
When the irqchip is split, we face a dilemma that KVM irqfd will be
enabled, however the slow irqchip is still running in the userspace.
It means that the resamplefd in the kernel irqfds won't take any
effect and it will miss to ack INTx interrupts on EOIs.
One example is split irqchip with VFIO INTx, which will break if we
use the VFIO INTx fast path.
This patch can potentially supports the VFIO fast path again for INTx,
that the IRQ delivery will still use the fast path, while we don't
need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
callers of vfio_eoi() hook). However the EOI of the INTx will still
need to be done from the userspace by caching all the resamplefds in
QEMU and kick properly for IOAPIC EOI broadcast.
This is tricky because in this case the userspace ioapic irr &
remote-irr will be bypassed. However such a change will greatly boost
performance for assigned devices using INTx irqs (TCP_RR boosts 46%
after this patch applied).
When the userspace is responsible for the resamplefd kickup, don't
register it on the kvm_irqfd anymore, because on newer kernels (after
commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
irqchip and resamplefd. This will make sure that the fast path will
work for all supported kernels.
https://patchwork.kernel.org/patch/10738541/#22609933
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20200318145204.74483-5-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a SIGBUS signal handler. In this handler, it checks the SIGBUS type,
translates the host VA delivered by host to guest PA, then fills this PA
to guest APEI GHES memory, then notifies guest according to the SIGBUS
type.
When guest accesses the poisoned memory, it will generate a Synchronous
External Abort(SEA). Then host kernel gets an APEI notification and calls
memory_failure() to unmapped the affected page in stage 2, finally
returns to guest.
Guest continues to access the PG_hwpoison page, it will trap to KVM as
stage2 fault, then a SIGBUS_MCEERR_AR synchronous signal is delivered to
Qemu, Qemu records this error address into guest APEI GHES memory and
notifes guest using Synchronous-External-Abort(SEA).
In order to inject a vSEA, we introduce the kvm_inject_arm_sea() function
in which we can setup the type of exception and the syndrome information.
When switching to guest, the target vcpu will jump to the synchronous
external abort vector table entry.
The ESR_ELx.DFSC is set to synchronous external abort(0x10), and the
ESR_ELx.FnV is set to not valid(0x1), which will tell guest that FAR is
not valid and hold an UNKNOWN value. These values will be set to KVM
register structures through KVM_SET_ONE_REG IOCTL.
Signed-off-by: Dongjiu Geng <gengdongjiu@huawei.com>
Signed-off-by: Xiang Zheng <zhengxiang9@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Xiang Zheng <zhengxiang9@huawei.com>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-id: 20200512030609.19593-10-gengdongjiu@huawei.com
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Both gsi_handler() and kvm_pc_gsi_handler() have the same content,
except one comment. Move the comment, and de-duplicate the code.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVMState struct is opaque, so provide accessors for the fields
that will be moved from current_machine to the accelerator. For now
they just forward to the machine object, but this will change.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Awareness of an in kernel irqchip is usually local to the machine and its
top-level interrupt controller. However, in a few cases other things need
to know about it. In particular vfio devices need this in order to
accelerate interrupt delivery.
If interrupt routing is changed, such devices may need to readjust their
connection to the KVM irqchip. pci_bus_fire_intx_routing_notifier() exists
to do just this.
However, for the pseries machine type we have a situation where the routing
remains constant but the top-level irq chip itself is changed. This occurs
because of PAPR feature negotiation which allows the guest to decide
between the older XICS and newer XIVE irq chip models (both of which are
paravirtualized).
To allow devices like vfio to adjust to this change, introduce a new
notifier for the purpose kvm_irqchip_change_notify().
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
VMX requires 64-bit feature words for the IA32_VMX_EPT_VPID_CAP
and IA32_VMX_BASIC MSRs. (The VMX control MSRs are 64-bit wide but
actually have only 32 bits of information).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>