@steff22 Wow, great news! Kudos to the Xen & XCP-ng dev teams
Posts made by tuxen
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:
The bios disabled internal ipma when an Ext GPU card is connected even though int gpu is selected as primary gpu in the bios. So I only see xcp-ng startup on screen no xsconsole. Have tried without a screen connected extgpu same error then
I suggest to call the Asrock support and explain this behavior.
@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:
no. 2 Have tried pressing Detect only to be told that there is no more screen. Have only tried reboot
Could you try the shutdown/start after the driver installation?
@steff22 said in Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work:
At first I thought there was something wrong with the bios. But this works with Vmware esxi and proxmox.
Considering it worked with the same XCP-ng version, but on a different hardware, that's why I'm more inclined to a Xen incompatibility issue with the combo Nvidia + some AMD motherboards. If you search the forum, there's a mixed result about that.
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 I have some questions:
- Is the host being powered up with a monitor or a dummy plug (headless) already attached to the dGPU?
- Without rebooting the VM and right after the driver installation succeeds (showing that the device is working OK), what happens if you click the
[Detect]
button at the display settings window? - Instead of a reboot, did you try a VM shutdown/start cycle for the 1st time after the driver installation?
Nonetheless, if the same dGPU card works normally on another XCP-ng host, a possible Xen passthrough incompatibility with that AM5 board should be considered. For example:
- CSM/UEFI GPU compat issues, as referenced by @Teddy-Astie
- Beta BIOS/IOMMU broken or lacking features (eg. ACS support for PCI/PCIe isolation)
Tux
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 After reading this Blue Iris topic, I wonder if it's related. As of Xen 4.15, there was a change on MSRs handling that would cause a guest crash if it tries to access those registers. XCP-ng 8.3 has the Xen 4.17 version. The issue seems to be CPU-vendor-model dependent too.
https://xcp-ng.org/forum/topic/8873/windows-blue-iris-xcp-ng-8-3
It's worth to test the solution provided there (VM shutdown/start cycle is required to take effect):
xe vm-param-add uuid=<VM-UUID> param-name=platform msr-relaxed=true
Replace the
<VM-UUID>
with your VM W10 uuid.Tux
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 weird bug. Is that W10 VM a fresh install on Xen? It seems that the driver or the dGPU are timing out somehow. Could be related to PCI power management (ASPM), but I'm not sure. You could try booting dom0 with
pcie_aspm=off
just for testing./opt/xensource/libexec/xen-cmdline --set-dom0 "pcie_aspm=off" reboot
Another option that comes to mind is to compare the VM attributes on Proxmox and try to spot any VM config differences by set/unset the
PCI Express
option.Tux
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 Ah, you should try to reproduce the BSOD and then run the
xl dmesg
. I was wondering why there's no error this time in the log -
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 Ok, let's boot Xen in verbosity=all mode:
/opt/xensource/libexec/xen-cmdline --set-xen "loglvl=all guest_loglvl=all" reboot
After the VM BSODs, post the
xl dmesg
output.Tux
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@Teddy-Astie @steff22 For the Windows VM, Xen is indeed triggering a guest crash:
(XEN) [ 1022.240112] d1v2 VIRIDIAN GUEST_CRASH: 0x116 0xffffdb8ffaf76010 0xfffff8077938e9f0 0xffffffffc0000001 0x4
I also noticed that dom0 memory was autoset to only
2.6G
(out of32G
total) which might be low for a more resource-hog dGPU. Before booting Xen in debug mode, could we rule this out by testing a non-persistent boot change to a higher value (eg. 8G)?- In the first grub menu option "XCP-ng", press key
<e>
to edit the boot line - Locate and set:
dom0_mem=8192M,max:8192M
- Press
<F10>
to boot - Check dom0 free memory (
free -m
). It must be within the7000-8000
range. - Boot the Windows VM.
Tux
- In the first grub menu option "XCP-ng", press key
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 what's the output of
lspci -k
andxl pci-assignable-list
?Also, the outputs of the system logs re. GPU and IOMMU initialization would be very useful:
egrep -i '(nvidia|vga|video|pciback)' /var/log/kern.log xl dmesg
Tux
-
RE: Gpu passthrough on Asrock rack B650D4U3-2L2Q will not work
@steff22 Assuming the
xen-pciback.hide
was previously set, could you try this workaround (no guarantee that'll work, since each motherboard and BIOSes have their quirks):/opt/xensource/libexec/xen-cmdline --set-dom0 pci=realloc reboot
-
RE: MAP_DUPLICATE_KEY error in XOA backup - VM's wont START now!
@jshiells said in MAP_DUPLICATE_KEY error in XOA backup - VM's wont START now!:
@tuxen no sorry, great idea but we are not seeing any errors like that in kern.log. this problem when it happens is across several xen hosts all at the same time. it would be wild if all of the xen hosts were having hardware issues during the small window of time this problem happened in. if it was one xen server then i would look at hardware but its all of them, letting me believe its XOA, a BUG in xcp-ng or a storage problem (even though we have seen no errors or monitoring blips at all on the truenas server)
In a cluster with shared resources, It only takes one unstable host or a crashed PID left with an open shared resource to cause some unclean state cluster-wide. If the hosts and VMs aren't protected by a HA/STONITH mechanism to automatically release the resources, a qemu crash might keep one or more VDIs in an invalid, blocked state and affect SR scans done by the master. Failed SR scans may prevent SR-dependent actions (eg. VM boot, snapshots, GC kicking etc).
I'm not saying there isn't a bug somewhere but running MEMTEST/CPU tests on the host that triggered the bad RIP error would be my starting point of investigation. Just to be sure.
-
RE: MAP_DUPLICATE_KEY error in XOA backup - VM's wont START now!
@jshiells Did you also check
/var/log/kern.log
for hardware errors? I'm seeing qemu process crashing with bad RIP (Instruction pointer) value which screams for a hardware issue, IMO. Just a 1-bit flipping in memory is enough to cause unpleasant surprises. I hope the servers are using ECC memory. I'd run a memtest and some CPU stress test on that server.Some years ago, I had a two-socket Dell server with one bad core (no errors reported at boot). When the Xen scheduler ran a task on that core... Boom. Host crash.
-
RE: AMD Radeon S7150x2 - Not being seen by VMs
@cunrun @jorge-gbs any init errors in dom0
/var/log/kern.log
re. GIM driver? Also, if you search some topics here covering this specific GPU, there were mixed results booting dom0 withpci=realloc,assign-busses
. Maybe it worth a try. -
RE: Windows 2022 VM - Reboot triggered - VM shuts down
@KPS one thing is clear to me. The reboot is triggering a VM shutdown due to a system crash (kernel errors and memory dump files being a lead). Without a detailed stack trace (like Linux's kernel panic) and the difficulty in reproducing the issue, troubleshooting is a very hard task. One last thing I'd check is the
/var/log/daemon.log
at the VM shutdown time window. -
RE: Windows 2022 VM - Reboot triggered - VM shuts down
@KPS I was exactly thinking about an after-hour task doing heavy storage I/O (e.g data replication or ETL-like workloads). Under this scenario, a forced reboot might cause some sort of file system corruption due to uncommitted data being lost.
Now, other source of issue comes to mind: automatic Windows Update. Is this service active? I'm not a Windows expert but a forced reboot during a system update might also cause an unexpected behavior.
Seeing all those errors, it seems that some system file or DLL got corrupted, needing a repair. It's strongly recommended taking a snapshot before running a system repair.
-
RE: Windows 2022 VM - Reboot triggered - VM shuts down
@KPS When that force reboot command is issued, the VM:
- Is under intensive I/O?
- Has a backup job started/running?
-
RE: AMD Radeon S7150x2 - Not being seen by VMs
@cunrun Are the XCP-ng host and the Windows 2019 Server VM booting in legacy/BIOS or UEFI? Since the FirePros was launched when the legacy/BIOS was still the standard, I'd try the this mode (if not yet).
-
RE: Memory Consumption goes higher day by day
@dhiraj26683 seeing the htop output, there's some HA-LIZARD PIDs running. So, yes, there's "extra stuff" installed on dom0
HA-LIZARD uses the TGT iSCSI driver which in turn has an implicit
write-cache
option enabled by default, if not set [1][2]. Is this option disabled in/etc/tgt/targets.conf
?
[1] https://www.halizard.com/ha-iscsi
[2] https://manpages.debian.org/testing/tgt/targets.conf.5.en.html -
RE: Intel Xeon W-2145 CPU on SuperMicro & failing xenpm get-cpufreq-para
@gecant I don't have any Supermicro server to test (mostly Dell/HPe) but checking the mobo manual [1], sadly, there's no profile for tweaking the power settings. After checking the available options, try this config:
Advanced >> CPU Configuration >> Advanced Power Management Configuration > CPU P State Control SpeedStep (PStates) [Enable] EIST PSD Function [HW_ALL] Turbo Mode [Enabled] > Hardware PM State Control Hardware P-States [Native Mode] > CPU C State Control Autonomous Core C-State [Disable] CPU C6 Report [Enable] Enhanced Halt State (C1E) [Disable (performance)] or [Enable (powersave)] > Package C State Control Package C State [C0/C1 (performance)] or [C6(Retention) state (powersave)]
[1] https://www.supermicro.com/manuals/motherboard/C420/MNL-2005.pdf
-
RE: Very scary host reboot issue
@darabontors some additional tests that I could think of:
- Minimum WG MTU on client-side (
MTU=1280
); - OPNSense with emulated
e1000
interfaces (bypass the PV driver but not OVS). It'll keep the VM 'agile' (hot-migrate) but with a big cost in performance; - The last OPNSense version
23.7.5
.
As for the last version, found this important info posted by the devs about a change in the MTU code [1]:
Today introduces a change in MTU handling for parent interfaces mostly
noticed by PPPoE use where the respective MTU values need to fit the
parent plus the additional header of the VLAN or PPPoE. Should the
MTU already be misconfigured to a smaller value it will be used as
configured so check your configuration and clear the MTU value if you
want the system to decide about the effective parent MTU size.
(...)Hope it helps.
- Minimum WG MTU on client-side (