Posts made by dredknight | XCP-ng and XO forum

dredknight

@Linus-S here is something you can do:

run a virtual environment on your PC (Virtualbox?) that can export the disk in VHD or QCOW2 format.
install debian 12
build the XOA from sources
once installed shut the vm down and export the disk
if the disk is in QCOW2 format, covert it to VHD using qemu

qemu-img convert -O vpc  xoa-disk.qcow2 xoa-disk.vhd

import the disk in the restricted hypervisor environment
create a VM from it
(bonus) Make a backup of the VM so you have it in handy for other less fortunate days

dredknight

@Ascar B) is your way to go. VMs with PCI passthrough will fail migration (A) as the passthrough device itself cannot be migrated.

dredknight

@olivierlambert I was just trying to describe it better. If that is the case then all is good.

dredknight

@olivierlambert the additional IP is used when one have a dedicated network for ISCSI network traffic. Here are some examples;

Case 1
imagine you have host 3 interfaces:

eth0 is xcp management - 10.10.10.10
eth1 is trunk for vlans - <trunk>
eth2 is free and you want to use its bandwidth to attach your new ISCSI Storage. This way you can get the maximum bandwidth possible as eth2 is not overlapping with eth0. ISCSI configuration requires assigned IP on the interface but does not require it to be of type management (as eth0).

Case 2
Same as the previous one except the user decides to setup the ISCSI network ip on management interface.

dredknight

@creoleMalady one more suggestion for the sake of an alternative solution. If you insist on managing multiple servers not in host (even though running them in a pool would equalize the CPU features to the oldest CPU available) you can use Cloudstack as a management platform to do the templates and networking for you. It dynamically transfers templates and assigns networks on demand on the hosts as you provision vms on them.

The disadvantage is that you need to learn Cloudstack which can be much of an overhead by its own if you dont want to go too deep.

dredknight

FYI - it was a CPU issue.
We changed the CPUs between servers and it moved with them.

Thanks for the tips everyone!

dredknight

@olivierlambert said in Weird kern.log errors:

That's indeed a clue if it happens ONLY on this machine if you run the same VMs on others without triggering any problem, and if you have exact same versions of XCP-ng/BIOS/firmware and the same hardware

Reply

All VMs spin VMs from the same image (Using CloudStack on top) and all workload pushed on them is the same.
If one VM was the issue it would make sense the VM is at fault, but in the current case both VMs on that server get broken at the same time.

All servers have 2 SSDs in RAID 1. In the last iteration we did not use RAID and placed 1 VM on different SSD disk just in case the issue can come from there. The problem still appeared in the same way as before.

We are thinking of other ways to hold tests at the moment. Will keep you updated :).

dredknight

@olivierlambert host = physical server. I fixed it in the last post.

dredknight

Yes only on this one it happens in the span of 2-5 days after complete reset.
From user perspective we know the following.

Server has 2 vms running on 80-100% CPU utilization (each server has 64 cores assigned, the server has 2 EPYC CPUs for a total of 128 cores).
After the issue occurs one of the VM becomes unresponsive and the other is kind of fine - you can login but no commands can be executed on it. For example you type "top" or "df -h" press enter and it stays like that indefinitely with no output.

One tip I got, though a bit far-fetched, is it can be a "cosmic ray" behaviour. I dont know about that but so far nothing else can be tracked.

dredknight

@Danp yes all servers came in one batch and had the same hardware and software installed at the time.
By all, I mean 36 in total same hardware and same workload. So it is just 1/36 error rate which pretty much means hardware
issue for me, but yet there is no diagnostic report of it which is quite peculiar.

dredknight

@olivierlambert here are the versions we have:

BIOS 2.10.2
I- drac - 6.10.30.0
Ethernet - 30.6.24
RAID backplane - 4.36
RAID Slot - 51.16.0-4076

Devices have latest firmware, updated 2 weeks ago as recommended by Dell.

dredknight

after some more debugging we found this issues reported by dmesg

dmesg |grep WARN -A 5
--
[126340.468112] WARNING: CPU: 11 PID: 1311 at arch/x86/xen/multicalls.c:130 xen_mc_flush+0x1aa/0x1c0
[126340.468113] Modules linked in: ebtable_nat arptable_filter arp_tables xt_set ip_set_hash_net ip_set nfnetlink tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache ebt_arp ebt_ip ebtable_filter ebtables xt_physdev br_netfilter bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c dm_multipath iptable_filter sr_mod cdrom sunrpc nls_iso8859_1 nls_cp437 vfat fat dm_mod uas usb_storage dcdbas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables hid_generic usbhid hid sd_mod ahci libahci
[126340.468140]  i40e(O) xhci_pci libata megaraid_sas(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[126340.468145] CPU: 11 PID: 1311 Comm: forkexecd Tainted: G        W  O      4.19.0+1 #1
[126340.468146] Hardware name: Dell Inc. PowerEdge C6525/04DK47, BIOS 2.10.2 10/25/2022
[126340.468147] RIP: e030:xen_mc_flush+0x1aa/0x1c0
--
[126340.468351] WARNING: CPU: 12 PID: 0 at arch/x86/xen/multicalls.c:130 xen_mc_flush+0x1aa/0x1c0
[126340.468358] Modules linked in: ebtable_nat arptable_filter arp_tables xt_set ip_set_hash_net ip_set nfnetlink tun rpcsec_gss_krb5 auth_rpcgss oid_registry nfsv4 nfs lockd grace fscache ebt_arp ebt_ip ebtable_filter ebtables xt_physdev br_netfilter bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc bonding bridge 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c dm_multipath iptable_filter sr_mod cdrom sunrpc nls_iso8859_1 nls_cp437 vfat fat dm_mod uas usb_storage dcdbas crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper sg i2c_piix4 ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter ip_tables x_tables hid_generic usbhid hid sd_mod ahci libahci
[126340.468423]  i40e(O) xhci_pci libata megaraid_sas(O) xhci_hcd scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
[126340.468438] CPU: 12 PID: 0 Comm: swapper/12 Tainted: G        W  O      4.19.0+1 #1
[126340.468439] Hardware name: Dell Inc. PowerEdge C6525/04DK47, BIOS 2.10.2 10/25/2022
[126340.468443] RIP: e030:xen_mc_flush+0x1aa/0x1c0

dredknight

Thank you both

We run memtest twice, but nothing was reported.

Hardware is:

host: Dell PowerEdge C6525
CPU: 2 x EPYC 7532 32 Core
Memory: 16 x 16GB 3200 MHz DDR4
Disks: 2 x 512 GB SSDs in RAID 1
RAID: Dell Perc H345

BIOS is v. 2.10.2
XCP is version 8.2.1 latest update (yum updated 10 days ago when it was reinstalled). Couldnt find where to get the exact patch number.

We have other servers with the same setup but just this one is having those issues.

dredknight

Environment: XCP-NG 8.2 with a few debian vms that squeeze 100% of the host CPUs. 2 SSD disks in raid 1.

Have anyone seen such errors in XCP kern.log?

May  3 10:22:27 xcpfuz135 kernel: [760888.661906] tapdisk[18876]: segfault at 7f2fc3ca9630 ip 00007f2fc3ca9630 sp 00007ffd030e3888 error 14
May  3 10:22:27 xcpfuz135 kernel: [760888.661918] Code: Bad RIP value.
May  3 11:58:13 xcpfuz135 kernel: [766634.433681] xcp-rrdd-xenpm[1492]: segfault at 7f316d5187d8 ip 00007f316d5187d8 sp 00007fff780db5b8 error 15
May  3 11:58:13 xcpfuz135 kernel: [766634.433687] Code: 00 00 02 04 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00 d8 87 51 6d 31 7f 00 00 00 14 00 00 00 00 00 00 <08> 88 51 6d 31 7f 00 00 20 88 51 6d 31 7f 00 00 c8 88 51 6d 31 7f
May  3 12:25:52 xcpfuz135 kernel: [768293.718628] tapdisk[21047]: segfault at 7f625725d630 ip 00007f625725d630 sp 00007ffe541135e8 error 14 in zero[7fb8921de000+202000]
May  3 12:25:52 xcpfuz135 kernel: [768293.718646] Code: Bad RIP value.
May  3 13:20:43 xcpfuz135 kernel: [771584.713117] xapi4: port 2(vif2.0) entered disabled state
May  3 13:20:43 xcpfuz135 kernel: [771585.015102] xapi4: port 2(vif2.0) entered disabled state
May  3 13:20:43 xcpfuz135 kernel: [771585.015571] device vif2.0 left promiscuous mode
May  3 13:20:43 xcpfuz135 kernel: [771585.015588] xapi4: port 2(vif2.0) entered disabled state
May  3 13:21:44 xcpfuz135 kernel: [771645.993918] block tdc: sector-size: 512/512 capacity: 209715200
May  3 13:21:45 xcpfuz135 kernel: [771646.097319] block tdg: sector-size: 512/512 capacity: 2097152
May  3 13:21:45 xcpfuz135 kernel: [771646.098822] block tdh: sector-size: 512/512 capacity: 62914560
May  3 13:21:45 xcpfuz135 kernel: [771646.755689] xapi4: port 2(vif4.0) entered blocking state
May  3 13:21:45 xcpfuz135 kernel: [771646.755693] xapi4: port 2(vif4.0) entered disabled state
May  3 13:21:45 xcpfuz135 kernel: [771646.755853] device vif4.0 entered promiscuous mode
May  3 13:21:45 xcpfuz135 kernel: [771646.971247] xapi4: port 3(tap4.0) entered blocking state
May  3 13:21:45 xcpfuz135 kernel: [771646.971251] xapi4: port 3(tap4.0) entered disabled state
May  3 13:21:45 xcpfuz135 kernel: [771646.971373] device tap4.0 entered promiscuous mode
May  3 13:21:45 xcpfuz135 kernel: [771646.978912] xapi4: port 3(tap4.0) entered blocking state
May  3 13:21:45 xcpfuz135 kernel: [771646.978914] xapi4: port 3(tap4.0) entered forwarding state
May  3 13:22:07 xcpfuz135 kernel: [771668.627952] xapi4: port 3(tap4.0) entered disabled stat
May  3 13:22:07 xcpfuz135 kernel: [771668.628380] device tap4.0 left promiscuous mode
May  3 13:22:07 xcpfuz135 kernel: [771668.628415] xapi4: port 3(tap4.0) entered disabled state
May  3 13:22:19 xcpfuz135 kernel: [771680.603894] vif vif-4-0 vif4.0: Guest Rx ready
May  3 13:22:19 xcpfuz135 kernel: [771680.604048] xapi4: port 2(vif4.0) entered blocking state
May  3 13:22:19 xcpfuz135 kernel: [771680.604050] xapi4: port 2(vif4.0) entered forwarding state
May  3 15:49:08 xcpfuz135 kernel: [780489.913642] tapdisk[26502]: segfault at 7f2fc3ca9630 ip 00007f2fc3ca9630 sp 00007ffe201f1ac8 error 14 in zero (deleted)[7f3f31b69000+20000]
May  3 15:49:08 xcpfuz135 kernel: [780489.913660] Code: Bad RIP value.
May  3 15:54:57 xcpfuz135 kernel: [780838.914536] tapdisk[19705]: segfault at 7fff1e11cc50 ip 00007fff1e11cc50 sp 00007ffc52bbdf08 error 14
May  3 15:54:57 xcpfuz135 kernel: [780838.914549] Code: Bad RIP value.
May  3 15:55:01 xcpfuz135 kernel: [780842.957102] tapdisk[26429]: segfault at 7f0668b77630 ip 00007f0668b77630 sp 00007ffc2dfedfb8 error 14 in zero[7f14475ad000+202000]
May  3 15:55:01 xcpfuz135 kernel: [780842.957123] Code: Bad RIP value.

There are issues in the VM instances - they are up but systemd commands do not work anymore (ps aux, top, choose your pick).That started after the segfault issues above. Some instance logs:

[12561.047902] systemd[1]: systemd-journald.service: Killing process 75421 (systemd-journal) with signal SIGKILL.
[12561.062143] systemd[1]: systemd-journald.service: Killing process 75425 (systemd-journal) with signal SIGKILL.
[12561.076665] systemd[1]: systemd-journald.service: Killing process 75428 (systemd-journal) with signal SIGKILL.
[12561.090558] systemd[1]: systemd-journald.service: Killing process 75430 (systemd-journal) with signal SIGKILL.
[12651.228990] systemd[1]: systemd-journald.service: Processes still around after final SIGKILL. Entering failed mode.
[12651.244675] systemd[1]: systemd-journald.service: Failed with result 'timeout'.
[12651.255652] systemd[1]: systemd-journald.service: Unit process 859 (systemd-journal) remains running after unit stopped.
[12651.270724] systemd[1]: systemd-journald.service: Unit process 75415 (systemd-journal) remains running after unit stopped.
[12651.285416] systemd[1]: systemd-journald.service: Unit process 75418 (systemd-journal) remains running after unit stopped.
[12651.299654] systemd[1]: systemd-journald.service: Unit process 75421 (systemd-journal) remains running after unit stopped.
[12651.313796] systemd[1]: systemd-journald.service: Unit process 75425 (systemd-journal) remains running after unit stopped.
[12651.327886] systemd[1]: systemd-journald.service: Unit process 75428 (systemd-journal) remains running after unit stopped.
[12651.341851] systemd[1]: systemd-journald.service: Unit process 75430 (systemd-journal) remains running after unit stopped.
[12651.355831] systemd[1]: systemd-journald.service: Unit process 75437 (systemd-journal) remains running after unit stopped.
[12651.370338] systemd[1]: Failed to start Journal Service.
[12651.378681] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 8.
[12651.390940] systemd[1]: Stopped Journal Service.
[12651.397801] systemd[1]: systemd-journald.service: Found left-over process 859 (systemd-journal) in control group while starting unit. Ignoring.
[12651.414744] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.429011] systemd[1]: systemd-journald.service: Found left-over process 75415 (systemd-journal) in control group while starting unit. Ignoring.
[12651.446003] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.460405] systemd[1]: systemd-journald.service: Found left-over process 75418 (systemd-journal) in control group while starting unit. Ignoring.
[12651.477450] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.492410] systemd[1]: systemd-journald.service: Found left-over process 75421 (systemd-journal) in control group while starting unit. Ignoring.
[12651.509964] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.524694] systemd[1]: systemd-journald.service: Found left-over process 75425 (systemd-journal) in control group while starting unit. Ignoring.
[12651.541827] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.556536] systemd[1]: systemd-journald.service: Found left-over process 75428 (systemd-journal) in control group while starting unit. Ignoring.
[12651.575905] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.592435] systemd[1]: systemd-journald.service: Found left-over process 75430 (systemd-journal) in control group while starting unit. Ignoring.
[12651.612706] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.629075] systemd[1]: systemd-journald.service: Found left-over process 75437 (systemd-journal) in control group while starting unit. Ignoring.
[12651.647205] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[12651.893002] systemd[1]: Starting Journal Service...

Reinstalling the whole environment (xcp-ng + vms) will allow it to run for several days and then this issue will come back. We think it is a hardware issue but diagnostics show no such problem.

dredknight

@olivierlambert we couldn't find a reason either.
We will run more tests in the following weeks and report if we find anything of value.

dredknight

@olivierlambert this is just 1 single host in a single cluster. Only local SSD storage.

It is managed by Cloudstack, we are doing high performance tests on that server so nothing of value on it. I thought logs can help find the issue if such actually exists.

dredknight

Reuploaded logs again because tgz was not allowed format. Added .txt extention so just remove it and extract.

dredknight

Hey everyone,
bumped into this topic as we had similar issue with one of our test vms. full day of logs attached logs.tgz.txt

We found out that on the 5th of April one of the VMs shutdown. You can see the issues in the log after 14:10:00.

grep error <the log file>
to see the specific messages.

We are still investigating and not yet sure what is the problem but it seems like it is related to storage.

XCP is latest version 8.2.1.

dredknight

@olivierlambert its looks like it was not from XCP but from the cloud provider managing it (Cloudstack). as soon as max restrictions were removed i could assign all 128 threads as vCPUs to a single VM.

dredknight

alright i managed to put my hands on such host.

out of 128 cores (including hyperthreading) i managed to create a Debian 11 vm with 108 cores. if i go higher i get an error.

I assume the error is due to some threads being exclusive for the hypervisor. Its possible that is the actual limit too though.

Not that important though just got curious how far i can push it.