Sorry title :
XCP-NG 8.3 freezes and / or reboot on HGR-SAP3 by Ovh
Sorry title :
XCP-NG 8.3 freezes and / or reboot on HGR-SAP3 by Ovh
Hi all
First time with a competition's setup...
And a big step forward for me
HGR-SAP-3 :
2x480 SSD Sata Soft RAID1
12x3.84To SSD SAS Hard RAID50
96 CPU : 2 x Intel Xeon Gold 6248R - 24c/48t - 3 GHz/4 GHz
768Go RAM
PCI devices :
4x MT27800 Family [ConnectX-5] (Mellanox Technologies ConnectX-5 EN network interface card for OCP2.0, Type 1, with host management, 25GbE dual-port SFP28, PCIe3.0 x8, no bracket Halogen free ; MCX542B-ACAN)
RAID bus controller MegaRAID SAS-3 3324 [Intruder]
VM in production :
14 VM
56 CPU
339 Gib
6.8 TiB (because of ZFS)
100% Windows environement
All VM are on my RAID50 ZFS disk
(zpool create -o ashift=12 -m // zfs set dedup=on //zfs set atime=off /zfs set sync=disabled (I remove the name ))
xcp-ng 8.3 all patch on and 16 GiB (beacause using nfs SR)
Fist time for me with xcp-ng 8.3 and zfs I made a pool and one host in this pool bond network card
I dont't have a good practice for saving my VM, I made replication
If I don't do any save, xcp-ng is ok
I have a 400 Gb Orphan VDIs
When I made a replication of the VM this weekend the system freeze
I'm connect whith the IPMI and have this message (translation of my screenshot) :
[ 0.099269] **CPU0: Unexpected LVT thermal interrupt!**
[ 0.099271] do_IRQ: 0.162 No irq handler for vector
[ 0.598711] efi: EFI_MEMMAP is not enabled.
[ 2.561621] Out of memory: Kill process 117 (systemd-journal) score 18 or sacrifice child
[ 2.561759] Killed process 117 (systemd-journal) total-um:34984kB, anon-rss: 276kB, file-rss:2572kB, shmem-rss:1208kB
[ 2.562721] Out of memory: Kill process 281 (dracut-initqueu) score 13 or sacrifice child
[ 2.562781] Killed process 291 (udevadm) total-um:32860kB, anon-rss: 280kB, file-rss:2404kB, shmem-rss: 0kB
[ 2.565352] Out of memory: Kill process 281 (dracut-initqueu) score 13 or sacrifice child
[ 2.565410] Killed process 281 (dracut-initqueu) total-um: 12124kB, anon-rss:596kB, file-rss:2452kB, shmem-rss:0kB
[ 2.567166] Out of memory: Kill process 355 (systemd-journal) score 13 or sacrifice child
[ 2.567224] Killed process 355 (systemd-journal) total-um: 30888kB, anon-rss: 252kB, file-rss:2596kB, shmem-rss:0kB
[FAILED] Failed to start Journal Service.
[ 2.568151] Out of memory: Kill process 277 (systemd-udevd) score 12 or sacrifice child
[ 2.568209] Killed process 277 (systemd-udevd) total-um: 38980kB, anon-rss: 420kB, file-rss:2224kB, shmem-rss:0kB
See 'systemctl status systemd-journald.service' for details.
[ OK ] Stopped Journal Service.
Starting Journal Service...
[ 2.570273] Out of memory: Kill process 357 (systemd) score 3 or sacrifice child
[ 2.570824] Out of memory: Kill process 359 (systemd) score 3 or sacrifice child
[ 2.570322] Killed process 357 (systemd) total-um: 42884kB, anon-rss:636kB, file-rss: 0kB, shmem-rss: 0kB
[ 2.570872] Killed process 359 (systemd) total-um: 42884kB, anon-rss:636kB, file-rss:0kB, shmem-rss: 0kB
[ 2.571758] mlx5_core 0000:86:00.0: mlx5_function_setup: 1486: (pid 277): failed to allocate init pages
[ OK ] Started dracut pre-mount hook.
[FAILED] Failed to start Journal Service.
See 'systemctl status systemd-journald.service' for details.
[ OK ] Stopped Journal Service.
Starting Journal Service...
[ OK ] Started Journal Service.
[ 2.596733] mlx5_core 0000:86:00.0: probe_one: 2136: (pid 277): mlx5_init_one failed with error code -12
[ 2.600905] mlx5_core 0000:1a:00.0: mlx5e_create_netdeu:7195: (pid 277): mlx5e_priv_init failed, err=-12
[ 2.601049] mlx5_core 0000:1a:00.0: mlx5e_probe: 7482: (pid 277): mlx5e_create_netdev failed
[ 2.601205] mlx5_core 0000:1a:00.1: mlx5e_create_netdeu:7195: (pid 277): mlx5e_priv_init failed, err=-12
[ 2.601348] mlx5_core 0000:1a:00.1: mlx5e_probe: 7482: (pid 277): mlx5e_create_netdev failed
[ 2.838490] megaraid_sas 0000:3b:00.0: Could not allocate memory for map info megasas_allocate_raid_maps:1620
[ 2.838786] megaraid_sas 0000:3b:00.0: Failed from megasas_init_fw 6392
[***] A start job is running for dev-disk-by\x2dlabel-root\x2dxrdbfu.device (1d 15h 49min 32s no limit)
I don't see the CPU0: Unexpected LVT thermal interrupt! in all the xcp-ng log
I have all the log whith me
I power Off / On for reboot xcp-ng (8am)
But whith all VM autostart "On" and an automatic task of "Garbage Collector for SR"
I have a lot off trouble
VM very slow and garbage task on a same time
xcp-ng freeze and crash with or not reboot 4 time between 12 and 13h
after second or third reboot I have time to disable the autostart of all VM and reboot VM one by one
I disable the backup task and xcp-ng don't reboot since monday afternoon
It's my second server by ovh the first one have other bug so I can change it but in fact I have the same problem of freeze and reboot
I'm just sure of the result (freeze and or crash) but not why it the beginning
The ultimate consequence was an out of memory of xcp-ng
I have hundred Mo of log if you want something more
Perhaps I'm wrong and it's a hardare incompatibility ? (network card, hard raid...)
I have zabbix snmp probe
Thank's