XCP-ng 7.5 - MegaRAID SAS 9240-8i hang/reboot issue.
-
@r1 Any Kernel upgrade is a pain, regardless how seamless some distros try to make it. I've been using XS for years and it's always been rock-solid in my production environment. This is the first time I've ever experienced a critical issue. Unfortunately it translates to an unstable system, so there is no telling when the machine is going to suffer a "critical-error" and hang at reboot.
My low-level familiarity win XS/XCP-ng does not extend to the depth where I would attempt a self-guided kernel upgrade, however if there is a development fork where they are testing the 4.14 or 4.15 kernel, I would be inclined to evaluate it.
-
@olivierlambert said in XCP-ng 7.5 - MegaRAID SAS 9240-8i hang/reboot issue.:
- Same issue with XenServer 7.5?
YES! (Just got around to testing it.)
-
Let us see if a newer kernel would help. There is also an option of back porting the newer driver to older kernel with possible code changes. Both will be experimental though!
-
Same thing happens in XS 7.6 "upgrading" from XS 7.5. Interenstingly enough post upgrade, the upper-left corner says Xenserver 7.5 but the stats field and Xencenter report 7.6.
Same thing happens in a clean installation of XS 7.6 too. -
@mpyusko Let me see if we can build driver 07.703.05.00-rc1 for your XCP-NG 7.5/6, will let you know if it becomes available.
-
@mpyusko Please get the driver from link and
[root@xcp-ng-rjv ~]# yum install megaraid_sas-07.703.05.00-1.x86_64.rpm
[root@xcp-ng-rjv ~]# rmmod megaraid_sas
[root@xcp-ng-rjv ~]# modprobe megaraid_sas
Then check for your lspci.
// Additional info
[root@xcp-ng-rjv ~]# modinfo /usr/lib/modules/4.4.0+10/weak-updates/megaraid_sas/megaraid_sas.ko filename: /usr/lib/modules/4.4.0+10/weak-updates/megaraid_sas/megaraid_sas.ko description: Avago MegaRAID SAS Driver author: megaraidlinux.pdl@avagotech.com version: 07.703.05.00 license: GPL srcversion: 2A8AB66F9A16F0542FC2173
-
For the record...
@mpyusko said in XCP-ng 7.5 - MegaRAID SAS 9240-8i hang/reboot issue.:
Same thing happens in XS 7.6 "upgrading" from XS 7.5. Interenstingly enough post upgrade, the upper-left corner says Xenserver 7.5 but the stats field and Xencenter report 7.6.
Same thing happens in a clean installation of XS 7.6 too.XS 7.5
***** megaraid_sas Version Info ***** version: 07.701.18.00-rc1 srcversion: 550B32DFFACE241631510C5 vermagic: 4.4.0+10 SMP mod_unload modversions
XS 7.6
***** megaraid_sas Version Info ***** version: 07.701.18.00-rc1 srcversion: 550B32DFFACE241631510C5 vermagic: 4.4.0+10 SMP mod_unload modversions
-
@r1 said in XCP-ng 7.5 - MegaRAID SAS 9240-8i hang/reboot issue.:
get the driver from link and
Try with this driver.
-
@r1 said in XCP-ng 7.5 - MegaRAID SAS 9240-8i hang/reboot issue.:
@mpyusko Please get the driver from link and
[root@xcp-ng-rjv ~]# yum install megaraid_sas-07.703.05.00-1.x86_64.rpm
[root@xcp-ng-rjv ~]# rmmod megaraid_sas
[root@xcp-ng-rjv ~]# modprobe megaraid_sas
I did what you requested....
[root@vincent Downloads]# yum install megaraid_sas-07.703.05.00-1.x86_64.rpm Loaded plugins: fastestmirror Cannot open: megaraid_sas-07.703.05.00-1.x86_64.rpm. Skipping. Error: Nothing to do [root@vincent Downloads]# rpm -Uhv megaraid_sas-07.703.05.00-1.x86_64.rpm error: megaraid_sas-07.703.05.00-1.x86_64.rpm: not an rpm package (or package manifest): [root@vincent Downloads]#
-
Can you post
#ls -lh
andmd5sum
output of it? -
@r1 said in XCP-ng 7.5 - MegaRAID SAS 9240-8i hang/reboot issue.:
Can you post
#ls -lh
andmd5sum
output of it?-rw-r--r-- 1 root root 40K Oct 5 13:28 megaraid_sas-07.703.05.00-1.x86_64.rpm e1e232eab5d90308144bf3c47665cedd megaraid_sas-07.703.05.00-1.x86_64.rpm
-
You seem to have downloaded something wrong.
my output is
[root@xcp-ng-rjv ~]# wget "https://github.com/rushikeshjadhav/MegaRAID-SAS-07.703.05.00/raw/master/megaraid_sas-07.703.05.00-1.x86_64.rpm" [root@xcp-ng-rjv ~]# ls -lh megaraid_sas-07.703.05.00-1.x86_64.rpm -rw-r--r-- 1 root root 388K Oct 4 21:26 megaraid_sas-07.703.05.00-1.x86_64.rpm [root@xcp-ng-rjv ~]# md5sum megaraid_sas-07.703.05.00-1.x86_64.rpm ef3064607545e0d390445f9e82ab8930 megaraid_sas-07.703.05.00-1.x86_64.rpm
-
@mpyusko Did you happen to check this?
-
Just got to it again....
***** ahci Version Info ***** version: 3.0 srcversion: 35F0A9078B4BB938E54A1E7 vermagic: 4.4.0+10 SMP mod_unload modversions ***** megaraid_sas Version Info ***** version: 07.703.05.00 srcversion: 2A8AB66F9A16F0542FC2173 vermagic: 4.4.0+10 SMP mod_unload modversions
lspci -v output
[root@vincent nfs]# lspci -v -s 07:00.0 07:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03) Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9240-8i Flags: bus master, fast devsel, latency 0, IRQ 40 I/O ports at ec00 [size=256] Memory at df2bc000 (64-bit, non-prefetchable) [size=16K] Memory at df2c0000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at df200000 [disabled] [size=256K] Capabilities: [50] Power Management version 3 Capabilities: [68] Express Endpoint, MSI 00 Capabilities: [d0] Vital Product Data Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [c0] MSI-X: Enable+ Count=15 Masked- Capabilities: [100] Advanced Error Reporting Capabilities: [138] Power Budgeting <?> Capabilities: [150] Single Root I/O Virtualization (SR-IOV) Capabilities: [190] Alternative Routing-ID Interpretation (ARI) Kernel driver in use: megaraid_sas [root@vincent nfs]#
lspci -vv output
[root@vincent nfs]# lspci -vv -s 07:00.0 07:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03) Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9240-8i Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 40 Region 0: I/O ports at ec00 [size=256] Region 1: Memory at df2bc000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at df2c0000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at df200000 [disabled] [size=256K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
And then same result. Ugh.
-
@mpyusko If I understood correctly,
lspci -vv -s 07:00.0
is crashing the host? Even on megaraid_sas version 07.703.05.00. But Kali linux host does not crash on same megaraid_sas version.To resolve this, Do you have console access to the host? or remote KVM?
I would suggest you to boot your host in "XCP-ng in Safe Mode", this menu comes up when you start to boot the host. Instead of default "XCP-ng" choose "XCP-ng in Safe Mode".
This will allow us to see the messages generated in kern.log or onscreen about the crash and would point it right to the problem.
Meanwhile if you have some stack trace logs in kern.log, please share those.
-
Yes, you are understanding correctly. I have Root, Console, iDRAC, KVM and physical access to the machine
The SEL reports:
Normal 0.000202Mon Oct 15 2018 03:24:03 An OEM diagnostic event has occurred. Normal 0.000201Mon Oct 15 2018 03:24:03 An OEM diagnostic event has occurred. Normal 0.000200Mon Oct 15 2018 03:24:03 An OEM diagnostic event has occurred. Normal 0.000199Mon Oct 15 2018 03:24:03 An OEM diagnostic event has occurred. Non-Recoverable 0.000198Mon Oct 15 2018 03:24:03 CPU 1 machine check detected. Normal 0.000197Mon Oct 15 2018 03:24:00 An OEM diagnostic event has occurred. Critical 0.000196Mon Oct 15 2018 03:24:00 A bus fatal error was detected on a component at bus 0 device 9 function 0. Critical 0.000195Mon Oct 15 2018 03:23:59 A bus fatal error was detected on a component at slot 3.
Please note, I have tried changing slots, the same issue occurs, and the SEL reports accordingly. The kern.log does not have any applicable output. Neither in 'normal' mode, nor in "safe mode". same applies to dmesg. I'm running tail -f from both files. It there is any output, it's not being logged or displayed.
Under "safe mode" the output is:
# lspci -vv -s 07:00.0 07:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03) Subsystem: LSI Logic / Symbios Logic MegaRAID SAS 9240-8i Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 40 Region 0: I/O ports at ec00 [size=256] Region 1: Memory at df2bc000 (64-bit, non-prefetchable) [size=16K] Region 3: Memory at df2c0000 (64-bit, non-prefetchable) [size=256K] Expansion ROM at df200000 [disabled] [size=256K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [68] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+ RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
I really don't feel I should be having this issue since it is all mainstream, enterprise hardware. The only thing "odd" about this server is I pulled out the PERC 6/i controller and installed a brand new LSI controller because I my drives exceed the 2TB limit of the PERC. Even when idle in Maintenance Mode, it will still randomly reboot with the same SEL output. This makes it too unstable to run for production, or even a dev environment. It could be minutes, hours, or days between random reboots. Probably due to the kernel accessing the controller for some health check or something. In Maintenance Mode, there are no VMs running, just XCP-ng, and that's it. The system is on a conditioned powersource with battery-backup. So I am ruling out dips and spikes. The iDRAC also reports on power quality, usage, and health. Everything is good. As I said before, this does not happen under Kali. I probably have other boot flashes for other OSes and distros I can try. But the fact is, if it was hardware related, then it would never be stable.
-
We'll have a more recent kernel to test, thanks to @r1's work. This could be interesting to test I suppose.
-
Debain Stretch reports:
# modinfo megaraid_sas | grep version version: 06.811.02.00-rc1 srcversion: 64B34706678212A7A9CC1B1 vermagic: 4.9.0-7-amd64 SMP mod_unload modversions
and it completed successfully.
Unfortunately, the NIC drivers are not configured for this boot flash, so I can't copy and past the console output.
-
@mpyusko I'm surprised that the host is "rebooting" even after having in "safe-mode" thats not expected behavior.
Step1:
I know only one situation in earlier Xen days (3.x) when BIOS CPU C-states were causing CPU to black out, resulting an undetectable crash. To rule this out, please set your server to "performance mode" from BIOS so that it does not try to enter in power save mode randomly.Step2:
Please have your grub Xen line updated as
multiboot2 /boot/xen.gz noreboot no-mce ....
Step3:
A newer version of driver is released on 12 Sept, 07.707.00.00. I'll make that available for you to isolate the "driver" to be the reason for crash.Step4:
I have a test kernel which may fix this issue - or at least help us isolate "kernel" to be the reason for crash.[root@xcp-ng-kernel ~]# modinfo /usr/lib/modules/4.9.133/kernel/drivers/scsi/megaraid/megaraid_sas.ko | grep -i ver filename: /usr/lib/modules/4.9.133/kernel/drivers/scsi/megaraid/megaraid_sas.ko description: Avago MegaRAID SAS Driver version: 06.811.02.00-rc1 srcversion: E452D341082401C48444BC7 vermagic: 4.9.133 SMP mod_unload modversions [root@xcp-ng-kernel ~]# uname -a Linux xcp-ng-kernel 4.9.133 #1 SMP Sun Oct 14 15:48:31 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux [root@xcp-ng-kernel ~]#
-
@olivierlambert It will be. Kali with 4.15 and Debian with 4.9 both do not exhibit the issue. However Xenserver and XCP-ng both do. I'd be interested to compare their compiler settings as to what they do and do not include.