XCP-ng
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login
    1. Home
    2. the_jest
    T Offline
    • Profile
    • Following 0
    • Followers 0
    • Topics 3
    • Posts 13
    • Groups 0

    the_jest

    @the_jest

    1
    Reputation
    1
    Profile views
    13
    Posts
    0
    Followers
    0
    Following
    Joined
    Last Online

    the_jest Unfollow Follow

    Latest posts made by the_jest

    • RE: Can't restart stopped VMs; unclear error message

      I have not seen warnings to reboot after previous patchings (nor this one, that I recall).

      In any case, I did reboot, and everything is fine, and I'm sorry to have wasted time and attention on this! Thank you.

      posted in XCP-ng
      T
      the_jest
    • RE: Can't restart stopped VMs; unclear error message

      @Danp
      Grr. OK I'll reboot later today (after some other tasks are finished) and report back.

      I am worried about nuking the currently-working VMs. It would obviously be good to have another host for exactly such purposes. Thank you.

      posted in XCP-ng
      T
      the_jest
    • RE: Can't restart stopped VMs; unclear error message

      @Danp
      Yes, I did restart the toolstack; sorry for not saying this. It successfully restarted, but did not resolve the problem. Meanwhile the existing VMs do continue to work.

      I also updated all the patches (which I do semi-regularly as well).

      I have tried to look at the log files, but there's nothing obvious there that makes sense to me. I'm happy to share this here or elsewhere as appropriate; also happy to try anything else.

      Thank you for the attention.

      posted in XCP-ng
      T
      the_jest
    • Can't restart stopped VMs; unclear error message

      I have an XCP-NG 8.3.0 installation, on just a single host, that has been extremely stable for some time, usually running 5–6 VMs. I needed to restart one VM to update some packages, and it shut down cleanly but did not restart. When I tried to do it manually, I got:

      [19:38 xcpng2 ~]# xe vm-start uuid=[FOO]
      The server failed to handle your request, due to an internal error. The given message may give details useful for debugging the problem.
      message: xenopsd internal error: Unix.Unix_error(Unix.ENOENT, "open", "/dev/net/tun")
      

      Subsequently, it seems that another VM stopped on its own, and trying to restart that gives the same message. I tried to create a new VM out of curiosity, and that also won't start, with the same message.

      Meanwhile there are three existing VMs that are running just fine; two of them are very active and are dealing with a lot of load and network traffic without a problem. I'd prefer not to restart the host, since I don't want to interrupt these VMs just for the hope that a power-cycle will do anything.

      How can I evaluate this problem? Searching for this error message hasn't led me to anything useful.

      posted in XCP-ng
      T
      the_jest
    • RE: Diagnosing frequent crashes on host

      @olivierlambert

      @olivierlambert said in Diagnosing frequent crashes on host:

      Maybe there's a usage that's slightly different since when it was "more solid" and now it's trigger more easily. Is your XCP-ng fully up to date?

      No; as said originally, I'm still on 8.2.1. I have been concerned about moving to 8.3 because it's a new installation, and I don't want to screw it up, but I'm willing to accept that it's the right thing to do.

      posted in XCP-ng
      T
      the_jest
    • RE: Diagnosing frequent crashes on host

      @andyhhp

      @andyhhp said in Diagnosing frequent crashes on host:

      Ok, so it's a logical bug in Linux. Have you updated the dom0 kernel recently? Can you revert back to the older build and see if that changes the behaviour?

      I haven't updated the kernel, or the installation in general, since I first installed it, almost a year ago. It's never been rock-solid, but in recent weeks it's been very much worse; but there's been no system change in that time.

      If I were going to make any kind of change, I assume the sensible thing to do would be to upgrade to 8.3?

      I've been getting the impression that people have a low opinion of this particular system, esp. with regard to thermal management. I'm not sure what to do about this; I don't really need this exact hardware but I don't know what the correct replacement should be, if I want to just ditch it.

      posted in XCP-ng
      T
      the_jest
    • RE: Diagnosing frequent crashes on host

      @andyhhp

      @andyhhp said in Diagnosing frequent crashes on host:

      @the_jest said in Diagnosing frequent crashes on host:

      but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

      Shot down is correct. It is the past tense of "Shoot down", because the companion message you get when something went wrong is "Failed to shoot down $CPUS", and is the single most valuable print message I've ever inserted into the code.

      My apologies!

      @andyhhp said in Diagnosing frequent crashes on host:

      The snippet of xen.log you've posted suggests it's a linux kernel crash, so look at dom0.log, and right at the end.

      The last consecutive block of messages (timewise, i.e. the part of this log from the same milisecond to the end of this log), is

      [  19701.650235]   WARN: Call Trace:
      [  19701.650238]   WARN:  <IRQ>
      [  19701.650241]   WARN:  xen_evtchn_do_upcall+0x27/0x50
      [  19701.650245]   WARN:  xen_do_hypervisor_callback+0x29/0x40
      [  19701.650248]   WARN:  </IRQ>
      [  19701.650250]   WARN: RIP: e030:xen_hypercall_sched_op+0xa/0x20
      [  19701.650253]   WARN: Code: 51 41 53 b8 1c 00 00 00 0f 05 41 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 51 41 53 b8 1d 00 00 00 0f 05 <41> 5b 59 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc
      [  19701.650259]   WARN: RSP: e02b:ffffc900400e7eb0 EFLAGS: 00000246
      [  19701.650261]   WARN: RAX: 0000000000000000 RBX: ffff8881db639d00 RCX: ffffffff810013aa
      [  19701.650264]   WARN: RDX: ffffffff8203d250 RSI: 0000000000000000 RDI: 0000000000000001
      [  19701.650267]   WARN: RBP: 0000000000000004 R08: 0000000000000008 R09: 000011eef4cd8cc2
      [  19701.650269]   WARN: R10: 0000000000007ff0 R11: 0000000000000246 R12: 0000000000000000
      [  19701.650272]   WARN: R13: 0000000000000000 R14: ffff8881db639d00 R15: ffff8881db639d00
      [  19701.650276]   WARN:  ? xen_hypercall_sched_op+0xa/0x20
      [  19701.650279]   WARN:  ? xen_safe_halt+0xc/0x20
      [  19701.650282]   WARN:  ? default_idle+0x1a/0x140
      [  19701.650284]   WARN:  ? do_idle+0x1ea/0x260
      [  19701.650287]   WARN:  ? cpu_startup_entry+0x6f/0x80
      [  19701.650289]   WARN: Modules linked in: tun nfsv3 nfs_acl nfs lockd grace fscache bnx2fc(O) cnic(O) uio fcoe libfcoe libfc scsi_transport_fc openvswitch nsh nf_nat_ipv6 nf_nat_ipv4 nf_conncount nf_nat 8021q garp mrp stp llc ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter dm_multipath sunrpc nls_iso8859_1 nls_cp437 intel_powerclamp crct10dif_pclmul vfat crc32_pclmul ghash_clmulni_intel fat pcbc dm_mod aesni_intel aes_x86_64 crypto_simd cryptd glue_helper video backlight ip_tables x_tables hid_generic usbhid hid xhci_pci nvme igc(O) xhci_hcd i40e(O) nvme_core scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh_alua scsi_mod efivarfs ipv6 crc_ccitt
      [  19701.650330]   WARN: ---[ end trace 79b40169d24b8e01 ]---
      [  19701.650333]   WARN: RIP: e030:__xen_evtchn_do_upcall+0x82/0x90
      [  19701.650335]   WARN: Code: 66 90 f6 c4 02 75 23 80 3b 00 75 d7 65 ff 05 85 89 ba 7e 48 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 09 48 83 c4 18 5b 5d c3 <0f> 0b e8 77 aa bf ff 0f 1f 80 00 00 00 00 0f 1f 44 00 00 e9 66 ff
      [  19701.650341]   WARN: RSP: e02b:ffff8881dc503fb8 EFLAGS: 00010002
      [  19701.650344]   WARN: RAX: 0000000000000000 RBX: ffff8881dc514100 RCX: 000000008518dd93
      [  19701.650346]   WARN: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881db6aa800
      [  19701.650349]   WARN: RBP: 0000000000000004 R08: 00000000000035c6 R09: ffff8881db003210
      [  19701.650352]   WARN: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [  19701.650355]   WARN: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
      [  19701.650361]   WARN: FS:  0000000000000000(0000) GS:ffff8881dc500000(0000) knlGS:0000000000000000
      [  19701.650364]   WARN: CS:  e033 DS: 002b ES: 002b CR0: 0000000080050033
      [  19701.650367]   WARN: CR2: 00007fffc3789c78 CR3: 00000001d81ca000 CR4: 0000000000040660
      [  19701.650371]  EMERG: Kernel panic - not syncing: Fatal exception in interrupt
      
      

      Thank you for looking this over.

      posted in XCP-ng
      T
      the_jest
    • RE: Diagnosing frequent crashes on host

      @DustinB
      Sorry, I should have been more clear. The cabinet is not sealed, i.e. it's got a loose grating on the entire front, and in addition to that, it is actively ventilated, with internal fans in the back of the cabinet pulling air through. And the box itself is not covered up in any way.

      I do not have any additional coolers to use.

      posted in XCP-ng
      T
      the_jest
    • RE: Diagnosing frequent crashes on host

      @DustinB
      It's less than a year old. I don't know how to evaluate whether the ventilation is "adequate"; the existing ventilation is exposed (i.e. not covered up), and it's inside a cabinet that is itself ventilated, but it's not in a freezingly air-conditioned space.

      I have tried to install lm_sensors, but it doesn't detect any sensors, so I'm not sure how to tell if it's regularly overheating.

      posted in XCP-ng
      T
      the_jest
    • RE: Diagnosing frequent crashes on host

      @olivierlambert
      Thanks. I ran two full passes of memtest, and it's totally clean.

      I don't know what I'm looking for, but I looked through some random folders in /var/crash, and the last bit of the most recent xen.log was:

      (XEN) [19617.315490] CPU4: Temperature/speed normal
      (XEN) [19617.315492] CPU5: Temperature/speed normal
      (XEN) [19622.518692] CPU5: Temperature above threshold
      (XEN) [19622.518693] CPU4: Temperature above threshold
      (XEN) [19622.518694] CPU5: Running in modulated clock mode
      (XEN) [19622.518695] CPU4: Running in modulated clock mode
      (XEN) [19707.409611] Executing kexec image on cpu5
      (XEN) [19707.410615] Shot down all CPUs
      

      There's no reason why the CPU should be running hot; this was a largely unloaded system, but I figured I'd mention it. (Also, "Shot down" should be "Shut down".)

      posted in XCP-ng
      T
      the_jest