desperately searching for solution for xe command timeouts and xcp-ng crashes
- 
 Hi all, 
 I am facing here already a few weeks a weird and faulty behavior of our xcp-ng cluster with halizard and iscsi-ha with drbd mirroring of the storage
 partition. This setup was working without any changes or problems for two years, now it is crashing repeatedly.After successfully starting the cluster, and after drbd finished updating the clustered partition, everything works fine or a few minutes or half an hour (xe host-list, xe vm-list is responding fast and normal as it should be). Then after 15 minutes each request to xe host-list, xe vm-list or whatever else command has an incredible long timeout before a result is delivered (1 minute or longer). If this is running even longer it is getting totally unresponsive. dns-resolving is working flawlessly. 
 if I switch off the second cluster server. everything is running from one server and there are no timeouts so far. So this is the pool communication somehow. I do not have any idea, what could be wrong.Logically this behaviour is killing any xcp.ng-center or xen orchestra connection and it is not possible to work with the cluster. Virtual machines are working normally. Today I did a cold start of the cluster, but no joy. Does anybody have probably an idea, how to debug that or fix it??!! kind regards 
 Christoph
- 
 Hi, I would start to read the XAPI log, xensource.logwhich is where you have all XAPI thing happening.See https://docs.xcp-ng.org/Troubleshooting for more details 
- 
 @olivierlambert Thank you, I was doing this already. I do not really understand, what is going on.... 
 The server is in a private network, not accessible from the bad internet I do not understand, why there is a Session authentication failed.... This is happening all 2 seconds: # tail -f /var/log/xensource.log Jan 10 13:04:23 ahbxen1 xapi: [debug||963 |org.xen.xapi.xenops.classic events D:ae7d342ca10e|xenops] Processing event: ["Vm","5d51c38b-5260-2903-ae21-4bbe607fb99c"] Jan 10 13:04:23 ahbxen1 xapi: [debug||963 |org.xen.xapi.xenops.classic events D:ae7d342ca10e|xenops] xenops event on VM 5d51c38b-5260-2903-ae21-4bbe607fb99c Jan 10 13:04:23 ahbxen1 xenopsd-xc: [debug||72617 |org.xen.xapi.xenops.classic events D:ae7d342ca10e|xenops_server] VM.stat 5d51c38b-5260-2903-ae21-4bbe607fb99c Jan 10 13:04:23 ahbxen1 xapi: [debug||963 |org.xen.xapi.xenops.classic events D:ae7d342ca10e|xenops] xenopsd event: processing event for VM 5d51c38b-5260-2903-ae21-4bbe607fb99c Jan 10 13:04:23 ahbxen1 xapi: [debug||963 |org.xen.xapi.xenops.classic events D:ae7d342ca10e|xenops] Supressing VM.allowed_operations update because guest_agent data is largely the same Jan 10 13:04:23 ahbxen1 xapi: [debug||963 |org.xen.xapi.xenops.classic events D:ae7d342ca10e|xenops] xenopsd event: Updating VM 5d51c38b-5260-2903-ae21-4bbe607fb99c domid 14 guest_agent Jan 10 13:04:27 ahbxen1 xapi: [debug||5755278 INET :::80||dummytaskhelper] task dispatch:session.logout D:cfcb0866d8ec created by task D:b4fe67a1bb65 Jan 10 13:04:30 ahbxen1 xapi: [debug||5755279 UNIX /var/lib/xcp/xapi||cli] xe pool-param-get uuid=Stopping ha-lizard= (via= systemctl):= [= OK= ]= param-name=other-config param-key=XenCenter.CustomFields.ha-lizard-enabled username=root password=(omitted) Jan 10 13:04:30 ahbxen1 xapi: [ info||5755279 UNIX /var/lib/xcp/xapi|session.login_with_password D:655f9bce2d1c|xapi] Session.create trackid=24d93f0af567eed5f81fb70f2557487e pool=false uname=root originator=cli is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Jan 10 13:04:32 ahbxen1 xcp-rrdd: [ info||7 ||rrdd_main] memfree has changed to 5363956 in domain 6 Jan 10 13:04:35 ahbxen1 xapi: [ info||5755280 INET :::80|session.login_with_password D:2d12df15927e|xapi] Failed to locally authenticate user root from HTTP request from Internet with User-Agent: xmlrpclib.py/1.0.1 (by www.pythonware.com): Authentication failure Jan 10 13:04:35 ahbxen1 xapi: [debug||5755282 UNIX /var/lib/xcp/xapi||cli] xe host-list name-label=ahbxen1 minimal=true username=root password=(omitted) Jan 10 13:04:35 ahbxen1 xapi: [ info||5755282 UNIX /var/lib/xcp/xapi|session.login_with_password D:87466e0161dd|xapi] Session.create trackid=249a86916b75d708a2c52adb1f011eed pool=false uname=root originator=cli is_local_superuser=true auth_user_sid= parent=trackid=9834f5af41c964e225f24279aefe4e49 Jan 10 13:04:40 ahbxen1 xapi: [debug||1038 scanning_thread|SR scanner D:f2340ef7fc82|xapi_sr] Automatically scanning SRs = [ OpaqueRef:cf346a4f-4981-412d-a057-9b386d8bd2d6 ] Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] session.login_with_password D:2d12df15927e failed with exception Server_error(SESSION_AUTHENTICATION_FAILED, [ root; Authentication failure ]) Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] Raised Server_error(SESSION_AUTHENTICATION_FAILED, [ root; Authentication failure ]) Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 1/8 xapi Raised at file ocaml/xapi/xapi_session.ml, line 405 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 2/8 xapi Called from file ocaml/xapi/xapi_session.ml, line 40 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 3/8 xapi Called from file ocaml/xapi/xapi_session.ml, line 40 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 4/8 xapi Called from file ocaml/xapi/server_helpers.ml, line 83 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 5/8 xapi Called from file ocaml/xapi/server_helpers.ml, line 99 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 6/8 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 24 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 7/8 xapi Called from file lib/xapi-stdext-pervasives/pervasiveext.ml, line 35 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] 8/8 xapi Called from file lib/backtrace.ml, line 177 Jan 10 13:04:40 ahbxen1 xapi: [error||5755280 INET :::80||backtrace] Jan 10 13:04:42 ahbxen1 xapi: [ info||5755283 UNIX /var/lib/xcp/xapi||cli] xe message-create name=HA-Lizard - xe_wrapper priority=1 body=xe_wrapper: COMMAND: xe pool-param-get pool-uuid=c1dbc848-aa29-1603-2af7-078466842ac2 username=root password=(omitted)
- 
 - Do you have the same root password on both machines?
- Do you have enough disk space on both hosts?
- Double check if you don't have any IP conflict somewhere
 
- 
 It seems to be a ha-lizard related problem. As I said, this cluster was running 2 years flawless. just the last 4 weeks theses weird problems. There seems to be a restapi request with bad password. although it is possible to send those xe commands without password... Kind regards 
 Christoph
- 
 @olivierlambert 
 diskspace - ok
 network - no conflicts
 root password is also on both machines the same.kr 
 Christoph
- 
 So go check with HA lizard people then  
- 
 @chcnetconsulting Are you running the latest version of HA-Lizard? They care very responsive to issues if you contact them. And, yes, see if you see anything odd in the logs. Also make sure you hosts are properly time-synched with each other. 
- 
 @tjkreidl hi, it is 2.2.3-1 the latest version. All of a sudden it seems. that the cluster is working again... no timeouts anymore. totally weird. BUT - xcp-ng center takes long time to synchronize the hosts. Before the timeouts ended. I had fixed a bug in /etc/ha-lizard/ha-lizard.func in line 645, where the were reading a pool-param-get what did never work. and obviously this was running much too fast and spawning so many processes, that the rest-api was dead.  This fixed a million error-notifications with in the error log. After restarting ha-lizard (service ha-lizard restart), everything returned back to normal. Although the timeouts are history the wrong query wich creates the backtrace (authentication error) is not fixed yet. Probably this is an issue. the guys at ha-lizard know how to fix. Thank you for helping me out! 
 kind regards
 Christoph
- 
 @chcnetconsulting Glad to hear and nice debugging work! Yes, the HA-Lizard folks are very responsive and I'm sure will have this taken care of in the next release. 
 I published an article originally on xenserver.org back in 2016 on tests and improvements to HA-Lizard I did in cooperation with the company, but alas, the chart with all the findings didn't translate properly when taken over by this site. I may have the original squirreled away somewhere.
 https://xenserver.pl/citrix-xenserver/xenserver-high-availability-alternative-ha-lizard-2/9122
- 
 Just to mention here that your problem(s) have been addressed in the most recent version of HA-Lizard (2.3.1). Simply upgrading and you are happy again  

