Health alerts/alarms

olivierlambert

Well, a part of the job is first on you guys Give an explicit list of things you are interested to be alerted about, because otherwise it's a bit vague to develop something in the dark

simonebertucelli

@olivierlambert

Good morning Oliver,
I think it is important to understand why the email alerts: the small companies we support and that now have virtualisation with vmware esxi are monitored with third-party tools that can be replaced by the fantastic XOA.
But there, we do not have any software such as Nagios, Prtg or anything else so the only method now working is email alerts.
Since it is our intention to migrate all vmware environments to xcp+xoa, it is useful to keep track of the various installations.
One example is to alert everything in XOA that is not green ... and that turns from green to yellow or red: the status of the licence when it is approaching expiry or has expired;
the status of vm's in case of blocking or overuse of available resources (CPU, Ram, etc), the status of network connections, the status of remote storage (SMB/ISCSI/NFS)
In short, anything that might alert an IT manager by e-mail.
I will be happy to test the new features in my lab

billcouper

@olivierlambert There already seem to be some alerts configured inside the xcp-ng servers? Once I configure email alerts using XCP-ng center (just setting the smtp server and email addresses) I do get notifications about items like multipath and bond status (eg path lost, nic disconnected, bond status changed, etc). I had thought that Initially it would be good if we can simply enable the built in alerts using xoa, the same way that xcp-ng center enables them.

If you were going to develop a standalone alarm/alert system in XOA, that's a giant kettle of fish.

These are just some of the things I'd like to see alerts for (and be able to trigger some type of action, eg 'send email' or an api call).

NIC down (that was previously up)
Bond link count changed
Host unresponsive
Host patches required
Host restart required
Host memory usage
Pool memory usage
SR path lost (one of multiple paths went down)
SR all paths down
SR not connected to all pool hosts
SR capacity (with configurable thresholds as % or GB)
SR latency
Remote storage connectivity lost
Remote storage capacity (% or GB)
VM CPU co-stop (or xen equivelant)
VM Management Agent detected
VM Snapshot count
VDI remains attached to control domain after backup

Being able to apply the monitoring granularly and hierarchically are very important. For example, I might set a global latency alarm that applies to all SRs. But maybe one of my pools is 'special' so I apply a stricter latency alarm at that pool level (the rest of the pools can use the global threshold). Then inside that pool, one SR in particular might be unique and need it's own latency alarm configured at the SR level.
So the inheritance of the alarms for SRs should be Global > Pool > SR

So thinking of this type of granular requirement with hierarchical inheritance, XO6 seems like it could be possible, due to the new tree view which could potentially allow setting alert policies at each level.

olivierlambert

Adding @marcungeschikts in the loop so we discuss that later and see how we could tackle this and if we have the resources to do it.

If you have pro support and business cases for it, it might accelerate the priority. We will also try to see how much effort is needed to give you a rough idea of the cost.

splastunov

Maybe netdata will cover everything?

There are no default alerts, but you can easily create them by yourself.

Also it is very easy to deploy "parent" netdata node and stream metrics to it from all hosts (maybe this part could be integrated to XOA free version? ).
You do not need netdata cloud account for this solution

billcouper

@olivierlambert This post was created more than six months ago to find out about exposing what I thought was existing functionality in the xcp-ng hosts. XCP-ng Center windows app allows you to configure SMTP settings for alerts, and it fires off emails about things like multipath count etc. I assumed the alert mechanism is already built into xcp-ng and 'center' is just configuring it. If that is true, then it would be nice to expose this existing functionality via xen orchestra, so we no longer feel the need to keep xcp-ng center hanging around.

@olivierlambert Please don't go and develop a completely new monitoring solution based on my request. However, IF you were to undertake such a mammoth task, then I am more than happy to put forth my wish list of what I would like to see alerts for (almost none of which are performance related).

@splastunov I looked at netdata but it's primarily performance metrics. I'm not too interested in performance metrics at the host level (other than memory consumption). Plus, I don't see a way to create triggers/thresholds and fire alerts off.

In regards to more comprehensive monitoring that can fire off alerts, I have worked on my own methods. I am happy at the moment using Zabbix to monitor Xen Orchestra, with agents on the XCP-ng hosts for additional monitoring. I am happy to provide info about the Zabbix if anyone wants it?

simonebertucelli

@billcouper
Forgive me, but I think that the a native alert system is better than just an external system via snmp. There are definitely services or states that are not handled by the snmp protocol. In any case, XOA is a great product, but without an email alert system it is like great food without salt.

billcouper

@simonebertucelli SNMP? No. Xen Orchestra via API. Zabbix Agent on hosts for additional detail.

Edit: It works just like monitoring a vCenter Server via API - all of the hosts and VMs appear in Zabbix. I am using this Zabbix template https://github.com/bufanda/zabbix--template-xenorchestra/tree/main

KPS

@billcouper
This is mostly the list of "Veeam One"-alerts. This is a great start.
Additional things are:

High total CPU for VM for x mins
Unusual backup size/duration
VM-snapshot age above limit

jshiells

@billcouper I 100% agree with you.

I will add to this that it would be amazing if RX and TX nic/SFP errors could be made visible in the interface as well. reason why i mention this is we recently had a problem on one of our hosts storage links where due to a dirty fiber we had some problems (vm's going to read only, crashes, very poor IO). Due to how dom0 sits as a VM we could not get stats off a nic using ethtool or snmp. It took WAY to much trouble shooting to figure out that our issues was caused by a dirty fiber on the servers side.

the more native/local self monitoring/alerting xcp-ng and XOA can do, the better.