techjeff

techjeff

@HeMaN You're correct that no changes have been published yet.

We were under the impression that we had found an undocumented requirement, but I was reminded that giving each host the same certificate is not the best practice. xcp-ng should be able to handle each host having its own certificate as long as their respective certificate authorities are trusted.

In any case, I need to do some more testing to narrow down the exact cause of the issues that I was seeing. I have been working a systemd service and timer with a few supporting scripts that automatically renew certificates by making ACME requests to my local, private CA, specifically adding support for additional SANs (previous iterations just used the system's FQDN).

Specifically, I want to test each server with SANs that correspond to each of it's IP addresses and FQDNs, deploy them using xe host-server-certificate-install, then perform packet captures as needed to determine why the Xapi#getResource /rrd_updates (on xcp-ng-1) 0% task is getting stuck.

So far, life has gotten a bit in the way, so I haven't dedicated the time to testing this, but I hope to get back to this soon.

techjeff

@florent the CR job was completed with health checks. The issue appears to be fixed in the fix_cr_healthcheck branch.

techjeff

@Andrew @Danp - I discovered my boneheaded oversight

I have been developing on a one-shot systemd service with a timer for automatically renewing my short-lived XO certificate daily between a prescribed window of time using my local Step CA instance's ACME provisioner. It has been a fun project and I now have mostly automatic renewal of short-live certificates using what is essentially a private letsencrypt server that uses the CA that I generated.

I just looked at the logs and realized that I completely overlooked the fact that the the step-renewer.service is being run starting at midnight + random number of seconds between 0 and 5 minutes. This service will get a new certificate and then it restarts xo-server.service and that is what has been interrupting my backups.

Facepalm

With mixed feelings, I can confidently report that this issue is entirely self-induced. Thanks for the assistance narrowing down the cause!

techjeff

@stormi @olivierlambert Please see my submitted PR and please provide feedback.

techjeff

@irtaza9 happy to help!

techjeff

@olivierlambert Thank you and your team again for your commitment to this fantastic FOSS tool and for allowing me to build it myself!

I very much appreciate the personal touch of my issues being triaged by the CEO and Co-Founder. It's refreshing to see an executive officer stay in touch with their customer base.

techjeff

@olivierlambert as pointed out by @psafont on my PR #216,

I believe there is no such technical requirement, when following a redirect the
new request should be done against a different IP/host and the TLS connection renegotiated with that host, meaning none of the hosts' certs should have identifying information from the other one.

I think I need to dive deeper and hope to find an a log message related to the lingering Xapi#getResource /rrd_updates tasks.

techjeff

@olivierlambert @julien-f

As a final confirmation, since my last message, I generated one certificate for each of my 3 hosts. Each certificate only contained the DNS and IP SANs for that specific host. I then deployed each of the 3 certificates to their respective hosts using xe host-server-certificate-install without issue.

Like I mentioned in my previous post, I am not not getting self signed certificate in certificate chain because I have properly configured xo-server to trust my CA cert.

However, I am now back to getting the endless Xapi#getResource /rrd_updates (on xcp-ng-1) 0% tasks every minute that last for ~24 hours (unless I run xe-toolstack-restart to clear the list.)

I then redeployed my "Pool Certificate" (contains all DNS/IP SANs for all hosts) to each host from the master, executed xe-toolstack-restart and now all is working without issue. I must admit, this is much easier to maintain than trying to maintain 1:1 cert:host.

In conclusion, as you mentioned in your previous reply @olivierlambert, it does appear that a person must sign 1 certificate per pool and that certificate must be configured with Subject Alternate Names for each DNS name and IP used by all hosts in the pool.

Thanks again for working with me on this!

techjeff

@olivierlambert said in Commands in Xen Orchestra Jobs no longer working:

The best solution is to rely on XOA, you know

I agree with @olivierlambert! Personally, I'm using XO from sources in my home lab environment -- nothing is "production" and I'm mostly having fun trying to give back to the open source community.

techjeff

@olivierlambert - I'm not sure who would be the best person at Vates to ping or whether there is another channel I should be using to request enhancements. I'm happy to be directed to the correct place if that's not here.

Despite the fact that I brought this upon myself... I do think that it would be nice if Xen Orchestra could improve the error handling/messaging for situations where a task fails due to an invalid object UUID. It seems like the UI is already making a simple XAPI call to lookup the name-label of the SR, which, upon failure results in the schedule where an invalid/unknown UUID is configured displaying the invalid/unknown UUID in Red text with a red triangle.

techjeff

@irtaza9 happy to help!

techjeff

It's also not hard to "copy" a template from one pool to another. So if you create your "golden image" template, you can just copy that template to another pool.

You can see the template Intangible Debian Bookworm 12 (Cloud Init)_2023-09-26T21:48:00.318Z that I originally created in my "performance" pool, then later when I set up my "efficiency" pool, I simply copied to an SR in my "efficiency" pool.

Screenshot 2025-03-03 153911.png

In order for a pool to utilize a template, the template needs to be within one of the shared SRs within that pool. Once it has been copied to an SR in the destination pool, that pool can now create new VMs using that template.

techjeff

@DustinB said in Invalid Health Check SR causes Bakup to fail with no error:

but I hadn't made any changes to the shares or the underlying storage on that host so I really wasn't sure what could have caused it.

But you did make a change to the pool, you

Correct... And, I spoke somewhat ambiguously. I was using the term "host" in the generic sense to describe the TrueNAS Scale that was hosting my backup SRs not in the sense of a proper xcp-ng host. In retrospect, NAS would have been more appropriate.

I have 2 TrueNASs, tns-01 and tns-02. tns-01 is the "primary" with solid state drives which hosts both the Old SR I had deleted and the new SR with which I replaced it. tns-02 is the "backup" with spinning drives and it hosts the SR where my backups are stored.

My backup Jobs backup to the Remotes on tns-02, but I use the primary SR backed with solid state drives for restoring health checks because I don't want to wait all night.

So I was confused because I hadn't modified any of the Remotes or shares or anything on tns-02, but because my backup jobs use the old SR that I had removed from tns-01, it failed and didn't give me much information to figure out why.

If I wanted to externalize the responsibility, I would probably attribute it to the Health Check configuration being inside the schedule configuration which has always seemed not intuitive to me, though that might just be my brain

techjeff

@DustinB Yup, that's correct. I did it to myself! I overlooked that the SR I had removed was being utilized for restoring Health Check vms in that Backup job...and a few others too--yay homelab fun! lol

Naturally, when I attempted to run the backup job it failed, presumably because it detected that the UUID of the Health Check SR was invalid / not in the database; however, the error I got was essentially a default or fallback without any context-specific details. This feels like XO attempted to run the job, detected that the UUID wasn't valid, but didn't have an specific error message to describe the exception or erroneous situation that was caught/encountered.

I agree that it would be extra nice if the cautionary yellow triangle used to denote warnings elsewhere in the application could be used to denote a backup job with one or more "invalid" configuration entries.

Also, my guess is that XAPI is unaware of the Health Checks beyond Xen Orchestra using discrete calls to facilitate the health check process, and if that's the case, the error is suspected by me to have be generated by Xen Orchestra. If that's the case, then I have hope that XO Devs could simply add an additional call to xe sr-list uuid={health-check-sr-uuuid} for example to validate that the SR does in fact exist.

I do quality assurance testing and report bugs for a living, but I'm not familiar with this exact codebase, so my message is intended for illustrative, inspirational purposes.

techjeff

TL-DR - does your Health Check SR still exist? It turns out mine didn't!

This a story about finding an unhandled edge case in the Xen Orchestra Backup [NG...? it's just the backup tool now and we don't call it "NG" anymore, right?] utility: When you delete the SR to which your backup job restores vms for Health Checks, it fails without much helpful information.

On the latest commit to master:

Screenshot from 2025-03-02 22-44-52.png

So I recently moved my vm disks to a new SR, made the new SR the pool default, and removed the previous SR. Then I noticed that my backups were failing and I was getting no error message. It was quite strange. I decided to update my Xen Orchestra "community edition" (installed using the ronivay XenOrchestraInstallerUpdater tool) to the latest master commit, but the issue was still happening.

Screenshot from 2025-03-02 21-55-55.png

An example log from this evening before I solved the mystery:

{
  "data": {
    "mode": "delta",
    "reportWhen": "failure"
  },
  "id": "1740978958075",
  "jobId": "9017a533-4a2a-42ad-9319-cba19247e062",
  "jobName": "Daily Delta Backup of step-ca at 7:05pm",
  "message": "backup",
  "scheduleId": "a2229c74-dc47-42f6-90fd-a86ef7e6529d",
  "start": 1740978958075,
  "status": "failure",
  "end": 1740978959190,
  "result": {}
}

And when I went to the log entry under the Settings menu in Xen Orchestra, I saw an empty error message and this text when I clicked the eyeball icon to display details:

Screenshot from 2025-03-02 21-59-18.png

Screenshot from 2025-03-02 21-55-32.png

And it wasn't just one backup job either, over the course of the next day 3/4 backup jobs that all point to different shares on the same backup host were all failing -- but I hadn't made any changes to the shares or the underlying storage on that host so I really wasn't sure what could have caused it. Anyway it was the end of the weekend and time to go to bed.

This is all in my homelab so it's not a big deal if i miss backups for a few days, I was doing this on the weekend near the end of February and I knew I was a few days before an update which is probably when a number of last-minute approved commits get merged, so i figured I would wait a few days for the dust to settle and it would sort itself out after I update again once the next official release at the end of the month.

Just tonight I decided to update to the latest Xen Orchestra again, and my jobs still failed, like immediately with no error message. I did a bit of googling and found One of the backups fail with no error.

After skimming through I noticed their reported results were really similar to mine, but I hadn't restored from a backup and I didn't know what was wrong. I figured it would be just as easy for me to follow the same advice given, however: to recreate the job.

As I was referencing the schedule of the original job I noticed that the Health Check SR was in red text and just showed an unknown uuid which is when I realized that my backup jobs were still configured to restore Health Check VMs to the SR that I had destroyed and I had forgotten to update my jobs to restore to the new SR.

Screenshot from 2025-03-02 21-46-21.png

So, I am partially sharing a learning experience, and also report that the error handling for this situation ought to be improved.

I have replicated this many times in my instance and I would be happy to provide any logs that might be useful beyond what I've already included.

Anyway, thanks for making this awesome project open-source so people like me can tinker at home.

techjeff

@tjkreidl I'm using the "Backup" feature of Xen Orchestra that was previously called Backup-ng, IIRC. A person creates a backup job that determines the type of backup, i.e. Delta, Continuous Replication, etc. (Those are the old names, though the terminology is going to be changing with XO6/XOLite), the destination "remote" for the backup, either a discrete list of VMs to backup or "smart mode" which is dynamic based on pools to in/exclude and VM tags to in/exclude, and lastly a schedule which has an option to perform a health check (XO restores the backed up VM to the SR of your choice, waits for it to boot successfully, then deletes the restored VM since it was only temporary and not needed). The schedule displays the equivalent cron job syntax, but I'm not sure whether that is implemented by cron or if it's just displayed like that as a convenience.

AFAIK, the backup tool is a higher-level abstraction built on top of XAPI, but with additional niceties, like health checks in this particular case.

My two overlapping jobs are both using "smart mode" to determine the list of VMs to backup based on the tags assigned to the VMs and they both perform health checks. The first is a Delta backup that starts at midnight and usually completes fairly quickly, but sometimes it runs later than 2am when my other backup job starts (continuous Replication to the local storage of one of my xcp-ng hosts).

The issue I'm encountering is that sometimes the second backup begins before the first is finished and sometimes a healthcheck VM is in the middle of booting which results in the second backup job including that healthcheck VM in the list of VMs that it needs to backup. Later, by the time the second backup gets around to actually backing up the healthcheck VM, that VM will have been deleted (the health check is complete), but the second backup job doesn't know that it was deleted, so when it starts making XAPI calls against that healthcheck VM's UUID, XAPI responds indicating that no VM exists with that UUID and XO reports the INVALID_UUID for that particular VM in the backup. Thankfully the backup job is smart enough to know that only that VM failed and it continues with the other VMs.

techjeff

Thanks for the suggestion, @tjkreidl.

I'm not sure what commands I would run with this cron/systemd job/service.

I assume I would need to utilize the XO API calls to determine the list of running backups and then kill the second if the first is still running.. the issue I see with your suggestion is that my backup log would end up with many failures when I currently only get just one, if any.

While this home-lab thing is a hobby and platform for learning, I have a feeling that your suggestion would require that I invest time into learning how to and then building a Rube Goldberg machine that would results in me becoming dependent upon it, or I could let the seemingly amenable devs work on my low-hanging suggested improvement to their relatively new feature: backup health checks. I suppose I could also look into submitting a pull-request

Regardless, these backups don't hold anything critical per se; only the feeling of satisfaction I get from maintaining moderately resilient backups (I can't afford "3-2-1", but I can afford "2") and getting that sweet notification from my xcp-ng hosted internal mail server that the backup was successful. TBH, I could lose "everything" and not really lose anything because I still have the knowledge and experience and it would give me the excuse to practice settings things up again from scratch.

Also, solutions like adding/upgrading hardware to speed up backups are not options at this point in life due to financial, electrical, and space limitations. As it stands, all of my hardware is 5-10+ years old, second-hand (probably 3rd, 4th or more in some cases--several pieces were donated to GoodWill they were so poorly valued several years ago), and I have only a single 20A 120V circuit breaker powering all lights and outlets in the upstairs of the apartment -- the joys of being an American millennial that graduated high school with little familial wealth just before the great recession that has never managed to get a degree

The neat thing is that these computers give me computational power and learning potential while heating our apartment instead of turning on a heater which only consumes money. I really need to move the computational heaters downstairs for more effective heating.. one of these days!

TL;DR - After some consideration I don't think your suggestion fits my use case, but it did provide for a good thought experiment!

techjeff

@djingo, FWIW, configuring with XO (from sources) has worked just fine for me. I actually just tested this the other day because I saw that my log server wasn't getting anything. When I looked at the pool settings I realized that I had made a typo. After I fixed it the logs started flowing.

TL;DR, I think just setting the remote syslog host is enough, but I could be wrong.

The host's logs might have some insights you can glean: https://xcp-ng.org/docs/troubleshooting.html#log-files

techjeff

@florent thank you! Please let me know if you would any more information or further assistance from me. As of yet, the scenario I described is a just a theory as I wanted to get feedback about whether it is a reasonable hypothesis before attempting to conclusively replicate it.

Also, I realized the other day that I had a typo in my remote syslog host address (for who knows how long--apparently I don't check my logs often which I'm calling a sign of reliable tools and setup ) so I don't have logs beyond the backup report which doesn't give much more information than the UUID of a VM that doesn't exist anymore..

In any case, now that my logging is fixed, if I see this happen again, I'll try to gather more details and share them.

techjeff

@techjeff

Best posts made by techjeff

Latest posts made by techjeff