@R2rho
Faulty gear always sucks. But who would've guessed that two separate systems would produce the same problems. That is highly unlikely, but never impossible.
Good luck with the RMA
@R2rho
Faulty gear always sucks. But who would've guessed that two separate systems would produce the same problems. That is highly unlikely, but never impossible.
Good luck with the RMA
So I was doing more backup-testing today.
When doing a "Full Backup" with subsequent health check enabled. The job became interrupted. However, the backup itself was successful. But it was the health check portion that failed/interrupted. This doesn't seem that it triggered the retry function.
I propose that the health checks are included into the retry criteria. But with the added granularity that when doing a retry, it only retries/continues the actual operation that failed. So it was the backup itself, then retry and continue the job from there. However, if it's the health check. Then it's unnecessary to redo the entire backup. And therefore only the health check needs to be redone.
If I'm wrong about how things work, then feel free to correct me. I'm only spit balling ideas that are based on my understanding of what I'm seeing.
Thanks
Screenshot of one of the interupted backups report:
@R2rho
Faulty gear always sucks. But who would've guessed that two separate systems would produce the same problems. That is highly unlikely, but never impossible.
Good luck with the RMA
Well, unfortunately I got nothin... Extremely weird indeed
Given that BIOS and everything is updated to latest version possible.
First thing I do then with these kinds of symptoms, is to disable all kinds of power management and/or C-states in BIOS.
Some combinations of OS and hardware, just doesn't work properly.
If for nothing else, it's a easy non-intrusive test to do.
Update: I see that your motherboard has an IPMI interface. If the issues happen again, after you've disabled power management/c-states. You could use the remote functionality of the impi, to hopefully get some more info from the sensors and stuff.
Looking a tiny bit further. The same discrepancy is present with Disaster recovery too. Being reference as Full Replication (formerly: Disaster recovery)
XO and the docs are conflicting with each other, with what the backup function should actually be called. See attached pic for example.
This is truly a niche sittuation. But I noticed that when I have VMs without any disks attached. The Rolling Snapshot schedule doesn't remove snapshots in accordance to the schedules Snapshot retention.
So I'm guessing that the schedule only looks at cleaning up snapshots of disks. But since the snapshots are acctually of the entire VM. Then maybe this should be takin into account as well?
If this is working as intended, then just ignore this post
I finally have some new hardware to play with. And I'm noticing that the Healtcheck fails, due to the vGPU being busy with the acctual host.
INTERNAL_ERROR(xenopsd internal error: Cannot_add(0000:81:00.0, Xenctrlext.Unix_error(4, "16: Device or resource busy")))
My suggestion is that for the sake of Healthchecks should unassign any attached PCIe devices. If it is crucial that they are attached, then maybe have an opt-in checkbox either in the VM or next to the Healthcheck portion of backups?
The address field doesn't trim trailing whitespace. Not a big deal breaker. But it did take me a couple of minutes until I found out why my copy/paste address was giving me errors
I'm unsure if Threadripper is affected. But the EPYCs have a problem with networking speeds. And I'm quite sure they haven't found the root cause yet,