Group Details Private

Top contributor

  • RE: Switching to XCP-NG, want to hear your problems

    @crazyadm1n Its pretty good, im not able to max it out with XCP on either iSCSI or NFS so it doesnt really matter.
    The Powerstore supports native NFS server, so it is really easy to setup a new filesystem with a NFS server so it is actually easier to setup than iSCSI with multipathing.

    posted in Migrate to XCP-ng
  • RE: Switching to XCP-NG, want to hear your problems

    Welcome to the forum!
    We are running both XCP-NG and VMWare and there are a ton of things that we've been annoyed with in XCP-NG, all tho it does work and it does the job if you have basic needs.
    As long as you're not doing anything "special" it will work and it will be reliable and stable.

    Somethings we have learned and are living with:

    1. Disk I/O is bad compared to VMWare, but it works and its not "too bad".
    2. Live storage migration is a bit slow, but since you're using shared storage this isnt an issue.
    3. Always use NFS if you can, we're also using Dell storage (Powerstore series) and we're using NFS and it is working really well. Dedup ratio is at 6.7:1 and performance is pretty good.
    4. There are issue with Secure boot, so we're not using it.
    5. VM's may "lagg" if they have a lot of RAM when you migrate between hosts, in our case we're running a lot of WSFC SQL Server failover-clusters and it might cause a failover. To resolve that we've been tweaing the registry in our Windows installations running WSFC.
    6. Backup performance compared to veeam is very slow, in VMware we're doing 9Gbit/s and in XCP-NG it's about 50-80MB/s.
    posted in Migrate to XCP-ng
  • RE: The writer IncrementalRemoteWriter has failed the step writer.beforeBackup()

    Usually this error is caused by the merge process still running in the background from a previous backup job.

    posted in Backup
  • RE: Some HA Questions Memory Error, Parallel Migrate, HA for all VMs,

    @vahric-0 Then I am out of suggestions, sorry.

    posted in Management
  • RE: Migrate VMWare VM to XCP-ng fails 'Cannot read properties of undefined (reading 'errored')'

    @fluxtor No, I don't know if/when this functionality will be added.

    posted in Management
  • RE: stunnel sdn cert error when adding host to pool

    @CJ IDK. You could try restarting XO Server to see if that makes a difference.

    posted in Management
  • RE: Migrate VMWare VM to XCP-ng fails 'Cannot read properties of undefined (reading 'errored')'

    Warm migration is not currently supported with VMFS6.

    posted in Management
  • RE: stunnel sdn cert error when adding host to pool

    Have you tried disabling the plugin?

    posted in Management
  • Failing Backups: Trying To Find Root Cause

    This may be something for me to put a ticket in for, but I wanted to try and post here and do it publicly first since it could benefit others.

    One of the environments I am managing has consistent backup failures and I haven't been able to get to the root cause of them, this post will probably be long with lots of details. The short of it is that I think it's only happening to large VMs, but I can't figure out why, the majority fail on "clean VM directory" and show missing VHDs or missing parent VHDs.

    To start, this setup has 2 backups that run for all VMs on a nightly basis, one is uploaded to Backblaze and another is sent over SMB to a TrueNAS machine.

    I have a similar setup in my lab at home, and it's not failed once, never ever. But all my VMs are under 100GB, this other environment has some that are more than 2TB, which is why I am starting to think that is the root cause.

    XOA version is at 5.93.1, so not 100% up to date (will update shortly), but this has been an ongoing issue for months now so I don't think it's a version specific thing.

    Backup Schedules

    First wanted to explain my schedules in details, then will go into the errors we are seeing.

    Both schedules backup the same number of VMs, 2 of which are slightly over 2TB in size (several VHDs).

    Backblaze Backup

    • This one is setup to run every night
    • Concurrency of 2
    • Timeout of 72 hours (since they are large I set the timeout very big, but usually this finishes within a few hours, sometimes taking like 10)
    • Full Backup Interval is 15
    • NBD is enabled and set as 4 per disk
    • Speed is limited to 500MiB/s (this is never hit though)
    • Snapshot is normal
    • Schedule is set to run ever weekday at 5PM with a retention of 14 and force full backup disabled
    • Worth noting these B2 bucket settings are:
    • Lifecycle is set to keep only the last version of the file (plan is to adjust this more later)
    • Object lock is enabled but no default set, so nothing should be getting locked

    SMB NAS Backup

    • Concurrency of 1
    • Full Backup Interval of 30
    • NBD is disabled, number of connections is 1
    • Snapshot mode is normal
    • Schedule is set to run every weekday at 8PM with a retention of 7
    • This NAS does do backups of this VM directory (an additional backup I run) but those start at 7PM and I have it set to snapshot the dataset before backing it up, so in theory anything XCP-ng is touching shouldn't be messed with
      • I've been able to confirm TrueNAS's "snapshot first" feature (which runs before the backup starts) takes a snapshot, backs up the data of that snapshot, then deletes the snapshot, this whole thing is to prevent file locking on a directory that has other things accessing it

    I know the backup retention periods etc.. are a bit odd here, if we think that could be causing an issue I'm happy to adjust them, was planning on reworking retention sometime soon anyway. But as far as I can tell it shouldn't cause a major problem.

    The Errors

    Backblaze

    • Several VMs, including smaller ones are seeing this issue, which maybe means my thoughts about this being a large VM specific issue are wrong?
    • It always happens during the clean VM directory process
    • Last log I have is 3 VMs with the below:
      • UUID is Duplicated
      • Orphan Merge State
      • Parent VHD is missing (several times for each VM)
      • Unexpected number of entries in backup cache
      • Some VHDs linked to the backup are missing
    • On all of these the Backblaze "transfer" section of the logs is green and successful, but the clean VM directory is not, seems the merge is failing
    • Retrying VMs will sometimes work but other times will just fail again

    SMB

    • Only seems to happen with big VMs, they will work fine for a while (several weeks) then start erroring out
    • The only fix I've found is to wipe the entire VMs directory on the NAS so the backup starts fresh
    • The error is always parent VHD is missing (with a path to a VHD that as far as I can tell exists)
    • Then followed by a "EBUSY: resourece busy or locked, unlink (vhd path)"
    • It's always a VHD that starts with a period, so ".2024**********.vhd"
    • Checking the NAS via shell and the file definitely exists and has the same permissions on it as everything else in the directory
    • Now another super interesting thing is, if I go to the VM Restore page, select the one that failed SMB, it will show no original key backup like so (top/most recent to bottom):
      • Incremental
      • Incremental
      • Incremental
      • Incremental
      • Key
      • Incremental
      • Incremental

    So as you can see, no original Key for the last 2 incrementals

    Any ideas as to what could be causing this? I'm thinking they might be 2 entirely separate issues, it's just odd that they're both happening.

    I will do what I can to troubleshoot this directly as well and update this post with anything else I find.

    posted in Backup