@acebmxer so, NBD it was...
holy molly, you have some good network performance !
what kind of SR at source ? and remote at destination ?
what about the PIFs ?
@acebmxer so, NBD it was...
holy molly, you have some good network performance !
what kind of SR at source ? and remote at destination ?
what about the PIFs ?
@acebmxer bottom of POOL advanced tab, is BACKUP NETWORK selected on the NBD enabled network accessible by both hosts and XOA ?
@acebmxer I have a new case of managing to force the fell back to full error...
i'll create a new topic for this
in the time being, if you can, do a toolstack restart on your pool when no tasks is ongoing
your backups with NBD could be better (spoiler alert : iptables rules...
)
Thanks.
The old snapshots are being removed as the total never increases beyond 16, so when a new snapshot is added, the old one is removed.
immediatly removed, yes, but then Garbage collection takes place.
and perhaps with 19x16 GC to process it can't be done in one hour, and then next CR is launched, etc etc...
@florent was finally able to read the pull
/clap ! the fix seems totally legit and consistant with XOA ram ramping up !
when will this be officially published ? 
so we can disable daily reboot of XOA & XO PROXies 
@McHenry could you screen the health page ?
where we could see the chain length
@McHenry I dont think more than 3 snapshots triggers an error, just tested on one VM 
it is not recommended for "in production" VMs, but for a CR destination, it's OK (as you would need to start a copy anyway)
your problem, failing CR jobs is probably due to garbage collection not finishing in the one hour timeframe when chain is long.
@simonp patched tonight, a job who took 3 hours yesterday took only 1 tonight.
so, big improvment !
need to re up concurrency to 2 or 4 on some jobs to see if I can squeeze more time on the backup window
perhaps "in the context of a proceeding RPU, do not start halted VMs" ?
or "boot only halted VMs that have HA enabled" ?
but I can imagine corner cases where this is not wanted.
some chicken & egg problem.
@stormi indeed.
but the host restarting is by design empty of all VMs because of evacuate process
the stopped VMs are on other hosts. so strange to see them booting when restarted host is coming online.
ps : as I wrote it, I understood my "error". halted VMs are on no host at all, just attached to the pool.
I did verify and yes
ha-reboot-vm-on-internal-shutdown ( RW): true
it is enabled on our Pool. but no HA
ha-enabled ( RO): false
@stormi no, we do not use HA, it's disabled on the POOL and on the VMs
@olivierlambert having done a RPU yesterday on 3 hosts, I still have the "bug" where some VMs with "auto start" switched ON, but halted on purpose, reboot when a host is rebooted.
we do stop some VMs during RPU to lessen the migrate times, but at every host reboot, they DO START UP
annoying. not critical, but annoying.
@oliv77 you should let your XOA in the right pool I guess
as the networks are defined at pool level, that would need you to be sure networks in pool A (and management IPs of HOSTs in pool A) are available to XOA-A placed in pool B
can be done, but why the head ache ?
master down have no impact on VM running on other hosts in the pool
master of PoolA down, no bonus having XOA-A in pool B, you are still blind as long as Pool A doesn't recover its master.
@MajorP93 sounds promising
i didn't patch, waiting for official release.
can you tell me if your jobs have lot of concurrency configured ?
for the time being I had to down to 1 the concurrency (was 4 or 6) to mitigate the added time of simultaneous mergings
eager to see if with the patchs, we can re pump concurrency up in the jobs.
@McHenry 19 VMs is 19 Chains of 16 VDIs
at each hourly run, a new snapshot is created (some minutes) and the oldest one is merged/garbage collected in the first snap (time undetermined)
I guess 19 merge + chain garbage collected seems to not be able to be done in the one hour timeframe before next CR is done
you possibly have a chain growing
can you check in DASHBOARD/HEALTH the unhealthy VDI section at 11 am ?
It's better IMO to have a solid backup less frequently than have them fail on a regular basis.
totally agree.
@tjkreidl either skip or wait until possible
I'm used to veeam backup & recovery that is very resilient to these corner cases, on vmware if it understands that a Datastore has too many snapshots, or some backup ressouce is not ready yet (you can throttle number of active workers on a repository or per proxy), veeam will just wait for availability and keep going.
problem with this way of doing is it can shift in time the schedule where you expect CR or backup to be happening.
but can be a problem to skip altogether, if @mchenry need compliancy of a certain number of replicas happening
waiting vs skipping, in a perfect world the devs give us a switch to choose our destiny 
ps : I know XO Backup is not to be 100% mapped on Veeam functionnalities, but some of these functionnalities would really augment the XO Backup experience. just have to take into account Xen environment (no GC in vmware infrastructure)