Maybe we could just fork xe-guest-utilities and figure out how to add it in there? Or, if we want a distinct package, how about a more generic name like xcp-guest-agent, the rationale being that the agent's purpose can be broader than just snapshot/quiesce. Watches and communication with xenstore opens up lots of possibilities in domU.
Posts
-
RE: Proof-of-concept for VSS-like quiesce agent in Linux
-
Proof-of-concept for VSS-like quiesce agent in Linux
Dear XCP-ng community,
In my unending quest for bulletproof backups, I've stumbled upon a technique which might be of interest to XCP-ng developpers and users. In short, it's a method for freezing filesystem I/O inside a domU during a snapshot event. It could potentially be extended into a hook provider, allowing not just blocking of filesystem I/O but coordinating the execution of any tasks moments before a snapshot is created, like activating database locks, unmounting filesystems, etc.
Requirements
By leveraging the existing
vm-snapshot-with-quiesce
XAPI call in XenServer/XCP-ng, no modifications whatsoever are required in dom0. In domU, the python-based agent requires thepsutil
andsubprocess
modules, as well as a slightly-modified version of pyxs. The upstream version needs a small fix to work around a bug in XenBus. The agent usesfsfreeze
, a standard tool included in util-linux, to freeze the guest filesystems. Supported filesystems are ext3, ext4, xfs, jfs and reiserfs.Disclaimer
The code below is just a rough proof-of-concept. It's far from being well tested and is missing lots of stuff like better error handling. I'm sharing it as an experiment for the benefit of the community. If you decide to test this on your own systems, make sure you have good backups. I'm not responsible for any damage to your data and/or systems!
Code
import psutil import subprocess from pyxs import Client with Client(xen_bus_path="/dev/xen/xenbus") as c: domid = c.read(b"domid") vsspath = c.read(b"vss") dompath = c.get_domain_path(domid) print("Got domain path:", dompath.decode('ascii')) print("Got VSS path:", vsspath.decode('ascii')) print("Enabling quiesce feature") c.write(dompath + b"/control/feature-quiesce", b"1") print("Establishing watches") m = c.monitor() m.watch(dompath + b"/control/snapshot/action", b"") m.watch(vsspath + b"/status", b"") print("Waiting for snapshot signal...") for wpath, _token in m.wait(): if wpath == dompath + b"/control/snapshot/action" and c.exists(dompath + b"/control/snapshot/action"): action = c.read(dompath + b"/control/snapshot/action") if action == b"create-snapshot": print("Received snapshot-create event") # Acknowledge VSS request c.delete(dompath + b"/control/snapshot/action") c.write(vsspath + b"/status", b"provider-initialized") # Construct list of VDIs to snapshot devlist = [] vdilist = [] vbdlist = c.list(dompath + b"/device/vbd") for vbd in vbdlist: state = c.read(dompath + b"/device/vbd/" + vbd + b"/state") devtype = c.read(dompath + b"/device/vbd/" + vbd + b"/device-type") if state == b"4" and devtype == b"disk": backend = c.read(dompath + b"/device/vbd/" + vbd + b"/backend") vdiuuid = c.read(backend + b"/sm-data/vdi-uuid") devlist.append(c.read(backend + b"/dev").decode('ascii')) vdilist.append(vdiuuid) else: continue # Populate VDI snapshot list for vdi in vdilist: c.mkdir(vsspath + b"/snapshot/" + vdi) # Freeze filesystems print("Begin freezing filesystems") fslist = [] for p in psutil.disk_partitions(): if p.fstype not in ['ext3', 'ext4', 'xfs', 'jfs', 'reiserfs']: continue for d in devlist: if p.device.startswith("/dev/" + d): r = subprocess.run(["/sbin/fsfreeze", "-f", p.mountpoint]) if r.returncode == 0: fslist.append(p.mountpoint) print("Successfully froze " + p.mountpoint) else: print("Error, unable to freeze " + p.mountpoint) # Instruct snapwatchd to create VM snapshot print("Sending create-snapshots signal...") c.write(vsspath + b"/status", b"create-snapshots") elif wpath == vsspath + b"/status" and c.exists(vsspath + b"/status"): status = c.read(vsspath + b"/status") if status == b"snapshots-created": print("Received snapshot-created event") # Unfreeze filesystems for f in fslist: #r = subprocess.run(["/sbin/fsfreeze", "-u", f]) r = subprocess.run(["/bin/true"]) if r.returncode == 0: print("Successfully unfroze", f) else: print("Error, unable to unfreeze", p.mountpoint) # this should not happen ... c.write(vsspath + b"/status", b"create-snapshotinfo") elif status == b"snapshotinfo-created": print("Received snapshotinfo-created event") # Create fake VSS transport id (Windows-only...) c.mkdir(dompath + b"/control/snapshot/snapid") c.write(dompath + b"/control/snapshot/snapid/0", b"0") c.write(dompath + b"/control/snapshot/snapid/1", b"1") # Record snapshot uuid snapuuid = c.read(vsspath + b"/snapuuid") print("Snapshot created:", snapuuid) c.write(dompath + b"/control/snapshot/snapuuid", snapuuid) # Signal snapshot creation print("Sending snapshot-created signal...") c.write(dompath + b"/control/snapshot/status", b"snapshot-created") # Cleanup vsspath print("Cleaning up") c.delete(vsspath + b"/snapshot") c.delete(vsspath + b"/snaptype") c.delete(vsspath + b"/snapinfo") c.delete(vsspath + b"/snapuuid") c.delete(vsspath + b"/status") elif status == b"snapshots-failed": print("Received snapshots-failed event") # Unfreeze filesystems for f in fslist: r = subprocess.run(["/sbin/fsfreeze", "-u", f]) if r.returncode == 0: print("Successfully unfroze", f) else: print("Error, unable to unfreeze", p.mountpoint) # this should not happen ... # Signal snapshot error print("Sending snapshot-error event...") c.write(dompath + b"/control/snapshot/status", b"snapshot-error")
Resources
The key resources that allowed me to develop this was the
xenstore
command, available in both domU and dom0, in particularxenstore ls
, which is unfortunately only available in dom0. Logs in/var/log/SMlog
and/var/log/xensource.log
were also invaluable, as was the source code for both XAPI and the snapwatchd storage-manager (sm) component.Having a Windows Server guest VM on hand was also useful to understand the xenstore messaging task sequence of a quiesce snapshot using the Citrix Xen VSS provider.
Development
I'm curious to see if this can actually be useful to the XCP-ng community.
If you're curious and familiar with Xen, please do test it and report back!
-
RE: Backup Job HTTP connection abruptly closed
For the record, since upgrading to 5.63 the issue hasn't re-occurred at all.