CEPH FS Storage Driver
-
Hi @r1 I've only had a quick play so far but it appears to work quite well.
I've shutdown a Ceph MDS node and it fails over to one of the other nodes and keeps working away.
I am running my Ceph cluster in VMs (at the moment one per host with the second SATA controller on the motherboard passed-through) so have shut down all nodes, rebooted my XCP-ng server, booted up the Ceph cluster and was able to reconnect the Ceph SR, so that's all good.
Unfortunately at the moment the patch breaks reconnecting the NFS SR, I think because the scan function is broken??, but reverting with
yum reinstall sm
fixes that, so not a big issue but is going to make things a bit more difficult until that is sorted after a host reboot. -
@jmccoy555 said in CEPH FS Storage Driver:
Unfortunately at the moment the patch breaks reconnecting the NFS SR,
It did not intend to.. I'll recheck on this.
-
@r1 ok, I'm only saying this as it didn't reconnect (the NFS server was up prior to boot) and if you try to add a new one the Scan button returns an error.
-
Hi @r1 Just ttied this on my pool of two servers, but no luck. Should it work? I've verified the same command on a host not in a pool and it works fine.
[14:20 xcp-ng-bad-1 /]# xe sr-create type=nfs device-config:server=10.10.1.141,10.10.1.142,10.10.1.143 device-config:serverpath=/xcp device-config:options=name=xcp,secretfile=/etc/ceph/xcp.secret name-label=CephFS Error: Required parameter not found: host-uuid [14:20 xcp-ng-bad-1 /]# xe sr-create type=nfs device-config:server=10.10.1.141,10.10.1.142,10.10.1.143 device-config:serverpath=/xcp device-config:options=name=xcp,secretfile=/etc/ceph/xcp.secret name-label=CephFS host-uuid=c6977e4e-972f-4dcc-a71f-42120b51eacf Error code: SR_BACKEND_FAILURE_140 Error parameters: , Incorrect DNS name, unable to resolve.,
I've verified ceph.mount works, and I can manually mount with the
mount
command./var/log/SMlog
Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] lock: opening lock file /var/lock/sm/fa472dc0-f80b-b667-99d8-0b36cb01c5d4/sr Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] lock: acquired /var/lock/sm/fa472dc0-f80b-b667-99d8-0b36cb01c5d4/sr Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] sr_create {'sr_uuid': 'fa472dc0-f80b-b667-99d8-0b36cb01c5d4', 'subtask_of': 'DummyRef:|cf1a6d4a-fca9-410e-8307-88e3421bff4e|SR.create', 'args': ['0'], 'host_ref': 'OpaqueRef:5527aabc-8bd0-416e-88bf-b6a0cb2b72b1', 'session_ref': 'OpaqueRef:49495fa6-ec85-4340-b59d-ee1f037c0bb7', 'device_config': {'server': '10.10.1.141', 'SRmaster': 'true', 'serverpath': '/xcp', 'options': 'name=xcp,secretfile=/etc/ceph/xcp.secret'}, 'command': 'sr_create', 'sr_ref': 'OpaqueRef:f57d72be-8465-4a79-87c4-84a34c93baac'} Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] _testHost: Testing host/port: 10.10.1.141,2049 Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] _testHost: Connect failed after 2 seconds (10.10.1.141) - [Errno 111] Connection refused Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] Raising exception [108, Unable to detect an NFS service on this target.] Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] lock: released /var/lock/sm/fa472dc0-f80b-b667-99d8-0b36cb01c5d4/sr Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] ***** generic exception: sr_create: EXCEPTION <class 'SR.SROSError'>, Unable to detect an NFS service on this target. Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/SRCommand.py", line 110, in run Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] return self._run_locked(sr) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] rv = self._run(sr, target) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/SRCommand.py", line 323, in _run Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] return sr.create(self.params['sr_uuid'], long(self.params['args'][0])) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/NFSSR", line 198, in create Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] util._testHost(self.dconf['server'], NFSPORT, 'NFSTarget') Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/util.py", line 915, in _testHost Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] raise xs_errors.XenError(errstring) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] ***** NFS VHD: EXCEPTION <class 'SR.SROSError'>, Unable to detect an NFS service on this target. Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/SRCommand.py", line 372, in run Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] ret = cmd.run(sr) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/SRCommand.py", line 110, in run Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] return self._run_locked(sr) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] rv = self._run(sr, target) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/SRCommand.py", line 323, in _run Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] return sr.create(self.params['sr_uuid'], long(self.params['args'][0])) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/NFSSR", line 198, in create Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] util._testHost(self.dconf['server'], NFSPORT, 'NFSTarget') Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] File "/opt/xensource/sm/util.py", line 915, in _testHost Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369] raise xs_errors.XenError(errstring) Apr 4 14:32:02 xcp-ng-bad-1 SM: [25369]
-
@jmccoy555 said in CEPH FS Storage Driver:
device-config:server=10.10.1.141,10.10.1.142,10.10.1.143
Are you sure that's supposed to work?
Try using only one IP-address and see if the command works as intended. -
@r1 said in CEPH FS Storage Driver:
while for CEPHSR we can use #mount.ceph addr1,addr2,addr3,addr4:remotepath localpath
Yep, as far as I know that is how you configure the failover, and as I said it works (with one or more IPs) from a host not in a pool.
p.s. yes I also tried with one IP to just be sure.
-
@jmccoy555 Can you try the latest patch?
Before applying it restore to normal state
# yum reinstall sm
# cd / # wget "https://gist.githubusercontent.com/rushikeshjadhav/ea8a6e15c3b5e7f6e61fe0cb873173d2/raw/dabe5c915b30a0efc932cab169ebe94c17d8c1ca/ceph-8.1.patch" # patch -p0 < ceph-8.1.patch # yum install centos-release-ceph-nautilus --enablerepo=extras # yum install ceph-common
Note: Keep secret in
/etc/ceph/admin.secret
with permission 600To handle the NFS port conflict, specifying port is mandatory e.g.
device-config:serverport=6789
Ceph Example:
# xe sr-create type=nfs device-config:server=10.10.10.10,10.10.10.26 device-config:serverpath=/ device-config:serverport=6789 device-config:options=name=admin,secretfile=/etc/ceph/admin.secret name-label=Ceph
NFS Example:
# xe sr-create type=nfs device-config:server=10.10.10.5 device-config:serverpath=/root/nfs name-label=NFS
-
@r1 Tried from my pool....
[14:38 xcp-ng-bad-1 /]# xe sr-create type=nfs device-config:server=10.10.1.141,10.10.1.142,10.10.1.143 device-config:serverpath=/xcp device-config:serverport=6789 device-config:options=name=xcp,secretfile=/etc/ceph/xcp.secret name-label=CephFS host-uuid=c6977e4e-972f-4dcc-a71f-42120b51eacf Error code: SR_BACKEND_FAILURE_1200 Error parameters: , not all arguments converted during string formatting,
Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] lock: opening lock file /var/lock/sm/a6e19bdc-0831-4d87-087d-86fca8cfb6fd/sr Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] lock: acquired /var/lock/sm/a6e19bdc-0831-4d87-087d-86fca8cfb6fd/sr Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] sr_create {'sr_uuid': 'a6e19bdc-0831-4d87-087d-86fca8cfb6fd', 'subtask_of': 'DummyRef:|572cd61e-b30c-48cb-934f-d597218facc0|SR.create', 'args': ['0'], 'host_ref': 'OpaqueRef:5527aabc-8bd0-416e-88bf-b6a0cb2b72b1', 'session_ref': 'OpaqueRef:e83f61b2-b546-4f22-b14f-31b5d5e7ae4f', 'device_config': {'server': '10.10.1.141,10.10.1.142,10.10.1.143', 'serverpath': '/xcp', 'SRmaster': 'true', 'serverport': '6789', 'options': 'name=xcp,secretfile=/etc/ceph/xcp.secret'}, 'command': 'sr_create', 'sr_ref': 'OpaqueRef:f77a27d8-d427-4c68-ab26-059c1c576c30'} Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] _testHost: Testing host/port: 10.10.1.141,6789 Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] ['/usr/sbin/rpcinfo', '-p', '10.10.1.141'] Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] FAILED in util.pread: (rc 1) stdout: '', stderr: 'rpcinfo: can't contact portmapper: RPC: Remote system error - Connection refused Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] ' Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] Unable to obtain list of valid nfs versions Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] lock: released /var/lock/sm/a6e19bdc-0831-4d87-087d-86fca8cfb6fd/sr Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] ***** generic exception: sr_create: EXCEPTION <type 'exceptions.TypeError'>, not all arguments converted during string formatting Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/SRCommand.py", line 110, in run Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] return self._run_locked(sr) Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] rv = self._run(sr, target) Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/SRCommand.py", line 323, in _run Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] return sr.create(self.params['sr_uuid'], long(self.params['args'][0])) Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/NFSSR", line 222, in create Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] raise exn Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] ***** NFS VHD: EXCEPTION <type 'exceptions.TypeError'>, not all arguments converted during string formatting Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/SRCommand.py", line 372, in run Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] ret = cmd.run(sr) Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/SRCommand.py", line 110, in run Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] return self._run_locked(sr) Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/SRCommand.py", line 159, in _run_locked Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] rv = self._run(sr, target) Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/SRCommand.py", line 323, in _run Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] return sr.create(self.params['sr_uuid'], long(self.params['args'][0])) Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] File "/opt/xensource/sm/NFSSR", line 222, in create Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906] raise exn Apr 6 14:38:26 xcp-ng-bad-1 SM: [8906]
Will try from my standalone host with a reboot later to see if the NFS reconnect issue has gone.
-
@jmccoy555 I have
rpcbind
service running. Can you check on your ceph node? -
@r1 yep, that's it,
rpcbind
is needed. I have a very minimal Debian 10 VM hosting my Ceph (dockers) as is now the way with Octopus.Also had to swap the
host-uuid=
withshared=true
for it to connect to all hosts within the pool (might be useful for the notes).Will test and also check that everything is good after a reboot and report back.
-
Just to report back..... So far so good. I've moved over a few VDIs and not had any problems.
I've rebooted hosts and Ceph nodes and all is good.
NFS is also all good now.
Hope this gets merges soon so I don't have to worry about updates
On a side note, I've also set up two pools, one of SSDs and one of HDDs using File Layouts to assign different directories (VM SRs) to different pools.
-
@jmccoy555 glad to know. I don't have much knowledge on File Layouts but that looks good.
NFS edits won't be merged as that was just for POC. Working on a dedicated CephFS SR driver which hopefully won't be impacted due to
sm
or other upgrades. Keep watching this space. -
We can write a simple "driver" like we did for Gluster
-
@olivierlambert With (experimental) CephFS driver added in 8.2.0. Reading the documentation
WARNING This way of using Ceph requires installing ceph-common inside dom0 from outside the official XCP-ng repositories. It is reported to be working by some users, but isn't recommended officially (see Additional packages). You will also need to be **careful about system updates and upgrades.**
are there any plans to put ceph-common into the official XCP-ng repositories to make updates less scary?
I have been testing this for almost 8 months now. First with only one or two VMs, now with about 8-10 smaller VMs. The ceph cluster itself is running as 3 VMs(themself not stored on CephFS) with SATA-controllers passedthoughed on 3 different hosts.
This has been working great with the exception of in situations when the XCP-ng hosts are unable to reach the ceph cluster. At one time the ceph nodes had crashed(my fault) but I was unable to restart them because all VM operations were blocked taking forever without ever suceeding eventhough the ceph nodes them selfes are not stored on the inaccessable SR. To me it seems the XCP-ng hosts are endlessly trying to connect never timing out which makes them non-responsive.
-
Hi,
Short term: no. Longer term when we have SMAPIv3: very likely, yes, at least as a community driver.
What about perfs? Can you describe more your setup and config?
-
I'm about to deploy the latest ceph on 45drives hardware and will use 8.2 with finally a decent amount of network backbone to start building a new virtual world. I've been using nfs over cephfs on single gigabit public and single gigabit privates and it performs ok for what we do but cannot do any failover or moving of virtuals live. This should alleviate those issues as well as give me lots more options for snapshots and recovery.
So in latest 8.2 patches and updates what do I need to do other than install ceph-common? Will the ceph repository show up in the xcp-ng center or orchestra?
I've had power outages and UPS failures and this stuff just self heals and the only issue has to be with mounting the cephfs after boot and then restarting nfs to recover the nfs repositories and it just comes up. Its scalable and way less trouble to deal with than fiber sans or iscsi.
-
@scboley https://xcp-ng.org/docs/storage.html#cephfs
Once you do the manual stuff it will show up like any other SR in Xen Orchestra etc.
-
@jmccoy555 should I go ahead and update the 8.2 to latest patches first before doing this? I have yet to run a single patch on xcp-ng over many years and is it straightforward?
-
@scboley I would assume so, but I can't say yes. I don't think it was available before 8.2 without following the above.
-
@jmccoy555 I'm talking about 8.2.1 and 8.2.2 and so forth. Is that a simple yum update on the system? I've just left it default version and never updated I was on 7.6 for a long time and just took it all to 8.2 with one straggler xenserver 6.5 still in production. I've loved the stability I've had with xcp-ng not even messing with it at all.