[DEPRECATED] SMAPIv3 - Feedback & Bug reports

tjkreidl

@olivierlambert said in SMAPIv3 - Feedback & Bug reports:

You need to get rid of SMAPIv1 concepts If you meant "iSCSI block" support, the answer for right now: no.

It's a brand new approach so we'll take time to find the best one, to avoid all the mess that had SMAPIv1 on block devices (non thin, race conditions etc.)

I think the next big "device type" support might be raw (passing a whole disk without any extra layer to the guest).

Ages ago (in the 1980s), I experimented with raw disk I/O on VAX systems using QIO calls. Yes, it's fast, but also doesn't take bad block or deteriorating disk sectors into account. I can't recall offhand if there way a way to at least update bad block lists or if you had to start from scratch.

Are there better mechanisms these days to handle such things as read/write errors and re-allocation to good blocks if bad blocks are detected on a running system?

Reference: https://www.tech-insider.org/vms/research/acrobat/7808.pdf

Andrew

@tjkreidl In days gone by drives used to have a bad sector list printed on the case (SMD/MFM/RLL). It would also be stored on the drive for quick reference. When you formatted the drive the software would use the bad sector list and then add to it during formatting tests. These sectors were "allocated" in the filesystem so they would not be used for normal storage. DOS and unix support a hidden bad block list for this.

As time progressed the controllers got smarter and the bad sector avoidance moved from the OS to the controllers. The systems were able to map out bad blocks into spare sectors or tracks. As the controllers became integrated onto the drives (SCSI, IDE, etc) the drives mapped out bad sectors automatically and hidden from the OS and offered a continuous range of good blocks to the OS. This is why systems have moved to LBA and don't use Head/Track/Sector.

So data block X is always data block X even if the drive moved it somewhere else..... the OS does not know or care.

This contiguous whole disk range of good blocks exists today with flash storage and is automatically and dynamically handled by the flash controllers. As the flash blocks fail (or just get near failure) and get reallocated the spare block count decreases. When spare blocks reach 0 (zero, none) most flash drives force a read-only mode and the device has reached end of life. Hard drives also have a limited number of spare blocks. SMART tools can be used to check how healthy a drive is.

So today RAW drive/storage devices are not really raw but managed by the device and storage controller (flash, SATA, SAS, RAID, etc) to provide good blocks. I/O failure is very bad as it indicates a true unrecoverable failure and time to replace the drive.

tjkreidl

@Andrew Thank you for that, much appreciated. Although I was aware of this process for SSD drives, I did not know that spinning disks had become that much smarter in the interim (~40 years!). But in any case, raw drives are very powerful if you have decent code to access them and the overhead can be appreciably less than with formatted drives.

Forza

@olivierlambert hi. I'm also eager to see how the new v3 is progressing. From my company point of view, being able to compact VDIs using guest trim/unmap is very valuable as it minimises storage space usage and improves backup/restore speeds.

olivierlambert

A big blog post is coming soon. I need to check with @matiasvl about trim passing via raw tapdisk datapath.

swivvle

Please let us know when we can test that new zfs-ng!

Chmura

SMAPI v3 looks very exciting, unfortunately on the bottom is still tapdisk, and that has one but it's a very serious limitation - no io/bandwidth limit ;(

olivierlambert

It's not obvious/100% sure that tapdisk is the bottleneck

Chmura

@olivierlambert
Hmm, if we creating a volume plugin that combine linux cgroups (iops/bandwidth limit) + filesystem (zfs block device - zvol), that would be one possible workaround no matter what's at the bottom.

olivierlambert

I think you are oversimplifying how the storage is working in Xen It's not KVM.

See https://xcp-ng.org/blog/2022/07/27/grant-table-in-xen/ for more details.

netracerx

Sorry to resurrect an old topic, wasn't sure if updates were being made to a new topic. Wanted to ask how goes the implementation, and if there is code (or will be code) to support SMAPIv3 in Xen Orchestra (xoce/XOA), even as a development version? If there's a newer topic that I missed, please point me that direction! Thanks!

olivierlambert

Nothing new right now (yet). For now, migrating stuff from Python2 to 3 is taking its toll…

nikade

@olivierlambert sorry for piggybacking on an old thread but I thought it would be best to keep it together.

We are (As many others) looking for alternatives to our VMWare platform, we're already using XCP and feel that it would possibly be a good alternative once XOSTOR is ready.

One thing that we and others (For example I read this in pretty much every thread on reddit) are struggling with is the 2Tb VDI limit. Many on-prem enterprises are running fileservers or large sql servers which requires a big VDI (Which is not ideal, I know).
Is this being resolved in SMAPIv3?

olivierlambert

Short answer: yes Our goal is to have a local SMAPIv3 SR available in 8.3 on the "short term" to demonstrate what's already doable with it. It will be likely ZFS based behind, allowing to use any VDI size while enjoy ZFS perks (compression).

nikade

@olivierlambert said in SMAPIv3 - Feedback & Bug reports:

Short answer: yes Our goal is to have a local SMAPIv3 SR available in 8.3 on the "short term" to demonstrate what's already doable with it. It will be likely ZFS based behind, allowing to use any VDI size while enjoy ZFS perks (compression).

This sounds great!
So it is really that close to becoming a reality?
Can't wait for it to be released, this will probably be a huge performance increase as well

olivierlambert

It's a bit more subtle than this. SMAPIv3 provides a decoupling between the volume and the data path. It means you can use whatever way to store your volumes (like a regular SMAPIv1 driver for example) BUT also choose the datapath (tapdisk with VHD or raw is the only choice in v1).

With v3, you could use any other datapath, like qemu-dp and other future solution.

Since VHD isn't mandatory to do snapshots and anything (as long you implement a way to do it yourself), it allows you to delegate some operation to the storage itself.

Here is an example with ZFS-ng driver: https://xcp-ng.org/blog/2022/09/23/zfs-ng-an-intro-on-smapiv3/

In short, we use tapdisk in raw mode, and let ZFS handle the snapshots and so on.

The first driver will be available in XCP-ng 8.3, without some features still missing (no migration path from v1 to v3, no storage migration and no backup). We are prioritizing a way to have XO able to backup this first implementation. This way, you could backup a SMAPIv1 based VM disk and restore it on SMAPIv3 ZFS-ng, providing a "cold/warm" migration to it.

nikade

@olivierlambert said in SMAPIv3 - Feedback & Bug reports:

It's a bit more subtle than this. SMAPIv3 provides a decoupling between the volume and the data path. It means you can use whatever way to store your volumes (like a regular SMAPIv1 driver for example) BUT also choose the datapath (tapdisk with VHD or raw is the only choice in v1).

With v3, you could use any other datapath, like qemu-dp and other future solution.

Since VHD isn't mandatory to do snapshots and anything (as long you implement a way to do it yourself), it allows you to delegate some operation to the storage itself.

Here is an example with ZFS-ng driver: https://xcp-ng.org/blog/2022/09/23/zfs-ng-an-intro-on-smapiv3/

In short, we use tapdisk in raw mode, and let ZFS handle the snapshots and so on.

The first driver will be available in XCP-ng 8.3, without some features still missing (no migration path from v1 to v3, no storage migration and no backup). We are prioritizing a way to have XO able to backup this first implementation. This way, you could backup a SMAPIv1 based VM disk and restore it on SMAPIv3 ZFS-ng, providing a "cold/warm" migration to it.

Alright!
So a clean installation of 8.3 with SMAPIv3 would probably be the best way to test it once 8.3 is released.
If you were to summarize the 3 biggest differences (May be positive or negative) between SMAPIv1 and v3, what would those be?

Im thinking one would be this migration limitation, but other than that?

olivierlambert

The differences will depends on the "when". In the end, the goal is to keep the flexibility of v1 (ie live storage motion between any kind of storage repository, CBT/diff capabilities etc.) without any of its current limitation (no 2TiB limit, potential new/other datapath).

To me, the best thing with SMAPIv3 is the flexibility (which is a challenge to deal with to appear as seamless regardless the storage you use as a user). But this flexibility could offer fast path and offload to devices.

Eg right now with SMAPIv1, it's ultra flexible but the entire concept is managed by the dom0 (we have to VHD everything, coalesce a chain, garbage collect snapshots etc.).

With SMAPIv3, you can do the same, but also, in some case, delegate to specific hardware. Let's take an example: a Pure Storage Flash array. Those things got an API, so you could have a specific driver talking to the array for doing snapshot and so on. So no more coalesce to deal with on XCP-ng, the storage will do it for you. That's just an example, but it will give a degree of freedom to provide many different capabilities, and some more "native" to a storage tech.

The downside is it's up to us to develop a way to universally works when you want to go from a storage to another, and to export your data too.

nikade

Sounds like there will be a lot to think about!
Im just happy this is finally happening, it will be a huge improvement for everyone, including new users who haven't had to struggle with the coalescale trrain in the past

swivvle

@olivierlambert as someone who only uses zfs for vm storage on all of their xcp-ng hosts, this makes me very happy.