Your own GPU-Powered LLMs with XCP-ng

Tutorial Jul 28, 2025

The recent wave of large language models (LLMs) has been transformative, but also incredibly centralized. Most people experience them through APIs provided by big cloud vendors, with all the usual trade-offs: latency, cost, lack of control, and privacy concerns.

But what if you could run your own LLM locally, right next to your data, with full performance and zero dependency on cloud access?

In this article, we’ll guide you through building a small local LLM virtual machine running on XCP-ng. The best part? You get near bare-metal performance, but inside a VM. Which means all the benefits of flexibility, isolation, and ease of management, without sacrificing speed. And also while keeping your other VMs running at the same time.

What we'll cover

In the next sections, we’ll walk you through:

Preparing your XCP-ng host: hardware considerations (especially for GPU usage)
Creating the VM: choosing the right Linux distro and VM config
Configuring GPU passthrough: unlocking acceleration for AI workloads
Installing your LLM stack: from Open WebUI to models like Mistral or LLaMA
Access and testing: how to use your local AI assistant from your browser or API

By the end, you'll have your own private AI, running fast and securely on your XCP-ng infrastructure.

Our setup

This isn’t a purpose-built AI server, far from it! We're using an 8-year-old Lenovo ThinkSystem SR650 7X06 2U server we managed to get for free (though you can find refurbished units between $500 and $800 depending on the configuration).

It’s running a single Xeon Silver 4112 and 64 GiB of RAM: nothing fancy by today’s standards. But the real star here is the Nvidia Tesla T4 with 16 GiB of VRAM. This GPU is about the same age as the server and can also be found fairly cheap on the refurbished market. It's a great fit for rack servers thanks to its passive (fanless) cooling design.

💡

This GPU is well-suited for rack servers thanks to its passive (fanless) design. However, in a typical PC case or homelab setup, it can overheat easily, so keep that in mind. If you're building a consumer-grade setup, you're probably better off with a consumer-grade GPU too!

Yes, the hardware is old: but that’s the whole point. You don’t need cutting-edge gear or a hyperscaler budget to run local LLMs effectively. Plus, as a 2U machine, it’s surprisingly quiet: definitely not a datacenter monster.

And most importantly, it shows how XCP-ng can help you extract maximum value from this kind of hardware, thanks to virtualization and PCI passthrough.

XCP-ng installation and configuration

XCP-ng is very straightforward, if you want to learn all the steps, please read our documentation: https://docs.xcp-ng.org/installation/install-xcp-ng/

The next step is to deploy XOA or connecting XOA to your freshly installed host:

Let's go back in XO 5 view to easily enable PCI passthrough of our Tesla T4 card. In the host/Advanced view, click on the line containing the GPU PCI device.

By clicking on the toggle, it will ask for a reboot:

By clicking on OK, your host will automatically reboot.

After the reboot, you should have the device toggled on:

Great! Now we can quickly create our test VM.

VM creation

Let’s be honest: we’re lazy, and we’re not in the mood to install Debian manually. Thankfully, we don’t have to.

We used XOA Hub to deploy a ready-to-go Debian 12 VM image with cloud-init support. Just head over to the Hub tab in Xen Orchestra and pick:

Now we'll create the new Debian VM from this template, with some reasonable space, RAM and CPUs:

One important thing here: we unchecked the "boot at VM creation" option. Why? Because before starting the VM, we want to attach the GPU via PCI passthrough:

That’s it: the GPU is now assigned to your VM and will be passed through directly to the guest. When the VM boots, it will see the GPU as if it were running on bare metal.

This is one of the most powerful things about XCP-ng: clean, simple passthrough without needing to mess with obscure kernel options or edit XML files.

Thanks to the cloud-init integration, your SSH key was automatically injected, and the root disk was resized to the full 500 GiB we defined at creation. No extra steps needed.

Just wait a few seconds after boot, then SSH into the machine with: ssh debian@<VM_IP_address>

Let's do a first check:

root@Test-LLM:/home/debian# lspci 
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

The GPU is here, it's a go for the next step!

VM configuration

Now we have the basic stuff ready, we'll need to install CUDA with Nvidia drivers, so we can use the card for LLMs. So as root (sudo -s just before):

apt install linux-headers-$(uname -r)
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt update
apt install nvidia-driver-cuda nvidia-kernel-dkms

You must reboot to load the driver. After that, we can check if it works with a simple nvidia-smi:

root@Test-LLM:/home/debian# nvidia-smi 
Thu Jul 24 17:31:23 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       On  |   00000000:00:05.0 Off |                  Off |
| N/A   60C    P8             15W /   70W |       0MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Great, we are ready to work on the LLM part!

Ollama & OpenWebUI

Now that our VM is up and running (with the GPU attached) it’s time to set up the LLM side of things. We’ll split this into two steps:

Install Ollama directly on the VM.
This will let us quickly test the GPU and run models from the CLI to make sure everything works as expected.
Install Open WebUI using Docker.
This gives you a clean, browser-based interface to interact with Ollama and your models—ideal for regular use or sharing with teammates.

Why both? Because Ollama makes it trivial to download and serve models, and Open WebUI makes it actually pleasant to use them.

Let’s start with the basics: testing GPU support with Ollama.

Ollama

It's trivial to install in our VM:

apt install curl
curl -fsSL https://ollama.com/install.sh | sh

That's it! We'll test if the GPU is correctly used, so we'll make a quick test with deepseek-r1:

ollama pull deepseek-r1

You can find all the models by their name at:

When it's done, we can start interacting with it:

# ollama run deepseek-r1
>>> Send a message (/? for help)

You can type anything and watch (in another terminal) the GPU usage, confirming it works. For example, after typing Hi! I'm just testing the model and having another terminal with a watch nvidia-smi:

You can see which process is using the GPU, and the power draw

Great! Now we have ollama working, we'll add a web UI on top of it: OpenWebUI.

💡

Tip: you can also use nvtop (in "contrib" repo) to monitor the GPU usage.

nvtop showing a request to the GPU (the big bulge)

OpenWebUI

We'll use OpenWebUI for our LLM setup. It's relatively easy to install and use, while providing a webUI that's close the most popular products, like ChatGPT. The easiest approach is to use Docker. So first, follow the usual Debian Docker installation. Alternatively, you can use podman.

docker pull ghcr.io/open-webui/open-webui:main
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

💡

Note: we are telling OpenWebUI to reach our Ollama server on localhost, which is the gateway of the container.

Now you can access your VM URL with the right port, in my case http://192.168.1.90:3000.

Click on "Get started" and fill the form:

Congrats, you are in!

Since we already installed DeepSeek, it should be already available:

With Ollama installed and Open WebUI running in Docker, you now have a fully functional local LLM setup, GPU-accelerated and easy to use.

Open WebUI adds a polished user experience on top of Ollama:

A clean web interface
User management
Model-level permissions
And other handy features

Most importantly, everything runs locally. You’re not sending prompts or data to any external service. Even though you’re not training your own model, you can still leverage techniques like RAG (Retrieval Augmented Generation) to improve your model’s responses, using your own data, on your own terms.

This setup gives you the best of both worlds: the flexibility of VMs, the performance of bare metal, and the control of running on-prem.

Got GPU? Got XCP-ng? Then you’ve got private AI, ready to go!

Recommended for you

Tutorial

XCP-ng High Availability: a guide

a year ago • 7 min read

Tutorial

Using Packer with XCP-ng

a year ago • 7 min read

Tutorial

Upgrading From Debian 11 To Debian 12

2 years ago • 1 min read

Windows PV drivers: update and roadmap

July 2025, Security Update #2 for XCP-ng 8.2 LTS