How to Install and Configure vLLM on Ubuntu for Fast LLM Inference and OpenAI Compatible Serving

vLLM has become a popular choice for teams that need fast, efficient large language model inference on Ubuntu. It is designed for high-throughput model serving, supports an OpenAI-compatible API, and helps make better use of GPU memory through its optimized attention and KV cache handling. If you want to run LLMs locally, power internal AI tools, or deploy a production-ready inference server, vLLM is an excellent option.

This guide explains how to install vLLM on Ubuntu, verify your GPU environment, launch an API server, test inference requests, and tune the setup for better performance and reliability.

Why vLLM stands out for LLM serving

Traditional model serving approaches often reserve memory too aggressively, especially for long context windows. vLLM improves this process with an advanced memory management strategy called PagedAttention. This allows the runtime to allocate KV cache resources more efficiently, support dynamic batching, and handle multiple requests with better throughput.

Key advantages of vLLM include:

Higher inference throughput for large language models
Improved GPU memory efficiency
Continuous batching for concurrent requests
Compatibility with the OpenAI API format
Support for production LLM deployment workflows

Prerequisites for installing vLLM on Ubuntu

Before you begin, make sure your server or workstation meets the typical requirements for vLLM deployment.

Ubuntu 20.04 or Ubuntu 22.04
An NVIDIA GPU with sufficient VRAM, ideally 16 GB or more
NVIDIA driver version 525 or newer
CUDA 12.1 or later
Python 3.9 through 3.12
At least 32 GB of system memory for smoother model loading

If your GPU has limited VRAM, you can still run models by selecting a smaller checkpoint or using quantized model formats such as AWQ, GPTQ, or bitsandbytes.

Step 1: Confirm GPU and CUDA availability

Start by checking that Ubuntu can see your NVIDIA GPU and that the driver stack is working correctly.

nvidia-smi

nvcc --version || nvidia-smi | grep "CUDA Version"

nvidia-smi --query-gpu=memory.total,memory.free --format=csv

These commands help verify GPU status, CUDA support, and available video memory. If nvidia-smi fails, resolve the NVIDIA driver issue before installing vLLM.

Step 2: Create a Python virtual environment

Using a dedicated Python environment is the safest way to install vLLM and its dependencies without affecting other projects.

sudo apt-get update && sudo apt-get install -y python3 python3-pip python3-venv

python3 -m venv ~/vllm-env

source ~/vllm-env/bin/activate

pip install --upgrade pip

This prepares an isolated environment for your vLLM installation on Ubuntu.

Step 3: Install vLLM

Once the environment is active, install the vLLM package with pip.

pip install vllm

python3 -c "import vllm; print(vllm.__version__)"

The installation may take some time because it can pull in large dependencies, including PyTorch and CUDA-related libraries. After installation, the version check confirms that the package is available and working.

Step 4: Prepare a model from Hugging Face

vLLM can load models directly from Hugging Face Hub or from a local directory. Many teams allow vLLM to download the model automatically on first launch, but pre-downloading can reduce startup delays.

pip install huggingface_hub

huggingface-cli login

You only need to authenticate if the model is gated or requires a usage agreement. To download a smaller test model ahead of time, run:

python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='meta-llama/Llama-3.2-1B-Instruct', local_dir='/opt/models/llama-3.2-1b-instruct')"

A lightweight model is ideal for validating your Ubuntu vLLM setup before moving to larger checkpoints.

Step 5: Start the vLLM OpenAI-compatible API server

One of the biggest advantages of vLLM is that it can expose models through an API that mirrors the OpenAI interface. This makes it easier to reuse existing applications with minimal code changes.

To launch the server using a remote Hugging Face model reference:

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-1B-Instruct --host 0.0.0.0 --port 8000

To launch the server with a model stored locally:

python3 -m vllm.entrypoints.openai.api_server --model /opt/models/llama-3.2-1b-instruct --host 0.0.0.0 --port 8000 --served-model-name llama-3.2-1b

The --served-model-name value is the model identifier clients will use in API requests.

Useful vLLM server options

For production LLM inference, you will often want to adjust memory usage, context length, or multi-GPU behavior.

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 --max-model-len 4096 --gpu-memory-utilization 0.90 --dtype bfloat16 --max-num-seqs 256 --enable-chunked-prefill --api-key your-secret-key

Important parameters include:

tensor-parallel-size to distribute inference across multiple GPUs
max-model-len to limit context size and reduce memory pressure
gpu-memory-utilization to define how aggressively vLLM uses available VRAM
dtype to control numerical precision based on your GPU architecture
max-num-seqs to tune concurrency

How to test the vLLM API server

After the server is running, send a request to confirm that inference works correctly.

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama-3.2-1b","messages":[{"role":"user","content":"Explain what a KV cache is in 2 sentences."}],"max_tokens":200,"temperature":0.7}'

You can also test classic text completion behavior:

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model":"llama-3.2-1b","prompt":"The capital of France is","max_tokens":50}'

To view the list of models currently exposed by the service:

curl http://localhost:8000/v1/models

This is especially useful when checking whether your served-model-name matches the model name sent by client applications.

Connecting a Python client to vLLM

Because vLLM uses an OpenAI-compatible interface, you can point many standard OpenAI SDK integrations at your local Ubuntu server simply by changing the base URL.

If your application already uses an OpenAI client library, update the endpoint to http://localhost:8000/v1 and use the model name exposed by your server. This makes vLLM attractive for internal copilots, RAG systems, and private AI infrastructure where data should remain on-premises.

Using quantized models to reduce VRAM usage

Quantization is a practical way to run larger models on smaller GPUs. vLLM supports several quantized model formats that can lower memory requirements while preserving useful inference speed.

Example with AWQ:

python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-13B-chat-AWQ --quantization awq --max-model-len 4096

Example with GPTQ:

python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-13B-chat-GPTQ --quantization gptq

Example with bitsandbytes:

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-70B-Instruct --quantization bitsandbytes --load-format bitsandbytes --tensor-parallel-size 4

If you are deploying vLLM on Ubuntu with limited GPU capacity, quantized models can make the difference between a failed launch and a stable inference service.

Running vLLM as a systemd service

For persistent deployments, it is better to run the server as a managed Linux service. This allows automatic restarts and system boot integration.

Create a service file at /etc/systemd/system/vllm.service with your preferred environment variables, model path, and startup command. Then reload systemd and enable the service.

sudo systemctl daemon-reload

sudo systemctl enable vllm

sudo systemctl start vllm

sudo journalctl -u vllm -f

This setup is useful for production Ubuntu servers hosting local LLM APIs behind internal networks, reverse proxies, or application gateways.

Benchmarking vLLM throughput

Performance testing helps you determine how many prompts per second your infrastructure can support. Benchmarking is especially important when tuning concurrency, prompt size, and model selection.

pip install aiohttp

python3 benchmarks/benchmark_throughput.py --backend vllm --model meta-llama/Llama-3.2-1B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 10

python3 benchmarks/benchmark_serving.py --backend openai-chat --model llama-3.2-1b --base-url http://localhost:8000 --num-prompts 100

These tests can help identify the best values for throughput, latency, and request batching in your Ubuntu vLLM deployment.

Troubleshooting common vLLM issues on Ubuntu

If the model fails to load because of insufficient memory, reduce the context length or use quantization.

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-1B-Instruct --max-model-len 2048 --gpu-memory-utilization 0.80

If model downloads are very slow, enable faster Hugging Face transfer support.

HF_HUB_ENABLE_HF_TRANSFER=1 pip install hf_transfer

export HF_HUB_ENABLE_HF_TRANSFER=1

If the server starts but requests fail, confirm the served model name and inspect service logs.

curl http://localhost:8000/v1/models

sudo journalctl -u vllm --since "5 minutes ago" -f

You may also notice the first request is slower than later requests. This is normal because CUDA kernels and internal caches often need an initial warm-up pass before steady-state performance is reached.

Best practices for production use

If you plan to use vLLM in production on Ubuntu, consider the following recommendations:

Start with a smaller model to validate the environment
Use a dedicated virtual environment for package isolation
Serve the API behind a reverse proxy for TLS and access control
Set an API key when exposing endpoints to other systems
Benchmark different values for concurrency and context length
Use quantized models when VRAM is constrained
Run the service through systemd for stability and automatic restarts

Conclusion

Installing vLLM on Ubuntu is a strong choice for organizations and developers who want fast, efficient local LLM inference with an OpenAI-compatible API. With proper GPU support, a clean Python environment, and the right model configuration, you can build a reliable AI serving stack for chatbots, internal assistants, RAG pipelines, and privacy-sensitive workloads.

By combining vLLM’s optimized inference engine with Ubuntu’s stability, you get a flexible foundation for scalable language model serving. Whether you are experimenting with a small test model or operating a larger multi-GPU deployment, the steps above provide a practical path to getting vLLM running smoothly in development and production.

common.nav.discover

common.nav.oneClickApps

How to Install and Configure vLLM on Ubuntu for Fast LLM Inference and OpenAI Compatible Serving

How to Install and Configure vLLM on Ubuntu for Fast LLM Inference and OpenAI Compatible Serving

articleSlug.continueReading

How to Install Cursor on Ubuntu and Keep It Updated with APT

Immich Quick Start Guide: How to Install Immich with Docker and Back Up Photos

How to Install Vaultwarden on Ubuntu 24.04 with Docker and Caddy

How to Install Paperless ngx on Ubuntu VPS with Docker, Nginx, and SSL