vLLM has become a popular choice for teams that need fast, efficient large language model inference on Ubuntu. It is designed for high-throughput model serving, supports an OpenAI-compatible API, and helps make better use of GPU memory through its optimized attention and KV cache handling. If you want to run LLMs locally, power internal AI tools, or deploy a production-ready inference server, vLLM is an excellent option.
This guide explains how to install vLLM on Ubuntu, verify your GPU environment, launch an API server, test inference requests, and tune the setup for better performance and reliability.
Why vLLM stands out for LLM serving
Traditional model serving approaches often reserve memory too aggressively, especially for long context windows. vLLM improves this process with an advanced memory management strategy called PagedAttention. This allows the runtime to allocate KV cache resources more efficiently, support dynamic batching, and handle multiple requests with better throughput.
Key advantages of vLLM include:
- Higher inference throughput for large language models
- Improved GPU memory efficiency
- Continuous batching for concurrent requests
- Compatibility with the OpenAI API format
- Support for production LLM deployment workflows
Prerequisites for installing vLLM on Ubuntu
Before you begin, make sure your server or workstation meets the typical requirements for vLLM deployment.
- Ubuntu 20.04 or Ubuntu 22.04
- An NVIDIA GPU with sufficient VRAM, ideally 16 GB or more
- NVIDIA driver version 525 or newer
- CUDA 12.1 or later
- Python 3.9 through 3.12
- At least 32 GB of system memory for smoother model loading
If your GPU has limited VRAM, you can still run models by selecting a smaller checkpoint or using quantized model formats such as AWQ, GPTQ, or bitsandbytes.
Step 1: Confirm GPU and CUDA availability
Start by checking that Ubuntu can see your NVIDIA GPU and that the driver stack is working correctly.
nvidia-smi
nvcc --version || nvidia-smi | grep "CUDA Version"
nvidia-smi --query-gpu=memory.total,memory.free --format=csv
These commands help verify GPU status, CUDA support, and available video memory. If nvidia-smi fails, resolve the NVIDIA driver issue before installing vLLM.
Step 2: Create a Python virtual environment
Using a dedicated Python environment is the safest way to install vLLM and its dependencies without affecting other projects.
sudo apt-get update && sudo apt-get install -y python3 python3-pip python3-venv
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install --upgrade pip
This prepares an isolated environment for your vLLM installation on Ubuntu.
Step 3: Install vLLM
Once the environment is active, install the vLLM package with pip.
pip install vllm
python3 -c "import vllm; print(vllm.__version__)"
The installation may take some time because it can pull in large dependencies, including PyTorch and CUDA-related libraries. After installation, the version check confirms that the package is available and working.
Step 4: Prepare a model from Hugging Face
vLLM can load models directly from Hugging Face Hub or from a local directory. Many teams allow vLLM to download the model automatically on first launch, but pre-downloading can reduce startup delays.
pip install huggingface_hub
huggingface-cli login
You only need to authenticate if the model is gated or requires a usage agreement. To download a smaller test model ahead of time, run:
python3 -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='meta-llama/Llama-3.2-1B-Instruct', local_dir='/opt/models/llama-3.2-1b-instruct')"
A lightweight model is ideal for validating your Ubuntu vLLM setup before moving to larger checkpoints.
Step 5: Start the vLLM OpenAI-compatible API server
One of the biggest advantages of vLLM is that it can expose models through an API that mirrors the OpenAI interface. This makes it easier to reuse existing applications with minimal code changes.
To launch the server using a remote Hugging Face model reference:
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-1B-Instruct --host 0.0.0.0 --port 8000
To launch the server with a model stored locally:
python3 -m vllm.entrypoints.openai.api_server --model /opt/models/llama-3.2-1b-instruct --host 0.0.0.0 --port 8000 --served-model-name llama-3.2-1b
The --served-model-name value is the model identifier clients will use in API requests.
Useful vLLM server options
For production LLM inference, you will often want to adjust memory usage, context length, or multi-GPU behavior.
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-8B-Instruct --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 --max-model-len 4096 --gpu-memory-utilization 0.90 --dtype bfloat16 --max-num-seqs 256 --enable-chunked-prefill --api-key your-secret-key
Important parameters include:
- tensor-parallel-size to distribute inference across multiple GPUs
- max-model-len to limit context size and reduce memory pressure
- gpu-memory-utilization to define how aggressively vLLM uses available VRAM
- dtype to control numerical precision based on your GPU architecture
- max-num-seqs to tune concurrency
How to test the vLLM API server
After the server is running, send a request to confirm that inference works correctly.
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"llama-3.2-1b","messages":[{"role":"user","content":"Explain what a KV cache is in 2 sentences."}],"max_tokens":200,"temperature":0.7}'
You can also test classic text completion behavior:
curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model":"llama-3.2-1b","prompt":"The capital of France is","max_tokens":50}'
To view the list of models currently exposed by the service:
curl http://localhost:8000/v1/models
This is especially useful when checking whether your served-model-name matches the model name sent by client applications.
Connecting a Python client to vLLM
Because vLLM uses an OpenAI-compatible interface, you can point many standard OpenAI SDK integrations at your local Ubuntu server simply by changing the base URL.
If your application already uses an OpenAI client library, update the endpoint to http://localhost:8000/v1 and use the model name exposed by your server. This makes vLLM attractive for internal copilots, RAG systems, and private AI infrastructure where data should remain on-premises.
Using quantized models to reduce VRAM usage
Quantization is a practical way to run larger models on smaller GPUs. vLLM supports several quantized model formats that can lower memory requirements while preserving useful inference speed.
Example with AWQ:
python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-13B-chat-AWQ --quantization awq --max-model-len 4096
Example with GPTQ:
python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Llama-2-13B-chat-GPTQ --quantization gptq
Example with bitsandbytes:
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3-70B-Instruct --quantization bitsandbytes --load-format bitsandbytes --tensor-parallel-size 4
If you are deploying vLLM on Ubuntu with limited GPU capacity, quantized models can make the difference between a failed launch and a stable inference service.
Running vLLM as a systemd service
For persistent deployments, it is better to run the server as a managed Linux service. This allows automatic restarts and system boot integration.
Create a service file at /etc/systemd/system/vllm.service with your preferred environment variables, model path, and startup command. Then reload systemd and enable the service.
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo journalctl -u vllm -f
This setup is useful for production Ubuntu servers hosting local LLM APIs behind internal networks, reverse proxies, or application gateways.
Benchmarking vLLM throughput
Performance testing helps you determine how many prompts per second your infrastructure can support. Benchmarking is especially important when tuning concurrency, prompt size, and model selection.
pip install aiohttp
python3 benchmarks/benchmark_throughput.py --backend vllm --model meta-llama/Llama-3.2-1B-Instruct --dataset ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 --request-rate 10
python3 benchmarks/benchmark_serving.py --backend openai-chat --model llama-3.2-1b --base-url http://localhost:8000 --num-prompts 100
These tests can help identify the best values for throughput, latency, and request batching in your Ubuntu vLLM deployment.
Troubleshooting common vLLM issues on Ubuntu
If the model fails to load because of insufficient memory, reduce the context length or use quantization.
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.2-1B-Instruct --max-model-len 2048 --gpu-memory-utilization 0.80
If model downloads are very slow, enable faster Hugging Face transfer support.
HF_HUB_ENABLE_HF_TRANSFER=1 pip install hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
If the server starts but requests fail, confirm the served model name and inspect service logs.
curl http://localhost:8000/v1/models
sudo journalctl -u vllm --since "5 minutes ago" -f
You may also notice the first request is slower than later requests. This is normal because CUDA kernels and internal caches often need an initial warm-up pass before steady-state performance is reached.
Best practices for production use
If you plan to use vLLM in production on Ubuntu, consider the following recommendations:
- Start with a smaller model to validate the environment
- Use a dedicated virtual environment for package isolation
- Serve the API behind a reverse proxy for TLS and access control
- Set an API key when exposing endpoints to other systems
- Benchmark different values for concurrency and context length
- Use quantized models when VRAM is constrained
- Run the service through systemd for stability and automatic restarts
Conclusion
Installing vLLM on Ubuntu is a strong choice for organizations and developers who want fast, efficient local LLM inference with an OpenAI-compatible API. With proper GPU support, a clean Python environment, and the right model configuration, you can build a reliable AI serving stack for chatbots, internal assistants, RAG pipelines, and privacy-sensitive workloads.
By combining vLLM’s optimized inference engine with Ubuntu’s stability, you get a flexible foundation for scalable language model serving. Whether you are experimenting with a small test model or operating a larger multi-GPU deployment, the steps above provide a practical path to getting vLLM running smoothly in development and production.







