Ollama Speed Test: Windows vs Linux (in WSL2)

Earlier this week, I stumbled upon a Reddit post discussing the performance differences between Ollama running natively in Windows versus it running within Linux on WSL2, so I thought that I would test it out.

Disclaimer: While I wouldn’t consider my testing to be 100% scientific, I tried my best to get the best results possible.

Here is how I set up the test:

I used the latest version of Ollama on both operating systems – version 0.3.14
I downloaded the same model to both – llama3.2:latest, Parameters: 3.21b, Quantization: Q4_K_M
For my GPU I am using an NVIDIA GeForce RTX 4080 with 16 GB GDDR6X
I used the basic Ollama prompt instead of a web front end like Open WebUI

For the Windows portion of the testing, I started by installing Ollama for Windows.

And since my Linux instance was still running at the time, I had to set the default Ollama API port to something different using an environment variable, and then started the server.

C:\Users\dschmitz>set OLLAMA_HOST=127.0.0.1:11435

C:\Users\dschmitz>ollama serve
2024/11/02 08:44:37 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11435 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\dschmitz\\.ollama\\models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-11-02T08:44:37.726-05:00 level=INFO source=images.go:754 msg="total blobs: 0"
time=2024-11-02T08:44:37.726-05:00 level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2024-11-02T08:44:37.726-05:00 level=INFO source=routes.go:1205 msg="Listening on 127.0.0.1:11435 (version 0.3.14)"
time=2024-11-02T08:44:37.727-05:00 level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm_v6.1]"
time=2024-11-02T08:44:37.727-05:00 level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-11-02T08:44:37.727-05:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2024-11-02T08:44:37.727-05:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=10 efficiency=0 threads=20
time=2024-11-02T08:44:37.939-05:00 level=INFO source=types.go:123 msg="inference compute" id=GPU-7494b07a-24c6-9c1e-a630-54a4e412eed2 library=cuda variant=v12 compute=8.9 driver=12.6 name="NVIDIA GeForce RTX 4080" total="16.0 GiB" available="14.7 GiB"

Once the Windows Ollama server was running, I opened a second command prompt, and started my testing using the Ollama prompt.

As you can see from the screenshot, I set the it to verbose mode, so that it outputs the statistics at the bottom of each result like this:

C:\Users\dschmitz>ollama run llama3.2
>>> /set verbose
Set 'verbose' mode.
>>> write a long story about little red riding hood
Once upon a time, in a small village nestled at the edge of a dense forest, there lived a young girl named Little
Red Riding Hood.
...
...
total duration:       7.2977744s
load duration:        17.5168ms
prompt eval count:    34 token(s)
prompt eval duration: 284.146ms
prompt eval rate:     119.66 tokens/s
eval count:           1005 token(s)
eval duration:        6.99479s
eval rate:            143.68 tokens/s
>>> Send a message (/? for help)

I ran the same prompts multiple times on both Windows and Linux running within WSL, and here were the results of the first round.

Prompt	Win-Tokens	Win-Tokens/s	Linux-Tokens	Linux-Tokens/s	Tok/s Difference
write a long story about little red riding hood	1005	143.68	1114	124.37	15.52%
write a really long extended story about little red riding hood with lots of imagery and details	1945	133.75	1931	125.09	6.92%
write a game in python to play guess a number	549	133.97	401	126.76	5.68%
In Ubuntu Linux, what does the grep command do?	569	137.61	477	121.24	13.50%
Averages	1017	137.25	980.75	124.37	10.41%

First round

Just to make sure that there wasn’t any interference from both Ollama instances running at the same time, for the second round I stopped the Linux version while Windows was running, and vice versa.

Prompt	Win-Tokens	Win-Tokens/s	Linux-Tokens	Linux-Tokens/s	Tok/s Difference
write a long story about little red riding hood	1516	143.40	1138	127.11	12.81%
write a really long extended story about little red riding hood with lots of imagery and details	1676	145.74	1921	122.45	19.02%
write a game in python to play guess a number	273	144.79	277	128.28	12.87%
In Ubuntu Linux, what does the grep command do?	432	138.09	476	127.73	8.11%
Averages	974.25	143.005	953	126.3925	13.20%

Second round

Conclusion

Going into this test, I was fully expecting the WSL instance to have some degree of overhead due to the WSL2 virtualization. However, based on the comments on Reddit, I didn’t know what to expect. In my opinion, I don’t think that a 10-13% difference in tokens per second makes that much of a difference. On the other hand, if you want to squeeze every last drop of performance out of your GPU, then running Ollama native on Windows, seems to be the way to go.

QUICK INFERENCE

Ollama Speed Test: Windows vs Linux (in WSL2)

Conclusion

Follow Me

Popular Tags

Top Categories